PolyAI raises $50 million series C Read more

Intent Classification with Geometrically-Friendly Embeddings

October 12, 2020


PolyAI technical blog image

At PolyAI, our conversational agents are powered, in part, by machine learning models that detect the intent behind what a user says. For example, in a banking environment, if a customer says “When did you send me my new card?”, the models will detect that you’re enquiring about card arrivals and the agent will route you through the conversation accordingly.

Our intent models are powered by our state-of-the-art sentence encoder ConveRT model, which has been trained on billions of real-world sentences in order to capture the “meaning” of what the user says. In software, this “meaning” is encoded as a point in 1024-dimensional space. Given some training data, our intent classifiers learn to associate these sequences with user intents, e.g. whether the user is asking about card arrival, erroneous charges on their statement, exchange rates, etc.

An example of intent classification. An encoded point gets mapped to the best matching intent class.

For most applications, we’ve been using neural networks to classify sentence “meanings” into particular intent categories. In this post, we’ll explore a way to do intent classification with a k-nearest neighbour (KNN) classifier.

KNN is seen as a “classic” machine learning algorithm, but in contrast to modern neural networks, it offers explainability (an insight into how the classifier makes its decision), as well as the ability to enable/disable certain intents depending on the context of conversion, among other benefits. This usually comes at the expense of classification accuracy, but we’ll talk more about how to remedy this and how we harnessed the benefits of KNN later in this post.

Intent classification with nearest neighbours

Let’s visualise the 1024-dimensional “meaning” vectors, called sentence embeddings, in 2D. We can think of a neural network classifier as predicting an intent for a sentence using learned decision boundaries. However, a k-nearest neighbours (KNN) classifier works by searching for k closest known (from the training data) embeddings to what the user said, and predicting an intent that a majority of neighbours have.

A visualisation of KNN and neural network classification. Green and red could represent “AFFIRM” (e.g. user says “Yes”, “Agree”) and “DENY” (e.g. user says “No”, “I wouldn’t like that”) intent classes, respectively. In both cases, the query point (in white) will be predicted to be green = “AFFIRM”.

It would be relatively straightforward to apply a KNN classifier now, but the embeddings we get from our encoder model are not particularly suitable for it. In the diagram below, the encoder embeddings are on the left hand side: we see that intent clusters can be of different shapes, densities and distances from each other. For KNN to succeed, we’d like to have neat and tidy clusters that are well separated from each other. We can accomplish this by training another neural network that transforms embeddings into what we’ll call “geometrically-friendly” embeddings.

A visualisation of encoder embeddings (left) and embeddings we would like to have for KNN classification (right). Points from the same intent should be close together, but points from different interns should be a certain distance (margin) apart.

Learning such a clustering of embeddings is explored in the branch of machine learning called metric learning (see this paper on the topic). Since we’re training a neural network, we need a suitable loss function.

Intuitively, we would like our loss function to pull transformed points closer together if they represent the same intent (positive pairs), or push them further apart if they are of different classes/intents (negative pairs), but only if the distance between them is smaller than a given “margin”. This can be achieved with a triplet or contrastive loss, but here we use the lifted structure loss. The lifted structure loss makes use of all examples in a training batch to create positive and negative pairs, allowing the training to converge faster.

Now we have everything to assemble the new intent classification system. In production, a user’s sentence would be first encoded by the encoder, then transformed by our new learned transformation function, and then finally classified by the KNN:

Results and benefits of the improved classifier

Here are some of the results of this setup, some of which we would not be able to achieve with a neural network-based intent classifier:

  • Have a guaranteed 100% training set accuracy for K=1 and explainability. Occasionally, we find that the current intent model misclassifies some input sentences, so we add them to the training set and retrain the classifier. Even then, we are not guaranteed that the model will learn from or remember this example (especially if it’s a particularly difficult one). However, here, when K is set to 1, an example that is already in the training set becomes its own nearest neighbour, so we’re guaranteed to get it right. For K > 1, we can check which neighbours influenced the prediction, making debugging mistakes much easier.
  • You don’t need to update the transformation network every time a new data point (or a new intent class!) is added. It turns out the learned transformation is good at handling a lot of new sentences that it hasn’t seen during training, even those that belong to a completely new intent. This allows us to save time when updating the classifier: when we don’t retrain the transformation function, updating is just a matter of adding new points to the KNN, which is cheap.
  • Use dynamic datasets and contextual classes. In certain tricky conversations, we do not know for certain what intents the user may express (or sometimes even what the training data for those intents would look like) until the agent gathers some more information. Since the training is cheap, we can actually generate new training sets on-the-fly during the conversation. The use of KNN also allows us to not predict certain intent classes based on the context of the conversation. This is done by ignoring certain classes when looking up neighbours.
  • Work towards a unified classifier for all conversational agents. We found that we are able to reuse the same transformation function across multiple domains: the same instance can, for example, work for both banking and restaurant booking agents. In fact, we found that we could use a model for up to 10 datasets without a meaningful accuracy loss, and this number could be likely made larger if we were to make the transformation network larger. This would allow us to share an intent classifier across many projects, reducing maintenance costs for each one.
  • Achieve classification accuracy similar to the neural network classifier. On our test datasets, we achieve classification accuracies within 1-2% of our previous neural network (NN) classifier: see the accuracy plot below comparing this “geometric” classifier with KNN and NN baselines. This is a good result because we are able to harness all the benefits mentioned above, without meaningfully sacrificing the quality of predictions.
“Geometric” classifier accuracy on our test datasets, versus KNN and neural network (NN) intent classifiers. The numbers in brackets indicate whether the dataset was downsampled to a number of examples per intent.

Overall, this is a good example of how using a different approach to classification, combined with a “classic” machine learning algorithm, can give more freedom in what we can do with our agents.

About the Author

Edgar Liberis is a PhD student at the University of Oxford. He joined PolyAI for a remote machine learning internship over Summer 2020. As well as the work discussed in this blog post, Edgar contributed to a number of projects including zero-shot learning, out of domain point detection and assembling large text classification datasets.

Ready to hear it for yourself?

Get a personalized demo to learn how PolyAI can help you
 drive measurable business value.

Request a demo

Request a demo