Join us at PolyAI VOX 2023
⚠️ Unsupported Browser

Your browser is not supported.

The latest version of Safari, Chrome, Firefox, Internet Explorer or Microsoft Edge is required to use this website.

Click the button below to update and we look forward to seeing you soon.

Update now

A Repository of Conversational Datasets

Image of Matthew Henderson
Matthew Henderson
15 Apr 2019 - 2 minutes read

Progress in Machine Learning is often driven by large datasets and consistent evaluation metrics. To this end, PolyAI is releasing a collection of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation framework for models of conversational response selection.

We are initially releasing three large conversational datasets — Reddit, OpenSubtitles, and AmazonQA.

Conversational response selection, the task of identifying a correct response to a given conversational context, provides a powerful signal for learning implicit semantic representations useful for many downstream tasks in natural language understanding. Models of conversational response selection can also be directly used to power dialogue systems, question answer features, and response suggestion systems.

We hope that these datasets can provide a common testbed for work on conversational response selection. The 1-of-100 accuracy metric, which measures how often the correct response is selected over 99 random responses, allows for direct comparison of models.

1-of-100 accuracy (%) results for various baselines on the three datasets. The PolyAI encoder model is a deep neural network model trained to project contexts and responses into a shared high dimensional vector space. For full details, see the paper on arXiv.

For full details, see the Conversational Datasets GitHub repository, and our paper on arXiv. The GitHub repository contains scripts to generate these datasets, implementations of various conversational response selection baselines, and tables of benchmark evaluation results.

We welcome contributions to the GitHub repository, for new datasets, new evaluation results, new baselines etc.

Thanks to my colleagues at PolyAI.


Customer Experience | March 2018

The Age of Personal Assistants: More Machine Learning!

Customer Experience | January 2019

Moving from Engineering to Orchestrating Conversations

Get in Touch

Learn more about voice-based conversational AI, request a demo or find out how PolyAI can help.

Request Info