Last year, we got to see coherent, multi-paragraph text generated by OpenAI’s GPT-2 model. This week, a new paper from Google AI showed that a chatbot based on a gigantic neural network and huge amounts of data can hold coherent conversations, maintaining context over multiple turns and conversing on just about any topic. The chatbot is called Meena, and it’s even able to invent jokes (see picture).
As someone who works in this field, I am impressed. (I’m a co-founder of Rasa, a company that offers an open source conversational AI framework that Meena may potentially compete with some day). But what does this result mean for the future of AI assistants?
Known ideas, well executed
More data, more computer power, better model, right? We hear this all the time, but it’s interesting to actually put this idea to the test. How far can you push it, and where do the returns flatten out? It’s a question behind multiple headline-catching results like OpenAI’s GPT-2 model and their champion Dota system. Combine some neural architectures and algorithms that are known to work well, and see what they can achieve when trained on tens or hundreds of times more data. Meena is a sequence-to-sequence model based on a transformer architecture. These are both widely used tools, but this is the first time anyone’s trained such a model at this scale.
The “newest” part of the paper is a proposed method for evaluating the quality of the chatbot. Meena is an open domain system, meaning that it can talk about anything the user wants to. Evaluating these kinds of models is fiendishly difficult, and coming up with good metrics is a whole field in itself. This paper’s solution is strikingly simple: show people the chatbot’s responses and ask them first, does this response make sense? And second, is this response specific? A problem with previous sequence-to-sequence models is a tendency to play it safe and come up with generic responses like “I don’t know” or “ok.”
Last year’s GPT-2 model was trained on a very simple task: go through a bunch of text and predict the next word based on the words you’ve already seen. Conceptually, the way Meena was trained is no different; the network just has to predict the next response.
Reasons to be cautious
The first thing to mention is that for now, this work is a preprint and hasn’t been peer reviewed. The authors make some strong claims that will have to stand up to scrutiny from experts. The most obvious question is that with 2.6 billion parameters, the authors will have to show convincingly that Meena hasn’t merely (approximately) memorized appropriate responses to just about anything we can say. In any case, there are some clear limitations to Meena that we can already see.
Large neural networks can often seem smart and yet have significant weaknesses. Think of the adversarial patches that trick object detection algorithms but would never throw off a human. Similarly, the GPT-2 model can generate impressively coherent text, but fails at being logically consistent.
Many things that look like natural language understanding problems are actually about understanding the real world. I appreciate that Meena’s authors included some transcripts where the bot fails badly. I love Meena’s imagined anecdote about coastal Arizona: “Yup. I live in southern Arizona, so there’s plenty of surfing to be had.” This should remind us that these models don’t actually understand. That said, they are incredibly powerful tools and can be put to good (and bad!) use.
AI as a public relations risk
Looking at the sample conversations in the paper, I am dying for a chance to talk to Meena. I’m so curious and there are so many things I’d like to try. So why isn’t there a demo available?
In an accompanying blog post, Google tells us this is related to safety. In the paper, the authors are coy about their source of training data. I assume it’s Reddit, since they talk about “public domain social media conversations … in a comment tree structure”. Where else would you get 341 GB of conversation-like text?
It’s notable that Google has not released this publicly yet. Even during the research phase, the authors used an additional filter to remove “sensitive or toxic response[s]”. This isn’t easy to do. Even innocuous responses like “yes” and “no” can be deeply problematic depending on context.
Last year, Microsoft published related work on a model called DialoGPT. The authors of DialoGPT released their model, but not the decoder that would allow anyone to easily play with it. Google says that they might release Meena’s model in the future, but clearly they don’t want an online demo on a Google domain or in any way affiliated with Google branding.
What needs to happen before this has any impact on real-world chatbots
A number of things need to happen before this result can have any impact on real-world chat and voice assistants. Firstly, the authors need to release the model and the training data so that others can build on this research. According to the blog post, Google “may choose to make it available in the coming months”
If the model is released, we will see dozens of papers building on this work in the next months, analyzing the model and building on the ideas in the paper. Remember the BERT model from 2018? There has been so much work on analyzing its behavior that this sub-field has its own name: BERT-ology.
How can we use this research to improve chatbots? The biggest hurdle is that a model like Meena is end-to-end: you put a message in and get a response out. There is no systematic way to control what the model will talk about. The best we can do is sample a lot of candidates and hope that one of them is close to what we want. Meena’s ability to carry context across multiple turns is impressive (just re-read the joke at the top of the page), but the mechanism is hidden from us. Developers can’t just “hook in” and bring this ability into their AI assistants.
Let’s say we were building a chatbot and wanted to integrate Meena just for the excellent jokes. Any time a user says “tell me a joke,” the chatbot hands over to Meena. That’s straightforward enough and should work fine. But given the size of Meena’s model and the number of parameters, I’d be surprised if you could get a response in under five seconds on typical hardware. I really hope that the training data and model are made available, so that researchers and engineers can work on bringing these ideas into a practical system. Compressing large models to make them lean and fast is an active research area at Rasa, where I work.
Not to mention, if Google is being this careful about a release, few people will be comfortable letting Meena loose on their users until the acknowledged risks get addressed. According to Google: “tackling safety and bias in the models is a key focus area for us, and given the challenges related to this, we are not currently releasing an external research demo”.
So, are AI assistants about to get a lot smarter? Yes, but it won’t happen overnight. It’ll require a lot of ingenuity to take these results and translate them into practical improvements for real-world AI assistants. And while I’m biased, of course, my belief is that the open source frameworks will innovate fastest.
Alan Nichol is co-founder and CTO of Rasa, a company that offers an open source conversational AI framework.