1. Introduction to Pavlov Agent
Hello, my name is Mikhail Burtsev, and I'm founder and leader of the Pavlov project at Moscow Institute of Physics and Technology. Today, I will tell you about the Pavlov Agent, an open source framework for multi-skill conversational AI. Multi-skill is important because customer experience spans multiple domains, and to address each domain, specific skills are needed. Traditional conversational systems use a modular dialog system, where user prompts are converted to textual form and processed by a natural language understanding module. The Dialog Manager updates the dialogue state and performs actions based on the current state. Current systems rely on neural nets, deep learning models, and rules for dialogue management and natural language generation. The AI assistant lifecycle begins with a Minimal Viable Product, featuring pre-trained models for NLU and scripts for the Dialog Manager.
Hello, my name is Mikhail Burtsev, and I'm founder and leader of the Pavlov project at Moscow Institute of Physics and Technology. And today, I will tell you about the Pavlov Agent, which is open source framework for multi-skill conversational AI.
So let's start with a question, why multi-skill is so important? It is important because customer experience spans multiple domains, like surveys, promotions, campaigns, customer service, technical supports and many, many others. And usually to address every domain, every single domain, you need specific skill. So this is why we need to build multi-skilled digital assistant, and we need to have multiple conversational skills in our system.
And this is, if you take a look at the e-commerce assistants, like a modern complex dialogue systems. For example, here is a case of Alimia Assist, which is an assistant at AliExpress. So you can see here that it's hybrid system with many different skills. For example, we have these assistant service with some slot filling engine, and we have customer service with a knowledge graph engine, and we have chatting service with a chat engine. So you see that it's combination of some business rules, of scripted scenarios, and with specific skills addressing different customer needs.
So what is traditional way to build conversational systems right now? The most dominant approach is so-called modular dialog system. So how it works? We have user, user have some prompt to the system, and this prompt is converted to the textual form and feed in natural language understanding module, which performs basically three functions. Domain detection, intent detection, and entities detection in the input of the user. And then after these preprocessing, we have some formal description of the user input, which is also called semantic frame, where we have intent, here it's request movie, and we have entities. In this request, it's general comedy and date weekend. And then all this information goes to the Dialog Manager. And the task of the Dialog Manager is first to update current dialogue state, to make it up to date, to integrate this new information in the previous history of the dialogue, and then with these updated dialogue state, perform the action you need on the side of the system.
So it consists from the dialogue state and from the policy, or script, which decides what action should be selected, given the current dialogue state. And here in our example we have action which is request location. But this action is in some, like internal system representation. And we need to convert this action into the natural language prompt. And here we have the last module of our system, which is natural language generation. Which creates surface form of our request to the users. So we, with action request location, we have output in natural language where are you. So this is basically how current systems are built. And mainly in this interview part, we have a lot of neural nets, deep learning models, which is used here, and in the dialogue manager part, we have some neural networks and a lot of rules and scripted dialogues. And also for natural language generation, we mostly have either retrieval models with some slot filling or templates.
Okay, so then what is AI assistant lifecycle? How we are building our digital assistant, our dialogue systems, with this modular technology. So usually we start with some MVP, a Minimal Viable Product. For NLU, we have some features and some models pre-trained for this domain, and on the side of Dialog Manager, we have a few scripts and it's very nice and clear architecture, and we understand how it works.
2. Advantages of Decomposing Complexity
We want to go beyond the complexity ceiling of the current technology by decomposing complexity between the agent and conversational skills. Our microservice architecture allows for scalability and the reuse of existing skills. With the Deep Pavlov Library, we can build NLP pipelines and combine different components into conversational skills for specific domains and tasks.
It covers the most important aspects of the interaction between system and the user. And then, we deployed this MVP into production and this system starts to interact with the users. And here, we understand that we need to increase coverage of the system because users ask the same questions differently because of language variability, so we need to add more features and make our natural language understanding part of our system more complex.
And we also want to cover more functions, so we add more scripts on the side of our dialogue manager. And then, we continue with more features and more scripts, more features, more scripts, we reach so-called major AI assistant stage, which is actually a mess of features and scripts. And this is a solution which approach it already, complexity, maximum complexity it can have because of all of these interdependent components. So now you're in a position that you cannot grow your product anymore.
And what we want to do with our framework, with Deep Power of Agent, we want to break this picture. We want to go beyond this complexity ceiling of the current technology. So in our vision, AI system lifecycle starts with the same simple and clear and nice MVP. And then, what you do, you test it and then you just add it to already deployed system as only one of the conversational skills. And then, if you want to add more functionality to your system, you just create a new conversational skill and add it to your agent. This allows you to decompose complexity between agent, which is basically a skill orchestration framework and conversational skills. And this provides you a very nice microservice architecture, which can be scaled in much more complex major IIS system.
And it also gives you many nice features. For example, you can have default skills. You don't need to develop them by yourself. You can just need to plug in your own skills. And it's, as I've said, it's a very scalable architecture because every skill is deployed as a microservice. And it's also very handy because when you create a new product or you want to create new skills, which are similar to the one which you already have, you can just reuse the old one and extend it for the new function, or to integrate in your product. And which is also important in our global development culture right now, that usually complex solutions are built with distributed teams. And this architecture of skill orchestration and modular structure of your conversational agent allows you to distribute maintaining and development of separate skills to different separate teams. So, this is making your work and coordination between skills much more organized and much more efficient. So this is what we want, this is our vision. What we want to do, we want to have conversational skills and we want to have conversational orchestration level. So what we are doing right now to implement this vision. So we have started with the Deep Pavlov Library. Deep Pavlov Library is an open-source library for building NLP pipelines and conversational skills for conversational AI. So you can have specific NLP models like named entity recognition, coreference resolution, intent recognition, and self-detection, question answering, dialogue policy, dialogue history, language models, and so on. And then, with our framework, you can combine these different components into some conversational skills for specific domains, for specific tasks, like here.
3. Deep Pavlov Agent Framework and DreamSocialBot
We have task-oriented skills, factored skills, and chit-chat skills, all orchestrated by the deep Pavlov agent framework. An example of this architecture is the DreamSocialBot, built for the Alexa Prize competition. User input is processed by annotators, which extract information and update the dialog state. A skill selector determines the most relevant skills, and a subset of these skills is executed to produce response candidates. Candidate annotators ensure user safety, and a response selector performs final selection. This multi-skill system runs asynchronously, with the deep Pavlov library serving as the foundation for creating NLP pipelines.
We can have some task-oriented skills, like restaurant booking. We can have factored skills, which allows you to answer a factored question, and we can have chit-chat skills. And then, we have deep Pavlov agent framework, which orchestrates these skills.
And as an example of this architecture, I would like to present you architecture of DreamSocialBot, which is built with deep Pavlov and deep Pavlov agent framework for Alexa Prize competition, Alexa Prize Challenge. And our team, because it's a university team, we have participated last year in the Alexa Challenge and were selected as one of the 10 teams out of 350 applications to develop a solution which will be hosted inside Amazon Alexa and which can be evoked by Alexa Chat Commons. So, we used our deep Pavlov agent to build this DreamSocialbot. So, let's take a look at how it works.
So, first we have user input. And this user input is going to so-called annotators. It's the first stage of processing. Current annotators are also implemented as deep Pavlov NLP pipelines and run as microservices and they are used to extract information from the user input. After that, this annotated user input goes to the dialog state and dialog state is like a shared memory between all microservices like annotators or scales and so on. So, this annotated user input and current dialog state are used by a skill selector to decide which skills are most relevant to the current dialog state. And here, for the competition, we have developed about 25 different skills and some for use for weather, movies, books, like general chart and so on.
And then, as I've said, only some subset of these skills is selected and these skills are executed and produce response candidates. Every skill outputs response candidate and some confidence in its own response. And all these candidates go to candidate annotators. We need these annotators to be sure that no candidates might be harmful to our users. So we perform toxicity detection, dialog termination detection, and blacklist filtering of the response candidates. After that, we have annotated response candidates and we perform final selection. So we have response selector which performs final selection of the final output of our system. And this response can be also post-annotated and presented to the users. And so it's actually, as you see, it's a multi-skill system and it all the elements of our pipeline are running asynchronously. So all the annotators and skills, they are running asynchronously as a microservices with only two points of synchronization, it's a skill selector and response selector and our dialogue state, as I've said, serves as a shared memory. And so what we have right now, what we want to build, actually we have what we can call a deep Pavlov ecosystem. It's a ecosystem of our products and on the left-hand side here, you can see it's our deep Pavlov library, which is library for creating NLP pipelines. And you can also use, you can also include third-party NLP models like a Hugging Face Transformers or NVIDIA NIMU as the components of your NLP pipelines. And you, then you can deploy these, all these NLP pipelines as microservices in the cloud. For example, for your AI assistant, these can be annotators or some NLP components. And then we have what we call deep Pavlov dream.
4. Deep Pavlov Dream and Deployment Agent
We want to open source our skills and provide them as a default distribution for conversational agents. You can reuse our skills and add your own task-oriented skills. The deep Pavlov dream is a repository for different skills and templates, allowing the integration of third-party skills. The Pavlov agent orchestrates all components to produce the final conversational experience. We aim to create an open hub for exchanging skills, making development easier and faster. Our DeepPowell library integrates ML models into NLP pipelines and solves conversational skill problems. We provide a unique solution with our Deployment Agent for multi-skill orchestration.
Why we call this stuff deep Pavlov dream? Because what we want to do, we want to open source all our skills, which we have developed for the competition to make them open source to provide like default distribution for conversational agent to be used by others. So then you don't need to develop your own chit chat skill or some basic skills like weather and so on. For your solution, you can just reuse our skills and then add your own task-oriented skills.
So as I've said, deep Pavlov dream then is like a repository for different skills or templates. And you can also use third-party skills here, because our architecture allows you to integrate other skills via API. So you can use Rasa or AI ML skills in your pipeline as a part of the Pavlov dream. So then you run all these skills as microservices. And in the center here we have the Pavlov agent, which performs orchestration of all these components, annotators, and skills, to produce the final conversational experience of your AI assistant.
And as I've said, what we want to do, we want to create, which can be seen as an analogue of the open source operating system. But here we have not operating system, but like a conversational or dialog system. Because we have some applications, and these applications are conversational skills, and we have some services like annotators, and we have user experience. So you can add your applications to make user experience good. And we want also to create open hub for exchange of skills to make life of every developer for conversational, complex conversational assistance easier, because then we exchange some general purpose skills. It will make the development of the solution much more easy and faster in prototyping of your complex conversational agents.
So now, if we look at the stack of conversational AI technologies, then at the bottom of this stack we have ML platforms like PyTorch and TensorFlow. Then we have NLP frameworks, which integrates ML models into the NLP pipelines. And here we have SpaCy and Transformers and NVIDIA NIMR or even Stanford NLP, and our DeepPowell library belongs to this NLP frameworks level. But not only to this level, it also takes a part of a conversational skills level because you can create an NLP pipeline in the DeepPowell, which exactly solves the problem of the conversational skill.
So here, on this conversational skill level, we have Rasa or Pandora Bots or LMONT. These are frameworks for creating separate conversational skills. And then on the highest level, on the level of multi-skill orchestration, I think that in the open source domain, we provide a unique solution right now. It's our Deployment Agent which is a framework for conversational skills orchestration. So with it, it allows you to deploy your skills and to manage your skills and to orchestrate them to provide a very nice user experience. So this picture, you can see the whole line of our framework which we are building to create our open-source solution for the, like, full-stack open-source solution for the conversational AI. So thank you for your attention. That's all. And I would be very glad to have your questions and to answer your questions. Thank you. Hi, once again. Hey, thanks for joining us today and giving this amazing talk.
5. Comparison with Microsoft LUIS
DeepLove.AI is fully open source, providing more flexibility compared to Microsoft's Language Understanding service (LUIS). While there is a steeper learning curve, our framework allows for greater customization and control over the components.
We have some questions from the audience. Shall we get right to it? Yep. All right. Well, the first question is actually from my co-emcee, AJ, and he would like to know, how do you compare DeepLove AI to other conversational AI, like those built with Microsoft's Language Understanding service, for example? Sorry. Sorry, can you repeat the question? How would you compare DeepLove.AI to other conversational AI, like those built with Microsoft's Language Understanding service, LUIS, for example? Okay. So the major difference here is that our project, it's fully open source. With Microsoft LUIS, you are not as flexible as with our framework, and of course, with this flexibility, there is a cost for mastering all the components. So you need better knowledge of components to build something similar to what you can build on LUIS and so on. Yes, so it is a steeper learning curve, but you are more flexible. Yes. Yes, all right.