INF5820, Obligatory Assignment 3: Development of a Spoken Dialogue System

INF5820, Obligatory Assignment 3: Development of a Spoken Dialogue System Pierre Lison October 29, 2014 In this project, you will develop a full, end-to-end spoken dialogue system for an application domain of your choice. To this end, you will use the OpenDial toolkit that allows you to design a dialogue system using probabilistic rules specified in XML. An external cloud-based API will be used for speech recognition and synthesis. To complete this assignment, follow the step-by-step explanations in the next pages. Note that you will need a microphone for this assignment (internal microphones in laptops are sufficient). The submission deadline is set to November 19 at 23:59. I advise you to start working on this assignment as early as possible! Step 1: Chose an application domain The first step is to chose any type of application domain for your spoken dialogue system. Here are some examples of possible domains: A dialogue system for ordering pizzas (or sushis, books, etc.). An information kiosk for a public transport network (to answer queries such as when is the next bus arriving at Carl Berners Plass? ). A (simulated) robot able to execute some simple tasks, such as finding and fetching (simulated) objects. A Rogerian psychotherapist similar to the doctor.el script in Emacs. A (very simplified) navigation assistant that can provide driving instructions from place A to place B. A tutoring system that can help you practice a particular skill (e.g. a grammatical construction in a foreign language). A health coach that helps elderly persons with their medication and exercises. 1

A virtual receptionist for a company. A (simulated) home automation system that allows you to control particular devices in your house. You are of course not limited to this list of examples! The only constraint for your system is that it must be using speech as communication medium with the human user. You can use Norwegian or English as language for the dialogue system. Please make sure that the application domain you are selecting is constrained enough such that you can develop the application in a reasonable time (remember that you only have 3 weeks to complete this assignment).the goal of the assignment is to give you a better grasp of how spoken dialogue systems work in practice and explore some the concepts seen in the lecture in other words, the goal is not to build a professional-grade application! I would therefore advice you to adopt an iterative approach: start with a basic, toy dialogue domain, and gradually extend its coverage. Step 2: Formalise the task Once you have chosen the task, the next step is to formalise it. More precisely, try to answer the following questions: 1. What are the possible user inputs for the system? Which user utterances should be covered? For instance, a pizza delivery system might need to cover sentences such as what types of pizzas do you have, I would like a pizza X (with X being a pizza type), etc. 2. What are the (communicative and non-communicative) actions available to the dialogue system? In other words, what will the dialogue system be able to say or do? For the pizza delivery example, the action might be to ask the customer for its type of pizza, ask the customer to repeat, confirm the delivery, etc. 3. Which variables need to be recorded in the dialogue state? In other words, are there contextual variables that the system needs to keep track of in your application domain (beyond the last user dialogue act and the last system action)? For instance, a navigation assistant may require the current location of the user as contextual variable. 4. Do you need to implement external modules in addition to the core dialogue system components? If you decide to work on a simulated robot, you need for instance to develop a dummy simulator to represent the robot environment. 2

Once you have answered these questions, try to sketch a few interaction scenarios for your domain, and see if these interactions can be fully covered with the user inputs and system actions you have listed. To get a more precise idea of how human users are actually going to interact in your domain, it is also useful to conduct some Wizard-of-Oz experiments: ask a friend to act as a human user, while you are acting as the dialogue system (preferably remotely, e.g. via Skype). You can then adapt your domain design in light of these interactions. Step 3: Install OpenDial The OpenDial toolkit will be used a the main platform to develop your dialogue system. The OpenDial toolkit is a Java-based, domain-independent toolkit that allows system developers to build practical dialogue systems based on probabilistic rules specified in XML. You can find the toolkit and its documentation at the following address: http://opendial.googlecode.com Please checkout the version on the development trunk (as I will sometimes update the code to correct bugs or add new functionalities): svn checkout http://opendial.googlecode.com/svn/trunk/ opendial Once the code is downloaded, you can simply compile it using ant compile and start the toolkit with ant run. You should then go through the step-by-step example in the online documentation to get a better grasp of the toolkit. Note: if you encounter bugs or strange behaviours in OpenDial, please use the issue tracker on the website to record them. Step 4: Set up speech recogniser The first component to set up is the speech recogniser. We are going to use a cloud-based solution. Two cloud-based speech plugins for OpenDial are available: Nuance Speech and AT&T Speech. The Nuance speech recogniser generally has a better recognition performance than AT&T, but AT&T allows you to specify a recognition grammar (while Nuance is limited to a custom vocabulary). I personally recommend you to try the Nuance plugin first, and switch to AT& if you experience problems with the recognition. The installation instructions are detailed in the README.txt file for each plugin. Once you have followed all installation steps, test the setup and check that you can get recognition results in the chat window of OpenDial. If you don t get any result, make sure the microphone is working properly, and that the right audio input source is selected in the OpenDial menu bar. 3

You will probably notice that the quality of the recognition results is not always very good. To improve the recognition performance, you should modify the language model used by the speech recogniser. For the Nuance recogniser, this is done by specifying a custom vocabulary a collection of phrases that are especially likely for the domain (please see the documentation on the Nuance website for details). For the AT&T recogniser, this is done by specifying a context-free recognition grammar that covers the set of possible user utterances. Modify the custom vocabulary or recognition grammar until you get recognition results of reasonable accuracy for your domain. Step 5: Construct dialogue act recognition model The next step is to build a natural language understanding model that will map user utterances to high-level, logical representations of the user dialogue act. For instance, one should map an utterance such I would like a peperoni pizza, please! to a predicate such as Request(PeperoniPizza,1). Such mappings can be easily specified using XML rules in OpenDial. To reuse the above example, one can write a rule such as: <condition operator="or"> <if var="u_u" value="a peperoni pizza" relation="in" /> <if var="u_u" value="one peperoni pizza" relation="in" /> <if var="u_u" value="one pizza with peperoni" relation="in" /> </condition> <effect> <set var="a_u" value="request(peperonipizza,1)" /> The above rule specifies that the user dialogue act a u will be set to Request(PeperoniPizza,1) if the user utterance contains a substring such as a peperoni pizza, one peperoni pizza or one pizza with peperoni. Of course, this is a very basic example, and you can write more complex types of rules with nested logical operators, underspecified variables, etc. See the OpenDial documentation for details. The collection of rules should be specified within a model <model trigger="u u">... </model> to get OpenDial to trigger these rules upon changes to the user utterance variable u u. If required for your application domain, you can design more complex NLU models (using e.g. a sequence of several models) to include e.g. reference resolution or other processing tasks. It is up to you to decide how the NLU model should be structured depending on your particular needs. 4

Start by specifying a collection of probability rules that maps the possible input utterances to dialogue acts expressed as logical predicates (as in the example above). Once this are done, make sure you properly test your model to ensure that all types of utterances are covered. You can check the results of the mapping process via the Dialogue State Monitor tab in OpenDial. Note: The model will automatically erase the previous values for the user dialogue act a u. In some application domains, it might be useful to keep track of a longer history of dialogue acts. To record the previous dialogue act, you can simply add the following rule to the NLU model, which will record the previous dialogue act in a new variable a u prev. <effect> <set var="a_u-prev" value="{a_u}" /> Step 6: Construct dialogue management model Once the NLU model is in place, you are now ready to construct a simple dialogue manager that will select the most likely system action given the last user dialogue act (and possibly other contextual variables). The dialogue management model will be encoded via utility rules that specify the utility of various system actions depending on state variables. Here is an example of utility rule, which specifies that, if the user orders an item X and quantity Y, the utility of asking the user to confirm X and Y will be set to 2: <condition> <if var="a_u" value="request({x},{y})"/> </condition> <effect util="2> <set var="a_m" value="confirm({x},{y})"/> The system actions may comprise both communicative actions (statements, clarification requests, etc.) and non-communicative actions (for instance, fetching an object). Again, you are free to decide on how you would like to design the dialogue strategies for your application domain. Once you have constructed the model, test it to ensure that the system selects the most appropriate action in each situation. I recommend you to test the system using the speech recogniser, to verify that the dialogue management model also works in the presence of errors and uncertainty. 5

Step 7: Construct generation model Communicative actions must be mapped to actual system utterances. This is the purpose of the generation model. The generation model is another utility model that specifies the utility of various linguistic realisations of the selected system action. A typical example of generation rule is the following: <condition> <if var="a_m" value="confirm({x},{y})"/> </condition> <effect util="1> <set var="u_m" value="you ordered {Y} {X}. Is that correct?"/> If you would like to use several alternative realisations for a given system action (in order to introduce some variety in the system behaviour), you can do so by specifying several effects with identical utility. Step 8: Implement external modules Depending on your application domain, you may need to implement additional system modules and connect them to the OpenDial toolkit. Such modules may be used to periodically update the dialogue state with new content (for instance, a simulator may be responsible for updating specific contextual variables related to the user location, perceived objects, etc.). They can also be used as a backend to a database or information repository (for instance, timetables for a transportation system). Or they can be used on the output side to execute non-verbal actions. See the online documentation for more information on how to integrate modules into OpenDial. I would like to stress once more that the purpose of this assignment is not to build a professional-grade application. It is perfectly ok to fake things up or introduce dummy behaviours in the external modules, as they are not the focus of this assignment. Step 9: Construct user act prediction model The current understanding model is quite basic: it directly maps raw user utterances to dialogue acts, without taking into account the likelihood of various dialogue acts given the context. For instance, it is much more likely that the user will utter a sentence such as Put down the object on the floor if the robot actually carries an object. 6

We can create such predictive model using probability rules. Here is an example of predictive rules stating that the probability of asking the robot to put down an object will be 0.3 if an object is actually carried, and 0.01 otherwise: <condition> <if var="carriesobject" value="true"/> </condition> <effect prob="0.3> <set var="a_u^p" value="request(putdownobject)"/> <effect prob="0.01> <set var="a_u^p" value="request(putdownobject)"/> Notice the ^p suffix on the output variable, which signals to OpenDial that this rule specifies a prior probability for a future (currently unobserved) user dialogue act a u. Create a set of probability rules that specifies the predicted probability of user dialogue acts depending on the current context (which may correspond to the last user dialogue act, the last system action, or other variables). To help you determine the probability values, you can exploit the Wizard-of-Oz data from Step 2. 1 Step 10: Evaluate dialogue system If you have correctly followed all the steps so far, you should have a working dialogue system. The final step is to evaluate its performance. As we will discuss during the lectures, there are many ways to evaluate spoken dialogue systems. One simple approach is simply to ask external persons to try the system and ask them about their user experience in a survey conducted after the interaction. For this assignment, we will adopt a very simple evaluation scheme: find (at least) two external persons and ask them to interact with your system a few times. Once they are done, ask them to rank their experience on a scale from 1 (worst) to 5 (best). Also ask them whether they have suggestions for future improvements. 1 In an actual system, one would of course derive these probabilities rigorously using statistical methods based on collected dialogue data. 7

Step 11: Write and submit small report To complete this assignment, please submit the following in Devilry: The specification of your dialogue domain in the OpenDial XML format. The source code for the additional modules you have developed for your domain (if any). Additional data that you may have used for this project (for instance, Wizard-of-Oz data). A short 3-pages report on the dialogue system you developed. The report should describe the application domain, your design choices, the system components, and the results of the final evaluation. In other words, the report should document what your system does, how you designed it, and what are its current functionalities and limitations. You may write the report in Norwegian or English. Step 12: Present you work! One week after the submission deadline (during the last gruppetime for INF5820), you will be asked to present your dialogue system to your fellow students, and do a short demonstration. 8