The Different Types of Computer Networking Software

Transcription

1 Department of Linguistics and Philology Språkteknologiprogrammet (Language Technology Programme) Master s thesis in Computational Linguistics 10th June 2005 A Speech-Driven Automatic Receptionist Written in VoiceXML Katarina Matzon Supervisors: Beáta Megyesi, Uppsala University Tobias Öhman, Voxway AB

2 Abstract This thesis describes the implementation of a speech-driven receptionist for Voxway AB. The receptionist was designed to be used by smaller Swedish companies. It answers calls coming into the company and directs the calls to an employee based on speech input from the user. It also handles unrecognized names and unanswered phonecalls. It was programmed in VoiceXML and ColdFusion. A database was designed and implemented to store data needed in order to make the receptionist dynamic and to log call statistics. The telephony application was evaluated by test users and a user survey. A website (programmed in HTML and ColdFusion) was designed to administrate the telephony application and allow companies to customize the application as well as view statistics about their usage of the application.

3 Contents Abstract Contents List of Figures List of Tables Acknowledgements ii iii v vi vii 1 Introduction Purpose Outline Dialogue Systems Speech Recognition Dialogue Management Design Methods Human Communication Design of Dialogue Generator VoiceXML ColdFusion Programming the Receptionist Static Receptionist Design of Dialogue Basic Code Building Grammars for Use Integrating Error Handling in the Code Integrating Dynamics Building the Database Using ColdFusion to Integrate Dynamics Organizing the Code for Dynamics Dynamic Queries and Output Dynamic Grammars Dynamic Prompts Implementing Statistical Element iii

4 4 Evaluation Evaluation Method Testing Test Users Evaluation of Results Designing the Web Interface 27 6 Concluding Remarks Future Improvements A Database 32 Bibliography 33 iv

5 List of Figures 2.1 The three modules of a dialogue system The relationship between SGML, HTML, XML and VoiceXML A simple VoiceXML example The seven subsystems of VoiceXML Stages of Development of Receptionist Example Dialogue Receptionist Applications s chain of events Example of different types of VoiceXML grammars Example of error handling in a dialogue Static event handling for an unanswered call Example of a possible conversation Query to find company name and ID Example of ColdFusion output Dynamic Grammar Dynamic Prompt Task example Home Page Employee List Blank form for new employees v

6 List of Tables 4.1 User Satisfaction Survey with Average Scores User Satisfaction Scores vi

7 Acknowledgements I would like to thank the people without whom this paper would not be what it is today. Thank you to both my supervisors Beáta Megyesi and Tobias Öhman. Thank you Bea for your encouragement and advice in writing this thesis, and thank you Tobias for all your encouragement and help on the programming of the receptionist. I would like to thank Botond Pakucs at KTH for contributing with advice on the evaluation of dialogue systems. I would also like to thank my friend Jens Bergqvist for helping me record incredible sound for the receptionist so that it sounds more professional. Thank you to all my friends and family who have supported me this semester and were always around to talk when I needed a break. And lastly, I especially want to thank my boyfriend, Johan, for being such an incredible support and help throughout this process, thank you for being my bollplank! vii

8 1 Introduction Natural language processing, the study of linguistics and computer science, is growing everyday. Everywhere people go today computers are understanding and interpreting the human language. One of the branches of computational linguistics is speech technology where computers understand and output speech. More and more companies are using speech technology. If you call the Swedish railway company you will be speaking to a computer to book your tickets or if you call the postal office in the United States you will be speaking to a computer to find out the postal code you need. Soon enough we will not need to type into keyboards because it will be standard to talk to your home computers. People are already speaking to their mini-computers. For example, when a person calls their friend on their mobile, they just say the friend s name and the call is connected (Dobler, 2000). Or when you are driving in your car and your navigational system is reciting directions for you to follow to your next destination (Wikipedia). These features improve our lives at home or at work. One branch of speech technology is spoken dialogue systems. Spoken dialogue systems utilize speech technology to enable humans and computers to interact by means of human speech. Here both aspects of speech technology, speech recognition and speech synthesis, combine to interact with humans in the form of a dialogue. In the Merriam-Webster English Online Dictionary dialogue is defined as follows Dialogue a conversation between two or more persons; also : a similar exchange between a person and something else (as a computer) b : an exchange of ideas and opinions c : a discussion between representatives of parties to a conflict that is aimed at resolution A spoken dialogue system can then be defined as a system designed to perform a spoken conversation between a person and a computer. One area where these systems are increasingly popular is the telephone industry. In the end of the 1990 s, telephone companies wanted to develop a common language to voice enable the web, in other words, to build dialogue systems that work over the web and over the telephone. The result of this discussion was VoiceXML (Voice Extensible Markup Language) (W3C, 2003). VoiceXML made it much simpler for companies to build web-enabled applications that include speech over the telephone and expanded the possibilities for voice applications. 1.1 Purpose The purpose of this thesis is to develop a speech-driven receptionist for Voxway AB. Voxway AB is a company specializing in developing and hosting IVR (Interactive 1

9 Voice Response) applications with speech technology. This task involves developing an automatic receptionist for small companies where the goal is to form a comfortable and efficient dialogue between the caller and the automated service. The dialogue system is programmed with VoiceXML. The receptionist is designed to expect the name or position of a person at the company. In case the requested person may be reached at several numbers, the application asks which number it should connect to (mobile, home, work). After the system knows the correct number, it connects the call. It also handles problems such as unrecognized names, busy signals, and unavailability. Besides the dialogue aspect, the application involves designing a database and web interface that can be accessed by each company in order to customize the application to their needs. Each company has its own application content that is stored in a database and accessed by using the telephone number that receives the call as a key. The information in the database is managed by the website which is designed to allow different companies to enter the site with a password and enter the information for each employee that is necessary for the receptionist to be able to connect a call. The website allows companies to see call statistics about the calls coming into the company and calls transferred within the company. 1.2 Outline This paper describes the implementation of an automatic receptionist. Chapter two gives a background on dialogue systems in order to prepare the user for chapter three which discusses the implementation of the receptionist from the static receptionist to the dynamic receptionist. The next chapter describes the evaluation of the implemented receptionist. The chapter to follow the evaluation describes the website design and implementation. The paper ends with concluding remarks and suggestions for future improvements. 2

10 2 Dialogue Systems Spoken dialogue systems are systems built to handle human-computer interaction in the form of speech. A system normally consists of different modules that handle different aspects of the dialogue. A simple system consists of three modules, a speech recognizer, a dialogue manager and an output generator as seen in Figure 2.1 (Gustafson, 2002). Speech Recognizer Dialogue Manager Output Generator Figure 2.1: The three modules of a dialogue system The first part is the automatic speech recognizer which converts the speech that is the input into text that the computer can parse. Once the text is parsed, it is sent to the dialogue manager which decides how the system should react to the input. Often, the reaction is to send output to the output component or generator. The output component consists of recorded prompts or text-to-speech (TTS) which converts a given output into speech to be recited to the user. Together these components form a dialogue system. This system can then accept input as speech, parse this input, decide how to handle the input, and send output via the generator. This is how a general dialogue system works, but systems are designed with different goals in mind and each component in the system will be formed differently depending on the goal. For example, the CU communicator is an interactive dialogue system for travel information over the phone (Pellom and Ward, 2000). In comparison, a system with an entirely different goal is August, a multimodal dialogue system which was used to interact with people at the cultural center in Stockholm (Gustafson et al., 1999). Since dialogue systems can differ so greatly, they are divided into three categories. The first is the task-oriented dialogue. This dialogue has well-defined goals and this is usually a simple dialogue. Examples include simple question and answer systems such as the CU communicator mentioned above. Another example is a system that gives traintimes over the telephone such as the Philips automatic train timetable information system (Aust et al., 1995). The second type of dialogue is the explorative dialogue where the goals are not as well-defined but instead the goals are to acquire knowledge about complex tasks or browse information (Gustafson, 2002). An ex- 3

11 ample would be an information browsing system such as AdApt which allows users to find out information about available apartments in the Stockholm area (Gustafson et al., 2000). Although there is a goal in their interaction it is not easily defined. With AdApt, the goal may be to find an apartment to buy or simply to browse available apartments out of curiosity. The third type of dialogue is context-oriented. These dialogues are focused on the actual dialogue situation. The primary goal for the user in this interaction is to be entertained (Gustafson, 2002). This dialogue is based on the system, its locations, or its surroundings. An example of this would be a museum guide system that talks about the exhibition it is stationed in such as August, the system described earlier (Gustafson et al., 1999). August has no goal other than conversing. Today, task-oriented dialogue systems are the most common. Mostly because it is easy to measure errors and effectiveness of the systems since the goals are so clear (Gustafson, 2002). But the other two types are possible and would expand the possibilities of the dialogue systems endlessly. A more in-depth look into each of the components of dialogue systems will be explored below. 2.1 Speech Recognition Automatic speech recognition (ASR) is the task of converting speech to text that can then be parsed by the computer. Determining what type of recognizer to build is one of the first steps. Many types of recognizers exist. One distinction is based on whether the system has prior knowledge about the user s speech characteristics or not. Speaker-dependent (SD) systems are designed to understand speakers previously trained on the system, and speaker-independent (SI) systems are trained to respond to a large group of people where training for each individual would be impossible (O Shaughnessy, 2000). SD systems exist, for example, in mobile phones where the speech recognizer recognizes its owner s way of pronouncing a person in the phone book exclusively. SI systems are much harder to make successful considering the large variations in speech that need to be taken into consideration. Inter-speaker variability is the difference in speech between individuals. These differences include dialects, emotion in speech, sex of the speaker, and age of the speaker. For example, the accent of a person from the south of Sweden is very different compared to the accent of a person from the north of Sweden. A SI recognizer needs to account for these differences in order to understand a broader scope of people. Besides these differences even the emotion in a voice differs between speakers. For example, the level of excitement in a voice will also be different depending on the speaker. All of these differences and more need to be considered when building a SI system. Besides inter-speaker variability, intra-speaker variability exists. Intra-speaker variability is the variability of speech within one person. One person is unlikely to utter the same exact thing more than once. The combination of intonation, pauses and emphasis is difficult to repeat exactly. This effects both SI and SD systems. A speech recognizer needs to be broad enough to handle these subtle differences in speech and be able to recognize the words that are spoken, but it needs to be narrow enough so that it does not confuse similar words. 4

12 Besides the aspects of speech, the nonspeech aspects are important to consider as well. Background noise plays a huge factor for the recognition. If a person is sitting in a crowded restaurant or in an empty room, it will be more difficult for the recognizer to recognize the person in the restaurant because of all the noise in the background. Also channel distortion needs to be considered. If a person is interacting with a system via a telephone the connection can worsen the recognition because of bandwidth limitations in the telephone network. Mobile phone connections can be bad or if a person calls from overseas, the connection can be affected and make it more difficult for the recognizer to understand the caller. The perfect conditions for a speech recognizer is one person in a silent room interacting with the computer without a medium such as a telephone. These conditions are, of course, not that common. Once speech is recognized and the actual text is extracted, the computer parses the input in a couple of ways. Each speech recognizer is equipped with a linguistic component that will parse the text before it is sent to the dialogue manager. The simplest parser is a static grammar which means that the parser has an unchanging grammar that the input is matched to, to try to find the best match. These matches can be similar to one another and therefore lists can be made by the system listing the most similar match to the the least similar. In more complex recognizers, a lexicon or corpus with a much larger number of words along with a grammar interact to parse the meaning of the input (Gustafson, 2002). This allows for more possibilities when it is impossible to know exactly what inputs will be entered. A more complex linguistic component allows for a more robust system. Once speech is recognized and parsed so that the system can interpret it, it is sent to the next component, the dialogue manager. 2.2 Dialogue Management The dialogue manager in a dialogue system is the backbone of the system. Once a text is parsed by the recognizer, the dialogue manager has to decide what to do with the input it has received. There are several different aspects to consider in the design of the dialogue manager so that it can handle input correctly and a successful dialogue can be programmed. The first and most basic is which method of design the designer chooses Design Methods A few different ways to design a dialogue system exist. Design by inspiration, design by observation and design by simulation (Gustafson, 2002). Designing by inspiration is when a designer decides how he is going to design his dialogue without consulting any external party. This is a bit risky since one person cannot think of all the possibilities in a conversation and it relies solely on the linguistic competence of the designer (Gustafson, 2002). This can be considered an option in simple systems where the purpose is for the user to reach a goal. Here it works since the user can be trained on how he can reach his goal, and then the dialogue system can be considered a success. In more complex systems, it will most likely not give a good result. Designing by observation is when the designer observes communication between humans emulating the situation he wants to depict in his system and trys to incorporate aspects of that 5

13 communication into the system. Lastly is design by simulation (wizard-of-oz technique) which is when some or all parts of a system are simulated and thus different aspects of the dialogue can be tested (Gustafson, 2002). This is quite a useful strategy since it will make the system more realistic since it will be a human speaking to a simulated interface instead of a human speaking to a human. The type of system and the possibilities the designer has will decide which design strategy is best suited for the dialogue system. Once a design method is chosen it is important to consider certain principles that exist in human communication Human Communication In order for a successful dialogue to be designed, the designer needs to observe human dialogue and account for all the unwritten rules that exist in human conversation. Only by following these rules and principles will the designer be able to design a dialogue system that people find as natural as speaking to a human.these principles and rules are discussed below. Certain assumptions exist when humans communicate in order for a conversation to be satisfactory to all parties. Principles have been studied and defined so that communication can be more easily studied. Grice (1975) has famously written about four well-known maxims that govern all conversation and when they are not followed, a conversation can be considered unsatisfactory. These four maxims are listed below. Quality. This means that in a conversation a person should always be sincere. People expect to hear the truth and will therefore be surprised if this maxim is not followed. Quantity. This means a person should say neither too little nor too much. If a person doesn t say enough then it could lead to confusion and the same could happen if they say too much. Relevance. This is easily explained as what a person says should always be relevant in conversation. If a person starts speaking of something unrelated to the current subject then it will confuse the listeners. Manner. This means avoid ambiguity. Be clear and to the point otherwise it can lead to confusion. All of these maxims need to be upheld in a dialogue system if the user is to feel comfortable with the conversation. Besides underlying principles in conversations, the conversation structure is important to follow. Conversations between humans are structured in turn construction units (TCU). Each speech act by each partner is considered a TCU and these TCUs are surrounded by turn relevance places (TRPs) (Norrby, 1996). For example, if one person directs a question to another person, that is considered a TCU. The answer the other person gives is another TCU and the time in between the question and answer is a TRP. TRPs are extremely important because they signal when another party can take a turn. TRPs are the natural place to take a turn if you are participating in a conversation. They can be signalled by a longer pause, the intonation at the end of a TCU and other signals that humans perceive automatically. It is important for the dialogue 6

14 system to understand when a pause is a TRP or not, otherwise a conversation can be frustrating for the user. These TRPs can be easier to find if the role of initiative in the dialogue is clear. When one person starts a dialogue she has initiative. The initiative can switch between the different parties as the conversation moves along to keep it going forward. A conversation is considered single initiative if one party always takes initiative (Gustafson, 2002). For example, the Danish flight ticket reservation system is a mainly system-directed task oriented dialogue (Bernsen et al., 1997). Mixed initative is when either party can take initiative (Gustafson, 2002). This can be seen in a system where the user can prompt the system for an answer to a question and the system can do the same with the user. An example of such a system is the Waxholm system which gives boat information for the Stockholm archipelago and was designed to allow user initiative as well as system initiative (Carlson et al., 1995). These assumptions and underlying rules of conversation need to be taken into consideration when designing a dialogue manager. Otherwise it will most likely be unpleasing to the human user. The next step is programming the actual dialogue Design of Dialogue Once the design method is decided and conversation principles are considered, the designer is ready to program the type of dialogue the manager will understand and interpret. To help in the design process, the designer can gather examples of dialogues to base design on or if this is not a possibility, the designer can use scenarios (Gustafson, 2002). Scenarios are when a designer considers all the different types of dialogues that can occur with the system in order to form a successful design. Scenarios are very helpful in that they take the system through as many different dialogues as possible. With the help of the gathered examples or scenarios, a dialogue is designed. The dialogue manager can then be programmed to interact with human users in the limited way that the system was designed to. But in order for the system to reach a greater scope of information, the dialogue manager may interact with a database. A database stores all the information that could be relevant to the dialogue. For example, in a train booking system, where people call to book tickets, the dialogue manager must interact with the database in order to find out information about the trains that are relevant. The database may give input to what the acceptable output may be. Once the dialogue manager has processed the input, the appropriate output is sent to the next component, the output generator. 2.3 Generator Output can be generated in a few ways in a dialogue system. One way is through recorded prompts that are played back to the user. Another way is generated through a TTS system. Recorded prompts can be used when there are messages that are always played in every dialogue. They are chosen because it is a real voice instead of a computer generated voice since human voices could be considered more pleasing to human listeners. 7

15 TTS is used when the output can not be foreseen. TTS does not sound as natural as a human voice and therefore recorded prompts are sometimes preferred, but, in many systems, output is often unique which makes TTS extremely powerful. TTS systems generally synthesize speech from text using linguistic processing and concatenating small speech units. It converts input text into speech waveforms using algorithms and previously coded speech data (O Shaughnessy, 2000). Speech synthesizers can be characterized by the size of speech units they concatenate and by the method used to synthesize the speech (O Shaughnessy, 2000). Large speech units produce high-quality speech but requires a lot of memory while efficient coding reduces memory but also reduces speech quality. Most commercial synthesizers have been based on word or phone concatenation (O Shaughnessy, 2000). Two commercial applications exist for speech synthesizers, voice-response systems which handle input text of limited vocabulary and syntax, and TTS systems which accept all input text (O Shaughnessy, 2000). TTS systems construct speech from text using small speech units and much linguistic processing whereas voiceresponse systems simply concantenate speech from the large units the system has stored. TTS systems are the systems that are of interest for most spoken dialogue systems. Several different methods of synthesis exist for TTS systems which include formant synthesis, articulatory synthesis, linear predictive coding synthesis, and waveform synthesis. The highest-quality synthesized speech uses waveform coders and large memories (O Shaughnessy, 2000). These synthesizers can be considered quite advanced for certain systems. Two other types of synthesizers are terminal-analog synthesizers and articulatory synthesizers (O Shaughnessy, 2000). With articulatory synthesis, the sound is created by modelling the actual vocal tract shapes and movements. In terminal-analogue synthesis only the acoustic results of speech are modelled without taking the vocal tract into account. The choice of synthesizer is greatly influenced by the size of the vocabulary. For example, a system that requires a synthesizer that can produce unlimited text will generally be of lower quality than a system that has limited output. The generator makes up the last of the three components that a dialogue system consists of. Now I will discuss one possibility to implement a dialogue system. This is the implementation that will be used in this thesis. If you want to learn more about speech synthesis or speech recognition refer to (O Shaughnessy, 2000). For more information on dialogue systems refer to (Gustafson, 2002). 2.4 VoiceXML VoiceXML (Voice Extensible Markup Language) is a powerful markup language that descends from SGML (Standard Generalized Markup Language). VoiceXML has two older siblings, HTML and XML, which were developed as children of SGML (see Figure 2.2). Whereas HTML is considered a single SGML application, XML is a metalanguage just as SGML. A metalanguage is a language that is used to define other languages (Abbott, 2002). All the descendents of SGML are markup languages which means that information content is stored with tags that describe the meaning of the information content (Abbott, 2002). XML was developed by a designer to generalize the success of HTML and also allow for a broader user base than SGML 8

16 by taking away some of the complexities of its mother language (Abbott, 2002). VoiceXML can be considered a young sibling to HTML. SGML HTML XML VoiceXML Figure 2.2: The relationship between SGML, HTML, XML and VoiceXML Although it is a sibling it interacts differently with its users than HTML since in VoiceXML applications the user speaks to the computer whereas in HTML, the user communicates visually with the computer with their mouse or keyboard (Abbott, 2002). VoiceXML was developed after discussion between telephone companies to develop a common language to voice enable the web. The first version was released in August A simple example is seen in Figure 2.3. The output after running this example would be a TTS of the text Hello World. <?xml version="1.0"?> <vxml version="2.0" xmlns=" <form> <block>hello World!</block> </form> </vxml> Figure 2.3: A simple VoiceXML example VoiceXML can be seen as a complete dialogue system for telephony applications where the designer simply has to program the dialogue manager and build grammars for the system. This can be seen in the seven subsystems which are listed below and illustrated in Figure 2.4. Network Interface Allows HTTP to communicate with a web server. VoiceXML Interpreter Software that can be considered the dialogue manager. This is where the programming and construction of the dialogue takes place. TTS As discussed above translates text to speech. Audio Allows audio prompts to be played or recorded. 9

17 Speech Recognition As discussed above translates user utterances into text. Voice- XML uses speaker-independent speech recognition where the interactions are structured dialogs where the user is limited to a finite vocabulary. DTMF (dual tone multi-frequency) Translates keypad input into characters Telephony Interface Enables communication with telephone networks. Telephony Interface Speech Recognition VoiceXML Interpreter DTMF Audio TTS Network Interface Figure 2.4: The seven subsystems of VoiceXML (Abbott, 2002) By putting together speech recognition, speech synthesis, XML and the web in this one powerful language, VoiceXML is able to extend the reach of the web since it allows it to be accessed from anywhere. It makes the web easier to use especially for people with disabilities such as blindness or illiteracy. In addition, it increases the options for human-computer interfaces since it is an inexpensive option compared to other voice applications (Abbott, 2002). VoiceXML has taken the expensive highend technology of speech technology and combined it with markup language to make speech technology something that is available for even low-end systems. VoiceXML works by interpreting between the user and the web server. The Voice- XML code lies on a server and is accessed by the web or by a telephone number. The code is processed and able to form a dialogue with the caller. Although this is powerful in and of itself, it is not very exciting. It can be compared to a static web page, the results never change. In order to make it dynamic it can integrate with a web application server which allows it to connect to a database. One such application server is ColdFusion ColdFusion ColdFusion was created in 1995 to introduce dynamics onto the internet (Danesh and Motlagh, 2000). Coldfusion interprets commands given by the web and connects to the database to retrieve the necessary information. For example, a website that contains many articles uses an application server such as ColdFusion to access the articles in the database. Otherwise each article would have to have its own webpage. This is what makes the web dynamic. When ColdFusion integrates with VoiceXML it allows telephony applications to become dynamic. ColdFusion is responsible for getting information to and from the database in the same way it does with regular webpages, but with voice applications it is interpreted by the VoiceXML gateway in 10

18 order for the information to be processed and found in the database. ColdFusion code can be integrated into VoiceXML applications which makes it very simple and easy to learn. Simple SQL statements are used to retrieve the necessary information from the web and this information continues to be processed by the VoiceXML code. 11

19 3 Programming the Receptionist The receptionist is programmed using VoiceXML and ColdFusion. Since the other parts of a dialogue system are included in the VoiceXML system (see section 2.4), the focus of the implementation will be on the design and implementation of the program code. Designing the receptionist has several stages of development (as seen in Fig- Static Code Event Handlers Database Design Dynamic Code Statistics Figure 3.1: Stages of Development of Receptionist ure 3.1). The first stage involves designing a static receptionist where no dynamic information exists to make sure that the program can run with hard-coded information. The next step involves integrating event handlers that will handle misrecognitions and other events. Once these two pieces are working, a database is developed that will allow the information that the receptionist uses to be dynamic. After the database is done, the static receptionist is reprogrammed to include ColdFusion markup language (CFML) which will enable communication with the database. Once the dynamics are in place, I am able to program in statistical elements that are important for administrative purposes such as call length, time the call started, phone number that the user called from, and the number the user called. After this, a website is designed that will allow companies to submit, change, or delete information in the database. Each of these developments is discussed below. 12

20 3.1 Static Receptionist Design of Dialogue Before programming the receptionist, the dialogue is designed. Since it is a simple dialogue, it is designed by inspiration and some observation of receptionist situations. A dialogue needs to be designed that upholds Grice s four maxims as discussed above, where the turn relevance places (TRPs) are obvious to the caller and also makes the system s dialogue simple so that the user will model their dialogue to the system s. The best approach is to be direct and to the point in as few words as possible. The dialogue is designed to be single-initiative where the system will always direct the caller. Although more experienced users have the possibility to barge-in which interrupts the computer when it is speaking which makes the dialogue more efficient. An example dialogue can be seen in Figure 3.2. (1) Computer: Välkommen till företaget. Vem vill du prata med? Caller: Anna Matzon. Computer: Vill du prata med kundservice Anna Matzon? Caller: Ja. Computer: Vill du bli kopplad till jobbtelefon, mobilen eller hemtelefon? Caller: Jobbtelefon. Computer: Varsågod. Snälla vänta medans jag kopplar samtalet. (samtalet kopplas) (2) Translated into English Computer: Welcome to the Company! Who would you like to speak to? Caller: Anna Matzon. Computer: Would you like to speak to customer service Anna Matzon? Caller: Yes. Computer: Would you like to be connected to work, mobile, or homephone? Caller: Workphone. Computer: One moment. Please wait while I transfer your call. (call transfers) Figure 3.2: Example Dialogue In this conversation, quality is upheld since there is no false statement in the conversation and the system is therefore sincere. Quantity is also upheld since the questions are simple but informative so that the user knows what response is necessary. The conversation upholds the relevance maxim since all the questions directed by the system are related to the goal of connecting the caller to a callee. Since the questions are unambigious, the manner maxim is also upheld. And in this way, all four maxims are satisfied. Since the system mostly asks questions, the TRPs are also clear to the user since an obvious TRP is the end of a question. The user is placed in a single-initiative situation since the questions are always directed to the user, and the user should not feel a need to ask questions in return. The goal with the receptionist is not to have a long conversation, but to connect the caller to a callee as simply and quickly as possible. This dialogue succeeds on 13

21 that aspect while upholding the rules of human conversation. The implementation of this design is discussed below Basic Code The static receptionist where all values are hard-coded, is programmed solely with VoiceXML. In the static version, the program code consists of one document that is followed linearly to connect the caller to a fixed destination. This chain of events can be seen in Figure 3.3. Callee Name Confirm Callee Callee Number Transfer Call Caller Callee Figure 3.3: Receptionist Applications s chain of events In the first part of the code, speech synthesis is used to ask who the caller would like to speak to. The response the caller gives has to be a part of the active grammar in order for it to be accepted. The grammars are discussed more below. If the user gives a response recognized by the system, the system confirms the recognized person that the caller chose. If the person is confirmed, the user is then asked by a speech synthesis prompt which telephone number she would like to be connected to. This response is also directed by a grammar. In the static version, the computer asks every person if they want to be connected to home, work or mobile phone since no database exists with information if one employee has more than one number or not. If it is incorrect, the code starts from the beginning. Once the number is retrieved, it goes to the next section which is the transfer section. In this section the call is transferred to the phone number that the caller wants to be connected to. If the number is busy the caller is told that they have to call back and a similar response if no one answers. After the call has been transferred and has returned, the system has a simple last message before the call disconnects. But in the static code, the telephone number is always the same since it is hardcoded. Therefore, the static code is pretty uninteresting to use except as a base to build on. How this static code turns into a useful dynamic code is discussed later in this chapter, but first grammars and event handlers will be discussed Building Grammars for Use In building the grammar for the receptionist, the goal is to keep the accepted responses short and simple so that the dialogue will be efficient and at the same time, 14

22 the speech recognizer will be able to work easily with short phrases. As discussed earlier, VoiceXML is built up of seven subsystems. One of these subsystems is the speech recognizer. In order for the recognizer to recognize user input, it needs to be told what the accepted responses are so that it can try to match them with the user input. This is done with grammars. A grammar can be built in several ways in VoiceXML. It can be a simple list of options, an inline grammar that is placed where it is used, or an external grammar that is placed in another document. Examples of these three are found in Figure 3.4. For the static code, an external grammar is used for both grammars. The first grammar is all the acceptable names a user can ask for (name grammar) and the second is the different types of telephone numbers they could be connected to(number grammar).  <option value="röd">röd</option> <option value="blå">blå</option> <option value="grön">grön</option>  <rule id="number" scope="public"> <one-of> <item>jobbet</item>  <item>mobilen</item>  <item>hemma</item>  </one-of> </rule> Figure 3.4: Example of different types of VoiceXML grammars As seen in Figure 3.4, the external grammar is identical to the in-line grammar, the only difference being that an external grammar is placed in another document instead of in the code. They are composed of rules that are defined by listing the possibilities. The options grammar is a bit different since there are no rules, instead a field has a set of options that defines the grammar. An external grammar is chosen for both grammars in the static code since it is neater and does not clutter the code. Since it is an external grammar, the rules can be more expansive as well. Since these grammars are what the speech recognizer will try to match to the user input, the text is written as say-as text which is similar to orthographic transcription. For example, Matzon is written matson since the z is pronounced as an s when spoken. Although it is written as it sounds, it is not phonetically transcribed. Once the grammars are implemented, the system recognizes an accepted name and connects the caller to the static phone number. But what happens with input that is not included in the grammar? Event handling is discussed in the next section. 15

23 3.1.4 Integrating Error Handling in the Code Error handling is necessary in order to handle exceptions in a way that is pleasing to the user. Errors introduced by imperfect recognition is a large problem facing dialogue systems (Choularton, 2004). Two general approaches exist to tackle this problem, error avoidance and error handling (Choularton, 2004). VoiceXML has built-in error handling for certain exceptions such as nomatch and noinput. Nomatch is when a person s response does not match any items in the specified grammars whereas noinput is when the user gives no audible response. In VoiceXML, by default, both of these are handled with a simple error message with a TTS voice and then reprompting the user for a response. This is a potentially frustrating scenario for a user since they would hear the same error message every time they give an unacceptable response. It is important that the exceptions are handled differently depending on the number of times the user has given an unacceptable response. Since the system wants to be natural, repeating the same question again and again is not desirable. According to Shin et al. (2002), user behavior when met with an error is to rephrase or repeat their response. This user behavior can be modelled in dialogue systems to manage dialogue when errors are introduced (Choularton, 2004). This way, the user is prompted once to repeat their answer and the second time they are given more specific instructions to rephrase their response. This approach follows the most normal way of handling errors even if it is not the most desirable since the information from the user s first response is discarded (Gorrell, 2003). For example, if the user responds with an unrecognized response one time, the message to the user will be different than if it is the third time. An example conversation with error handling is seen in Figure 3.5. (3) Computer: Välkommen till företaget. Vem vill du prata med? Caller: ehm, jag vet inte. Computer: Jag är ledsen. Jag förstod inte. Vem vill du prata med? Caller: ehm, jag vet inte. Computer: Jag känner inte igen det namnet. Du kan säga namnet eller funktionen av personen du vill prata med. Caller: Jag kommer inte ihåg. Computer: Tyvärr så förstod jag inte. Jag kopplar dig till kundtjänst. (4) Translated to English Computer: Welcome to the company. Who do you want to speak to? Caller: ummm, I don t know Computer: I m sorry I did not understand you, who would you like to speak to? Caller: Umm, I don t know Computer: I don t recognize that name. You can say the name or position of the person you would like to speak to. Caller: I don t remember. Computer: Unfortunately I did not understand. I will connect you to customer service. Figure 3.5: Example of error handling in a dialogue 16

24 Strategies that take longer but produce fewer errors and corrections are preferred by users (Hirschberg et al., 2000). As seen in the example above, if the system is unable to recognize an accepted answer three times in a row, the system connects the caller to customer service that can help them. This is a simple way of handling errors where after three attempts general help is given to the user (Gorrell, 2003). I choose to do this after three times since it gives the caller three opportunites to get to their desired person each time with slightly more specific instructions. If they are still unsuccessful after the third time, there is obviously a problem. More advanced techniques in error handling exist which take many aspects of the conversation into consideration as seen in Higgins - a dialogue system for investigating error handling techniques (Carlson et al., 2004). I have not implemented unique error handling for the number grammar where the user can respond with one of three options: mobile, home, or workhphone since the options are listed for the user in the question. It is unnecessary since the error handling would be simply reprompting the user again. The number grammar and the name grammar are the only two grammars where error handling for the user response is necessary. Error handling is also necessary for events pertaining to the phonecall. For example, error handling is necessary if the call is transferred to a number that is busy or has no answer. This is handled in the static version by simply stating that the person is busy or isn t answering and thanking them for their call as seen in Figure 3.6. Once the dynamics are built in, the user is given the option of trying another number or another person. (5) Computer. Anna Matzon svarar inte. Tack för samtalet, prova gärna igen senare. Computer: Anna Matzon is not answering. Thank you for your call, please try again later. Figure 3.6: Static event handling for an unanswered call To summarize, the static code is coded in VoiceXML where a person calls in, asks for a person that is in the grammar, responds with the type of number they want to call and are connected to a static number. If their responses are unacceptable, special event handlers exist. Also if the number is busy/noanswer, they are informed. It is quite obvious that this code is not very powerful. The force comes when the code becomes dynamic. In order for it to be dynamic, it needs a database to hold all the necessary information. 3.2 Integrating Dynamics The first part to integrating dynamics to the static code is building a functional database. Once the database is successful, ColdFusion can be integrated with VoiceXML to connect the database to the program Building the Database An efficient database is necessary to build an acceptable system. Without a working database, the system is not functional which is why the database design is so import- 17

25 ant and central to the entire system. The database can be viewed in Appendix A. It consists of five tables which are listed below. Company Employee Tilltal InCall TransferCall The Company table holds information about each company. Each company has a unique id which is used to separate the information in the other tables between companies. The Employee table holds information about each individual employee including their telephone numbers and position at the company. Each employee has their own unique id which separates the employees in the Tilltal table as well. The Tilltal table is the source of the grammar for all the names. Here, each name that can be used to reach a person is registered with that employee s ID. The last two tables, the InCall and TransferCall tables hold information about the calls for administrative purposes. In order to test that these tables with the information included as above are efficient and functional, scenarios that can happen with a caller are designed and how these events effect the database are tested. A few scenarios are accounted for below. All the scenarios begin by a caller calling a certain telephone number which identifies the company in the database. Knowing which company it is, the system finds the appropriate welcome message and plays the message to the caller. After the welcome message, the system asks who the caller wants to speak to. The caller then responds with a name (in our example the name is Anna). The system then searches in the Tilltal table of the database with the id of the company as above to find an entry of the name Anna. It then finds an entry, connects it to the employee table with the employee ID, and finds the filename with the employee Anna s full name and asks the caller if he wants to speak to Anna Matzon. If the answer is yes, the caller is connected to one of the telephone numbers in the Employee table. If no, the system has to start from the beginning but this time eliminating the employee Anna Matzon as one of the options. In this way the system can search through the names in the Tilltal table to find a different result. This is done by eliminating the previous employee s ID from the search. One variation of the above scenario is when a caller wants to speak to a group, for example sales or customer service. If the caller asks for customer service then the computer is going to find the employee that has customer service as her position. The problem comes when the computer wants to confirm the callee with the caller. If the computer says the callee s actual name then the caller has no idea if it is correct or not. An example of this can be seen in Figure 3.7. A simple solution to this problem is that instead of simply having their names in the confirmation, the confirmation states their position along with their full name so that if the person calling does not know the callee s name they will still know they are being connected to the correct person. The next scenario is how the database should handle the calls that aren t connected. A first thought is that for the calls that aren t answered or are busy and aren t automatically connected to voic , the system could have a message system of 18