DISSERTATION DOKTOR-INGENIEURS (DR.-ING.)

Size: px
Start display at page:

Download "DISSERTATION DOKTOR-INGENIEURS (DR.-ING.)"

Transcription

1 Intuitive Speech Interface Technology for Information Exchange Tasks DISSERTATION zur Erlangung des akademischen Grades eines DOKTOR-INGENIEURS (DR.-ING.) der Fakultät für Ingenieurwissenschaften und Informatik der Universität Ulm von Hansjörg Hofmann aus Göppingen Betreuer: Zweitgutachter: Amtierende Dekanin: Prof. Dr. Dr.-Ing. Wolfgang Minker Prof. Dr. Kristiina Jokinen Prof. Dr. Tina Seufert Ulm, 31. Oktober 2014

2

3 Acknowledgements This PhD thesis is the result of three and a half years of intense research in the Speech Dialog Systems group at Daimler AG, Ulm. First and foremost, I would like to thank my supervisor Dr.-Ing. Ute Ehrlich who guided me through the entire project work. Her expertise, assistance and constructive comments have been invaluable for the accomplishment of this PhD thesis. Furthermore, I wish to thank Dr.-Ing. André Berton whose support and fruitful advice during the research work helped me to manage my research work. Furthermore, I owe great thanks to my doctoral advisor Prof. Dr. Dr.-Ing. Wolfang Minker, associate director of the institute of Communications Engineering at University of Ulm for supervising my PhD thesis. Thanks to Prof. Minker my interest in Spoken Dialog Systems grew at the beginning of my university studies. Without his long-standing support during my studies, without his efforts to make my stays abroad possible and without his supervision during my Bachelor, Master, and PhD theses my research work in the speech dialog community would not have been possible. Furthermore, I also would like to extend my highest gratitude to my reviewer Prof. Dr. Kristiina Jokinen (University of Helsinki, Finland) for her interest in my research work. I am indebted to my colleagues of the Speech Dialog Systems group at Daimler AG, Ulm. Their friendship and always good mood created a great research atmosphere and their professional collaboration helped me to achieve my research goals. Especially, I would like to thank Dr.-Ing. Alexander Schmitt who supported me with his scientific expertise and experience whenever I needed an advice. I remain grateful to my master students Anna Silberstein from Technical University of Berlin and Mario Hermanutz from University of Ulm. Furthermore, I would like to thank my student employees Vanessa Tobisch, Frederic Metzler, Luc Watrin and Burim Ramosaj from University of Ulm for keeping up the good work during my research. On a personal level I would like to thank my parents and the rest of my family for their always helpful advice, support and encouragement.

4

5 Abstract Smartphones are considered as people s companions and help users to get instant access to contentrelevant information and Web services anytime and anywhere. The utilization of smartphones and their mobile Internet capabilities seem to be without limitations. However, this only applies for situations, where the actual smartphone use is in focus. In so-called dual task scenarios people perform two tasks in parallel and often different priorities are given to each task. For instance, if a driver would like to send an while driving, it is more important to drive safely than to use the smartphone manually to send the . In order to optimize the performance of both tasks intuitive spoken dialog systems (SDSs) can help to access the Internet as secondary task in a dual-task scenario by little impairing the primary task performance. Until today there is no persuading, consistent and sophisticated speech dialog concept, which allows users to control the Internet by speech. In the thesis at hand different speech-based human-machine interaction (HMI) concepts, which enable users to access the Internet by speech as secondary task in a dual-task scenario, are designed, realized, and evaluated in user studies. The Internet allows users to perform a large variety of tasks. Information exchange tasks (such as hotel bookings or sending s) enable users to exchange information in a Web-based setting. The development of speech interfaces to perform information exchange tasks requires the design of multi-turn dialogs and the functionality to provide users with proactively incoming information. As these tasks demand a lot of attention of the user and thereby might impair the primary task performance, this research work focuses on information exchange tasks. Due to the urgency and the high risks the automotive environment has been chosen as example dual-task scenario in this thesis. Up to today research has not examined the speech access to the Internet in the driving environment. Goal of this thesis was the design, the implementation and the evaluation of an intuitive in-car spoken dialog system (SDS), which enables users to perform information exchange tasks safely while driving a car. The voice-control of the SDS had to be designed in a user-friendly manner and had to reduce driver distraction to a minimum in order to meet the requirements of the driving use case. The thesis at hand provides first clear guidelines for in-car SDS developers, which help to design Internet enabling speech interfaces in the driving environment. After an introduction into SDSs the thesis describes the application of SDSs in the field with focus on the driving environment. Furthermore, background knowledge about driver distraction as measure, which indicates how much the secondary task interferes the primary task, is provided. Related work and challenges in the field of the evaluation of speech dialog strategies and proactive behavior of SDSs are presented and discussed. The development of speech interfaces to a new field requires a wide knowledge about the domain and how people interact with the domain by speech. There is no related research investigating people s voice-control of the Internet. Therefore, the main part of the thesis begins with the description of

6 VIII a conducted initial user study, which aimed at getting knowledge about how users would interact with Internet services by speech. The Web-based user study confirmed the strong need to develop an Internet enabling speech interface in the driving environment. Furthermore, the study confirmed that the research work should focus on information exchange tasks as the preference of people s speaking styles was equally distributed concerning these tasks and therefore, requires further investigation. In order to find the most appropriate speech interface for the performance of multi-turn dialogs for information exchange tasks several in-car SDS concepts were designed, implemented and evaluated. Related work in the field has previously examined different speech dialog strategies but did not focus the dialog design and the comparison of strategies on the driving environment. In contrast to previous research this contribution focuses strictly on the driving use case. In this research work two different speech dialog strategies for an online hotel booking were designed for the driving environment: a command-based and a conversational dialog. Several graphical user interface (GUI) concepts (one including a human-like avatar) were designed in order to support the respective dialog strategy the most and to evaluate the effect of the GUI on usability and driver distraction. The different concepts were implemented as SDS prototypes using the Daimler speech dialog framework and evaluated in a driving simulator study. When comparing the speech dialog strategies the results showed only few differences concerning speech dialog quality. The comparison of speech dialog strategies did reveal any differences in driver distraction. However, the results undeniably demonstrate that the use of a GUI supporting the speech dialog impaired the driving performance and increased gaze-based distraction. The presence of an avatar was clearly not appreciated by participants but did neither affect the dialog performance nor the driver distraction. The results strongly indicate that in-car SDS developers have to take both speaking styles into consideration when designing an SDS for multi-turn dialogs of information exchange tasks. Furthermore, developers must imperatively consider reducing the content presented on the screen in order to reduce driver distraction. The development of an intuitive proactive speech interface required the design, the implementation and the evaluation of several speech-based HMI notification concepts for in-car SDS. Previous research only investigated the use of visual output modalities to notify the user proactively about new information or did not focus on the driving environment. In this thesis speech-based notification concepts developed for the driving use case, are examined for the very first time. Four different speech dialog and two GUI concepts (one including a human-like avatar) with different levels of obtrusiveness were designed for an use case. The SDS concepts were realized as prototypes using the Daimler speech dialog framework. The developed speech-based HMI notification concepts were evaluated in a driving simulator study with respect to usability and driver distraction. In order to investigate the user preferences of the concepts in different contextual situations the priority of the incoming s and the driver workload varied in the course of the experiment. The results show clearly that the concept of informing the user verbally achieved the best result concerning usability and driving performance. The presence of an avatar was not accepted by the participants and led to a slightly impaired steering performance. In the end, the insights gained from this research work result in first clear guidelines supporting the development of an Internet enabling SDS in the driving environment. Based on the findings and the achievements of this thesis research can continue the development of a user-friendly SDS enabling users to perform online information exchange tasks while driving.

7 Contents 1 Introduction Assistive Technology for Multitasking Performance State-of-the-Art Speech Access to the Internet in Mobile Environments Thesis Contributions Outline of the Thesis Background and Resulting Challenges Fundamentals of Spoken Dialog Systems Overview of Spoken Dialog Systems Fundamentals of Dialog Management Conversational Speech Interfaces Proactivity Evaluation of Spoken Dialog Systems Application of Spoken Dialog Systems State-of-the-Art In-Car Speech Dialog Systems Daimler Speech Dialog Framework Driver Distraction Definition, Sources and Outcomes Driver Distraction Assessment Summary and Discussion Summary Related Research and Challenges User Study on Speech Interaction with the Internet Method Collection of Audio Data Questionnaire Results Questionnaire Results Data Collection and Speaking Style Analysis Discussion Summary Summary Implications on Research Work

8 X Contents 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Design of speech-based HMI Concepts Functionality of the Hotel Booking Use Case Dialog Strategy Design GUI Design Realization Prototype Implementation in the Daimler Speech Dialog Framework Linguistic Grammar Approach for Conversational Dialog Specification Evaluation Method Results Discussion of Results Summary Development of In-Car Speech Dialog System Notification Concepts for Incoming Proactive Events Design of Proactive Notification Concepts Functionality of the Use Case Speech Dialog Design GUI Design Realization Prototype Implementation in the Daimler Speech Dialog Framework Evaluation Method Results Discussion Summary Conclusion and Future Directions Overall Summary Initial User Study Comparison of Speech Dialog Strategies for Information Exchange Tasks Comparison of Proactive Speech-based Notification Concepts Research Contributions and Achievements Suggestions for Future Work A Materials of the User Study on Speech Interaction with the Internet A.1 Graphically Depicted Tasks A.2 Questionnaire B Materials of the Driving Simulator Study for Comparison of Speech-based HMI Concepts B.1 Task Descriptions B.2 Questionnaires B.2.1 Preliminary Interview B.2.2 SASSI Questionnaire B.2.3 DALI Questionnaire B.2.4 Final Interview

9 Contents XI B.3 Results B.3.1 Results of Session B.3.2 Results of Session C Materials of the Driving Simulator Study for Comparison of Speech Dialog System Notification Concepts C.1 Task Descriptions C.2 Questionnaires C.2.1 Preliminary Interview C.2.2 DALI Questionnaire C.2.3 Final Interview C.3 Results References

10

11 Acronyms AAM ABNF API AVP ASR CAN CCE CI ConTRe DD DL DM FSM GHSA GIMP HMI ITU-T IRC IV JSGF LCT LM MGD MRT MDev MV NHTSA NLU NoG NoT ON OS OTT P PC Alliance of Automobile Manufacturers Augmented Backus-Naur Form Application Programming Interface Attribute value-pair Automatic Speech Recognition Controller Area Network Central Control Element Contextual Interpretation Continuous Tracking and Reaction Dialog Duration Driver Workload Dialog Manager Finite State Machine Governor Highway Safety Association GNU Image Manipulation Program Human-Machine Interaction International Telecommunication Union Internet Relay Chat Interaction Variants JSpeech Grammar Format Lane Change Test Language Model Mean Glance Duration Multiple Resource Theory Mean Deviation Mean Value National Highway Safety Association Natural Language Understanding Number of Glances Number of Turns Obtrusiveness Operating System Over-the-top Content Priority Personal Computer

12 XIV Acronyms PDT PM PTA QUIS RT SASSI SCXML SD SDS SDSs SDF SGD SLM SOAP SRGS SUMI SYNC TDDM TMC TGT TS TTS W3C WOZ XML Percent Dwell Time Proactivity Manager Push-To-Activate Questionnaire for User Interaction Satisfaction Response Time Subjective Assessment of Speech System Interfaces State Chart XML Standard Deviation Speech Dialog System Speech Dialog Systems Speech Dialog Framework Single-glance Durations Statistical Language Model Simple Object Access Protocol Speech Recognition Grammar Specification Software Usability Measurement Inventory Synchronization component Task-Driven Dialog Manager Traffic Message Channel Total Glance Time Task Success Text-to-Speech World Wide Web Consortium Wizard-of-Oz Extensible Markup Language

13 List of Figures 1.1 Global Internet Devices Sales (Source: Gartner, IDC, Strategy Analytics, company filings, BI intelligence estimates) Sending an using Apple s Siri Looking for a restaurant using Apple s Siri The Web Information Classification (Kellar, 2007) Components of a SDS, similar to McTear (2004) A grammar network for TV information, similar to McTear (2004) Syntactical analysis of the sentence I like the movie with David Hasselhoff, similar to Schmitt (2012) Dialog graph for a TV information service Frame-based dialog control of a hotel booking service Conversational Speech Classification General proactivity system model Extended proactivity system model Extended proactivity system model Proactive speech dialog flow Usability engineering lifecycle, similar to Möller (2010) PTA button on the steering wheel Mercedes S-Class infotainment system Screenshots of COMAND Online of 2013 Mercedes S-Class infotainment system Screenshot of Apple s Siri Automotive SDS architecture Daimler SDF architecture Hierarchy of task types, similar to Ehrlich (1999) Screenshot of the LCT driving simulation Screenshot of the LCT analysis software presenting the deviation from the reference lane Screenshot of the ConTre task driving simulation Screenshot of the ConTRe task output presenting the deviation from the reference lane and the reaction time performances Information seeking sample tasks Communications sample tasks Transactions sample tasks

14 XVI List of Figures 3.4 Screenshot of the Flash application for recording the audio data Demographic information about the participants Results of the questions about the participants attitude towards in-car voice-control and the integration of Internet services into the car Frequency of occurence of speaking styles related to the different task categories Overview of the hotel booking dialog flow Command-based dialog flow graph while parameter input Conversational dialog state chart while parameter input Layout of the GUI screens Screen of the command-based dialog at the beginning of the parameter input Screen of the command-based dialog after the first parameter input Screen of the command-based dialog after parameter input List presentation overlay of the command-based dialog Screen of the conversational dialog at the beginning of the parameter input Screen of the conversational dialog while parameter input Screen of the conversational dialog after parameter input List presentation screen of the conversational dialog Screen of the conversational dialog with avatar while parameter input List presentation screen of the conversational dialog with avatar Screen of the command and conversational dialog without GUI Extended SDF prototype architecture of the HMI concept implementation XML message exchange during the hotel search Task hierarchy of the hotel booking dialog SDS output process when using the avatar Screenshot of Charamel s CharAT software for designing avatar motion animations Linguistic grammar concept Grammar specification using the linguistic grammar approach Filled-out form matching the search parameters given in the sample task including highlights of the required parameter fields (1) and the optional parameter fields (2) Driving Simulator Setup Session 1: Structure of the Experiment Session 1: Overall procedure of the Experiment Session 1: Overall T S distribution Session 1: T S per speech dialog. Left: overall T S rate per speech dialog strategy (orange), right: overall T S rate additionally separated according to the GUI condition (blue) Session 1: Average DD per speech dialog. Left: average DD per speech dialog strategy (orange), right: average DD additionally separated according to the GUI condition (blue) Session 1: Average NoT per speech dialog. Left: average NoT per speech dialog strategy (orange), right: average NoT additionally separated according to the GUI condition (blue) Session 1: Average CER per speech dialog. Left: average CER per speech dialog strategy (orange), right: average CER additionally separated according to the GUI condition (blue)

15 List of Figures XVII 4.32 Session 1: Overall SASSI result. Left: overall SASSI result per speech dialog strategy (orange), right: overall SASSI result separated according to the GUI condition (blue). Scale from -2 (=rejection) to 2 (=acceptance) Session 1: Average MDev per drive. Left: average baseline MDev (green), middle: average MDev per speech dialog strategy (orange), right: average MDev additionally separated according to the GUI condition (blue) Session 1: Average RT per drive. Left: average baseline RT (green), middle: average RT per speech dialog strategy (orange), right: average RT additionally separated according to the GUI condition (blue) Session 1: Overall DALI result. Left: overall DALI baseline result (green), middle: overall DALI result per speech dialog strategy (orange), right: overall DALI result additionally separated according to the GUI condition (blue). Scale from -2(=low driver workload) to 2 (=high driver workload) Session 1: Average PDT per speech dialog. Left: average PDT per speech dialog strategy (orange), right: average PDT additionally separated according to the GUI condition (blue) Session 1: Distribution of T GT per SDS prototype; red lines indicate 85th percentile of the distribution Session 1: Distribution of SGD per SDS prototype; red lines indicate 85th percentile of the distribution Session 2: Overall T S distribution Session 2: T S per speech dialog Session 2: SASSI result of the final interview. Scale from 1 (=rejection) to 10 (=acceptance) Session 2: Average MDev per drive Session 2: Average RT per drive Session 2: Overall DALI result. Scale from -2 (=low driver workload) to 2 (=high driver workload) Session 2: DALI part of the final interview. Scale from 1 (=distractive) to 10 (=not distractive) Session 2: Distribution of T GT per SDS prototype; red lines indicate 85th percentile of the distribution Session 2: Distribution of SGD per SDS prototype; red lines indicate 85th percentile of the distribution Time line of the sound notifications Time line of the verbal notifications GUI screen interaction of the different notification concepts GUI avatar screens Extended SDF prototype architecture for the evaluation of notification concepts XML message exchange of a new incoming Task of the proactive dialog Driving Simulator Lab Structure of the proactivity experiment Overall procedure of the proactivity experiment Overall T S distribution

16 XVIII List of Figures 5.12 T S per speech dialog concept. Left: overall T S rate per notification category (orange), right: overall T S rate additionally separated according to the four speech dialog notification concepts (blue) NoT per speech dialog concept. Left: overall average NoT per notification category (orange), right: overall average NoT additionally separated according to the four speech dialog notification concepts (blue) Overall average ON per speech dialog concept. Left: overall average ON per notification category (orange), right: overall average ON additionally separated according to the four speech dialog notification concepts (blue). Scale from -1 (=insufficient obtrusive) to 1 (=too obtrusive) Average ON per speech dialog concept with reference to the different driver workload and priority levels. Scale from -1 (=insufficient obtrusive) to 1 (=too obtrusive) Overall average ON comparing the GUI concepts. Scale from -1 (=insufficient obtrusive) to 1 (=too obtrusive) Average ON per GUI concept with reference to the different driver workload and priority levels. Scale from -1 (=insufficient obtrusive) to 1 (=too obtrusive) Overall average SASSI result per speech dialog concept. Left: overall average SASSI result per notification category (orange), right: overall average SASSI result additionally separated according to the four speech dialog notification concepts (blue). Scale from -2 (=rejection) to 2 (=acceptance) MDev before and after an incoming . Left: MDev during low driver workload, right: MDev during high driver workload Overall average MDev per speech dialog concept. Left: overall average MDev per notification category (orange), right: overall average MDev additionally separated according to the four speech dialog notification concepts (blue) Overall average MDev comparing the GUI concepts Overall average RT per speech dialog concept. Left: overall average RT per notification category (orange), right: overall average RT additionally separated according to the four speech dialog notification concepts (blue) Overall average RT comparing the GUI concepts Overall average DALI questionnaire result per speech dialog concept. Left: overall average DALI questionnaire result per notification category (orange), right: overall average DALI questionnaire result additionally separated according to the four speech dialog notification concepts (blue). Scale from -2.5 (=low driver workload) to 2.5 (=high driver workload) Overall average DALI questionnaire result comparing the GUI concepts. Scale from -2.5 (=low driver workload) to 2.5 (=high driver workload)

17 List of Tables 2.1 Telephone conversation transcript, taken from Zue and Glass (2000) The TRINDI tick-list SASSI - item composition and sample items The TRINDI tick-list applied on today s embedded in-car SDSs and SDSs on mobile devices (here: Apple s Siri) in the scope of the driving environment Information seeking task scheme including sample tasks corresponding to Figure Communications task scheme including sample tasks corresponding to Figure Transactions task scheme including sample tasks corresponding to Figure Popularity of Internet services on smartphones and in the car Identified speaking styles in the participants utterances Number of words per utterance Characterization of Speech Dialog Strategies on the basis of the TRINDI Ticklist Sample noun lexicon Sample verb lexicon Sample syntactic-semantic rules Evaluation Measures of the Experiment Results of subjective usability questions about the avatar in the final interview; scale from -5 (=rejection) to 5 (=agreement) Results of subjective distraction assessment questions about the avatar in the final interview; scale from -5 (=rejection) to 5 (=agreement) Session 1: hypotheses of the comparison of speech dialog Strategies Session 1: hypotheses of the comparison of GUI conditions Session 2: hypotheses of the comparison of the avatar conditions Speech-based HMI notification concept variants Sample Tasks Evaluation Measures of the proactivity experiment Hypotheses of the comparison of the speech notification concepts Hypotheses of the comparison of the GUI conditions

18

19 1 Introduction Impressive technological change is shaping our daily lives, the society and business. In 1981 when IBM introduced their first personal computer (PC) to be used by small business and users at home (O Regan, 2008) the era of personal computing began. At that point in time, home PCs were mainly used for work, educational purposes, and for entertainment (Vitalari and Venkatesh, 1987). With the rise of the Internet in the mid-90 s, the number of activities the home PC was used for increased. Over the next ten years, using the PC for information seeking, online shopping, financial management, and interpersonal communication through network applications including became part of people s daily lives (Venkatesh et al., 2011). Until % of European households owned a PC and 64% had Internet access (European Commission, 2012). When people started to aspire to exchange information in any situation mobile devices became more and more popular within the last 10 years. Especially, the arrival of smartphones in the mid-00 s has shaped new expectations towards modern mobile devices: smartphones are mobile phone[s] that [are] able to perform many of the functions of a computer, typically having a relatively large screen and an operating system capable of running general-purpose applications (Stevenson, 2010). Thereby, they can act as Pocket PCs with mobile phone functions, with broad and easy Internet access, which allows users to be connected everywhere at any time. By downloading so-called apps (abbreviated form of applications ) from the Internet the user is able to extend his smartphone s functionality. The breakthrough of mobile devices is reflected in the global number of sales of the most relevant Internet enabling devices, which is illustrated in Figure 1.1. Since 2000 the number of sold PCs has stagnated, whereas since 2009 the growth of smartphones and tablet PCs has increased tremendously. In 2012 Gartner predicted that smartphones would dominate the Internet enabling device market in the following five years. Today smartphones are considered as people s companions and support the user to make life easier in various daily situations. They allow the user to get instant access to content-relevant information and Web services anytime and anywhere. For example, if you are enjoying your free time lying on the grass in the park, you can use your smartphone to read the latest news in order to be always up-to-date. Furthermore, smartphones allow the user to communicate with others via networking apps in real time. Today in order to be always connected people do not only send text messages or simply call each other anymore. Instead, people communicate via social media like facebook 1, , and other (instant) messaging apps like WhatsApp 2 using their smartphone. By using these modern ways of communication, people are not restricted to send only a limited number of characters per 1 2

20 2 1 Introduction Tablets Units [in Mio.] Smartphones 500 Personal Computers 0 Personal Computers E 2013E 2014E 2015E 2016E Years Fig Global Internet Devices Sales (Source: Gartner, IDC, Strategy Analytics, company filings, BI intelligence estimates). message. Sending so-called over-the-top content (OTT) messages 3 via WhatsApp or facebook allows users to enrich the text message with pictures, soundfiles or videos. This attractive and entertaining way of exchanging information enhanced the inherent desire to communicate even more. According to Informa Telecoms & Media each OTT messaging user sends an average of 32.6 OTT messages every day (Clark-Dickson et al., 2013), which emphasizes the intense use of smartphones nowadays. 1.1 Assistive Technology for Multitasking Performance The utilization of smartphones seems to be ubiquitous and without limitations, but this may only apply for situations where the actual smartphone use is in main focus. In the news reading use case illustrated above, the user s main attention is concentrated on the manual use of the smartphone and on reading the news from the display. The user does not need to pay attention to other things happening in the park. Such a so-called selective-attention task requires a person to attend to only one of several possible sources of information (Proctor and Van Zandt, 2011, p.242). The user attends selectively to the displayed information on the smartphone and ignores other auditory or visual information in his environment. Some situations require the attention to several sources of information simultaneously. As people perform best when they have to attend to only one task, the more simultaneous sources they have to attend, the poorer they perform. The performance decrement in divided-attention tasks is usually reflected in a decreased accuracy in perception, slower response times or higher thresholds for stimuli detection and identification (Proctor and Van Zandt, 2011). In some cases, the user may not give each task the same priority. For example, if you would like to send a facebook message while driving your car, it is more important to drive safely than using your smartphone manually to send the facebook message to your friends. Here, driving is designated as primary task and sending the message as secondary one. The secondary task needs to be performed without interfering the primary task performance in order to ensure safety. Within the last years accessing the Internet in a so-called dual-task scenario, gained in importance. 3 OTT-messaging apps are downloadable smartphone apps, which enable users to send (instant) text messages for free, using mobile Internet access. (Clark-Dickson et al., 2013)

21 The Use of the Internet as Secondary Task in Dual-Task Scenarios 1.1 Assistive Technology for Multitasking Performance 3 There are many fields of application, in which users perform such dual-tasks in which Internet access is needed as secondary task. Some possible application areas are described in the following: Intelligent Home: Intelligent homes are considered to be environments, where people are assisted and supported in their everyday activities by information technology. By integrating information, communication and sensing technologies into everyday objects, the system can monitor the people s presence and activities and respond in a smooth and unobtrusive way (Röcker et al., 2005),(Grinter et al., 2005). Imagine a person in the kitchen, who is preparing his dinner. While taking care of the sauce for the meal he does not remember the exact quantity of ingredients for his meal. While cooking he would like to look up the recipe in the Internet without losing attention of the food in his pan. Here, the primary task is cooking and the request for information from the Internet is the secondary task. In an intelligent home, the system would assist the user to find the requested information online and present the user the recipe using the available output modalities. For example, the system could present the recipe on a screen or read out the information. Thereby, he can retrieve the desired information without losing attention of the primary task. Medicine: Computers and the Internet are widely used as resources in medical education and clinical care since most of the medical information and records have been stored electronically. Researching information in books or patient records is time-consuming, which is why today doctors use the Internet for instant access to patient data or to drug and dosage information, for example (Masters, 2008). In the doctor s office or during surgery physicians can look up medical information online using their mobile device. However, Smith et al. (2011) report that distraction caused by the use of mobile devices during professions like perfusion, where clinical vigilance is essential to patient care, is a big issue. The use of mobile devices to support physicians needs to be handled with care in order to prevent from misuses, which could endanger the life of patients. Automotive: In the automotive environment it is common to perform dual-tasks. For instance, people frequently control their navigation system, manipulate the sound system or make or receive phone calls while driving their car (AAA Foundation for Traffic Safety, 2003). Today the pervasive use of smartphones in daily situations also impacts the automotive environment. People spend much time in the car but cannot access the Internet on a regular basis, yet. According to Williams (2009) Americans spend hours in their car per day. In order to stay always connected people tend to use their smartphone s Internet enabling functionalities manually while driving. As illustrated above, while driving people continue communicating via OTT messages or use their smartphone to look for the cheapest gas station nearby. There are various applications where the driver may want to retrieve or exchange information via the Internet on the road. According to State Farm Mutual Automobile Insurance Company (2012), especially young drivers at the age of access the Internet on a smartphone frequently while driving. In this age group accessing the Internet while driving increased from 29% in 2009 to 48% in However, using a smartphone manually while driving distracts the driver and endangers the driver s and other s safety (Governors Highway Safety Association, 2011). Military: The military uses the Internet for communication and control infrastructure purposes. For instance, simple text chat techniques, like Internet relay chat (IRC), are used by the U.S. military for communicating practical information including reporting and command and control on the battlefield. This so-called tactical chat allows for near-real-time multi-participant communication among military units. While focusing on a target, a warfighter can receive immediate clearance to fire from different military units by using technical chat instead of having to contact several agencies one after another via radio or telephone. Here, the secondary focus of attention

22 4 1 Introduction is on communicating with other military units while primary targeting the enemy (Eovito, 2006), (Air Land Sea Application Center, 2009). To all use cases above applies that the secondary task must not impair the primary task performance. In order to optimize the performance of both tasks Wickens multiple resource theory (MRT) model can be applied (Wickens, 1984). According to Wickens, human operators have several information processing sources (e.g. visual, auditory, tactile input and output channels), which can be tapped simultaneously. Depending on the nature of the task, these resources may have to process information sequentially if the different tasks require the same resource. If different resources are required, the tasks can be performed in parallel. Therefore, Wickens concludes that performing dual-tasks in parallel is achieved best, when the required user workload is distributed on several resources (Wickens and Hollands, 2000). For instance, preparing dinner demands a person visually, when keeping the eyes on the meal in the pan, whereby the visual input channel is tapped. Adding ingredients to the meal requires the use of hands, which occupies the tactile output channel. Now, if the person in the kitchen would have to look up a recipe by typing on a keyboard and looking at a screen, the same resources would be tapped, which would lead to a performance decrement of at least one of the tasks. Here, the auditory channel is free and could be used to look up the recipe without impairing the primary task. An intuitive speech interface allowing to retrieve information from the Web would help the person to focus attention on the dinner preparations. The Use of Speech Interfaces as Assistive Technology in Dual-Task Scenarios In human-machine interaction (HMI) speech is considered as convenient interface modality. A socalled spoken dialog system (SDS) allows users to communicate and interact with an application or a complex computer system through speech (McTear, 2004). For humans speech is the most natural way to communicate, which is why no special training is needed. Furthermore, speech as communication channel is highly efficient as the transmission is fast and provides the highest capacity. Additionally, speech interaction offers benefits for persons with disabilities, such as visually impaired people or people with limited physical abilities (Sakti et al., 2009). Due to the benefits of speech as communication channel, today spoken dialog systems (SDSs) are already used in several dual-task scenarios in order to assist the user and to optimize task performances: Intelligent Home: In an intelligent home different interface modalities could be employed. However, when considering natural interaction, McLoughlin and Sharifzadeh (2008) suggest interaction by speech as the modality of choice due to its user-friendliness and its little hardware requirements. By installing a set of microphones and loudspeakers in each room, the system can be accessed from everywhere in the home. One of the first intelligent home speech-controlled consumer devices is Samsung s Smart TV 4. While watching TV the user is able to change channels or search for TV shows by speech, for example. Medicine: In the medical field SDSs are used for various purposes. Free form-dictation systems employing a large-vocabulary speech recognition engine, are used for the development of reports in areas like radiology, pathology and endoscopy. The doctor is able to dictate clinical reports directly into a computer, eliminating the time and reducing costs needed for a transcription service. Furthermore, speech interfaces are applied to control medical instruments. For instance, during surgical procedures a surgeon can control a camera using speech commands while operating on a patient (Grasso, 2003). 4

23 1.2 State-of-the-Art Speech Access to the Internet in Mobile Environments 5 Automotive: Speech technology has been applied for in-vehicle use since many years. Nowadays, drivers are able to dial or answer their mobile phone, input destinations for navigation or control audio devices (e.g. radio station selection) via voice (Häge et al., 2008). By using speech instead of manually controlling the infotainment system of the car, they can keep their hands on the wheel and their eyes on the road. Today integrated SDSs exist, such as Mercedes Benz Linguatronic 5, BMW s voice control as part of BMW s idrive system 6, Audi s voice control as part of Audi s MMI system 7 or Ford Sync 8, which provide the driver with the functionality mentioned above. There are also commercially available portable devices (e.g. Parrot s ASTEROID tablet 9 ), which can be installed in the car. However, their voice-controlled functions are limited. Military: The military uses speech technology in situations where the workload is heavy. Today a modern aircraft or helicopter is part of a complex system in which command posts, armored ground vehicles, other aircraft and military bases work together. Furthermore, adverse conditions, such as stress or bad weather hamper the primary piloting task and give a high workload for the crew (Pigeon et al., 2005). Several research projects examined the use of SDSs in highperformance aircraft and helicopter environments. In these projects speech recognizers have been integrated successfully for communication and navigation systems, display management and control functions (Weinstein, 1990). Furthermore, speech technology finds use as part of a multi-modal interface in operations rooms aboard ship, in the air or on land. For instance, the complexity of a sonar suite and the amount of information available exceeds the mental capacity of sonar operators. Instead of costly increasing the number of operators, speech input could be applied to report during surveillance, to control the display or interactive graphical tools provided by the sonar suite (Pigeon et al., 2005). As illustrated above, voice-control finds use in several areas and research fields, where dual-tasks have to be performed. The obvious benefits and the wide field of application emphasize the power of voice-control to perform secondary tasks without impairing the primary task. Therefore, the use of the Internet as secondary task should be performed by speech, too. However, speech access to the Internet has not progressed far yet and is still not fully developed. The state-of-the-art technology, which enables users to access the Internet by speech, is illustrated in the following. Since the mobile use of the Internet is becoming more and more important, the focus of section 1.2 is on mobile environments. Today s speech technology, its functionality and its shortcomings are described, based on which the research questions of this thesis arise. 1.2 State-of-the-Art Speech Access to the Internet in Mobile Environments Today s state-of-the-art technology enabling users to access the Internet by voice is still not sophisticated, yet. Mobile Internet access by speech is mainly found on smartphones and in vehicles. The functionality and the shortcomings of the available technology in these mobile environments are illustrated in the following. Afterwards, based on the deficiencies, the research questions of this thesis are described. 5 Mercedes Benz S-Class, Linguatronic, Zusatzanleitung, guide/articles/idrive.html

24 6 1 Introduction Most of today s smartphone manufacturers equip their product with a personal assistant, who allows the user to interact with the mobile phone by speech. Apple s Siri10 and Samsung s S-Voice11 are the most popular and the most powerful personal assistants, which are available as native apps on the respective mobile device. Both personal assistants allow the user to control several Internet apps by speech: For example, the user is able to communicate via facebook, send s, look for restaurants, ask for current weather conditions or retrieve knowledge from the Internet. The user is able to speak entire sentences or ask simple questions in order to make his request. As an example, Figure 1.2 shows Screenshots of the speech interaction with Apple s Siri, where the user would like to send an . Fig Sending an using Apple s Siri. First, by presenting a question on the screen, the user is asked what he would like to do. After having indicated his intentions by speaking to the smartphone, Siri gives the user a visual feedback of the recognized utterance, which is presented in the upper transcription of the second Screenshot. In this example dialog Siri does not provide an auditory feedback about the recognized utterance. Afterwards, when Siri speaks to the user and asks for the s subject, additional visual feedback is given by presenting an form, where the interpretation of the previous user utterance is presented. The speech dialog continues step by step and the user tells his subject followed by the content of the . Again, the recognition results are presented on the screen and later in the form. The voice-control seems to be appropriate when the primary focus of attention is on sending the using the mobile phone. However, when another task with higher priority is performed in parallel, this way of interacting is not feasible. As illustrated above, driving a car requires the use of the driver s hands and to keep his eyes on the road. If the driver would send an using Siri at the same time, he would always have to look on the screen in order to find out what the system has understood, since it does not always give you auditory feedback (e.g. when sending a text message, Siri confirms the understood recipient). Both tasks demand the visual input resource. Thus the driver cannot keep his eyes on the road anymore and would risk causing an accident. Furthermore, concerning the performance of secondary tasks, it has not been examined, yet, which kind of speaking style may be preferred. Instead of speaking entire sentences, short but efficient utterances (e.g. commands) could be used

25 1.2 State-of-the-Art Speech Access to the Internet in Mobile Environments 7 There are only few Internet apps, including sending s, which can be entirely controlled by speech. Most of the apps require a switch of input modalities during the interaction. For example, when looking for a restaurant asking Siri by voice, it will provide you with a list of restaurants presented on a screen (see Figure 1.3). From there no further speech interaction is possible. The user has to tap on the screen to select a certain restaurant or to tap the Siri button in order to re-activate the speech dialog. This demands the visual input and haptic output channel and is not applicable to a dual-task, where the primary task must not be impaired. Another shortcoming is the handling of incoming messages, which the user has to be notified of and whose content has to be delivered. Currently, smartphones notify the user by playing certain sounds and vibrating. After the notification the user has to tap on the mobile device to retrieve the information. Again, this is not feasible for Fig Looking for a restaurant using Apple s Siri. secondary task performances since it requires a lot of attention of the user. There also exist several third-party smartphone apps like Google s Google Now 12 or Nuance s DragonGo 13, which can be installed on smartphones supporting the required operating systems (OS). These apps also allow the user to access the Internet by speech and have similar functionalities, but also show the same shortcomings in speech interaction as explained above. People s wish for mobile Internet access also impacts the automotive environment. Today most of the car manufacturers offer infotainment systems like Mercedes Benz COMAND Online 14, BMW s Connected Drive 15 or Audi Connect 16, which allow the user to access Internet content by providing a set of car apps. However, the in-car SDSs are very limited and mostly require haptic input during the interaction. For example, using BMW s voice control, the dictation of an requires to navigate via the central control element (CCE) in order to start the creation of an . From that point on the user is able to dictate his message by speech. Furthermore, today s in-car SDSs are command-based and require to use certain commands. When the number of accessible Internet apps increases, the number of possible commands will increase, too. The driver may be overloaded with the high number of speech commands, he has to know in order to control the system. Because of this the current speech dialog concept does not cover the needs, which the voice-control of the Internet demands and has to be refactored. In addition, the notification issue described above, has not been addressed in the automotive environment, yet. Currently there is no in-car SDS, which notifies the driver about incoming Internet content. Since the number of OTT messages has increased in the last years, there is a strong need to find a speech-based HMI concept 17, which properly notifies the driver and presents the content. The overview shows that today s Internet access by speech in mobile environments is very limited and immature. There is no persuading, consistent and sophisticated speech dialog concept, which Mercedes Benz S-Class, COMAND, Operating Instructions, Here, speech-based HMI concepts are to be understood as human-machine interfaces, which use speech as main input and output modality but which can also provide other output modalities to support the speech dialog.

26 8 1 Introduction allows the user to have access to the Internet by speech. In the context of secondary tasks the addressed mobile devices do not provide the required speech technology in order not to impair a primary task running in parallel. Therefore, it is highly important to develop a speech dialog concept, which allows the user to access the Internet as secondary task. Before a new speech interaction concept to a new domain can be developed, developers have to gather information about how people may interact in this new domain by speech. Afterwards, based on these insights, new concepts can be developed. These new concepts have to be evaluated in order to find the best matching concept. In this thesis the mentioned research steps are performed. A new speech dialog concept to access the Internet as secondary task is designed and evaluated. In the next section the scope of the conducted research of this thesis and the core research questions are described. 1.3 Thesis Contributions Goal of this thesis is the design, the implementation and the evaluation of an intuitive SDS, which enables the user to access the Internet by speech while performing a primary task running in parallel. The Internet allows users to perform a large variety of tasks, which cannot be addressed entirely during the research of this thesis. Therefore, the research focuses on the speech access to the most popular Internet tasks. In the following, Internet tasks, which users engage in, are characterized and classified. Based on the classification, the research of this thesis is delimited and the arising research questions are described. Research Scope of the Thesis In a field study Kellar (2007) monitored people s Web usage and interaction with the Web browser in order to understand, what kind of tasks users engage in when using the Internet. Based on the data collection and existing research work, Kellar developed the Web Information Classification, which categorizes Web tasks users engage in. Concerning mobile devices, the tasks which people engage in, still remain the same. However, the frequency of occurrence of each task category may change slightly. Nevertheless, the Web Information Classification can also be applied for mobile use. As illustrated in Figure 1.4, the classification consists of three main information goals: information seeking, information exchange and information maintenance. Web Information Classification Information Seeking Information Maintenance Information Exchange Information Goals Fact Finding Information Gathering Updates / Browsing Transactions Communications Maintenance Information Tasks Fig The Web Information Classification (Kellar, 2007).

27 The three sub-categories shown in Figure 1.4 consist of the following tasks: 1.3 Thesis Contributions 9 Information Seeking: In Information seeking tasks the users try to change their state of knowledge (Marchionini, 1995). These tasks consist of Fact Finding (e.g. searching for a bus schedule), Information Gathering (e.g. looking for a new laptop) and Browsing (e.g. reading and browsing movie updates). Information seeking tasks account for 51.7% of all Web usage (Kellar, 2007). Due to the high frequency of occurrence of these tasks, developing an SDS, which allows users to perform information seeking tasks by speech has to be taken into consideration. Information Maintenance: Information maintenance tasks are tasks in which the user s goal is to maintain Web resource. They mainly occur, when people update or create Web pages. Information maintenance tasks only account for 1.7% of all Web usage (Kellar, 2007). Information Exchange: When performing information exchange tasks, the user aims at exchanging information in a Web-based setting. These tasks consist of transactions (e.g. hotel bookings) and communications (e.g. sending an ). Information seeking tasks account for 46.7% of all Web usage (Kellar, 2007). Concerning the performance of information seeking tasks by speech, the difficulties are the preprocessing of the Internet content and not the design of speech dialogs. There is a large variety of data sources online, which is available in a structured or unstructured form. The requested information has to be filtered out and preprocessed in order to present the content to the user by speech. Information seeking SDSs, which access Web content, have been examined by Mishra and Bangalore (2010), Ankolekar et al. (2006) and Hofmann et al. (2011b). Here, the data is retrieved by scraping content of Web sites or from semantic knowledge bases. Nowadays, there already exist Web services, such as WolframAlpha 18, which provide huge semantic knowledge bases for many different domains. Seeking for information by speech is normally achieved in only few dialog turns. Ideally, the user makes his search query and the system delivers the desired information. Therefore, the speech dialog often turns out to be a question&answer dialog. As the dialogs are simple and short, the SDS design challenges are low, because there is not much space for the conception of new dialog strategies. A user study, which investigates how people interact with the Internet by speech, which is presented in section 3, will confirm these assumptions. As the research challenges are located on the data preprocessing side and not on the speech dialog side, information seeking tasks are not further examined in this thesis. As the frequency of occurrence of information maintenance tasks is very low and as the tasks are unlikely to be performed as secondary task in a mobile environment, this task category is not further examined in this thesis. A transaction, such as a hotel booking, requires multiple steps, until the task is finished. There are many ways to design such a multi-turn dialog concerning speaking styles or dialog flow. As illustrated in the sample dialog of Figure 1.2, communication tasks also require multiple dialog steps. Furthermore, in some cases, these tasks initiate themselves and are not triggered by the user: e.g. if a new facebook message comes in the new message pops up and is presented to the user. The multiple dialog steps and the self-initiative character of information exchange tasks require a lot of attention of the user and may lower the performance of a primary task if the speech interaction is not designed in a sophisticated and intuitive way. Due to the high frequency of occurrence and the strong need for the examination of appropriate speech dialog concepts, this research work focuses on information exchange tasks. The user study of section 3 will confirm that it is fundamental to investigate speech dialog concepts for this kind of tasks. 18

28 10 1 Introduction Research Contribution Goal of this research work is the development of an user-friendly SDS with which the user is able to perform online information exchange tasks in a dual-task scenario. The addressed Internet task has to be performed as secondary task without impairing a primary task performed in parallel. Before developing an SDS in a new domain, a data collection from real users is needed to get to know how users would interact with the Internet by speech. Based on the strong needs and the high frequency of occurrence, the research of this thesis focuses only on information exchange tasks. In the following the research steps and challenges are presented: 1. User Study on Speech Interaction with the Internet The development of speech interfaces to a new domain requires a wide knowledge about the domain and about how people interact with the domain. Therefore, a user study has to be conducted first, which aims at getting knowledge about how users would interact with the Internet by speech. The user study will give information about people s speaking styles in the different task categories. Furthermore, the data is needed for the development of speech dialog prototypes. 2. Development of SDS Concepts for Information Exchange Tasks as Secondary Task As already mentioned above, an online transaction, such as a hotel booking or online shopping, is performed in several steps. First, the user indicates several input parameters based on which he receives a list of options. The user has to browse through the list of possible options in order to find the option of his desire. The design of the speech dialog offers a variety of possibilities. Different speaking styles (e.g. commands or entire sentences), dialog control variants (e.g. userinitiated, system-initiated) have to be taken into consideration. If the SDS is supported by a GUI, different GUI designs are conceivable. The goal of this research is to design and prototypically implement different SDS concepts. In order to find the most suitable speech interface for performing information exchange tasks as secondary tasks, the developed prototypes are evaluated in user studies in terms of usability and driver distraction measures. 3. Development of SDS Notification Concepts for Incoming Events as Secondary Task Communication tasks, such as sending s or facebook messages, also require to input several parameters. However, this problem is already covered by the previous challenge. Concerning communication tasks, the research interest is in the self-initiative behavior, where the system initiates the interaction. There are several ways of notifying the user proactively (see definition section 2.1.4) about an incoming event. A speech-based notification concept has to be developed, which notifies the user in an unobtrusive way, which does not impair the performance of the primary task. Again, the goal of this research is to design and prototypically implement different speech-based notification concepts, which are evaluated in user studies in order to find the most suitable notification concept. The research work requires the realization and evaluation of several SDS concepts, therefore, an example dual-task use case scenario has to be chosen. Section 1.1 presented several scenarios, in which the use of the Internet as secondary task is needed and where SDSs are employed as assistive technology. In the automotive environment SDSs have been established and used since many years. Today people spend much time in the car (Williams, 2009) and cannot access the Internet on a regular basis, yet, which is why they started using their smartphone manually while driving (State Farm Mutual Automobile Insurance Company, 2012). If an intuitive SDS, which enables drivers to access the Internet while driving, is not developed soon, drivers will continue to use their smartphone manually, which endangers the driver s safety (Governors Highway Safety Association, 2011). Due

29 1.4 Outline of the Thesis 11 to the urgent need and the high risks, the automotive environment is chosen as the example dual-task scenario in this thesis. The research work described in this thesis is performed in the context of the EU funding project GetHomeSafe 19, which aims at developing an in-car system for safe information access and communication by speech while driving. This research project is conducted within the scope of the Seventh Framework Program of the European Commission Outline of the Thesis A short introduction about the mobile Internet revolution was provided at the beginning of this Chapter. Afterwards the use of the Internet in dual-task scenarios was described, followed by the introduction of SDSs as promising interface technology, which assists today already in many dual-task situations. The current trends and shortcomings of today s Internet enabling speech-technology were presented based on which the research challenges of this work arise. Finally, the scope of this research work and the challenges, which this thesis aims to achieve, were illustrated. The remainder of the thesis is structured as follows: Chapter 2 presents the fundamentals to provide background for the subsequent Chapters. In section 2.1 the fundamentals of an SDS are described with focus on the dialog design and the evaluation of SDSs. Section 2.2 provides background of the application of SDSs in the field. Here, the applicaton and challenges of the use of SDSs in the automotive environment are illustrated. Section 2.3 introduces driver distraction, which indicates, how much the secondary task performance impairs the primary task performance. Based on the background, section 2.4 presents related research and derives the concrete research goal of this thesis. Chapter 3 describes a Web-based user study, which aims at getting knowledge about how users would interact with the Internet by speech. First, the idea and the method of the study are presented in 3.1, followed by the results (section 3.2). Section 3.3 summarizes the Chapter and presents implications on the following research. Chapter 4 presents the development of speech-based In-Car HMI concepts for the performance of information exchange Tasks while driving. In section 4.1, based on the provided background, the design of the different SDS concepts is described. Section 4.2 explains the realization of the different concept prototypes using the existing SDS framework. The prototypes have been evaluated in a driving simulator study. The method and the results of the user study are described in section 4.3. Finally, section 4.4 summarizes the research work and the main findings of Chapter 4. Chapter 5 explains the development of speech-based in-car notification concepts for incoming proactive events. The design of different notification concepts is illustrated in section 5.1. Section 5.2 explains the the implementation of the different speech-based concept prototypes. For the evaluation of the prototypes a driving simulator study has been conducted. Its method and results are described in section 5.3. In section 5.4 the research work and the main findings of Chapter 5 are summarized. Chapter 6 draws conclusions about the research work. The work of this thesis is summarized in section 6.1, followed by suggestions for future work in section

30

31 2 Background and Resulting Challenges This thesis investigates the control of the Internet by speech in a dual-task scenario. The user would like to access the Internet, while performing a primary task in parallel. However, the performance of the primary task must not be impaired by the secondary task performance. Goal is to design, implement and evaluate various SDS concepts, which allow users to access the Internet in an intuitive and unobtrusive way. Due to the strong need and the high frequency of occurrence, the research work of this thesis focuses on the performance of information exchange tasks by speech. The design of an SDS requires the understanding of the different SDS components. The prototypical implementation and the evaluation of the SDS concepts require the investigation of their application in the field. Here, the automotive environment has been selected as sample use case. In order to understand the difficulties and the challenges of this environment, background knowledge about this domain needs to be provided. Therefore, this Chapter describes the technical and theoretical background about SDSs and their application in the field required for the following Chapters. Additionally, related work is introduced, discussed and based on the shortcomings, the concrete research questions are pointed out. In Section 2.1, the fundamentals of SDSs are described, beginning with a short overview of the topology of an SDS. Subsequently, the remainder of this Section focuses on providing background, which is the base for the dialog design and the evaluation of SDSs. Section 2.2 provides background of the application of SDSs in the field. Here, the application of SDSs in the automotive environment is illustrated. First, the state-of-the-art in-car SDSs is presented, followed by the explanation of the Daimler Speech Dialog Framework, which is employed for the implementation of the SDS prototypes. Section 2.3 introduces driver distraction as measure, which indicates how much the secondary task interferes the primary task. Finally, in Section 2.4, the Chapter is summarized and related research and challenges are presented. This Section defines the concrete research questions of this thesis. 2.1 Fundamentals of Spoken Dialog Systems First attempts of spoken language communication with computers were made in the 1950s. Speechbased human computer interaction experienced rapid developments in research and in their field of application. In recent years speech recognition technology has sufficiently matured to allow for broad deployment in industry and thereby new business models arose. In the smartphone, the automotive environment and in telephone-based customer self-service countless applications exist, where SDS are employed successfully (Pieraccini and Lubensky, 2005). The complexity and the capabilities of SDS have steadily risen in the last years. For example, the functionalities of telephone-based systems moved from information retrieval tasks providing bus

32 14 2 Background and Resulting Challenges schedules towards transactional tasks, that allow users to book hotels or flights. The latest generation of these systems is employed to provide technical support and customer care. In the automotive and smartphone environments, a command-and-control speech interaction still predominates. However, nowadays, researchers and SDS developers attempt to design human-human communication like spoken dialog technologies. By allowing free and unconstrained user inputs, the naturalness of the spoken dialog interaction is enhanced and the SDS appears to be more human-like and user-friendly. Due to these new developments, the capabilities of the SDS will exceed the mere voice-control of navigation and telematic systems. Especially the voice-control of the Web and its countless fields of application require the development of new speech dialog concepts (Schmitt, 2012). Developers of SDSs need an understanding of the nature of a dialog, what it is used for and how people engage in dialog. The term dialogue is used in everyday language to describe a process of exchanging views, sometimes with the purpose of finding a solution to a problem or to resolve differences (McTear, 2004, p. 45). There is no intention of convincing one or several sides of one s opinion, such as in a debate. Instead, dialog is a collaborative process, in which two or more sides work together toward common understanding. Dialog might be distinguished from conversation, which describes informal spoken interaction, which is used for maintaining social relationships. However, especially in research in the United States, the term conversation is frequently used for advanced human-like SDSs, whereas dialog tends to be used for more restricted SDSs. Despite the differentiation, both terms are frequently used to describe computer systems, which interact with humans using spoken language (McTear, 2004). In this thesis the term dialog is generally used for spoken language interaction regardless of the SDS s interactional competences. However, here, the term conversational refers to a more advanced dialog system, which displays human-like conversational competences and behavior. The following Sections provide background of SDSs, beginning with an overview of the topology of an SDS, which briefly describes the different modules and their functionality. Section explains the fundamentals of the dialog management component, which is the SDS s core component and whose features are crucial for the interaction style and the character traits of an SDS. Subsequent to this, conversational speech interfaces and their characteristics are introduced. Since several understandings exist of what a conversational speech interface is, Section defines our understanding of conversational speech interaction applied in this thesis. One goal of this thesis is the development of notification concepts for incoming proactive events. In Section 2.1.4, the term proactivity is introduced and its application in SDSs is presented. In the research of this thesis several SDS prototypes are developed. In order to evaluate the different SDS on usability, the dialog quality has to be assessed. The last Section is dedicated to usability measures and evaluation methods applied on SDSs Overview of Spoken Dialog Systems An SDS is an interface, which enables users to access information and services available on a machine or over the Internet by using spoken language as the medium of interaction (Jokinen and McTear, 2009). An SDS comprises several components, which work together to enable a spoken interaction between humans and the machine. The architecture of an SDS is illustrated in Figure 2.1. The modules are automatic speech recognition (ASR), language understanding, dialog management, language generation and text-to-speech synthesis (TTS). As can be obtained from Figure 2.1, the topology can be considered as a pipeline, in which each module processes the output of the preceding module (McTear, 2004). In order to give an overview of the functionality of the different components the following real life example dialog at a TV information service is presented (McTear, 2004):

33 2.1 Fundamentals of Spoken Dialog Systems 15 User s speech input AutomaticSpeech Recognition Language Understanding Dialog Management System s speech output Text-to-Speech Synthesis Language Generation Application Fig Components of a SDS, similar to McTear (2004). 1 System: Welcome to the TV Information Service. How can I help you? 2 User: I would like to watch a movie with David Hasselhoff. 3 System: When would you like to watch the movie? 4 User: On Monday at 8pm. 5 System: There is a movie featuring David Hasselhoff shown on TV at 8:15 pm. First of all, the SDS welcomes the user and asks him about his goals (1). Afterwards the user responds and inputs the desired movie request (2). In order to process the user s input data, the system has to walk through the following steps, until it can provide the correct answer to the user: (a) Recognize the user s uttered words by transforming the speech signal into a sequence of words in text form (Automatic Speech Recognition). (b) Understand the user s intentions and the meaning of his words by interpreting the recognized text (Language Understanding). (c) Find out to which stage of the dialog the utterance belongs to and decide what to do next. If the task is not clear yet, ask for confirmation or further details. If necessary, retrieve information from the application that match the user s requirements (Dialog Management). (d) Create a response according to the information provided by the dialog manager (Language Generation). (e) Speak the response (Text-to-Speech Synthesis). In the following the different modules are briefly described with focus on the relevant modules for the design of SDSs. Automatic Speech Recognition The main task of the ASR component is to translate the captured speech signal into a sequence of words before sending it to the language understanding component. Mathematically, the goal is to find the most probable word sequence Ŵ = (w 1,w 2,...) matching a given set of acoustic observations O = (o 1,o 2,...) (Rabiner and Juang, 1993): Ŵ = argmax W P(W O). (2.1) By applying Bayes rule to the ASR task, the most probable word sequence can be computed by Ŵ = argmax W P(W O) = argmax W P(O W)P(W). (2.2) P(O)

34 16 2 Background and Resulting Challenges In order to translate the speech signal into text, the ASR module has to perform the following operations: extract the set of acoustic observations O from the speech signal and compute P(O) compute P(O W): the probability that the set of acoustic observations O originates from a certain word sequence compute P(W): the likelihood of individual word sequences W, which can occur find the word sequence which maximizes Eq Most state-of-the-art ASR engines use acoustic models, which are based on Hidden Markov Models in order to estimate the probabilities P(O W). The acoustic models of the ASR engine used in the SDSs, are not adapted for the research of this thesis. Therefore, this thesis does not further elaborate on this topic. The likelihood P(W) of possible word sequences W is estimated by the so-called language model (LM). The LM contains knowledge about possible word sequences and which words are more likely to occur in a given sequence. The word sequences, which are produced by the acoustic model, can be analyzed in terms of whether they conform the output of the LM. There are two main approaches to model the user s language for ASR: rule-based and satistical approaches (McTear, 2004). Rule-based Approaches Rule-based LMs specify all acceptable word sequences, which the user might say at a certain point of the speech interaction. In so-called grammars, all possible constituents, which can be cross-linked, are specified. A constituent is a word or a group of words that functions as a unit (Fromkin et al., 2003). A grammar consists of several rules determining the cross-linking of the different constituents. In Figure 2.2, a sample grammar covering the utterances of the example is illustrated as finite state network. I wouldlike to Start watch a movie with <actor> End On <day> at <time> Fig A grammar network for TV information, similar to McTear (2004). As a finite state machine (FSM) does not support recursiveness, the language modeling capabilities are limited. Today grammar specification languages exist, which have the expressive power of a context-free grammar and are widely used. The JSpeech Grammar Format (JSGF) is based on the Augmented Backus-Naur Form (ABNF) and adopts the style and the conventions of the programming language Java (Hunt, 2000). The Speech recognition grammar specification (SRGS) is derived from the JSGF specification, has been standardized in 2004 and is recommended by the World Wide Web Consortium (W3C) 1 (Hunt and McGlashan, 2004). SRGS includes two alternate yet equivalent specification formats, one based on the Extensible Markup Language (XML) and one using ABNF. The ABNF format is suitable for quick hand coding, whereas XML is more suitable for automatic 1

35 2.1 Fundamentals of Spoken Dialog Systems 17 integration into XML-based voice-user interface design languages, since it is easy to parse. In addition, SRGS (and also JSGF) is not only able to model possible utterances but also able to produce semantic interpretations of the input. Further background about language understanding and semantic interpretation is given in the next Subsection. A JSGF code snippet illustrating the grammar example of Figure 2.2 is illustrated below: 1 JSGF V1.0; 2 grammar TVSelection; 3 public <selectmovie> = [[I would like to] (watch) (a movie) (with <actor>)] [ on <day> at <time>] Grammar-based LMs are useful if the domain of the application is very restricted and when all the phrases, which are likely to occur, are known in advance. However, since any speaker of a human language can produce [...] an infinite number of sentences (Fromkin et al., 2003), it is time-consuming to cover all possible utterances. The design of a wide and flexible grammar needs a lot of expert knowledge and often legal word sequences, which were not anticipated, are ruled out or syntactical erroneous sentences are falsely accepted (McTear, 2004). Statistical Language Models Statistical language models (SLMs) provide statistical information on word sequences and are used to predict the next word in a sentence. SLMs incorporate information about the structure of the words and their order in a specific language. N-gram models are used to model possible word sequences. Commonly, bigram or trigram models are used to estimate the most likely occurring word. In order to train an SLM a large amount of word transcriptions of real (if possible spoken) utterances is needed. For each word pair or word triple the number of occurrences is counted and the occurrence probability is calculated during the training process. N-gram-based SLMs are useful if the vocabulary is very large. Then not all permissible sentences and word combinations can be predicted to design a grammar. In order to generate a reliable SLM, a huge amount of training data is needed. Therefore, SLMs have not widely been used in SDSs and are rather employed in automatic transcription applications (McTear, 2004). Language Understanding The ASR component translates a speech signal into word sequences. In the next step the meaning of the user s utterance has to be extracted, which is a prerequisite for the dialog manager (DM) to proceed with the next dialog step (McTear, 2004). The role of the language understanding - also known as natural language understanding (NLU) - component is to extract from a word sequence W = (w 1,w 2,...) the semantic meaning by applying rule-based grammatical relations, rule-based semantic grammars, template matching or statistical-based parsing techniques (Allen, 1995; Jurafsky and Martin, 2008). The understanding process is traditionally split in two processes (McTear, 2004): 1. Syntactic analysis: determines the constituent structure of the ASR output 2. Semantic analysis: extracts the semantic meaning of the constituents. By recursively determining subphrases a sentence can be syntactically analyzed. These subphrases, such as a noun phrase (NP) a verb phrase (VP) or a prepositional phrase (PP), can consist of other subphrases, as well as single words, such as nouns (N), pronouns (Pron), verbs (V), determiners (Det), prepositions (Prep), modifiers, etc. Context-free grammar rules can be used to model subphrases and their relations. Some sample rules are illustrated below.

36 18 2 Background and Resulting Challenges 1 PP -> Prep NP 2 NP -> [Det] [Modifier] N [Post-Modifier] 3 NP -> Pron These rules can be interpreted as follows: a PP consists of a Prep and a NP. A NP consists of a N and all the other surrounding constituents in brackets are optional. Furthermore, a NP can also consist of a Pron. By applying such rules a sentence (S) can be syntactically analyzed and transformed into a tree-structure (McTear, 2004). The syntactical analysis of a sample sentence is illustrated in Figure 2.3. S NP VP Pron I V NP PP like Det N Prep NP the movie with PN PN David Hasselhoff Fig Syntactical analysis of the sentence I like the movie with David Hasselhoff, similar to Schmitt (2012). Based on the syntax analysis the semantic interpretation is performed on these constituents applying rule-based grammars. Rule-based approaches are normally developed for a certain domain and require expert knowledge. Rule-based grammars are difficult to be transferred to other languages or domains. Therefore, often data-driven approaches (e.g. Hidden Understanding Models) are applied (Schmitt, 2012). Alternative approaches are so-called semantic grammars, which bypass the syntactic analysis and are widely used in current SDSs. A semantic grammar classifies the constituents in terms of their meaning or functions instead of their syntactic categories. The previously introduced grammar specification language SRGS is such a semantic grammar (McTear, 2004). Here, not only the words the user might say, are specified but also the meaning of the utterance. A very simple semantic grammar code snippet written in ABNF, one of the supported formats by SRGS, is presented in the following: 1 $selecttvshow = [I would like to watch] ($sitcom $action) [please]; 2 $sitcom = [the sitcom] Alf 3 {value="alf"}; 4 $action = [the action TV series] Knight Rider 5 {value="knightrider"}; If a user says I would like to watch Knight Rider, a parser would return the attribute-value pair (AVP) selecttv Show = knightrider. The semantic interpretation of the phrase is forwarded to the DM, the next module in the pipe, which decides about the next action (Schmitt, 2012). Other approaches combine the syntactic and semantic analyses (Ehrlich and Jersak, 2006).

37 2.1 Fundamentals of Spoken Dialog Systems 19 Dialog Management The dialog manager (DM) is the central component of an SDS and generally controls the dialog flow. The DM keeps track of the information relevant to the dialog, reacts on the input of the user and decides at which point in time and in which order it will prompt the user for missing information. This central component accepts or declines user input and re-prompts questions if the user does not answer in time. Furthermore, this module clarifies occurring ambiguities and asks for confirmation if the information forwarded by the preceding modules delivered low confidence values or if the information is of great importance (e.g. bank account details). The DM serves as interface to the application in the background. By interacting with the external knowledge source, such as a Web service, it is able to provide the user with the requested information (McTear, 2004; Cohen et al., 2004). A sample dialog demonstrating the described functionalities may proceed as follows: 1 System: Welcome to the TV Information Service. How can I help you? 2 User: I would like to watch a TV show with David Hasselhoff. 3 System: You want to watch a TV show with David Hasselhoff, right? 4 User: Correct. 5 System: When would you like to watch the TV show? 6 User: I would like to watch it tomorrow at 8pm. 7 System: There is the TV series Knight Rider shown on TV on Monday at 8:15pm. In this sample dialog the NLU module delivers the value actor = DavidHasselHo f f with a low confidence value. Due to the high uncertainty the DM re-prompts the user in order to confirm the request (3). After the confirmation by the user (4) the system checks if all required parameters are filled in order to accomplish the task. As this is not the case, the system prompts the user to elicit the missing date information (5). The user responds that he would like to watch a movie tomorrow at 8pm. Instead of giving an absolute date information, the user indicates a relative day ( tomorrow ). The system has to resolve the context and has to map the relative date on the absolute date ( Monday ). As all information is given now, the DM retrieves the matching information from the data source and responds the user about the available TV shows (7) (Schmitt, 2012). There are different ways, in which the behavior of the DM can be designed. When the behavior of the DM has to be realized, different dialog management characteristics have to be taken into consideration: dialog initiative defines, to which extent which one of the agents controls the initiative in the dialog, dialog control defines (among other duties), what question should be prompted at what point in time and in which order, dialog context keeps track of the information relevant to the dialog in order to support the dialog management process, grounding defines the process of confirming, what has been understood (McTear, 2004; Jokinen and McTear, 2009). Since the implementation of the DM is crucial for the usability of a SDS, Section elaborates on the fundamentals of dialog management and describes the implementation strategies in detail. Various approaches exist to implement DMs. Using the W3C standardized description language VoiceXML (Oshry et al., 2007), the DM is realized as a finite state machine, which describes the structure of a specific dialog. The user is always in one dialog state at a time. A hand-crafted set of rules specifies the next action or dialog to transition to. There are other rule-based frameworks, such as the TrindiKit (Larsson and Traum, 2000) and agent-based systems, such as Olympus (Bohus et al., 2007). These approaches require a lot of knowledge about the domain and the dialog scenario,

38 20 2 Background and Resulting Challenges which can be gained from pretests with real participants using an SDS in a prototypical development state. Other approaches use statistical algorithms to model the behavior of the DM, such as the Bayes Net Prototype implemented within the EU Funding Project TALK (Young et al., 2006) or the DM implemented by Williams and Young (2007) based on Partially Observable Markov Decision Processes. However, statistical approaches require a lot of data in order to train the statistical algorithms. Language Generation and Text-to-Speech Synthesis When the requested information has been retrieved from the external knowledge source, the natural language message has to be constructed and the speech output signal has to be generated, which is the responsibility of the Response Generation and the TTS synthesis module. As these modules are closely tied together, they will be described together in the same Subsection. The information the external knowledge source delivers may take a variety of forms: tables, database records, instruction sequences, etc.. The response generation module transforms this into a structure and form, in which it will be presented to the user. A simple yet inflexible approach is the use of canned text. Canned text is useful if the retrieved information represents the requested information by the user. A database containing the mapping of the retrieved information to the textual description would be sufficient. If a message has to be generated frequently, but occurs in little variations, prompt templates can be used. Template filling allows to parse database information into predefined text, which offers the flexibility to slightly vary the prompts (McTear, 2004). The template form for the last prompt of the example above would be the following: 1 There is the TV series $action shown on TV at $time. The DM delivers the database information $action and $time, which are inserted into the template message. Template filling is applied in most of the information retrieval SDS (McTear, 2004) and is supported in VoiceXML, cf. Larson (2002). More advanced approaches consider response generation as planning process, which starts with a communicative goal and returns a text message based on linguistic rules (Reiter and Dale, 2000). The prompt constructed by the response generation is transformed into the speech signal by the last module in the pipe. The system output is either pre-recorded canned speech or completely synthesized while runtime. It can also consitute a mixture of both. In the example above the plain text could be pre-recorded and the variable content ($action and $time) could be pre-recorded or generated by a TTS engine. However, a mixture of canned and synthesized speech sounds unnatural and patchy and might not be accepted by the listener (McTear, 2004). A TTS engine converts a text in a two-stage process: First, the text is analyzed and translated into a linguistic representation. The text analysis usually converts the text to a sequence of phonemes or diphones by using a dictionary. Second, the linguistic representation is enriched with prosody information (including rhythm, intonation, loudness and tempo and afterwards the speech waveform is generated (McTear, 2004). The major challenge in TTS synthesis is the prosody modeling. Poor prosody modeling leads to unnatural sounding wave forms and poor understanding. This Section gave an overview on the architecture of an SDS and described briefly the functionality of the different modules. In this thesis different SDS concept prototypes are designed and realized. The next Section elaborates on the fundamentals of dialog management, which are crucial for the design an implementation of an SDS.

39 2.1.2 Fundamentals of Dialog Management 2.1 Fundamentals of Spoken Dialog Systems 21 The core component of an SDS is the dialog management component, also known as the dialog manager (DM). As previously mentioned the DM keeps track of the information relevant to the dialog, reacts on the input of the user and decides, how the dialog continues. Before developers start designing a new SDS decisions about the applied dialog management modeling technique have to be made, which have an influence on the dialog flow and on the quality and usability of a speech dialog. This Section elaborates on the basic concepts of dialog management and describes the dialog design strategies in detail. Furthermore, the advantages and disadvantages of the different modeling techniques are described. In this Section, first, different dialog initiative strategies are illustrated, followed by the presentation of different dialog control mechanisms. Subsequently, the task of keeping track of the dialog context is explained. Finally, existing methods, in order to provide the user with acoustic feedback are described. Dialog Initiative The goal of a dialog between humans is to exchange some piece of information. In the course of the dialog the initiative between the participants may switch. For example, at one point in the dialog person A tells a story or asks many question and person B just listens or shows his interest by backchanneling. Later the roles might be reversed and person A is the listening partner. In a human-human conversation each participant might introduce new topics during the dialog and thereby, the initiative is fairly distributed. In contrast, in human-machine interaction (HMI), the initiative is usually not switching as frequently as in human-human dialogs. Thus, dialog initiative in SDS can be categorized as follows: system-directed, user-directed, and mixed-initiative (McTear, 2004). The differences between the three types of dialog initiative are further explained in the following. System-directed Dialog In system-directed dialogs the user is guided by the system throughout the entire dialog. The system asks concrete questions, which the user has to answer in order to proceed with the dialog. Most of today s SDSs are system-directed. The following example presents a system-directed dialog, where the system requests required parameters for a hotel booking step-by-step. 1 System: Where would you like to book a hotel? 2 User: In Berlin. 3 System: When do you arrive? 4 User: Today. 5 System: When do you depart? 6 User: On Monday, 25th. The system prompts in a system-directed dialog are generally designed in such a way that the user s input utterances consist only of single words or short phrases. Thereby, the grammar and the vocabulary stays constrained and can be specified in advance according to the most probable utterances. If the user s input is constrained, the speech recognition and language understanding outputs are likely to be more accurate. However, a system-directed dialog leads to a very inflexible dialog. The user is very restricted to only few words or phrases and is not able to take the initiative in the dialog to ask questions or start new topics (McTear, 2004).

40 22 2 Background and Resulting Challenges User-directed Dialog In user-directed dialogs the user takes the initiative and decides about the next step in the dialog. The role of the SDS is to interpret and answer the user s request. McTear (2004) compares a userdirected dialog with a speech interface to a database, which allows users to make spoken queries. The following TV information service example illustrates this kind of initiative: 1 User: Which TV shows featuring David Hasselhoff do you have? 2 System: Knight Rider and Baywatch. 3 User: Who else featured in Baywatch? 4 System: Pamela Anderson, Jeremy Jackson, Yasmine Bleeth,... 5 User: Ok, I d like to watch Baywatch. The speech dialog mainly consists of the user s requests and the respective answers of the system. If the user s request is unclear, the system might ask some clarification questions. As the user is free in the choice of words or phrases, flexible grammars and a huge vocabulary is needed, which poses great challenges to the ASR and NLU modules. In order to achieve a successful dialog, the user needs to be aware of possible utterances, which the SDS is able to understand (McTear, 2004). Mixed-initiative Dialog The third dialog initiative variant is a mixture of the previously introduced. Here, the dialog initiative can be taken by the system or the user and can switch during the speech interaction. Both participants can ask questions, request clarification, introduce new topics, etc. The hotel booking example below helps to illustrate this kind of dialog initiative: 1 User: I would like to book a hotel in Berlin. 2 System: When do you arrive in Bern? 3 User: Change the city to Berlin. 4 System: Ok, Berlin. When do you arrive in Berlin? 5 User: Today. 6 System: When do you depart? 7 User: The day after tomorrow. In this example, first, the user makes his booking request (1). Subsequently, the system takes the initiative and requests the user for the arrival date (2), which is one of the missing parameters to proceed with the booking. As obviously a recognition error occurred, the user takes the initiative again and instead of answering the system s question, the user corrects the destination (3). At utterance (4), the system re-requests the missing information and continues the dialog. The term mixed-initiative dialogs is also used for dialogs, in which the SDS has the overall control of the dialog, but allows the user some advanced flexibility to provide more information than requested by the system: 1 User: I would like to book a double room in Berlin. 2 System: When do you arrive in Berlin? 3 User: I arrive today and leave the day after tomorrow. The user provides already several values in his initial utterance. Afterwards the system asks for the missing values. In the following response (3), instead of only answering the system s question, the user gives more information than requested and thereby over-answers the system s request. If the system is able to interpret the additional information, the dialog is more efficient. As mentioned before, in human-human conversations the initiative is fairly distributed, which is why mixed initiative dialogs appear more natural. However, flexible grammars and dialog control modeling are needed in order to allow the flexibility (McTear, 2004).

41 2.1 Fundamentals of Spoken Dialog Systems 23 The selection of the appropriate strategy depends on the target group and the application domain. According to Peissner et al. (2011) novice users require guidance by the system, as they are not familiar with the SDS. After a while these users would like to take the initiative and use their knowledge about the system to quickly achieve their goals. In dual-task scenarios the user preference of the dialog initiative strategies might defer, since they have to perform a primary task in parallel. When users have to perform a secondary task by speech, they might prefer a simple and clearly directed dialog with low mental demands, which would speak for a system-directed dialog strategy. However, this dialog strategy requires many dialog steps and demands the user s attention for a long time. A user-directed dialog allows users to speak freely but might require many clarification dialogs, due to the high demands to the grammar and the vocabulary and thereby, frustrate the user. Using a mixed-initiative dialog, the user can input multiple parameters and thereby, speed up the dialog. However, this dialog strategy is more complex and mentally demands the user, which might have a negative effect on the primary task. Due to these uncertainties, it is important to find out, which is the most appropriate strategy in a dual-task scenario. Representation of Dialog Control The dialog management of an SDS can be distinguished between the different types of dialog initiative but also between the methods for representing and implementing the dialog flow. The three different methods for modeling the dialog control are: finite state-based, frame-based, and agent-based (McTear, 2004). The differences between these methods are explained in detail in the following. Finite State-based Dialog Control In finite state-based dialog control the dialog flow is determined in advance. The dialog dialog flow can be represented as a state transition network or graph. The nodes represent the system s prompts and the transitions resemble the paths through the speech dialog. Such a graph specifies all possible paths throughout the dialog. Each node represents a state in the dialog, in which some information is requested or shall be confirmed by the user. Figure 2.4 illustrates a graph, which determines the speech dialog for a TV information service: TV Show? Knight Rider Didyousay <TV Show>? no yes Day? Monday Didyousay <Day>? no yes Fig Dialog graph for a TV information service. According to the network in Figure 2.4 the SDS asks the user about the preferred TV show, followed by a confirmation question. If the user answers yes, the system continues to request the user for the day the TV show shall be sent on TV. In case the user answers no, the system goes back to the previous question. The process continues, until the whole graph is traversed (McTear, 2004). The advantage of this approach is its simplicity. Well structured dialogs, which can be found in a system-directed initiative, can be modeled by state transition networks. Toolkits, such as the CSLU toolkit (Sutton et al., 1998), provide a graphical representation of the dialog flow, which makes it easy for the developer to design the dialog and to maintain the overview in larger projects. However, finite state-based dialog control is very inflexible, since all the possible dialog paths are specified in

42 24 2 Background and Resulting Challenges advance. Thus, the system cannot deal with unforeseen deviations from those paths. Problems arise if the user would like to correct an input parameter or gives information that has not been considered by the time, when the dialog was designed. Despite these weaknesses, finite state-based dialog control is widely used in commercial systems, as it lowers the technical demands to other SDS modules. The strict guidance throughout the dialog constraints the possible user utterances and thereby simplifies speech recognition (McTear, 2004). VoiceXML can be used to implement system-directed dialogs. In VoiceXML a speech interaction involves the processing of a form, which consist of several fields that have to be filled. In a system-directed dialog the fields are requested in sequence and filled one after the other (Oshry et al., 2007). Frame-based Dialog Control Similar to finite state-based dialog control, frame-based dialog control is suitable for form-filling tasks, where several input parameters need to be requested. Finite state-based SDS follow a fixed dialog flow, whereas in frame-based dialog systems, the order of the questions is not specified in advance. A frame-based system gathers information from the user and collects the given input. If some information is not provided, yet, the system prompts the user for the missing information, until all required form fields are filled. Mixed-initiative dialogs are implemented applying the frame-based dialog control method. Frame-based dialog control requires three components in order to allow the flexibility in the dialog: a frame which specifies the required input parameter and keeps track of the provided information, an extended ASR grammar in order to allow for flexible user input and a control algorithm, which evokes the next question to be asked to fill the missing contents of the frame (McTear, 2004). Figure 2.5 presents a frame for a hotel booking dialog and illustrates the process of a frame-based dialog based on a sample dialog. Hotel Booking Frame 1 System: Where would you like to book a hotel? 2 User: I need a double room in Berlin. 3 System: On which day would you like to arrive in Berlin? Destination: Arrival Date: Departure Date: Room Type: Room Number: Berlin double Fig Frame-based dialog control of a hotel booking service. At the beginning of the interaction the frame is empty and the system prompts the user for the first input parameter (1). The user overanswers the question and indicates the destination and the room type (2). In the hotel booking frame these two fields are filled and the system continues prompting the empty fields, until all the required information is provided by the user (3). Frame-based dialogs are difficult to be expressed or taken down formally, as the dialog flow is not predetermined and does not follow a fixed plan. One possibility is to describe the dialog structure as a state chart, using a standardized language, such as State Chart XML (SCXML) (Barnett et al., 2013). SCXML is an XML-based markup language, which provides a generic state machine based execution environment. The frame-based approach offers several advantages over the finite-state based approach. The user benefits from a greater flexibility, as the dialog flow is less constrained and offers multiple

43 2.1 Fundamentals of Spoken Dialog Systems 25 slot-filling abilities. Thereby, over-informative answers can be processed and the number of dialog steps shrinks, resulting in a more natural and efficient dialog. However, the flexibility poses high challenges to the grammar and the ASR, since all possible input parameter combinations have to be taken into consideration (McTear, 2004). Even if the system prompts have been designed carefully, it is difficult to constrain the user in his responses to the requests (Eckert et al., 1995). There are different approaches, which allow the implementation of mixed-initiative dialogs using frame-based dialog control. VoiceXML offers the functionality to collect several information from the user s initial response. After the first response, the dialog becomes system-directed and prompts the user for the missing fields in sequence. The Philips SpeechMania platform allows a more flexible mixed-initiative dialog, since the user can fill multiple fields at any point in the dialog, as long as the currently active grammar rule permits (McTear, 2004). The rule-based TrindiKit (Larsson and Traum, 2000) allows the implementation of a mixed-initiative dialog as a task-oriented dialog. The TrindiKit is based on the Information State Update approach, where a dialog consists of several information states and update rules to alter the states (Traum and Larsson, 2003). Agent-based Dialog Control Agent-based dialog control targets modeling the dialog as communication between intelligent agents and applies artifical intellingence techniques. This kind of dialog control is appropriate for solving problems or tasks. The communication between two agents is characterized by intelligent and cooperative behavior: 1 User: I would like to watch a movie with David Hasselhoff. Are there any today? 2 System: No, there are no movies featuring David Hasselhoff today. However, there is the TV show Knight Rider featuring David Hasselhoff today. Would you like to watch this show instead? In this example the user asks for a movie featuring a certain actor (1). However, the system s answer is negative, as there are no movies, which match the user s expectations. But instead of simply answering no, the system provides the user cooperatively with another option, which might match his needs (McTear, 2004). The implementation of spoken dialog agents, which allow complex task solving in a robust, advanced mixed-initiative dialog, requires high challenges to the ASR, NLU and DM components. Several approaches have been made to implement agent-based dialog control. The CU Communicator of the DARPA Communicator Systems, which aims at developing robust, full mixed-initiative dialogs, is event-driven and is based on hierarchical set of frames, similar to the frame-based architecture (Pellom et al., 2001). The Conversational Architecture Project at Microsoft Research aims at building human-like SDSs by viewing dialog as joint activity and using Bayesian user models to interpret user goals based on conversational user utterances (Horvitz and Paek, 1999; McTear, 2004). In the research work of this thesis different speech dialog concepts are designed, implemented and evaluated. One speech dialog concept is the system-directed dialog. The dialog flow is determined in advance, which is why the dialog control of this prototype can be represented by state-transition networks described in this Subsection. A further speech dialog concept is based on a mixed-initiative dialog. The dialog flow is not predetermined and does not follow a fixed plan. The applied framebased dialog control is described as a state-chart. The agent-based dialog is not investigated in the research of this thesis and has only been described here for the sake of completeness.

44 26 2 Background and Resulting Challenges Dialog Context The DM has to keep track of the information, which is relevant to the dialog in order to make a successful speech dialog possible. This involves keeping track of the information, given from previous utterances and interpreting the current user input in relation to this information. The interpretation task also involves the resolution of anaphoric references and ellipsis. Depending on user preferences an utterance may be interpreted in different ways (Jokinen and McTear, 2009). Especially the use of short utterances requires context resolution: 1 User: I would like to book a single room in Berlin. 2 System: When would you like to arrive? 3 User: I arrive today. 4 System: When would you like to depart? 5 User: Tomorrow. 6 System: Ok, one moment, please System: There is the NH Berlin Mitte for 110 Euro per night. 9 User: Is there a cheaper one? 10 System: There is the Ibis Berlin Mitte for 80 Euro per night. In line (4) in the example above the system prompts the user for the departure date. Subsequently, the user provides the required date information. However, the short utterance (a so-called ellipsis ) only specifies a date but does not indicate if the user talked about the arrival or the departure date. In order to prevent such misunderstandings the system has to keep track of previously given information and previously asked questions. As the user has already provided the arrival date, which has been collected by the system, the system knows that the user meant the departure date. In the further course of the dialog the SDS presents the user a list of possible hotels. As the first hotel is too expensive (8), the user requests a cheaper hotel without using the word hotel or other synonyms (9). The user s intention can only be understood taking the context of the previous system prompt into account and the DM has to be able to resolve such anaphoric references across user and system utterances. Interpreting context-sensitive information requires the design of a context model, which stores elicited information from the user, information about the dialog progress and contains knowledge about the environment. According to Jokinen and McTear (2009) the following knowledge sources may be involved in the process of resolving context in a dialog: dialog history: contains information about the propositions and the elicited entities during the dialog, task record: represents the information that has to be gathered in order to proceed with the dialog (e.g., a frame in frame-based dialog control) domain model: contains specific information about the domain (e.g. hotel information), which is often encoded in a database or represented as an ontology, model of conversational competence: involves knowledge of the basic principles of conversational behavior (e.g., turn-taking or discourse obligations), user preference model: contains personal information about the user (e.g. age, gender, preferences, etc.) and assumed knowledge (e.g. beliefs, intentions, etc.), which may be relevant to the dialog. Depending on the chosen dialog strategy these knowledge sources are applied in different ways and to different degrees. In graph-based systems the context information is represented in the states and transitions. The collected data from the user is often stored in a database. A user preference model can be used to extend the dialog graph with elements and several paths leaving this element. For

45 2.1 Fundamentals of Spoken Dialog Systems 27 instance, if the user arrives at this certain element in the graph, the system could apply a mechanism to look up in the user model if the user has already had experience in using this system. Depending on the user experience different paths (e.g. with more or less verbose instructions) could be chosen. Frame-based systems encode context information as simple sets of attribute-value pairs in the slots of the frames, which represent the dialog task. Other approaches encode context information in more complex data structures. Goddeau et al. (1996) proposed a frame-based system using E-forms, which allows to associate different priorities for different users to each slot. The priorities determine the order, in which the system elicits information from the user (Jokinen and McTear, 2009). A more complex data structure used to control the dialog is the schema, which has been developed in the Carnegie Mellon Communicator system (Rudnicky et al., 1999). The schema describes a task-based strategy for dialog management, similar to the determination of an itinerary, which is represented as a hierarchical data structure. The itinerary is constructed over the course of the dialog. Although there is a default sequence of actions, the task-based strategy allows a cooperative task solving p.rocess between the user and the system. (Jokinen and McTear, 2009). The developed SDS prototypes in this research work are designed and implemented to interpret the dialog context described in this Subsection. These prototypes especially keep track of the dialog history and of the task, which has to be accomplished. Furthermore, specific information about the domain is included and the basic principles of conversational competence are followed. User preferences are not taken into consideration. Grounding During the speech interaction the DM can be confronted which typical SDS problems, which need to be solved smoothly in order to achieve a successful speech dialog. Based on the output of the NLU model, the DM desides how to continue with the speech dialog. However, the output of the NLU might not correctly represent the actual input of the user for various reasons (McTear, 2004): The speech recognizer did not detect speech in the audio signal, although the user had spoken. Due to noise in the environment the ASR detects a speech signal, although the user did not speak any word. Only a part of the speech signal is recognized and forwarded to the NLU module. The first part of the user s utterance can be missing if the user starts speaking, before the speech recognizer has started recording. This often occurs if the user begins speaking, before he receives the signal to start (e.g. a beep). The end of the user utterance might be missing if the speech recognizer stopped recording too early - typically, if the user lowers his voice at the end of the utterance. The whole utterance is recorded, but the speech recognition engine incorrectly recognizes some words. The speech recognition engine correctly recognizes all words, but the NLU module does not correctly interpret the meaning of the sentence. If one of the problems above occurs, common ground needs to be established in order to ensure that the system has correctly understood the user s intention. There are several methods, how a DM can react in order to establish common ground. The system s grounding techniques can differ if an ill-formed or incomplete input has been detected or if no error has been detected. If no error has been detected, the system should still verify if it has correctly understood the user. The different methods are described in the following.

46 28 2 Background and Resulting Challenges Clarification Subdialogs In case of a detected error the system could request a reformulation of the utterance. However, this method does not distinguish between the different reasons for an error and relies on the user to be able to reformulate his request. If the problem, which caused the ill-formed or incomplete NLU output could be determined, the system could address the problem explicitly. E.g., if silence has been detected, the system could prompt the user with a message such as: 1 System: I am sorry, I could not hear you. Could you speak a little louder, 2 please? Most of the dialog design toolkits provide facilities for handling these problems. E.g. VoiceXML provides a < noinput > event, which is thrown if silence is detected. The CSLU toolkit provides a default repair dialog, which is triggered if silence is detected or the input cannot be interpreted (McTear, 2004). Verification If no error, which requires clarification is detected, the system still has to verify that it has correctly understood the user s intention. In human-human communication it is common that important information, such as address or account data, is confirmed. This process helps to ensure that the user has been understood and that common ground has been established between the dialog partners. Particularly, in speech-based HMI the process of grounding is crucial to a successful dialog performance, since many recognition errors or understanding errors can occur. Two methods may be applied to verify that the systems has understood, what the user said: 1. Explicit Verification Using this method the system explicitly requests confirmation of the previously requested input: 1 System: Where would you like to book a hotel? 2 User: In Berlin. 3 System: So you would like to book a hotel in London? This method is very robust for confirming parameters. However, the speech dialog can become very tedious if several values have to be collected and afterwards confirmed, as the number of dialog steps increases with every requested parameter. If several values have to be confirmed, McTear (2004) suggests to defer the confirmation resolution, until all parameters have been collected and to explicitly ask the user about the possibly incorrect parameters: 1 System: Where would you like to book a hotel? 2 User: In Berlin. 3 System: When do you arrive? 4 User: Today. 5 System: So, you would like to book a hotel in Berlin for tomorrow? 6 User: No. 7 System: Which value would you like to correct: the destination or the arrival date? 8 User: The arrival date. 9 System: When do you arrive? Applying this strategy the user needs to know that he is only permitted to answer with yes or no on the confirmation request (5). If the user should be allowed to indicate the values that need to be corrected in this dialog step, a complex grammar is needed. This confirmation strategy is fairly robust, however, the dialog can become very complex if more than just one value needs to be corrected.

47 2.1 Fundamentals of Spoken Dialog Systems Implicit Verification Using the implicit verification method the system embeds a repetition of the understood user utterance in the next prompt. The user is able to correct the repeated information or simply answers the next prompt. Thereby, the embedded information is implicitly confirmed: 1 System: Where would you like to book a hotel? 2 User: In Berlin. 3 System: When do you arrive in Berlin? 4 User: Today. In the example above, the destination has been implicitly confirmed, as the user answers the follow-up question containing the implicit confirmation (3). This confirmation strategy decreases the number of dialog steps, because the number of system questions get reduced. Furthermore, when the user s input is implicitly confirmed, the dialog is more similar to human-human communication and has a more natural flow. When designing this kind of confirmation dialogs, the following issues should be taken care of: implicit verification requests offer a wide range of possible user responses and therefore, poses a great challenge to ASR and NLU as the user is also able to combine several values, the number of possible verification questions rises with the number of parameters the application requires implicit verification is based on the fact that the user will correct the SDS if a value is incorrect. Users might not realize that an error occurred or might think that they can correct the input later (McTear, 2004). McTear (2004) suggests to combine both verification strategies. First, the SDS should try to establish common ground using implicit verification. If the implicit verification method fails, the system should move to explicit verification. As suggested by McTear (2004) the designed speech dialog concepts of this thesis employ both verification strategies. When the speech dialog works well and the system successfully elicits information from the user, the system implicitly confirms the user utterances by embedding the understood information in the next prompt. If a parameter has been misunderstood and the user asks for correction, the system explicitly asks the user, which parameter he would like to correct. Complex dialog models and dialog modelling techniques for grounding have been investigated by various researchers (e.g. (Clark and Schaefer, 1989), (Jokinen, 1996), (Traum, 1999)). For instance, Jokinen s cooperative dialog management approach (Jokinen, 1996) suggests to use general principles of rational and cooperative communication in form of inference rules in order to generate adequate responses to establish common ground. However, since the complex modeling of how to achieve mutual understanding in human-machine speech dialogs is not in focus of this research work this topic is not further elaborated in this subsection. The described methods of modeling the dialog management strongly influence the dialog flow, the quality and the usability of a speech dialog. Depending on the provided features and functionalities an SDS can appear more or less human-like. An SDS, which displays human-like character traits and competences, is often called a conversational SDS. However, there are different understandings of conversational speech in human-human communication and which features and which behavior make an SDS conversational. As one of the developed prototypes of this research work is also considered as conversational speech interface, the next Chapter is explicitly dedicated to describe conversational speech and explain the author s understanding of the competences of a conversational speech interfaces, which is applied in this thesis.

48 30 2 Background and Resulting Challenges Conversational Speech Interfaces HAL: Excuse me, Frank. Frank: What is it, HAL? HAL: You got the transmission from your parents coming in. Frank: Fine. Put it on here, please. Take me in a bit. HAL: Certainly. Quote from Stanley Kubrick s A Space Odyssey (1968). The computer HAL 9000 speaks to Dave, while he is relaxing on his sun bed. approximately 1:00 hour into the film Stanley Kubrick ( ) already predicted in 1968 a computer with human-like qualities. In his science fiction movie A Space Odyssey (Kubrick, 1968) the computer HAL 9000 possesses conversational language competences for understanding and speaking and artificial intelligence, which allows for logical reasoning and self-initiative behavior. These abilities give him human character traits and allow the user to interact with the machine, as if he would talk to a human being (Strauss and Minker, 2010). Another vision of an intelligent, human-like machine is K.I.T.T from the American television series Knight Rider ( ) (Bilson et al., 1982). K.I.T.T is an advanced, artificially intelligent car, which is the sidekick of the human hero Michael Knight. Similar to HAL 9000 the car has its own personality and acts like a human being. The conversations between the computer and the driver resemble human-human communication and by making ironic comments K.I.T.T even demonstrates human character traits. A human-like computer like K.I.T.T or HAL has not been developed at present. However, human characteristics appear more and more in today s computer systems, especially in speech-based HMI. Apple s Siri, for example, also has its own personality, can be addressed by name and makes funny comments on silly questions. Conversational speech interaction plays an important role and is also still a hot research topic. However, when talking about conversational speech or natural speech and the competences of conversational speech interfaces or natural speech interfaces people s views and expectations differ depending on their background and their research field. Therefore, this Section is dedicated to first, illustrate the characteristics of conversational speech in human-human communication from a SDS developer point of view. Afterwards, different basic design principles, which have to be obeyed in order to design conversational speech interfaces, are presented. Finally, the author s understanding of a conversational speech interface and its competences is described, which is applied in the remainder of the thesis. Conversational Speech in Human Dialog The term conversational speech is widely used in many research fields. When looking up literature you also stumble upon terms like natural speech (Zimmerer, 2009), natural language (Thomson and Wisowaty, 1999) or spontaneous speech (Ward, 1989). They all seem to address problems occurring in human-human conversations but use their own definition for the use in their own context. In order to help the reader understand the meaning of conversational speech in human dialog, this Section is dedicated to classify conversational speech dialog phenomena, which have to be taken into consideration when designing and implementing an SDS. People s easiest way to communicate is simply to speak to each other. Depending on the purpose of the dialog, humans try to convey meaning, try to find solutions to a problem or simply chat to maintain social relationships (Zimmerer, 2009; McTear, 2004). The language, people talk, differs considerably on the situational context and the setting. For example, the language a professor speaks, who honors a student in a graduation ceremony, is different to the language he uses when chatting

49 2.1 Fundamentals of Spoken Dialog Systems 31 to his wife. Furthermore, sociological factors, such as geographical places or social backgrounds, where one grew up, have a strong influence on the way one speaks. Although people increasingly communicate via computers (e.g. writing s or chatting), human-human conversations are still the most common and the most natural kind of language use (Zimmerer, 2009). Zimmerer (2009) claims that conversational speech is the most frequently produced speaking style and thus, listeners have to deal with most of the time. The assumption that conversational speech is characterized by the speaker s intention to produce speech with as little effort as necessary, is widely accepted (e.g. Zimmerer (2009), Lindblom (1990)). Conversational speech differs in many respects from prepared speech, such as read speech aloud (Strangert, 2004). In order to demonstrate the variety of occurring phenomena in conversational speech a transcript of a conversation between an agent (A) and a client (C) over the phone (see Table 2.1) is presented in the following. Subsequently, the different phenomena are characterized and further described. Table 2.1. Telephone conversation transcript, taken from Zue and Glass (2000). C: Yeah, [umm] I m lookin for the Buford Cinema. disfluency, pronunciation variation A: Ok, and you wanna know what s showing there or...? pronunciation variation, interruption C: Yes, please. A: Are you looking for a particular movie? C: [umm] What s showing? disfluency, initiative switch A: OK, one moment. back-channel... A: They re showing A Troll in Central Park. initiative switch C: No. A: Frankenstein. ellipsis C: What time is that on? reference A: Seven twenty and nine fifty. C: Ok, and the others? sentence fragment The transcript shows some of the conversational speech phenomena. Disfluencies, differences in pronunciation, interruptions, initiative switches, ellipsis, references are only few examples that can occur in human-human conversations. The variation occurring in conversational human-human speech dialogs concern both, the way humans interact in a dialog and the speech production itself. Figure 2.6 illustrates, what kind of aspects have to be taken into consideration in order to characterize conversational speech. The collection is made by the author based on literature research and focuses on characteristics relevant to conversational speech interfaces. The occurring speech and dialog phenomena and characteristics are described in detail in the following. Spontaneous Speech Phenomena In human-human conversations the speech production differs from canonical speech or read speech in terms of prosody and sentence construction. In literature these variations are often related to as phenomena occurring in spontaneous speech or conversational speech (e.g. Ward (1989), Hofmann et al. (2012c), Strangert (2004), Dufour et al. (2009) ) or sometimes a mix of both spontaneous conversational speech (e.g. Husin et al. (2012)). In order to reduce confusion in the following, we will use the term spontaneous speech when addressing variations in speech production. Any speaker of any human language can produce and understand an infinite number of sentences (Fromkin et al., 2003, p. 117). Generally the human spoken language allows speakers to create sentences with the same meaning in many possible ways by e.g., using synonyms, reordering constituents or even concatenating phrases. In spontaneous speech the phrase humans try to speak

50 V Background and Resulting Challenges Conversational Speech Human Dialog Phenomena Spontaneous Speech Phenomena Cooperativeness in Dialog Social Behavior Context Knowledge Prosody Sentence construction Negotiation Ability Non-verbal Cues Context-dependentutterances Rhythm, speed Ungrammatical, incomplete sentences Initiative: Mixed Personality Contextual situation Intonation Flexible sentence construction Sensitive turn taking Adaptiveness Pronunciation of Words Disfluencies Grounding Correction Conversationalimplicature Presuppositions, misconceptions Intentional answers, generalization Fig Conversational Speech Classification. are produced on the fly and without planning. Phrases are filled with discourse particles (e.g. like ), hesitation sounds (e.g. ahm ) and pauses, which are used to structure the sentence and have no semantic meaning. These spontaneous speech phenomena occur, when the speaker is searching for words or has to rethink what to say next. These phenomena often even lead to syntactically ill-formed sentences. Furthermore, as the phrases are not well-thought in advance, phrases are restarted or words are repeated; people even tend to combine or miss words out, while they speak spontaneously (Hofmann et al., 2012c; Strangert, 2004). These planning problems also have strong influence on the prosody of the sentence. In spontaneous speech people pronounce differently and the intonation changes (Hofmann et al., 2012c). Depending on their origin people speak in their own dialect or use different words for the same meaning. (Zimmerer, 2009). The emotional state also influences the language used (Dufour et al., 2009). E.g., anger and happiness in speech are normally characterized by high mean pitch and a higher speech rate, whereas sadness is characterized by mean pitch and a slower speech rate (Yildirim et al., 2004). Due to these difficulties spontaneous speech poses great challenges to today s speech recognizers. On the 2003 NIST s benchmark test history the best systems achieved down to 10% of word error rate (WER), when recognizing speech conducted under planned, studio recorded conditions. The benchmark test revealed that WERs of the spontaneous part of the test sets were almost doubled compared to the portion conducted under perfect conditions (Sakti et al., 2009). Among all the presented difficulties Riley et al. (1998) consider multiple pronunciation variants as being the major problem of spontaneous ASR. With the help of recorded and annotated spontaneous speech data (Godfrey et al., 1996; Mark et al., 2005) research puts a lot of effort in improving spontaneous ASR (Davel and Barnard, 2003; Chen and Hasegawa-Johnson, 2004; Jang, 2006; Bates et al., 2007; Sakti et al., 2008; Hofmann et al., 2012c). Human Dialog Phenomena Human dialog is a collaborative activity between two or more partners. In the course of the dialog the dialog partners negotiate but do not have to agree with each other. In human-human conversations the dialog initiative is fairly distributed and switches between the dialog partners and often new topics are introduced in the course of the dialog. A dialog consists of several turns. The turns of each participant are negotiated on a turn-by-turn basis according to a complex set of rules (McTear, 2004; Sacks et al., 1974). Generally only one dialog partner speaks at a time. When the participant has finished speaking, after a short gap, the other participant takes his turn. However, sometimes

51 2.1 Fundamentals of Spoken Dialog Systems 33 listeners interrupt their dialog partner and start speaking, although the dialog partner has not finished speaking, yet. As speech is sometimes imprecise or ambiguous (e.g. due to generalized answers), misunderstandings and uncertainties can occur Berg (2013). Therefore, speakers establish common ground to confirm that what has been said has been understood. This evidence can be provided by listeners by back-channeling, which is either a simple indication of attention (short utterances, such as uh-uh or hm ) or an explicit acknowledgment (such as ok or right ). Furthermore, common ground can be established by giving feedback, which repeats the statement of the speaker. Often the content is embedded implicitly in the user feedback (McTear, 2004). If the speaker realizes that misunderstandings prevail, correction dialogs are made to clarify these misconceptions Berg (2013). If a listener feels uncertain, in what he has understood, he can re-request the information once again. Furthermore, people s utterances are often based on implications and assumptions, which affect the dialog flow. For instance, these implications and assumptions can be influenced by their background or their context. Dialog context is also very important for the listener in order to be able to interpret, what the other participant has said. For example, if two people make conversation and another person joins the conversation later, he might not understand, what the conversation is about, since he missed the preceding utterances. As described before, especially short utterances require context resolution. Ellipsis and anaphoric references are common in human dialog and need to be resolved (Jokinen and McTear, 2009). Furthermore, the relation between the dialog partners, the knowledge about each other and the contextual situation is very important (Zimmerer, 2009). People adapt their way of communicating and expressing to their dialog participant and to their current situation (Berg, 2013). People s personality impact the way people communicate and express themselves. Two business partners might speak different to each other than two old friends drinking a beer in a bar (Zimmerer, 2009). The different personalities and the relationships to each other affect the way, the dialog partners address each other or if the conversation is rather formal or informal. Furthermore, when people make conversations, they communicate non-verbally in addition to their voice. These non-verbal cues, such as facial expressions, postures and gestures (Mohammadi and Vinciarelli, 2010) can convey information intentionally or unintentionally (McTear, 2004). Silence or pauses within one turn may occur due to unplanned spontaneous speech. However, at the end of a turn this may also indicate that the speaker has finished his turn and that the other participant may take the floor (McTear, 2004). Sometimes the occurrence of silence is interpreted by the other dialog partner, as illustrated in the following example, taken from Levinson (1983). 1 A: So I was wondering would you be in your office on Monday? 2...pause... 3 A: Probably not. 4 B: Hmm, yes. In this example person A asks person B if he will be in his office on Monday. As person B does not answer directly, person A interprets this pause himself and continues with probably not. The short pause is sufficient to trigger interference that B might respond negatively (McTear, 2004). Developers of SDSs try to model speech-based HMI as human-like and user-friendly as possible. However, people talk differently, when they talk to machines as when they talk to a human dialog partner (Shechtman, 2002; Stent, 2001). At present research has not come to a decision, how a conversational dialog between humans and machines should be modeled. However, based on a set of rules and characteristics, there is a mutual understanding, of which general conversational competences a conversational SDS should feature. The next Subsection presents principles of conversational SDSs, which help to design and characterize the capabilities of SDSs. Furthermore, the author s understanding of the competences of a conversational SDS is presented.

52 34 2 Background and Resulting Challenges Conversational Speech Dialog System Design When talking about conversational or natural SDS the definitions and the understandings of such a system s capabilities differ. The terms conversational and natural are somewhat problematic and not clearly defined (Edlund et al., 2008). For instance, as mentioned before, especially in research in the United States the term conversational is frequently used for advanced human-like SDSs (McTear, 2004). But what are the strengths and limitations of an advanced human-like SDS? The term natural is often (mistakenly) used to describe the TTS synthesis qualities or the capabilities of the ASR and NLU modules. However, this is not correct, since a simple Command-and-Control system (e.g. speech control in rooms via short commands) also understands human natural speech to a minor degree but lacks in flexibility in the dialog (Berg, 2013). SDSs do not only differ in their input and output capabilities but also in their dialog behavior. Therefore, one often reads that an interaction with a natural or conversational SDS should resemble human-human-like speech communication (Berg, 2013; Edlund et al., 2008; Stent, 2001). For example, (Jokinen, 2003, p. 1) describes computers with natural interaction capabilities as intuitive human-computer interfaces that mimic human communication. In the remainder of the thesis the term conversational SDS is used to mean an advanced SDS with human-like speech interaction capabilities. Up to this day research has not come to a decision about how a conversational dialog between humans and machines should be exactly modeled. However, some basic principles and design characteristics exist, which help to develop and to compare the capabilities of a SDS. This Subsection presents these theory principles, which describe basic rules an SDS designer should obey and the competences an SDS should provide in order to support the user. Furthermore, a definition of the author s understanding of the conversational competences of a conversational SDS is presented. Social science and linguistics define cooperative principles of human-human dialog, which should be obeyed in order to achieve effective communication. The principles describe, what people do in cooperative conversations, which is why they can be used as guidelines for the design of SDSs. This applies only for cooperative SDSs (not for tutoring systems, etc.). Grice (1975) proposed four conversational maxims, which characterize cooperative answers that should be obeyed by an SDS: 1. Maxim of Quality: Truth Do not say what you believe to be false Use support of adequate evidence In a cooperative dialog the SDS should never give an answer, which might be not correct and mislead the user. If the literal answer might lead the user to a false assumption, provide further information to verify your response. 2. Maxim of Quantity: Information Make a contribution as informative as required for the current purpose of exchange Do not provide more information than required An SDS should give only as much information as necessary and should not overload the user. For example, if a database is large, an answer to a question could result in an endless prompt, which the user cannot process while listening. 3. Maxim of Relation: Relevance A contribution should be related to the purpose of the dialog Answer should be relevant to the user who asked the question

53 2.1 Fundamentals of Spoken Dialog Systems 35 The information provided should match the current stage of the dialog and should be related to the current topic. Depending on the user background the interests or the intentions of the user might be different. The system s response should be adapted to the respective user. Efforts have to be spent in user modelling to determine the goals and intentions of users. 4. Maxim of Manner: Clarity Avoid obscure and vague expressions Avoid ambiguity Be brief and orderly The answers of an SDS should be easily understandable, unambiguous and succinct. If the system response are provided in a clear manner, no clarification subdialogs are needed, which makes a dialog more effective. These four maxims should be satisfied in order to achieve a smooth and effective speech dialog. However, these rules do not tell how a speech dialog should be designed. There are many ways, in which an SDS can successfully interact with a user and thereby obey Grice s Maxims. Depending on the dialog features and the flexibility an SDS can appear more or less conversational. The TRINDI tick-list from Bohlin et al. (1999) characterizes the dialog behavior of an SDS with the help of twelve Yes-No questions. This tick-list evaluates the dialog management capabilities and is used to compare different SDSs on how human-like the different systems appear. The twelve Yes-No questions are illustrated in Table 2.2. Table 2.2: The TRINDI tick-list. Q1: Is utterance interpretation sensitive to context? Q2: Can the system deal with answers to questions that give more information than was requested? Q3: Can the system deal with answers to questions that give different information than was actually requested? Q4: Can the system deal with answers to questions that give less information than was requested? Q5: Can the system deal with ambiguous designators? Q6: Can the system deal with negatively specified information? Q7: Can the system deal with no answer to a question at all? Q8: Can the system deal with noisy input? Q9: Can the system deal with help sub-dialogs initiated by the user? Q10: Can the system deal with non-help sub-dialogs initiated by the user? Q11: Does the system only ask appropriate follow-up questions? Q12: Can the system deal with inconsistent information? The twelve questions focus on different dialog management capabilities. The flexibility of the dialog is assessed by verifying if mixed-initiative dialogs are possible (e.g., Q2, Q3, Q10). Mixedinitiative dialogs were previously defined as dialogs, in which the SDS has the overall control of the dialog, but allows the user some advanced flexibility to provide more information than requested (see

54 36 2 Background and Resulting Challenges Section 2.1.2). E.g., Q10 verifies if the system is able to deal with user requests, which do not answer the system s question: 1 System: Which TV would you like to watch? 2 User: Is there Knight Rider on TV today? 3 System: Yes. 4 User: Ok, then I would like to watch Knight Rider today. Q1 and Q11 verify if the system is able to resolve context. An SDS has to keep track of the dialog history and make decision based upon (Q11) and has to be able to interpret context-sensitive utterances (Q1). The following dialog illustrates an example for Q1: 1 System: When would you like to watch a TV show? 2 User: Today. 3 System: Which TV show would you like to watch today? Furthermore, the cooperativeness of an SDS is addressed (Q4, Q5). If the user s response is ambiguous or underinformative, a cooperative SDS should not simply decide itself about the most likely user intention. Instead, the system should re-request or help to establish common ground. For example, if the user s answer is ambiguous or not precise, the system could suggest possible answers in order to resolve the ambiguity (Q5): 1 System: What would you like to watch? 2 User: Stargate. 3 System: Did you mean Stargate, the movie or Stargate, the TV show? 4 User: The movie. Q6 and Q12 assess if a speech interface can deal with negatively specified and inconsistent answers. The system has to be intelligent and provide some reasoning competences to resolve these user utterances. If the interpretation is unclear, the system should ask the user before continuing the dialog (e.g. Q12): 1 System: When would you like to watch a movie? 2 User: Tomorrow on Wednesday. 3 System: Excuse me, but tomorrow is Tuesday. 4 User: Ok, on Tuesday then. Finally, the TRINDI tick-list consists of further questions, which evaluate the handling of uninterpretable user utterances due to silence (Q7) or noisy inputs (Q8) and which assess if the SDS provides help functionalities (Q9). The TRINDI tick-list is used to evaluate the conversational competences of an SDS. However, the TRINDI tick-list only focuses on dialog management capabilities and does not get to the point to define the nature of such flexible, human-like dialogs. Spontaneous speech phenomena are not addressed by the tick-list. Which user speaking styles should be allowed? Which degree of flexibility of user input is required? In order to answer these question one can try to model a speech interface based on observations of real human-human dialog. However, the opinions regarding the design of conversational interfaces by mimicking human conversations are somewhat divided (Sadek, 1999; Boves and den Os, 1999; Thomson and Wisowaty, 1999). Some researchers think that human-human dialog phenomena, such as interruptions, incomplete sentences, etc. may not contribute directly to goal-directed problem solving (Thomson and Wisowaty, 1999). On the other hand, users might feel more comfortable speaking to a user interface that acts like a human being. Therefore, many researchers build their SDS based on analyses of human-human interactions, which can provide valuable insights (Bernsen et al., 1996; Zue and Glass, 2000; Gustafson and Merkes, 2009). Regardless of the debate, if modelling human-machine interactions after human-human dialogs is necessary

55 2.1 Fundamentals of Spoken Dialog Systems 37 or appropriate, the author believes, and so do others that valuable rules and characteristics derived from human-human dialog can be applied to successfully design human-like conversational speech interfaces. Transcribed conversations over the phone between a client and an agent can provide valuable insights about how one should design an SDS. In face-to-face conversations people would additionally communicate non-verbally, which is unlikely to happen, when they talk to a machine. Of course, if the machine would look, talk and act like a human being, people might use non-verbal cues to communicate. However, at present, there is no machine, which provides any of these capabilities. Therefore, talking to an SDS, which assists users in task-oriented problem solving, can be compared to talking to an agent over the phone. A sample dialog, which helps to derive further conversational dialog features concerning speaking styles has been presented at the beginning of this Subsection in Table 2.1 (Zue and Glass, 2000). The sample conversation illustrates spontaneous speech phenomena, which have to be taken into consideration when developing an SDS. In order to appear humanlike an SDS should not restrict the user input to a certain sentence construct or grammar. Short utterances, such as ellipsis, ill-formed sentences as well as grammatically correct sentences should be understood by the system. A conversational SDS has to detect disfluencies or other spontaneous speech phenomena without semantic meaning and must not attach meaning to them. Differences in prosody (e.g., speech rate or intonation) should not affect the ASR of an SDS. As variation in prosody sometimes conveys meaning, the NLU component should be able to understand the different user intentions. Ideally the ASR component of an SDS should be robust against the pronunciation variation of spontaneous speech 2. The recognition result of the ASR component should not be impaired by differences in pronunciation. The SDS should be able to interpret non-verbal cues and take them into account in the course of dialog. Furthermore, the speech output should also present human-like characteristics. By presenting non-verbal cues an SDS can appear more natural. The SDS should take situational contextual knowledge into account. Depending on the situation and its user the system should adapt its speaking style and dialog behavior. Giving an SDS a personality has to be handled with care and depends on the user target group, as not everybody likes the idea of a machine with a concrete human or artificial character. To this day many SDSs, which aim at providing conversational competences, have been developed for industrial or research purposes. However, the developed systems only provide some of the required features a human-like speech interaction demands. For instance, the SDS Let s Go (Raux et al., 2005) developed by the Carnegie Mellon University, provides bus schedule information of the city of Pittsburgh. The TTS sounds human-like and the dialog enables user to correct falsely understood utterances (Berg, 2013). The telephone-based travel information of the German railway system Deutsche Bahn 3 enables users to retrieve train information by speech. By allowing the user to input several parameters at once, understanding anaphoric references and some colloquial language the system provides more flexible input possibilities (Berg, 2013). However, both systems lack in many conversational competences concerning dialog cooperativeness and context knowledge. E.g., the dialog control is system-directed, does not allow to overanswer system requests and applies explicit verification for confirmation (Berg, 2013). The Mercury system (Seneff, 2002) by the Massachusettes Institute of Technology provides users telephone access to flight information and prices. The system is based on a mixed-initiative dialog, thus, understands overanswering and requests missing information. Furthermore, the SDS is able to resolve references, uses implicit verification and allows users to 2 Handling pronunciation variation is mainly an acoustic modeling problem. The acoustic models of the ASR engine employed in the SDS prototypes of this research are provided by automotive suppliers. These models are not adapted or modified for the research of this thesis and therefore, only dealt with in a limited fashion. 3

56 38 2 Background and Resulting Challenges make corrections. Nevertheless, the system does not adapt to the situation or the user and does not understand colloquial or spontaneous speech. One of the most advanced spoken dialog systems concerning flexible input possibilities are the personal assistants on mobile devices, such as Apple s Siri, which have been addressed in the introduction of this thesis. Similar to the Mercury SDS Apple s Siri employs mixed-initiated dialogs, understands overanswered responses, requests missing information and is able to resolve references. The strength of the system is the ability to understand ellipsis and sentences with the same meaning in many possible ways. However, the conversational competences of this SDS are also limited. For example, the SDS does not provide auditive feedback to the user about what the system has understood. Corrections can only be made, when all the parameters have been collected. The speech dialog of the conversational SDS prototype developed in the course of the research work has also been designed based on analyses of human-human conversations. A conversation on the phone between a client and an agent has been taken as an example to model the human-machine speech dialog. When designing the dialogs the basic principles described in this Subsection have been adhered to. This Subsection presented conversational speech phenomena and the necessary conversational competences of an SDS in order to make it appear more human-like. If an SDS is capable of these competences, a further step forward towards the fictional computer systems HAL9000 and KITT is taken. One of the most important character traits of an human dialog partner is its independence. HMI should not only rely on interaction requests by the user. An SDS should also be able to take its own initiative and interact proactively as HAL9000 and KITT do. As this research work also investigates proactive SDSs in dual-task scenarios, the general meaning of so-called proactive behavior, its key features and proactivity in the scope of SDSs are described in the next Section Proactivity Research about proactivity has emerged only recently in several research areas. For example, research in HMI and human-human communication or industrial/organizational psychology investigate proactive behavior. Since it is a new research field, there are no precise definitions of proactive behavior. Dictionary definitions (American Heritage Dictionaries Editors, 2011; Merriam-Websters online dictionary, 2013; WordNet 3.1 Princeton University, 2013) typically contain two key features of proactivity. First, an anticipatory element is emphasized, which involves acting in advance of a future situation, such as acting in anticipation of future problems, needs or changes (Merriam-Websters online dictionary, 2013). Second, these definitions highlight taking control and causing change, for example: controlling a situation by causing something to happen rather than waiting to respond to it after it happens (WordNet 3.1 Princeton University, 2013). Both of these two elements - anticipation and taking control - can be found in most conceptualizations of general proactive behavior. E.g., (Parker et al., 2006, p. 636) define proactive behavior as self-initiated anticipatory action that aims to change and improve the situation. In addition, definitions of proactive behavior often emphasize its self-initiative nature, which addresses the attempt to solve problems, which have not yet occurred (Frese and Fay, 2001). Summarizing these definitions, proactivity can be described by three key features: proactive behavior is 1. anticipatory - instead of reacting, it involves scanning the environment and acting in advance to a further situation 2. change-oriented - instead of passively adapting to the situation or waiting for something to happen, being proactive means to take control or cause something to happen

57 2.1 Fundamentals of Spoken Dialog Systems self-initiated - the control is taken on a self-initiative base without being requested to do so (Parker and Collins, 2010). Most of the definitions are applied in industrial and organizational psychology and used to describe proactive behavior of employees in order to improve individual and organizational effectiveness. For example, a nurse, who is waiting for the doctor, sees a patient and prepares the equipment and data the doctor might need. Thereby, the doctor can do his work more effectively. The nurse acts anticipatory by thinking ahead and anticipating the doctor s needs. Instead of waiting for the doctor to come, she becomes active and prepares the equipment. The initiative to do so is taken all by herself without being requested by the doctor (Parker and Collins, 2010). As the definitions for proactive behavior are formulated in a general manner, they can be transferred to other research areas, such as human-human communication or HMI, too. Imagine an in-vehicle navigation system, which observes the traffic density on the previously configured route while driving. As the system detects a traffic jam, which would prolong the length of the trip the system speaks up to the driver and suggests to take a different route. Here the system acts anticipatory by observing the traffic density ahead and preventing the user from a possible traffic jam. Instead of ignoring the pending problem the system suggests to change the route to bypass the traffic jam. The system initiates the dialog itself without a request by the user. The only difference to the proactive behavior of the employee is that the system only makes suggestions to the driver and does not decide about the new route itself. The driver has the control of changing the route himself. In order to allow for successful proactive behavior the environment has to be scanned, the participants have to anticipate and act in advance on a future situation. Hereby, several influencing factors have to interact. The following Section describes a system model for proactive behavior with the help of a simple and abstract example. The system model focuses on proactive behavior in human-human communication and HMI and is brought in line with the dual-task scenario as this is the focus of this research work. Afterwards proactive behavior in an SDS and its special features are described. Proactivity System Model Imagine, two communication partner (A and B) would like to exchange information via any system. Therefore, A sends the information carrier to the system. When the system has received the information, it has two possibilities to interact to handle the information. Either, the system stores the information and waits until B requests the information, or it proactively sends the information to B. The latter activity would resemble proactive behavior. Figure 2.7 illustrates a simple and generalized system model of such a proactive acting system. In the previous example above A wanted to exchange information with B. In Figure 2.7 A would be represented by the source and B by the user. The system, which manages the information exchange could be a postal service, a smartphone or an in-car SDS. For example, imagine, A sends a letter to person B via any postal service. The postal service receives the letter from A. Instead of waiting until the recipient picks up the letter from the post office, the postman delivers the letter proactively to the addressee (Hermanutz, 2013). Successful proactive behavior can only be achieved if the proactive system observes the environment in order to act in advance on a future situation and to deliver the content at the right point in time. Therefore, the system model needs to be extended by a context component, which is illustrated in Figure 2.8 (Hermanutz, 2013). Context-awareness can relate to the current location, time or situation, knowledge about user preferences, etc. The context knowledge can be gained by observing the environment and the user himself. For example, if an addressee has changed his residence recently, the postal service needs to be aware of the new address. If person A would like to send a letter to

58 40 2 Background and Resulting Challenges User System (e.g. postal service, smartphone, SDS) Source (e.g., a letter, an , a traffic announcement) Fig General proactivity system model. information flow proactive interaction B without knowing the new address, the postal service has to take care that the letter arrives at the correct address. The in-vehicle navigation system of the example above also showed context-aware capabilities as it observes the traffic density in order to detect a possible traffic jam. Environment Context (e.g. location, time, situation, priority, user preferences) User System (e.g. postal service, smartphone, SDS) Source (e.g., a letter, an , a traffic announcement) information flow proactive interaction Fig Extended proactivity system model. In order to bring the model in line with the dual-task scenario the system model is extended by the primary task, the user has to perform (see Figure 2.9). Depending on the nature of task the primary task can occupy several input and output channels simultaneously. For instance, by steering and keeping one s eyes on the road driving a car demands a person visually and manually. The state of the primary task needs to be included in the context knowledge. Thereby, the system might know if the user is currently able to process the information, which the system tries to deliver. E.g., if the user is very busy performing the primary task, the system should not proactively interact with the user in order to not interfere the primary task. Many primary tasks in dual-task scenarios demand visual and auditory input and output channels. As the auditory channel is free, SDS are already used in several dual-task scenarios in order to assist the user. The next Subsection describes proactive SDSs and their characteristics.

59 2.1 Fundamentals of Spoken Dialog Systems 41 Environment Primary Task (e.g. cooking, driving, operate) Context (e.g. location, time, situation, priority, user preferences) User System (e.g. postal service, smartphone, SDS) Source (e.g., a letter, an , a traffic announcement) Fig Extended proactivity system model. information flow proactive interaction non-proactive interaction Proactive Spoken Dialog Systems The proactive behavior of an SDS can also be characterized by the three key features proposed by Parker and Collins (2010). A proactive SDS has to capture the spacial, temporal and user specific context of an interaction in order to act anticipatory in advance to a further situation, possibly even before the users have become aware of the problem. Furthermore, the system needs to understand the user s current psychological situation, intention and actions and has to keep track of the dialog history (as described in Section 2.1.2). Then it is able to assist the user in a meaningful way. A proactive SDS initiates the speech interaction itself and not only upon the user s request (Minker et al., 2009). Therefore, one can say that the speech dialog is first system-initiated. However, in the course of the dialog the initiative can switch. Depending on the dialog model the system guides the user through the entire dialog (system-directed) or the initiative switches in the course of the dialog (mixed-initiative). A successful and user-friendly proactive speech interaction does not only concern the way the system addresses the user. Moreover, the system has to interact with the user and then finish the dialog in a user-friendly manner. The dialog flow of a proactive speech interaction is illustrated in Figure 2.10 and the different dialog steps are described in the following. User state idle taskx Notification Problem Solving Task Completion idle taskx User state Fig Proactive speech dialog flow.

60 42 2 Background and Resulting Challenges Before the SDS delivers some new incoming information or informs the user about an upcoming problem, two different scenarios are conceivable: Either the user is idle or the user interacts with the system already. When the SDS initiates the speech interaction, the different speech dialog steps have to be walked through: 1. Notification: First, the system has to grab the user s attention to tell him that there is some new information. The manner an SDS interrupts the ongoing dialog or initiates the interaction should be situation-sensitive and user-friendly (Edlund et al., 2012). If the situation does not allow a speech interaction at the moment, the system should not address the user. E.g., if the user is dictating a sensitive , he should not be interrupted by the system. When the SDS decides to notify the user, in order to appear user-friendly, the SDS could allow the user to decide if he wants to enter into the new dialog or to reject talking about the newly introduced topic. Especially the dual-duask scenario requires a situation-sensitive and user-friendly proactive notification strategy as the primary task performance should not be impaired. 2. Problem Solving: In the course of the speech dialog the user interacts with the SDS on a regular basis. Depending on the dialog modeling and the competences of the SDS the speech interaction can appear more or less conversational. The dialog can be modeled according to the techniques presented in Section Task Completion: When the problem has been solved or the new information has been delivered, the new task is completed. Depending on the initial state of the user the previous task should be resumed or the SDS should hold back again. Again, by negotiating the desired process the system could leave the decision to the user in a user-friendly manner. There are only few research projects incorporating proactive behavior in SDS. Most of them implement system-initiated dialogs as additional feature but do not focus on the manner, how proactive behavior should be modeled. The DARPA Communicator program 4 ( ) focused on the improvement of SDSs, which allow for performing complex tasks by using speech as sole input modality. The DARPA projects helped to gain knowledge about proactive dialog management and conversational dialog design. In the SmartKom project 5 ( ) complex multimodal dialogs are aspired in which the user as well as the system can initiate interactions and make further inquiries or ask clarification questions. The Neem project (Barthelmess and Ellis, 2005) focused on tools to support and guide meetings in organisational, informational and social aspects. For instance, Kwaku, a virtual meeting partner performs organisational tasks, such as monitoring the time spent on certain agenda points and reminds participants proactively to go to the next item, if necessary. Strauss and Minker (2010) envisaged a dialog system, which listens to multiparty conversations and becomes active when needed. They conducted a user study with a simulated SDS, called Helmut, which assists the users proactively in a restaurant search. Today there already exist products, which notify the user proactively. E.g., most of today s navigation systems employ data retrieved from the Traffic Message Channel (TMC) for routing. TMC allows for dynamic route guidance taking into account real-time traffic situations (Hamerich, 2007). The speech-enabled navigation systems prompt the driver proactively or play a warning sound if, for example, a traffic jam on the route appears. Smartphones alert users about incoming s, instant messages or upcoming appointments by playing sounds. Location-based information, gained knowledge about the user from the smartphone use and real-time data gathered from the Internet set the basis for successful context-awareness. The smartphone app Google Now 6 uses this context-knowledge to notify the user proactively about 4 Websites offline

61 2.1 Fundamentals of Spoken Dialog Systems 43 relevant information by presenting the information on the screen. E.g., when the user enters a subway platform, he can see the schedule of the next trains leaving the station on his smartphone. These mobile devices aspire to act as personal assistant in order to support the user in many situations. Another example of a proactive situation-aware system in a dual-task scenario is the Warning and Informationmanagement (WIM) system by Heisterkamp and Rothe (2004), which ranks messages and warnings, which can occur while driving a car. Those warnings and messages are communicated to the user only in appropriate situations. There are also products, which use other input modalities to detect if the situation allows to initiate the dialog. SemVox talking terminal 7 is integrated into a display and serves as interactive information terminal. Based on integrated cameras it recognizes the presence of customers and initiates the dialog. As described in Section 1.3 another goal of the research work is to examine proactive SDSs in a dual-task scenario. In this thesis it is investigated how an idle driver should be notified about an incoming message in a user-friendly way. Depending on the contextual situation different ways of being notified might be preferred. The previous Sections presented fundamentals of dialog management and ways to make SDSs appear human-like. These Sections helped to provide insights into dialog design and modeling techniques. In order to assess if the systems perform as expected and if they are accepted by the user, the developed SDS have to be evaluated. The next Subsection presents evaluation methods and measures, which are widely used to assess the speech dialog performance and the user acceptance of an SDS. In the course of this research work SDS prototypes have been evaluated on usability applying the techniques presented in the next Subsection Evaluation of Spoken Dialog Systems Evaluation procedures are an important means for developing SDSs and are employed several times in the usability engineering life-cycle illustrated in Figure Design Prototyping Analysis Usability Tests Feedback from field Iterative Design Fig Usability engineering lifecycle, similar to Möller (2010). At the beginning of the development user profiles have to be established and the user behavior, the contextual situation and the environment have to be analyzed (Möller, 2010). In order to gain knowledge about the users, the situation and the environment initial user studies (e.g. surveys) can be conducted. A popular procedure to obtain interaction data to analyze the user behavior is the Wizard-of-Oz (WOZ) technique. This technique can be applied although the system, which shall 7

62 44 2 Background and Resulting Challenges be developed does not exist, yet. In a WOZ experiment a human operator simulates the SDS and interacts with a human participant. Ideally, the human participant thinks that he is interacting with a real system and not with the human operator behind the curtain (Dahlbäck et al., 1993). Based on the analysis initial SDS concepts are designed. One or more design concepts are realized as system prototypes, which then can be evaluated regarding usability. Hereto two possibilities are conceivable, which are often run in parallel or alternating: Either empirical tests with users are conducted or different methods of expert-based evaluation are applied (Möller, 2010). Usability evaluation assesses the quality of an interactive system regarding user-friendliness and usefulness. According to the ISO Standard (1999), usability is defined as: Usability: The extend to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use. The level of usability is characterized by the following dimensions: Effectiveness: The accuracy and completeness with which specified users can achieve specified goals in specified environments. Efficiency: The resources expended in relation to the accuracy and completeness of the goals achieved. Satisfaction: The comfort and acceptability of the system to its users and other people affected by its use. Usability testing reveals usability problems and can provide valuable insights about possible system improvements. The testing procedure is performed iteratively, which can involve several cycles of re-design and evaluation. Afterwards the new SDS can be employed in the field. Feedback from the field can be used for further iterative optimizations and give indications and ideas for future system generations (Möller, 2010). Usability evaluation is an important element in the usability engineering life cycle and plays an important role in this thesis. As there are several ways to assess the three dimensions of usability, the common evaluation methods and measures are described in detail in the following. Evaluation Methods and Measures Generally, in usability tests data are experimentally collected. The data collection process can be performed in several ways (e.g (Bortz and Döring, 2006), (Beywl et al., 2007), (Diekmann, 2007)). Beywl et al. (2007) identify three main evaluation methodologies: Content analysis is a technique, which analyzes the content of collected data. The set of data is broken down into several components, which are assigned to a system of categories and analyzed. Observation can be used to capture verbal or non-verbal actions. An observer (e.g. the experimenter) observes a process and classifies an certain behavior into previously specified categories. Survey is a method, which can be used to collect the respondent s attitudes or opinions. A survey can be carried out in a written (e.g. a questionnaire, an ) or an oral form (e.g. an interview or on the telephone). The evaluation of the developed SDS concepts in this thesis uses the content analysis and the survey technique. Usability evaluation of an SDS can also be differentiated between objective evaluation and subjective evaluation. Objective evaluation aims at rating system and interaction performances, whereas subjective evaluation assesses user judgments obtained from a subjective point of view (Möller, 2005). Möller (2005) found out that only moderate correlations between objective and

63 2.1 Fundamentals of Spoken Dialog Systems 45 subjective correlation measures exist. Therefore, one method cannot replace the other and as a concequence, both methods should be applied. In this research work objective evaluation is achieved by the content analysis method. Subjective evaluation is achieved by the use of questionnaires, the participants had to fill out during the experiments. In the following the applied approaches to objective evaluation are described, followed by the presentation of the applied subjective evaluation method. Objective Spoken Dialog System Evaluation Based on experimentally collected data interaction parameters are extracted and analyzed (Möller, 2005). A common method for objective SDS evaluation is the PARADISE (PARAdigm for Dialogue System Evaluation) framework by (Walker et al., 1997b), which computes user satisfaction based on task success rates and cost functions. The goal is to achieve a maximum user satisfaction by maximizing task success whilst minimizing the cost function. In Möller (2010), Möller describes a list of interaction parameters, which have been collected in a comprehensive literature research and have been standardized by the International Telecommunication Union (ITU-T) in ITU-T Suppl.24 to P-Series Rec. (2005). The different parameters can be classified in several categories. In this thesis only a subset of the most relevant interaction parameters, which allow a quick but still comprehensive evaluation of SDSs, are applied. The categories and the applied parameters are described in detail in the following. Dialog- and communication-related parameters: Interaction parameters, which refer to the overall dialog indicate roughly, how the HMI takes place. These parameters are defined on a dialog 8 or turn 9 level and do not specify the communicative function of each individual utterance 10. The following parameters are applied in this thesis: Dialog duration (DD) describes the overall duration of a dialog in [ms] or [s]. Number of Turns (NoT) describes the overall number of turns uttered in the course of a dialog. Meta-communication-related parameters: SDSs with limited recognition, understanding and reasoning capabilities require correction and clarification utterances or sub-dialogs (e.g. help requests or help prompts) in order to recover from recognition errors or misunderstandings. Furthermore, if the user does not understand, how to talk to the system or makes mistakes while giving information, correction and clarification sub-dialogs are required. The parameters belonging to this category quantify the number of these meta-communication utterances and evaluate the ability of the system to recover from interaction problems. However, the SDS prototypes of this research work do not provide any help functionalities. Therefore, these interaction parameters cannot be evaluated. Cooperatively-related parameters: These parameters assess the level of cooperativity of an SDS based on Grice s maxims (see Section 2.1.3). As the SDS concepts, which have been designed in the course of this research work, satisfy these maxims to the same extent, a comparison of cooperatively-related parameters is not further taken into consideration. Task-related parameter: SDSs assist users in solving problems or performing a certain task 11. Task-related parameters evaluate the degree of task fulfilment of a single dialog or several dialogs. A sample task-related parameter is defined as: 8 Definition of dialog for evaluation purposes: A conversation or an exchange of information. As an evaluation unit: One of several possible paths through the dialogue structure. (ITU-T Suppl.24 to P-Series Rec., 2005) 9 Definition of turn for evaluation purposes: Utterance. A stretch of speech, spoken by one party in a dialog, from when this party starts speaking until another party definitely takes over. (ITU-T Suppl.24 to P-Series Rec., 2005) 10 Definition of utterance for evaluation purposes: See turn. 11 Definition of task for evaluation purposes: All the activities which a user must develop in order to attain a fixed objective in some domain. ITU-T Suppl.24 to P-Series Rec. (2005)

64 46 2 Background and Resulting Challenges Task Success (TS) describes if the user has achieved his goal by the end of the dialog. Different labels are used to indicate if the goal was reached or not or reached with constraints. Speech-input-related parameters: These parameters describe the capability to recognize words and utterances and to interpret meaning from the recognized string, thus, they assess the performances of the ASR and NLU modules. For ASR, the recognition performance on an utterance or on a dialog level can be assessed. The language understanding performance is often assessed on the basis of attribute-value pairs (AVPs) occurring in an utterance. The most relevant measures are presented in the following: Word Error Rate (WER): computes the percentage of words of an utterance, which have been incorrectly recognized. The WER of an utterance is defined as WER = s w + i w + d w n w, (2.3) whereas n w designates the overall number of words in the utterance, s w the number of substituted words, i w the number of inserted words and d w the number of deletions. Concept Error Rate (CER): computes the percentage of incorrectly understood semantic entities of an utterance. The CER of an utterance is defined as CER = s AV P + i AV P + d AV P n AV P, (2.4) whereas n AV P designates the overall number of AVPs in the utterance and s AV P, i AV P and d AV P the number of substituted, inserted and deleted AVPs, respectively (ITU-T Suppl.24 to P-Series Rec., 2005). Usually, when evaluating the speech input capabilities of an SDS people assess the WER. However, in the data analysis of the usability tests of this research, the CER instead of the WER is assessed in order to evaluate the speech-input performance. The decision is explained with the help of the following example: User: ASR result: NLU result: I would like to book a hotel in Berlin. I want to book a hotel in Berlin. $action=book, $location=berlin In this example the intention of the user is to book a hotel in Berlin. In the first step of the speech input process the ASR module incorrectly recognizes the word would and substitutes this word by want. Furthermore, the word like is deleted. Based on the ASR output, the NLU extracts the meaning of the recognized sentence. As in this context, only meaningless words were misrecognized, the NLU still succeeds in interpreting the user s intention and the dialog can continue without errors. The computed WER and the CER of the example are: WER = = 0.22 ˆ=22% (2.5) CER = = 0 ˆ=0% (2.6) 2 The computed values illustrate that the CER value represents the speech input capabilities of an SDS better than the WER. Therefore, in the result analysis in the experimental part of this thesis, only the CER is assessed.

65 2.1 Fundamentals of Spoken Dialog Systems 47 Subjective Spoken Dialog System Evaluation Several validated questionnaires exist, which are used to measure the usability of user interfaces and other software subjectively. Shneiderman s Questionnaire for User Interaction Satisfaction (QUIS) (Shneiderman, 1998) is based on a theoretical model of what makes software usable. QUIS consists of several questions in order to measure overall reactions to the software and to evaluate user reactions to the display and learning and system capabilities. The Software Usability Measurement Inventory (SUMI) questionnaire developed by Kirakowski (1996) measures user satisfaction and assesses the perceived software quality on five different dimensions. The AttrakDiff questionnaire is a standardized questionnaire, which assesses the overall attractiveness of any sort of interactive product (Hassenzahl et al., 2003) in terms of usability and appearance. Although these methods have been found to be feasible to evaluate a large variety of applications they are not claimed to be applicable to evaluate SDSs. SDSs provide some unique features, which are not addressed with general usability measures like SUMI or QUIS. For instance, due to experiences in human-human conversation users have strong pre-conceived ideas about how a speech interaction should proceed. Thus, questions of naturalness, intuitiveness and habitability are important, which are not deeply covered by general scales (Hone and Graham, 2000). In order to close this gap, Hone and Graham (2000) have introduced the valid and reliable subjective assessment of speech system interfaces (SASSI) questionnaire especially for SDSs. As this approach is applicable to evaluate dialog systems, this questionnaire is deployed in the experiments of this research work and further explained in this Subsection. The SASSI questionnaire analyses six dimensions, consists of 34 items and is widely used to measure subjective usability evaluation of SDS. The 34 items are rated on a Likert scale. The higher the rating the more positive it is (except for the dimension annoyance, where the opposite holds). The SASSI dimensions are described below (Hone and Graham, 2000), followed by Table 2.3, which illustrates sample items and the number of items of each dimension. The six dimensions are: System Response Accuracy evaluates if the system understands the user input correctly from the perspective of the user s expectations and intentions; Likeability consists of items addressing the user s opinion about the system and feeling items; Cognitive Demand addresses the perceived level of effort needed to control the SDS and the users feelings arising form the effort; Annoyance assess irritating or annoying factors; Habitability contains items to find out, how clear the interaction is for the user. The items assess if the user knows what to say and knows what the SDS is doing; Speed consists of items related to the speed of the system. Table 2.3. SASSI - item composition and sample items. Dimension Sample Item Number of Items System Response Accuracy The system is accurate. 9 Likeability I enjoyed using the system. 9 Cognitive Demand I felt tense using the system. 5 Annoyance The interaction with the system is irritating. 5 Habitability I always knew what to say to the system. 4 Speed The interaction with the system is fast. 2 6 dimensions 34

66 48 2 Background and Resulting Challenges In this Section the fundamentals of SDSs in general with the focus on dialog management and evaluation have been described. This research work addresses a dual-task scenario, in which the user controls the Internet as secondary task. As the manual use of the Internet while driving impairs the primary task driving and thereby risks causing accidents, the development of an SDS to access the Internet in a driving scenario has been chosen as use case. In order to provide a better understanding of the employment of SDSs in the automotive environment, the next Subsection is dedicated to present a technical overview of today s in-car SDSs and measurements to evaluate the performance of the primary task driving. 2.2 Application of Spoken Dialog Systems Today SDSs find their application in many different environments. As described in the introduction, SDSs assist the user at home, in the medical field, in the automotive environment and in the military already. In the automotive environment is a strong need to develop an SDS to the Internet as people do not refrain from using their smartphone manually while driving (State Farm Mutual Automobile Insurance Company, 2012). As the use of a smartphone while driving distracts the driver and endangers the driver safety (Governors Highway Safety Association, 2011), the automotive scenario has been chosen as example use case in this research work. People s desire to stay always connected makes car and mobile device manufacturers try to find solutions for making the Internet accessible in the driving environment in a user-friendly way. However, the new domain poses great challenges to the SDS development in the car and on mobile devices. This Subsection is dedicated to provide a better understanding of the application of SDSs in the automotive environment. First an overview of the currently available in-car SDS technology is given. Afterwards the SDS research framework, which is used for the prototype development in the scope of this research work, is explained in detail. As an in-car SDS has to assist the user without impairing the primary task driving the level of driver distraction has to be assessed. Therefore, methods and measures to evaluate the driver distraction are presented at the end of this Subsection State-of-the-Art In-Car Speech Dialog Systems Speech technology has been applied for in-vehicle use since many years. In 1996, Mercedes Benz was the first car manufacturer to introduce a SDS named Linguatronic 5 (Dumitrache, 2009) integrated in the car s head unit 12. The very first in-car SDS only allowed drivers to dial numbers hands-free and to switch between radio stations. In the following years other automobile brands introduced their in-car voice-control systems to the market and the number of functions and the flexibility of the SDSs raised: E.g, the destination entry advanced from simple spelling of words to single word input up to full address entry within one utterance (Berton, 2012). Furthermore, next to the embedded speech-enabled infotainment systems portable devices with speech-control appeared on the market. For instance, there are speech-enabled portable navigation systems and smartphone applications, which shall support the user. Due to the large variety of available in-car speech technology, this Subsection gives an overview about the different technologies and their strengths and limitations. First, the speech-enabled functionalities and the speech interaction characteristics of the embedded solutions are presented, followed 12 A head unit is the central control unit of a modern infotainment system, is installed in the central console of the vehicle and replaces the car radios installed in the past (Alt, 2009).

67 2.2 Application of Spoken Dialog Systems 49 by the description of the capabilities of mobile devices. The different systems are analyzed according to the SDS characteristics identified in Section The description and analysis of the different SDS capabilities is based on literature research, the author s expert knowledge and on qualitative analyses of the different systems conducted in As the development of the Internet, infotainment systems and software on mobile devices is continuously ongoing and rapidly improving the following descriptions might be technically outdated, when the reader follows those lines in the thesis. Embedded In-car Spoken Dialog Systems Nowadays, most of the car manufacturers provide an infotainment system with an integrated voice-control. The most advanced in-car SDS are Mercedes Benz Linguatronic 5, BMW s voice control as part of BMW s idrive system 6, Audi s voice control as part of Audi s MMI system 7 or Ford Sync 8. All of the embedded voice-control systems make use of the headunit screen as additional output modality. In order to initiate the speech dialog the driver has to press the Push-To-Activate (PTA) button (see Figure 2.12), which is generally located on the steering wheel. A sample dialog of a navigation destination input, which is generally the most advanced speech dialog, is presented in the following: 1 User: Enter Navigation. Fig PTA button on the steering wheel. 2 System: Please say the name of the town, the street and the house number. 3 User: Stuttgart. 4 System: Stuttgart accepted. Would you like to enter a street? 5 User: Yes. 6 System: Please say the street. 7 User: Koenigsstrasse. 8 System: Koenigsstrasse accepted. Would you like to enter a house number? 9... After pressing the PTA button the driver can indicate his intentions by speaking short elliptical utterances. If he does not enter all requested parameters at the beginning, he is guided step-by-step through the dialog until the task is accomplished. When the user selects the application in the first utterance (1), the elliptic utterance is a nominal construction, which resembles a command. When the system guides the user step-by-step, common elliptic utterances are speakable. The nominal ellipsis is not common in human-human conversations, which is why this speaking style is often called command-based style. In this thesis such a system-directed speech dialog triggered by nominal ellipsis is called a command-based dialog. Generally, there are few synonyms available for each of the commands. Aside from some global commands, such as next radio station, the user is restricted to input only the currently requested parameters (system-directed dialog initiative). Overanswering is not possible. There are few subdialogs, which allow users to input several parameters at once, but only those that have been explicitly requested by the SDS and only in a very restricted manner. Here the system predetermines how it expects the input from the user. Correction dialogs are only possible in a limited way at certain moments in the dialog. The in-car SDS do not always provide auditive feedback to the user in order to establish common ground. Some systems (e.g. BMW s voice control) provide the user with visual feedback about what the system has understood.

68 50 2 Background and Resulting Challenges The in-car speech recognition is robust to background noises of the driving environment and differences in prosody (e.g. speed, or intonation). However, when it comes to other spontaneous speech phenomena like pronunciation variation in-car ASR reaches its limits. Furthermore, as the user is only allowed to input short elliptical utterances in a fixed command-style grammar, the problems of sentence construction of spontaneous speech is not addressed. The in-car voice control systems do not use non-verbal cues and do not represent a kind of personality and therefore, they do not show any social behavior. Context knowledge is applied only to a limited extent. For example, if the user has entered a certain city for navigation entry and the system requests the desired street, the system reduces the street result list to the streets, which exist in the elicited city. Here the system uses context knowledge based on previous system questions allowing only the explicit answer. Nevertheless, the SDS does not take care of the current contextual situation and does not adapt to the user. In order to help users know what to say in-car SDSs commonly use the speak-what-you-see strategy. Here the currently speakable commands are presented on the GUI (Niemann, 2013). There are generally two ways of presenting the commands on the screen: First, the currently presented widgets on the screen give indications about the speakable commands. For instance, in Figure 2.13(a) the screen indicates the commands navi or radio. If additional help is needed, state-of-the-art in-car SDS make use of so-called teleprompters, which present possible commands in a clearer manner (see Figure 2.13(b)). As teleprompters are very obtrusive, most of the SDSs allow to switch off this additional help functionality. If the commands presented on the screen still do not suffice, the user can ask for help and the system supports the user with spoken help system outputs. (a) Screenshot of a headunit screen during the speech dialog. (b) Screenshot of a headunit screen presenting a teleprompter. Fig Mercedes S-Class infotainment system. The voice-enabled proactive functions of today s embedded SDSs are very limited. For instance, drivers are notified about an incoming call by a ring tone. By speaking a certain command or pressing a button the call is answered and they can start talking to the other person on the line. Another example is, when the driver follows the navigation system s guidance on a previously specified route, the system notifies the driver verbally about changes in the traffic situation along the route, which affects the future journey. However, the system only notifies the user, but it is not possible to enter the speech dialog at this stage. The three most common voice-enabled applications in the car s infotainment system concern hands-free dialing, audio playback and the navigation system. Drivers are able to dial or answer their mobile phone, input destinations for navigation or control the audio devices (e.g., radio station selection, mp3-search) (Häge et al., 2008). As mentioned in the introduction of this thesis, car manufacturers have started recently to integrate Internet functionalities into the car in order to satisfy the driver s need to be always connected. E.g., Mercedes Benz COMAND Online 14 (see Figure 2.14), BMW Connected Drive 15 or Audi Connect 16 allow drivers to access Internet content. Hereto

69 2.2 Application of Spoken Dialog Systems 51 the manufacturers are creating an own ecosystem of Internet apps, which the driver can order and extend his infotainment system with. However, the voice-control of these in-car apps has only rarely been developed, yet, which is why they can mainly be controlled haptically using the CCE. There are few Internet apps, which are partly speech enabled. For instance, in order to dictate an using BMW s infotainment system, the user has to navigate via CCE, before he can start the dictation of an by speech. Furthermore, today s in-car infotainment systems do not provide the user proactively with new incoming Internet content-related information, neither auditorily nor visually. Fig Screenshots of COMAND Online of 2013 Mercedes S-Class infotainment system. The arrival of the Internet in the car will boost the number of apps in the infotainment system. When the number of accessible Internet apps increases, the number of speech commands will increase, too. The driver might be overloaded with the high number of speech commands, he has to know in order to control the system by speech. Furthermore, the increasing number of apps also raises the complexity of the headunit s hierarchical menu, since many applications have to be clustered to prevent from large lists. For example, the radio, web radio, the mp3-player, the CD player have to be grouped to audio. Thereby, the hierarchy depth increases and the step-by-step speech dialogs are extended. In addition, the proactive behavior, which many Internet apps provide, is not addressed by today s embedded in-car SDS. Because of this the current speech dialog concept does not cover the needs the voice-control of the Internet demands and has to be reconsidered and possibly re-factored. Spoken Dialog Systems on Mobile Devices for In-car Use As integrated infotainment systems are very expensive, mobile device manufacturer offer solutions, which are cheaper and which provide similar functionalities. Commercially available car mobile devices, such as Parrot s ASTEROID tablet 9, can be installed in the car as a substitute for a missing infotainment system. This mobile device comes with its own PTA button, provides the basic head unit functions, can also be extended by downloadable Internet apps and can be voice-controlled. However, the voice-controlled functions are limited. Another popular solution is the use of smartphones in the car due to the wide range of available Internet apps and because people carry them with them everywhere at anytime. Nowadays, most of the smartphones are equipped with personal assistants (e.g. Apple s Siri 10 ), which promise a human-like human-machine speech interaction. There are generally two ways of installing and using the smartphone in the car. The first solution is to secure the mobile device in a special car mount. Then the smartphone has to be used with its basic touch screen input or the available buttons. A more advanced solution is the use of an in-car smartphone interface, which some of the car manufacturers provide. When the smartphone is connected to the interface, the regular in-car control elements (CCE, PTA button on the steering wheel) can be used and the head unit display presents contents provided by the smartphone. However, then the control is limited to only few dedicated apps. As the SDSs on smartphones are the most popular and the most advanced speech interfaces on mobile devices, their speech interaction capabilities are explained in detail in the following. Apple s Siri is chosen as sample system.

70 52 2 Background and Resulting Challenges After initiating the dialog haptically the driver can indicate his intentions by speaking whole conversational sentences. Thereby, he is able to input several parameters at once. The system verifies, which information is missing to perform the desired action and requests the missing parameters one after the other. In order to fasten the speech dialog the user is able to overanswer the system s questions (mixed-initiative dialog). At some steps in the HMI the user cannot continue the dialog by speech but has to use haptic input in order to proceed. Afterwards the speech dialog is resumed and continues. If the system has misunderstood an utterance, the driver can change the parameter and make corrections at certain moments in the dialog. The system does not always provide auditory feedback to the user to establish common ground. However, it presents the understood utterance on the screen. The sample dialog of writing an from the introduction, which illustrates the described features, is presented in the following (see also Figure 1.2). 1 User: I would like to send an to David Hasselhoff. 2 System: What s the subject of your ? 3 User: Greetings from my holidays. 4 System: What would you like the to say? 5 User: Hi David, how are you? Greetings from my holidays. 6 System: Ok, Hansjoerg, here is your message to David Hasselhoff. As the sample dialog illustrates the driver s speech input is quite flexible concerning sentence construction and syntax. The user is able to indicate multiple parameters within one utterance. Apple s Siri allows users to speak syntactically wrong sentences and speak the constituents of the utterance in a free order. Furthermore, it is robust to changes in prosody concerning rhythm and intonation. The ASR performance degrades while driving, since the acoustic models are not trained on this special noise environment. Furthermore, multiple pronunciation variants still pose challenges to the ASR. Acting as personal assistant the SDS has a persona. The system can be addressed by name and sometimes Siri answers in a charming or funny way. Some short chat dialogs make the system appear more human-like. Non-verbal cues have not been addressed, yet. Siri is able to understand context-related utterances and employs context knowledge. For instance, the SDS is able to interpret relative date information (e.g. tomorrow ). Furthermore, it adapts to the user, as it remembers relations between the user and his address book entries: 1 User: Call Dad. 2 System: What is your father s name? 3 User: David Hasselhoff. 4 System: Ok, do you want me to remember that David Hasselhoff is your father? When the relation has been established, Siri will initiate the phone call directly next time without requesting the person s name again. However, the SDS does not adapt to the current environmental situations, yet. Apple s Siri only provides visual help to the user. When the user does not say anything after initializing the speech interaction haptically, the SDS presents a teleprompter, which contains a list of possible utterances (see Figure 2.15(a)). However, this list of utterances is always the same and does not adapt to the current stage of the dialog. The user can ask for help instead of initiating the dialog by a request, but he gets only presented a list of voice-controlled apps, which he can browse haptically to get more information about possible utterances (see Figure 2.15(b)). Proactive speech dialogs are very limited similar to the embedded SDSs. Drivers are notified about incoming calls, which can be answered. Furthermore, the driver is notified about incoming text messages and can answer them by speech. However, smartphones do not act sensitively to the current situation and present information only if the current situation allows.

71 2.2 Application of Spoken Dialog Systems 53 (a) Screenshot of Siri s teleprompter. (b) Screenshot of the help screen. Fig Screenshot of Apple s Siri. The voice-enabled apps cover generally the basic in-car applications audio, hands-free dialing and the navigation system. In addition to this, users are able to control apps, which access locally stored information or data from the Internet by speech. For instance, Apple s Siri provides a long list of partly voice-enabled apps. Examples for local apps are the smartphone s clock, calendar, reminders or notes applications. , weather, Facebook, Twitter or restaurant search are just few example apps, which retrieve data from the Internet. In order to compare the speech dialog capabilities of the embedded SDS with the systems on mobile devices, the TRINDI ticklist is applied and presented in Table 2.4. Table 2.4: The TRINDI tick-list applied on today s embedded in-car SDSs and SDSs on mobile devices (here: Apple s Siri) in the scope of the driving environment. Embedded in-car SDS SDS on mobile devices Q1: Is utterance interpretation sensitive to context? Q2: Can the system deal with answers to questions that give more information than was requested? Q3: Can the system deal with answers to questions that give different information than was actually requested? Q4: Can the system deal with answers to questions that give less information than was requested?

72 54 2 Background and Resulting Challenges Q5: Can the system deal with ambiguous designators? Q6: Can the system deal with negatively specified information? Q7: Can the system deal with no answer to a question at all? Q8: Can the system deal with noisy input? Q9: Can the system deal with help sub-dialogs initiated by the user? Q10: Can the system deal with non-help sub-dialogs initiated by the user? Q11: Does the system only ask appropriate follow-up questions? Q12: Can the system deal with inconsistent information? When comparing the different systems applying the TRINDI tick list, it seems that the SDSs on mobile devices provide stronger speech dialog capabilities, which is true to some extent. Indeed, Apple s Siri appears more conversational and the speech dialog is more flexible. However, some of the features are realized exemplarily, cannot be rediscovered at all stages of the speech interaction and cannot be applied in all applications. For example, when asking Siri to send an to a person available in the smartphone s address book, Siri does not confirm the recipient. In contrast, when sending a text message to a person in the address book, Siri verifies the recipient implicitly. These features could be implemented more consistently. When analyzing the different systems, one realizes quickly that compared to SDSs on smartphones embedded SDSs have been realized especially for in-car use. The speech interaction is never interrupted by the needs of haptic input and the ASR can cope with the noisy driving environment. The large number of features and the human-like dialogs of the SDSs on mobile devices have raised the users expectations of speech technology. Today people also expect car manufacturers to develop Siri-like dialogs in the car. However, many functions are just provided exemplarily and are not developed consistently. Furthermore, this kind of dialogs has not been developed for in-car use. An in-car SDS has to be attuned to the limited availability of the driver s human resources and thereby, must not distract from driving but still provide high usability. Therefore, it has to be clarified if those human-like dialogs are generally applicable in the driving environment. Currently, there is no persuading, consistent and sophisticated speech dialog concept in the context of a dual-task scenario, which allows the user to control the Internet by speech. Therefore, it is highly important to develop a user-friendly speech dialog concept for Internet access, which does not impair the primary task driving, which is the goal of this research work. In order to evaluate these concepts system prototypes have been implemented an evaluated in driving simulation studies. The SDS prototypes have been implemented using the Daimler Speech Dialog Framework (SDF), which is explained in the following.

73 2.2 Application of Spoken Dialog Systems Daimler Speech Dialog Framework The Daimler SDF is a research prototype framework, which enables developers to quickly implement an SDS for in-car use. In the automotive environment SDS are supported by a GUI in order to additionally provide the user with visual help. Generally, an in-car SDS is multimodal and is also controlled by haptical input. However, as this research work focuses on speech interaction, haptical input is limited to pressing the PTA button and not further addressed in the architecture description. Nevertheless, the SDS architecture of Figure 2.1 has to be extended by a GUI (see Figure 2.16). The data, which the dialog manager (DM) retrieves during the speech interaction, either come from locally available applications on the headunit or can be retrieved from the Internet. This extension is illustrated in Figure User s speech input AutomaticSpeech Recognition Language Understanding Dialog Management Local Application System s speech output Text-to-Speech Synthesis Language Generation Internet Apps GraphicalUser Interface SDS module extension for the automotive domain Fig Automotive SDS architecture. The Daimler SDF is originally based on the automotive SDS architecture and has been used to develop the different prototypes, which are evaluated in this research work. This Subsection presents first the architecture of the framework and the interaction of the different modules. Afterwards it is explained, how dialogs are modeled and specified in the Daimler SDF. Daimler Speech Dialog Framework Architecture Goal of the Daimler SDF is to allow developers to quickly implement new fully functional SDS prototypes with speech input, auditive and visual output. The architecture of the research framework has to be similar to the architecture of the product development SDS. Thereby, it is ensured that the demonstrated functions of the research prototypes can be integrated in the real in-car SDS. In order to demonstrate new use cases, new applications should be easily linked to the framework. New applications can concern locally existing applications on the running machine or Web services on the Internet. As the SDS has to communicate with other components, such as the navigation system or the audio player, a simple interface to other modules is necessary. Concerning multimodal interaction other input modalities, such as touch gestures from a touch pad have to be fused with. In order to satisfy these requirements, the architecture of the SDF looks slightly different compared to the architecture illustrated in Figure A simplified illustration of the SDF architecture and its different modules are presented in Figure 2.17 and are explained in the following.

74 56 2 Background and Resulting Challenges Automatic Speech Recognition Language Understanding Linguistic Analysis Contextual Interpretation Daimler SDF Task-Driven Dialog Manager Dialog Manager XML Socket Synchronization component Local Application App Interface Text-to-Speech Synthesis Language Generation e.g. SOAP Internet Apps GraphicalUser Interface XML Socket Fig Daimler SDF architecture. The Daimler SDF comprises the necessary SDS modules and a synchronization component, which is responsible for the communication with the outside world. As described above this architecture has been constructed with the idea in mind to easily connect the SDS to other HMI relevant modules, such as the GUI, or existing infotainment applications. Automatic Speech Recognition: As described in Section the ASR module captures the user s speech signal and translates it into text. The ASR engine employed in the prototypes of this research work is VoCon R by Nuance 14, which is embedded onboard. As language model (LM) constraint rule-based grammars are used to improve recognition in closed domains. The LM, which is based on word sequences, and the ASR pronunciation lexicon have to be specified in the prototyping process. Linguistic Analysis: The linguistic analysis analyzes the output of the ASR. The linguistic analysis parser works as phrase spotter: based on previously defined grammars, which extend the grammars used for ASR with the meaning of the utterances, the parser searches for phrases, that is associated words such as on Monday or a double room, and interprets their meaning. As in conversational speech spontaneous speech phenomena, occur the analysis process has to be robust against (e.g. syntactically incorrect sentences). Phrases generally do not comprise whole utterances in order to improve robustness (Ehrlich and Jersak, 2006). The lexicon for the linguistic analysis and the syntactic-semantic grammars have to be specified if a new prototype has to be implemented. Contextual Interpretation (CI): Section described keeping track of the dialog context as part of the DM s responsibility. In the SDF a separate module interprets the results of the linguistic analysis in a given context. The current context is set by the DM. The results are interpreted in the context of: 1. the utterance: Phrases are concatenated and interpreted together;

75 2.2 Application of Spoken Dialog Systems the dialog: E.g., when the system requests the arrival date and the user answers with tomorrow, the system interprets the user utterance as arrival date and not as departure date. 3. the situation: Concrete values are assigned to expressions, such as tomorrow (Ehrlich and Jersak, 2006). The context knowledge has to be specified by the developer. Task-Driven Dialog Manager (TDDM): The main task of the TDDM is the dialog control. Based on the input from the CI the TDDM initiates different topics or initiates certain dialog acts, such as confirmations or re-requests. As the TDDM decides about the next dialog step, this module also comprises the language generation task and employs prompt templates to stay flexible. The system prompts and the dialog strategy have influence on the continuation of the human-machine conversation. Therefore, the TDDM is responsible for setting the context of the next user utterance. The determined context has not only influence on the CI but also on the ASR. In the ASR module the context knowledge is used to limit the grammar to only currently in the given context allowed user utterances, which improves the recognition accuracy. For instance, if the system asks for booking confirmation (e.g. Would you like to book this hotel? ), the grammar can be limited to only few confirmation or rejection utterances (e.g. yes, yep, no, nope, etc.). Furthermore, the TDDM is responsible for the interaction with the outside world. The TDDM triggers changes on the GUI and requests or receives information from external applications via the synchronization component (Ehrlich and Jersak, 2006). In order to develop an SDS prototype, the dialog control has to be specified, the system prompts have to be designed and the context restrictions have to be specified. Text-To-Speech Synthesis: The TTS is needed to transform the selected prompts by the TDDM into the speech signal. The TTS engine receives the filled prompt template and generates a sound file. In the prototypes of the research work of this thesis the embedded TTS engine Vocalizer for Automotive 15 by Nuance is employed. Synchronization component (SYNC): As the in-car SDS architecture is multimodal it is important to synchronize all states of the input and output modalities and applications (Ehrlich and Jersak, 2006). Therefore, the TDDM, the GUI and the applications have to communicate with each other. The communication is achieved by transmitting XML messages on a socket connection via the SYNC component. The distribution of the messages is performed by SYNC. If the state of a module changes, the currently active module updates SYNC with its new state. For example, if the user inputs a search parameter by speech, the TDDM has to update its internal state and notifies SYNC. Now the GUI receives a message from the SYNC to update its internal state in order to give visual feedback about the understood utterance. When a new SDS has to be developed, the XML message exchange has to be configured. Graphical User Interface: The GUI supports the speech dialog as visual output modality by presenting dedicated information on the screen. As the main focus of this research work is on speech interaction, haptical input (e.g. via CCE) is not addressed. The only haptic input interaction is pressing the PTA button, in order to initialize the speech dialog. The GUI is organized in different regions, which separate the GUI screen in independent areas (e.g.: application line, play field, sub function line), which can present different information at the same time. Each region is realized as finite state machine (FSM). For each state of the FSM different renderings of the region can be defined. The different renderings are based on presenting different background images, image-based widgets or text fields. Transitions are defined in order to move to a different state and update the screen. These state changes are triggered by XML messages received from the SYNC component, which can be initially sent from the DM as in the example above. If the 15

76 58 2 Background and Resulting Challenges user haptically interacts with the GUI, XML messages are sent from the GUI component to the SYNC. However, in this use case, this only applies when the PTA button is pressed. Here the GUI notifies the DM via SYNC that the audio capturing process should be started as the user would like to speak to the system. In order to develop a new GUI the different screens have to be designed and the GUI states and transitions have to be implemented. Applications: In the course of the speech interaction the DM has to send data to and receive data from an application. The data can either be retrieved from a local application or from an application on the Internet. Often local applications are implemented to simulate existing head unit applications for demonstration purposes. Local apps can communicate with the SDF via SYNC using the described XML socket connection. The communication of the SDF with an Internet application requires the implementation of an interface application. The interface application encapsulates the technology of the requested Web Services (e.g. Simple Object Access Protocol (SOAP)). If a new SDS has to be developed, the local application or the interface application have to be implemented. Both types of applications are implemented in C++. This Subsection presented an overview of the different components of the SDS architecture, which is used to develop the prototypes of this research work. The development of a new SDS prototype requires the configuration of the different SDS modules. Therefore, the next Section describes, how a speech dialog is implemented in the SDF. SDF Dialog Specification Since SDSs are very complex and consist of several single components, the development of a new system is very time-consuming as generally all components have to be configured separately. For instance, the development of a new SDS requires the configuration of the ASR pronunciation lexicon and grammar, the NLU lexicon and grammar, the system prompts, and the specification of the dialog control. However, the communication between the different components can only be achieved by establishing common knowledge bases and interfaces in order to prevent from inconsistencies and incompatibilities. The SDF has been developed based on an approach, which tries to unite the different configurations within one document. The knowledge bases for all relevant speech dialog components - the ASR, the Linguistic Analysis, the CI and the TDDM - are configured in a common dialog specification. As XML is well structurable and easy to parse the dialog specification is written in XML. In this XML dialog specification the configuration of the knowledge bases and the interfaces are centrally managed (Ehrlich and Jersak, 2006). In the following the knowledge base model, based on which the dialogs are structured, the dialog specification and the grammar specifications are explained in detail. Task Hierarchy Model Before developing a speech dialog, an ontology about the domain has to be established. This ontology represents general knowledge of the respective topic as a set of concepts including their properties and interrelationships of those concepts. In complex dialogs humans do not talk about the same topic all the time. People switch between topics and sub-topics during their conversation. For example, if the user wants to travel from one place to another, he could discuss with a travel information center to travel by train or by car. If he goes by train, he needs to discuss the departure and the arrival time and if he goes by car, the best route or a possible parking lot might be discussed. Here the main topic travel has been split in two sub-topics train or car (Ehrlich, 1999). This hierarchical structure of tasks has been taken into consideration in the dialog modeling using the TDDM of the SDF. The required ontology is structured as a task hierarchy model, in which each task consists of a hierarchy

77 2.2 Application of Spoken Dialog Systems 59 of sub-tasks including their associated roles. They represent task parameters, which are relevant to the respective task. In addition to the task hierarchy sub-dialogs define, how the user and the SDS actually speak about the task. Each task of the task hierarchy model can be handled in a sub-dialog (Ehrlich and Jersak, 2006). A sample task hierarchy model of the travel task is illustrated in Figure travel from: to: departure_time: arrival_time: train train_station: train_departure: train_arrival: car parking garage: car_departure_time: car_arrival_time: ticket seat_reservation public_transport parking_lot delay class: price: car_nr: seat_nr: city_station: train_departure_time: train_arrival_time: parking_lot_nr: fee: reason: where: Fig Hierarchy of task types, similar to Ehrlich (1999). This structuring of dialogs into several sub-topics does not limit the dialog flow to a strict dialog control strategy (e.g. given by a state transition network). Based on the defined (sub-)dialog goals and the dialog strategy the dialog control determines the dialog continuation within each activated theme. For example, each sub-task including its roles can trigger a system reaction if the according user input is given. Thus, the dialog becomes very flexible and adapts to the user s input (Ehrlich, 1999). Modeling an application or a dialog as a hierarchy of sub-tasks yields many advantages: structuring a dialog as a task-hierarchy reflects human-human conversations, complex speech dialogs can be cleary structured into sub-topics, sequences of sub-topics can be modeled, several instances of the same sub-task can be generated (e.g., when the user asks for another train connection than the previously discussed one), previously discussed sub-tasks can be referenced or switched to, the task-hierarchy allows to reduce the size of the ASR grammar and lexicon for each sub-task and thereby allows to improve the recognition accuracy, by modeling the speech interaction as a task-hierarchy the speech dialog becomes very flexible and user adaptive (Ehrlich and Jersak, 2006). Due to the large number of advantages the task-hierarchy model is employed in the dialog modeling in the SDF. The dialogs of this research work have been structured in tasks including roles. The dialog specification, which allows to configure the relevant SDF components in order to design a speech dialog, is described in the following.

78 60 2 Background and Resulting Challenges Dialog Specification The dialog structure as hierarchy of tasks and their roles set the base for the TDDM and the CI. For each task several sub-dialogs can be specified by the developer. These sub-dialogs define system reactions, such as system prompts and the conditions, which trigger these reactions. The roles can be considered as information entities, which have to be filled during the speech interaction. Depending on the roles and their values different system reactions can be triggered. Each sub-dialog is equipped with a set of prompts. Depending on the roles, which have not been filled yet, the TDDM decides about the prompt selection. The prompt selection has influence on the context, which is set for the next user response (see the XML snippet below). 1 //Sample prompt definition in XML 2 <prompt promptname="requestdeparture" prompttext="when would you like to leave by train?"> 3 <dialogrole rolename="departure" state="request" /> 4 </prompt> Imagine, the user has previously indicated that he would like to travel by train and the context has been updated. Now, since the dialog role departure has not been filled yet, the TDDM selects the corresponding prompt and the user is asked about the departure date next. Thereby, the context is further limited to receive the expected departure value in the train context. When the user answers, the task of the CI is to interpret the output of linguistic analysis in the context of the respective task. For example, the utterance today at three p.m. shall be interpreted as train departure time if the current topic is train. If the current topic would be car, the utterance has to be interpreted as driving departure time. The configuration of the CI is based on a set of mapping rules, which define the mapping from the linguistic analysis output and their combinations of tasks and roles (Ehrlich and Jersak, 2006). The mapping rules are automatically generated from the dialog specification. Grammar Lexica Definitions of the ASR and the Linguistic Analysis The parser of the linguistic analysis module is a generic algorithm, which employs lexica and unification-based grammar formalism on a phrase level, which can contain both morphosyntactic and semantic knowledge. The grammars and lexica are represented in the PATR II format (Shieber et al., 1983). The content of the linguistic analysis configuration should be be consistent with the words of the ASR pronunciation lexica and the ASR grammar. Other words are regarded as meaningless. By configuring the ASR and the linguistic analysis within one document these consistencies can be better guaranteed. The grammars are based on fixed word sequences, which have to be specified manually or external predefined grammars can be imported. For each task possible user utterances are modeled in the grammar. The grammar takes each roles of the task into consideration, which might be addressed when talking about the subtopic. The grammar tries to cover the utterances the user might say. For example, if the system requests the departure date, the grammar might look like the following JSGF code snippet: 1 grammar departure; 2 3 import <departure_lexicon>; 4 5 public <date> = today tomorrow the day after tomorrow; 6 public <departure> = [I would like to] <depart> <date> the <departure_date> is <date> <date>;

79 2.2 Application of Spoken Dialog Systems 61 The grammar makes use of lexica, which contain important words and synonyms, which the user might say. A sample lexicon is illustrated in the following (written in JSGF for easier understanding): 1 lexicon departure_lexicon; 2 3 public <depart> = leave depart start; 4 public <departure_date> = departure day of departure date of departure; The sample grammar departure imports the lexicon departure lexicon and generates sentences, such as I would like to leave today or The day of departure is tomorrow. Both, the ASR and the linguistic analysis make use of this grammar. In order to extract meaning from the recognized utterance, key phrases have to be defined, which the parser tries to find in each utterances. Key phrases are semantically relevant phrase constituents. For each key phrase the semantic interpretation has to be assigned: 1 keyphrase date; 2 public <date>= today {this.from="today"} tomorrow {this.from="tomorrow"} the day after tomorrow {this.from="the day after tomorrow"}; If the departure date was requested and the user answers with I would like to leave today, the Linguistic Analysis would apply the key phrase rule date and would extract the word today, assign the variable date input = today and send this AVP to the CI. The aspiration of modeling the human language based on rule-based grammars is quite challenging and time-consuming. When implementing a new prototype in the SDF the words and phrases of the ASR and linguistic analysis components have to be specified. Based on these specifications the grammars and lexica are automatically generated. However, in the generation process syntactically wrong utterances might be produced. Furthermore, often legal word sequences, which were not anticipated by developers, are ruled out or syntactical erroneous sentences are falsely accepted. Especially languages with a complex syntax (e.g. German language) make it difficult to design a wide and flexible grammar, which still produces syntactically correct sentences. When all the described modules have been specified in the XML specification format, the dialog configurations can be generated. As the configuration of the relevant speech dialog components is achieved within one document inconsistencies and incompabilities can be checked before. By using an XML-schema and XML tools such as XMLSPY 16 invalid XML structures can already be detected during the construction of the dialog specification. Further consistency checks are performed when compiling the XML specification. By validating the XML specification on syntax and compliance to the task hierarchy the compiler can detect inconsistencies within the document. If the specification is valid, the compiler automatically generates the knowledge bases for the SDS components. Furthermore, all or special subsets of specified phrases and their interpretations can be generated to make it easier for the developer to inspect the generated grammars. This Subsection gave an overview of the Daimler SDF, which is employed in this research work to develop the SDS prototypes. The goal of this research work is the development of a user-friendly SDS, with which the user is able to perform online information exchange tasks in a dual-task scenario. Hereby, the Internet task has to be performed as secondary task without impairing the primary task performed in parallel. In order to evaluate if the secondary task performance negatively influences the primary task, the primary task performance has to be assessed. In this research work the automotive use case has been chosen as example use case, where the primary task is driving. 16

80 62 2 Background and Resulting Challenges Using the developed SDS prototypes while driving must not distract the driver and thereby, must not impair the driving performance. Driver distraction is a well-known metric to evaluate negative influences of secondary tasks while driving. Therefore, in this research work, the level of driver distraction is used to assess the degradation of the primary task performance. In order to provide the reader with a deeper understanding of driver distraction the next Section defines the term and explains causes and outcomes of driver distraction. Furthermore, common evaluation measures to assess the level of distraction are presented. 2.3 Driver Distraction Driving a car is a complex task, which requires the coordination of different physical, cognitive, sensory and psychomotor skills. Furthermore, a substantial degree of concentration and attention is demanded by the driver. Although this task is very complex, people engage in other activities while driving. For instance, while driving, people talk to passengers, listen to the radio, control the infotainment system or use portable electronic devices. All these activities, which compete for the driver s attention, may degrade driving performance and impact the driver safety (Young et al., 2003). A study of the Governor Highway Safety Association (GHSA) 17, which observed 100 American drivers for a full year, found out that drivers were distracted between one-quarter and one-half of the time (Governors Highway Safety Association, 2011). The U.S. Department of Transportation reported in 2009 that 20% of injury crashes involved distracted driving (National Highway Traffic Safety Administration, 2009). Driver distraction is a difficult safety problem and is not only addressed in the USA but all over the world. In the USA, Europe and Japan, driver distraction is even considered as a priority issue in road safety. Due to the mobile Internet revolution and the increase of entertainment and driver assistance systems on the vehicle market, the incidence of distraction-related crashes is expected to escalate (Young and Regan, 2007). Especially the use of mobile phones while driving has increased enormously within the last years. In a recent study conducted by McKinsey&Company (Huber, 2013) 35% of respondent admitted to using their smartphones while driving. 89% said that the smartphone use was for calls, 68% for navigation, 39% for SMS and 31% for using the Internet. Using a smartphone while driving highly demands the driver s resources and thereby distract from driving. In this research work driving a car is considered as primary task, whose performance must not be impaired by other distracting activities. In order to provide a deeper understanding of driver distraction this Subsection is dedicated to give information about what driver distraction is and how it can be assessed. As driving is just an example primary task in the addressed dual-task scenario, driver distraction is just a sample measure to assess the degradation of the primary task. Presenting a profound overview of the background of driver distraction including all the developed measures and evaluation techniques would go beyond the scope of this Section of this thesis. Therefore, this overview focuses on the information the reader needs to know in order to understand the remainder of this thesis. In this Section, first, a definition of driver distraction is given, followed by its sources and its outcomes. Second, this Section describes, how distraction can be assessed and what kind of assessment methods exist. 17

81 2.3 Driver Distraction Definition, Sources and Outcomes This Subsection defines and explains the term driver distraction. Furthermore, it is explained what drives distraction and its impact on driving performance. Although driver distraction is a worldwide research topic, until today there is no universally agreed upon definition of driver distraction (Young and Regan, 2007). Literature agrees that when defining driver distraction, first the term has to be distinguished from the broader category of driver inattention. Driver inattention refers to any condition, state or event [...] that causes the driver to pay less attention than required for the driving task (Beirness et al., 2002, p. 5). In contrast, distracted driving occurs when a driver is delayed in the recognition of information needed to safely accomplish the driving task because some event, activity, object or person within or outside the vehicle compelled or tended to induce the driver s shifting attention away from the driving task (Treat, 1980, p. 21). For instance, being fatigue is part of inattention but it is not a source of driver distraction. The presence of a specific event or activity, which triggers the distraction, distinguishes distracted driving from the broader category of inattentive driving (Beirness et al., 2002). Streff and Spradlin (2000) define driver distraction as a shift of attention away from stimuli critical to safe driving toward a stimuli that are not related to safe driving. Young and Regan (2007) argue that the definitions in literature involve a shifting of attention away from the driving task, but they do not address that not all events or objects, which attract the drivers attention actually create a distraction and as a result degrade the driving performance. They claim that if the secondary task performance has no negative effect on the driving performance, then distraction has not occurred. In their opinion driver distraction only occurs, when the driver s attention is diverted away from the driving task by an event or an object to such an extent that the driver is not able to perform the driving task safely or adequately (Young and Regan, 2007). Based on literature and the addressed shortcomings, Regan, Lee and Young develop the following definition of driver distraction in their book Driver Distraction: Theory, Effects and Mitigation, CRC Press, 2008: Driver distraction is a diversion of an attention away from activities critical for safe driving toward a competing activity (Lee et al., 2008, p. 34). This definition of driver distraction is also applied in the research of this thesis. Research has categorized distraction in four different types of distraction: visual, auditory, biomechanical (physical) and cognitive distraction (Young and Regan, 2007), which can affect the driving performance negatively: Visual Distraction occurs, when the driver does not keep his eyes on the road and instead focuses his visual attention on another visual target for an extended time period (Young and Regan, 2007). For instance, the driver could look at the in-car headunit screen instead of on the road. Auditory Distraction occurs, when the driver temporally or continually focuses his attention on auditory signals or sounds instead of focusing on the road environment (Young and Regan, 2007). The auditory notification signal of an incoming call refers to auditory distraction, for example. Biomechanical (Physical) Distraction occurs, when driver removes one or both hands from the steering wheel for an extended time period in order to physically manipulate an object, which is not related to the physical tasks required to drive safely, such as changing gears (Young and Regan, 2007). For instance, if the driver uses his smartphone manually while driving, he is physically distracted. Cognitive Distraction includes any thoughts, which reduce the driver s attention to the point that he is unable to navigate safely through the road environment (Young and Regan, 2007). Talking to another person on a mobile phone while driving is one of the most well-documented incidents when cognitive distraction occurs (Young et al., 2003).

82 64 2 Background and Resulting Challenges Although the four categories of distraction are classed separately, they do not occur mutually exclusively. For instance, operating a smartphone while driving involves all four types of distraction. Here, physical distraction is caused by dialing a number haptically, visual distraction is induced by looking at the smartphone screen, auditory distraction is caused by talking to the other person on the line and cognitive distraction is induced by paying attention to the topic of the conversation rather than focusing on the road environment (Young et al., 2003). A long debate has been carried out on the topic of what form of interference leads to the greatest degradation in driving performance. According to Wickens multiple resource theory (MRT), if two tasks, which are performed in parallel, tap the same resources dual-task interference occurs (Wickens, 1984). Driving is primarily a visual-spatial-manual task. Then according to the MRT tasks, which require visual and physical responses, should cause greater dual-task interference and thereby result in greater degradation in driving performance. In literature there is some evidence, which confirms this theory. However, just because two tasks performed in parallel use different resources does not imply that they will not cause any dual-task interference at all (Young and Regan, 2007). The different types of driver distraction can be induced by a variety of sources. Young et al. (2003) break down the sources of distraction into technology-based and non-technology-based distracters concerning distraction caused by activities or objects inside the vehicle. The most relevant technology-based distracters are the use of mobile phones, route guidance systems and the use of in-car infotainment systems. The main non-technology-based activities, which drivers engage in while driving, include eating, drinking, smoking and talking to passengers. Both, technology-based and non-technology-based distractions, can lead to an impaired driving performance. The frequency, with which the driver is exposed to the distracting source, has an influence on the extent to which distraction compromises driver safety (Young et al., 2003). Several studies have investigated the role of driver distraction in crashes. The addressed sources of distraction inside the vehicle and their relative dangers have been analyzed in order to determine, which source of distraction has the greatest distracting effects on drivers. Unfortunately, these studies have differed in several ways (using different variants of the same device or of the same activity or using different methodologies), which makes it difficult to come to a conclusion about which in-vehicle device or activity is more distracting than others. However, a general trend, which emerges from literature, is that the more complex an activity or a system is and the more time it takes to complete the task, the more it distracts the driver. Thus, operating complex devices, such as a route guidance system, appear to have stronger negative influence on the driving performance compared to a relatively simple task, such as tuning the radio (Young et al., 2003). In order to assess the level of the different types of driver distraction experimental studies can be conducted, in which the driver behavior and the driving performance are tracked. The different distraction measures and methods, which are relevant to this research work, are explained in the following Driver Distraction Assessment In driver distraction research researchers have developed various metrics and methods in order to assess the level of driver distraction. As driving is just an example primary task in this research work, driver distraction is just a sample measure to assess the degradation of the primary task. Presenting an overview of all the developed measures and evaluation techniques would go beyond the scope and therefore, this Subsection focuses on the distraction measures and methods relevant to the driver distraction assessment in this work.

83 2.3 Driver Distraction 65 Driver Distraction Measures This Subsection presents the most commonly applied driver distraction measures, which are used today to assess the level of distraction and thus, the degradation of driving performance (Young et al., 2003): Lateral Position: The lateral position of the vehicle on the road in reference to the centre of the lane is assessed. Research, which has examined the effect of using mobile phones while driving, has revealed that the lateral position of the vehicle on the road is negatively affected and greater lane position deviations are made (Young et al., 2003). Speed Control: Speed control assesses, how well drivers are able to keep a given speed limit. Several studies found out that larger variations in driving speeds occur, when drivers are distracted. Interestingly, drivers even tend to lower their speed, when they talk on a mobile phone while driving (Young et al., 2003). Event Detection and Reaction Times: These metrics measure the time between the availability of a stimulus and the first measurable response on the stimulus. Furthermore, the number of missed events or incorrect responses can be assessed. Research has shown that the driver s reactions to external events or objects is impaired by the use of in-vehicle devices, especially concerning complex devices. Various studies revealed that using a mobile phone or other in-vehicle devices while driving can increase drivers reaction times to driving hazards or common road events (e.g. traffic light changes) up to 30% (Young et al., 2008). Gap Acceptance: In distraction research, gap acceptance measures assess the number of collisions initiated and the size of gaps accepted. When in-vehicle devices, such as mobile phones, are used while driving, drivers tend to accept shorter gaps in traffic than without using such a device (Young et al., 2008). Eye Movement Measures: Eye movement measures assess the number and the duration of glances off the road, which are required to complete a secondary task (Hurts et al., 2011). Research has shown that visual demand off the road induced by the use of in-vehicle devices reduces driver situation awareness (Rogers et al., 2011) and negatively influences driving performance (Horrey et al., 2006), as the time spent looking inside the vehicle is not spent looking at the road for potential crash-inducing hazards. Driver Workload: This metric assesses the amount of cognitive resources a driver has to allocate in order to complete a task successfully. Using mobile devices while driving increases the driver workload. The more complex the control of the secondary task is the higher is the workload (Young et al., 2003). Workload can be assessed by measuring physiological measures, such as heart rate or skin conductance (Son et al., 2011) or self-reported by the use of questionnaires (Young et al., 2008). As driver distraction is multidimensional, no single distraction measure will cover all the effects of distraction. As many different driving distraction measures exist, it seems to be difficult to decide on which measure to be employed. Young et al. (2008) suggest choosing the measures depending on the type of competing task. In this research work the use of different SDS concepts while driving is evaluated regarding usability and driver distraction. Driving simulator studies are applied in order to assess the lateral position of the vehicle. As the danger of speed limit variation is not as crucial as other measures (e.g., such as the variation in lateral position) this measure is not applied in this work. Due to the strong relationship between the response time and event detection and the increased risk of crash involvement, these metrics are assessed in the driving simulator study. As the driving task of the driving simulator software is not designed to evaluate gap acceptances, this value could not be measured. In the course of the conducted research work it came up that different visual appearances

84 66 2 Background and Resulting Challenges of the evaluated SDSs are accepted by users and worth investigating on visual distraction. Therefore, in one of the conducted driving simulator studies, eye movement measurement techniques are applied. Furthermore, driver workload is assessed in the experiments. The next paragraph describes common evaluation methods, which are employed to assess the metrics described above. Again the focus is on the methods, which are applied in this research work. Furthermore, the concrete applied evaluation metrics are presented in the following. Methods for Measuring Distraction In order to measure the distractive effect of using an SDS while driving a driving environment is needed. This can be achieved in a field test or using a driving simulator. In a field test participants are driving a real car on a test track or in real traffic, which reflects real-world conditions. The vehicle can be equipped with various instruments (e.g. cameras) to capture the driver s behavior and the driving performance data can be collected using data loggers. However, this method is very time consuming, very expensive, and underlies strict regulations and therefore is rarely used to measure driver distraction. As the use of in-vehicle devices while driving is dangerous, conducting experiments to measure their distractive effects has to be handled with care. A more popular and more frequently applied method is the use of a driving simulator. These simulators allow to measure several driving performance measures in a relatively realistic driving environment (Young et al., 2003). As simulators provide a safe environment, this method is used in this research work. The technology of driving simulators can range from simple gaming equipment to high-fidelity simulators, which are based on a real vehicle whose car buses are accessed to gather driving information from the driver. A driving simulator always requires a driving simulator software, which simulates the driving situation. The driving software allows to control the driving task and the situation and thereby, ensure a level playing field for all participants of the experiment. A commonly employed and also ISO standardized (ISO 26022, 2010) driving task is the lane change test (LCT), developed by Mattes (2003). The LCT is a reliable and validated PC-based driving simulation, which is often used in dual-task driving experiments. In experiments using the LCT, participants are presented a simulated road, which consists of three straight lanes without any other traffic as illustrated in Figure Furthermore, the track contains several lange-change signs along the road, which indicate the type of lange-change to be performed. Each test track is limited to a fixed length, which contains 18 road signs. The speed is limited to 60km/h and cannot be exceeded. The participants are instructed to accelerate until the 60km/h are reached and to keep the gas pedal pressed, whereby the speed stays constantly at 60km/h. Furthermore, they are instructed to perform lane changes as fast and accurately as possible according to the signs. In a dual-task study participants have to follow the LCT and perform a secondary task in parallel. The analysis software of the LCT allows to measure the lateral deviation from the ideal line or from a recorded baseline drive (see Figure 2.20) (Mattes, 2003). However, today research argues that the LCT only includes foreseeable events and driving in the periods between events does almost not comprise any demands in terms of steering and no braking at all. In LCT drivers are only once in a while directed to change the lanes (even with announcement) by conducting a rather unnatural abrupt maneuver, combined with simple lane keeping on a straight road in between. But real driving mostly demands a rather continuous adjustment of steering angle and speed without announcements, when the next demand will occur exactly and to which extend a reaction will be necessary. This might lead to strategic task completion and to an underestimation of driver distraction. Furthermore, LCT is based on fixed tracks, which limits continuous recordings to three minutes. Due to these shortcomings in the research project GetHomeSafe 18, a new driving 18

85 2.3 Driver Distraction 67 Fig Screenshot of the LCT driving simulation. Fig Screenshot of the LCT analysis software presenting the deviation from the reference lane. simulation software including a successor for the LCT is currently being developed and validated. The so-called ConTRe (Continuous Tracking and Reaction) task as part of the OpenDS 19 driving simulation software complements the de-facto standard LCT including higher sensitivity and a more flexible driving task without restart interruptions. The employed steering task for lateral control resembles a continuous follow drive, which will help to receive more detailed results about effects of the two diverse dialog strategies. Furthermore, mental demand can be assessed by an additional reaction task implemented as longitudinal control (Mahr et al., 2012). In this research work the LCT is only used in an explorative pretest, which is not elaborated in detail. In the more comprehensive experiments, whose results are presented in detail in this thesis, the ConTRe task is applied. Therefore, only the driving task and the evaluation metrics of the ConTRe task are explained in detail in the following. Figure 2.21 presents a Screenshot of the ConTRe task from the driver s perspective. In the ConTRe task the user perceives two cylinders (one in yellow the other one in blue) on a unidirectional straight road consisting of two lanes. The yellow cylinder is also called the reference cylinder and moves autonomously at a constant longitudinal distance according to an simulation algorithm. The movement speed and direction of the reference cylinder are not predictable by the driver. In addition, there is a traffic light placed on top of the reference cylinder, which contains two different lights. The lower light shines green, when it is switched on, whereas the top one can be lighted red. Either none or only one of these lights is switched on at a time. The driver s primary task in the simulator is to turn the steering wheel and operate the brake and acceleration pedals. However, the system feedback differs from normal driving. In the driving simulation the car moves autonomously at 50km/h. By turning the steering wheel the lateral position of the blue cylinder can be controlled. The goal of the driver is to keep the blue cylinder overlapping with the reference cylinder as well as possible. Thus, this task resembles a continuous follow drive. If the red traffic light is switched on, the driver has to brake immediately. In case of the green light, an immediate reaction to accelerate with the gas pedal is expected. As soon as the user reacts correctly, the respective traffic light is switched off. Braking and accelerating have no influence on the driving speed. In fact, this task resembles an event detection and reaction task (Mahr et al., 2012). The implementation of the ConTRe task allows to modify different control variables in order to change the difficulty level of the driving task. The changes concern differences in the movement speed and movement direction of the reference cylinder and the frequency of the acceleration and break situations (Mahr et al., 2012). 19

86 68 2 Background and Resulting Challenges Start: :57:49 End: :58:28 Experiment ID: 123 Subject Name: Hansi Role: driver Condition Name: speech Reaction Times Steering Deviation Fig Screenshot of the ConTre task driving simulation. Fig Screenshot of the ConTRe task output presenting the deviation from the reference lane and the reaction time performances. Applying the described methods above a driving environment can be created, based on which driving simulation experiments can be conducted. The objective and subjective measures, which are assessed to evaluate the level of distraction in the conducted experiments in this research work, are described in the following. Objective Driver Distraction Evaluation Using the ConTRe task the lateral position and event detection and reaction time performances can be assessed. The OpenDS can be configured to present the performance results immediately after the data recording (see Figure 2.22). The lateral position metric is based on the current distance between the controllable cylinder and the reference cylinder. The driving simulation calculates the distance internally in meters, but outputs the deviation value in relation to the width of the street, which is a fixed value of eight meters. Hence, 100% deviation corresponds to the maximum width of the two cylinders, which corresponds to a full lane distance between both cylinders (Mahr et al., 2012). In this research work, the following measure is used: Mean Deviation (MDev): Average deviation of the lateral position while performing a task. Furthermore, the driving simulation software records the response times of operating the acceleration and brake pedal and the number of false reactions or omissions (Mahr et al., 2012), which is applied in this work: Response Time (RT): Average response time of correct reactions on events while performing a task. Further information about the ConTRe task can be obtained from Mahr et al. (2012). Eye movement measures are related to the movement of the eye and glances at a certain object. In general, the human eye does not move smoothly across the visual field. Instead, it makes a series of jumps, called saccades (approximately ms) followed by fixations (approximately ms), a period of relative stability during which an object is focused (Jacob, 1995). In this case glances are related to the fixations on an object, which demand most of the visual attention. The eye movement can be assessed by visual occlusion or eye glance studies. The visual occlusion is based on the assumption that driving requires visual attention to the road only part of the time. The rest of the time can be used for other purposes, such as controlling in-vehicle devices. Using this method,

87 2.3 Driver Distraction 69 the participant has to operate an in-vehicle device, while the participant s vision is partially or fully occluded through the use of a shield or a visor, which opens or shuts frequently. The occluded vision shall resemble the driving situation. This method can be used to evaluate if an in-vehicle task can be successfully carried out using only little amounts of visual attention, which the driving scenario offers (Young et al., 2003). However, this technique does not put the driver in a driving situation and cannot be used in a dual-task study, where the user has to perform a secondary task while driving, as the vision of the participant is occluded. In contrast, eye glance studies can be used to measure eye movements in dual-task studies. This technique measures the visual behavior of participants by recording the frequency and the duration of participant s eye glances at objects while driving (Farber et al., 2000). When performing a secondary task while driving, the driver s eye movement is characterized through a series of short glances (one to two seconds), which are needed to complete the task. In eye glance studies the frequency and duration of glances towards the secondary task are recorded, which indicate the eyes-off-road-time and thus, the interference or visual demand required to perform the task (Haigney and Westerman, 2001). The eyes-off-road-time is highly correlated with degraded driving performance during secondary task performance and therefore, it is a widely accepted and valid measure of the visual demand evoked when performing a secondary task (Haigney and Westerman, 2001; Horrey et al., 2006). Today, sophisticated eye tracking systems are employed to measure eye movements of the participants. Head-mounted eye trackers, such as SMI s Eye tracking glasses 20, can be worn on the head for mobile use. However, wearing a device might influence real user behavior and the data analysis is very time-consuming. Remote eye trackers, such as tobii s X2-60 eye tracker 21 are stationary and are often used for on-screen studies. Remote eye trackers can be applied in an environment, where the participant does not move and when the investigation of eye movements is focused on a certain screen. In the driving environment the driver is located on the driver seat and does not move. The focus of this research is to investigate the number of glances on the screen. Therefore, in the driving simulation studies of this research work a remote eye tracker (tobii s table-mounted eye tracker IS-Z1) was employed. Hurts et al. (2011) suggests the following eye movement measures, which are also applied in this research work: Number of Glances (NoG): The number of distinct glances at the screen, which are required to complete the task, Mean Glance Duration (MGD): The average duration of the glances while performing a task, Total Glance Time (TGT): The sum total of the duration of distinct glances at the screen while performing a task. Furthermore, in addition, the Percent Dwell Time (PDT): The percentage of time that the participants spend looking at the screen while performing a task. is assessed. Subjective Driver Distraction Evaluation Driver workload can be assessed by measuring physiological measures or by the use of questionnaires. As self-reported workload measures are simple and easy to analyze in this research work, workload is assessed by the use of questionnaires. Several subjective mental workload scales have been developed and used in the driving domain, which assess the individual s perceived workload, such as the National Aeronautics and Space Administration - Task Load Index (NASA TLX) (Hart and Stavenland, 1988),

88 70 2 Background and Resulting Challenges the Subjective Workload Assessment Technique (SWAT) (Reid and Nygren, 1988), the Rating Scale Mental Effort (RSME) (Zijlstra, 1993) and the Modified Cooper Harper Scale (MCH) (Wierwille and Casali, 1983). The NASA TLX and SWAT are based on multidimensional scales in order to address different workload dimensions. A recently developed multidimensional workload questionnaire is the Driving Activity Load Index (DALI) (Pauzie, 2008). The DALI is based on the NASA TLX, has been specifically adapted to the workload assessment of in-vehicle tasks in the driving context and has been validated in the AIDE project ( ) 22 (Young et al., 2008). As the DALI is tailored for dual-tasks in a driving environment, this questionnaire is applied in this research work and therefore further explained in detail. The DALI covers six dimensions: effort of attention, visual demand, auditory demand, temporal demand, interference and situational stress (Pauzie, 2008). The NASA TLX included a physical demand component, which covered the physical activity required to perform the activity. Pauzie (2008) argues that this item is not relevant since controlling an vehicle is quite automatic for an experienced driver and therefore not physically demanding. However, driving simulations do not exactly reflect real driving, which is why the physical demand should still be assessed. Therefore, in practice, the DALI questionnaire for subjective workload assessment often consists of the following dimensions (Harvey and Stanton, 2013): Global Effort of Attention: The overall mental (e.g., to decide, to think about, etc.), visual and auditory demand that is required during the experiment to perform the whole activity, Visual Demand: The visual demand that is required during the experiment to perform the whole activity, Auditory Demand: The auditory demand that is required during the experiment to perform the whole activity, Physical Demand: The physical demand that is required during the experiment, Stress: The level of stress (e.g. irritation, insecurity, fatigue, etc.) while performing the whole activity, Temporal Demand: Pressure and specific constraint felt due to time pressure of completing tasks while performing the whole activity, Interference: Disturbance of the primary task driving when completing the secondary task (e.g. control of an in-vehicle device) in parallel. A DALI questionnaire generally consists of one item for each dimension, which is rated on a scale from low to high (0,.., 5). In order to mitigate against and to manage driver distraction several countermeasures are taken. Governmental strategies comprise enacting laws, which restrict the use of distracting devices and distracting behavior while driving. Furthermore, by training and educating drivers, people can be made aware of the danger of driver distraction. Nowadays, research tries to invent technology to manage distraction. The development of technology lockouts, which, for instance lock distracting functions on a device or the development of adaptive interfaces, which observe the current driving environment, shall help to reduce accidents caused by distraction. The most promising approach and also the most relevant approach to this research work is the development of standards and guidelines, which should be followed when developing a new in-car human-machine interface with minimum distraction potential (Hurts et al., 2011). These standards or guidelines contain performance based goals, which must be reached by the in-car human-machine interfaces so that the interaction does not distract the driver while driving. The SAE-J2944 draft (2013) is a currently being developed standard 22

89 2.4 Summary and Discussion 71 with the purpose of defining driving performance measures and statistics for driver distraction studies. Guidelines and recommendations by the Commission of the European Communities (2006) and Driver Focus-Telematics Working Group (2002) can be used to design systems with minimum distraction potential. The Driver Focus-Telematic Working Group is a task force created by the Alliance of Automobile Manufacturers 23 (AAM) and the National Highway Safety Association (NHTSA) of the United States. Amongst other things, their guidelines include time-based criteria concerning visual-manual tasks performed while driving. The so-called AAM guidelines suggest that Single-Glance durations (SGD) should generally not exceed two seconds and that the T GT should not exceed 20 seconds. Here the 85th percentile of the distribution is relevant for the evaluation, which represents a common design standard in traffic engineering (Driver Focus-Telematics Working Group, 2002): P 85 (SGD)! < 2s, (2.7) P 85 (T GT )! < 20s. (2.8) In the United States and other countries adherence to such guidelines is voluntary. However, in April 2002 members of the AAM (BMW Group, Genereal Motors, Mercedes-Benz USA, Toyota and many more) signed a letter of commitment to the NHTSA. By signing this letter the manufacturers commit designing new products in accordance with the alliance s guidelines developed by the Driver Focus-Telematics Working Group (Hurts et al., 2011). The concept design and evaluation of this research work also follow the AAM guidelines. This Section gave an overview of driver distraction, which negatively influences the driving performance. First, driver distraction has been defined and its sources and outcomes were described. Second, measures and methods to assess driver distraction were presented. As the necessary technical and theoretical background about SDSs and their application in the field has been provided, the next Section introduces and discusses related work, which addressed the comparison of speech dialog strategies and proactive behavior of SDSs. Based on identified shortcomings the concrete research questions are pointed out. 2.4 Summary and Discussion This Section is dedicated to summarize Chapter 2, to discuss related work and the arising challenges, which are addressed in the scope of this research work Summary In order to understand the difficulties and the challenges of developing an SDS in the driving domain the technical and theoretical background about SDSs and their application in the driving environment have been provided in Chapter 2. In Section 2.1, the fundamentals of SDSs were described focusing on background, which is the base for the dialog design and the evaluation of SDSs. In this Section first an overview of SDSs is given (Subsection 2.1.1), followed by the description of dialog management concepts and the design of dialog strategies (Subsection 2.1.2). When designing a new SDS different methods to model the dialog initiative, control, and context, as well as possible grounding techniques have to be taken into consideration, which influence the human-machine speech interaction. In Subsection 2.1.3, conversational speech interfaces and their characteristics have been introduced. 23

90 72 2 Background and Resulting Challenges Until today different imaginations of a conversational human-machine dialog exist. According to the author a conversational human-machine dialog should present the characteristics, which can be found in human-human dialogs. Therefore, in order to design conversational speech interfaces, the human dialog and spontaneous speech phenomena presented in this Subsection have to be taken into consideration. Subsection defined proactive behavior in general and its application in SDSs. Proactive behavior has been identified as anticipatory, change-oriented and self-initiated. In a proactive human-machine speech dialog, first, the user has to be notified about an incoming event in an unobtrusive way, then the problem solving process can be initiated. When the new task is completed, possibly paused tasks have to be resumed. Section 2.1 closes with the presentation of techniques, which can be applied to evaluate SDSs on usability (Subsection 2.1.5). In this research work objective interaction parameters are assessed in order to measure the overall dialog quality and the SASSI questionnaire is applied to evaluate user acceptance. Section 2.2 provided background of the application of SDSs in the field. In this thesis the application of SDSs in the automotive environment is in focus. First, in Subsection 2.2.1, the state-of-the-art in-car SDSs have been presented. Until today there is no persuading, consistent and sophisticated SDS in the context of the driving environment, which allows the user to control the Internet by speech. Subsection explained the Daimler SDF, which is employed for the implementation of the SDS prototypes. This Subsection presented the architecture of the framework and its modules and described, how dialogs are modeled and specified in the Daimler SDF. Section 2.3 introduced driver distraction, which indicates, how much the secondary task interferes the primary task. In this section driver distraction has been defined, its sources and outcomes were presented and metrics and methods to measure the level of distraction have been illustrated. In this research work driver distraction is investigated in driving simulation studies. The ConTRe task is applied to assess driving performance measures, an eye tracker is used to measure visual demand and the DALI questionnaire is applied for subjective workload assessment. In the following, related work on the comparison of speech dialog strategies and proactive behavior of SDSs is introduced and discussed. Based on identified shortcomings of previous work the challenges and the goal of this research work are pointed out Related Research and Challenges The goal of the research work at hand is the development of a user-friendly SDS, which assists the user to perform Internet Exchange tasks in a dual-task scenario. The Internet task has to be performed as secondary task without impairing a primary task performed in parallel. Due to the strong need of a speech interface to the Internet while driving, the automotive use case has been chosen as primary task. This Subsection presents previous work, which is related to the research topic addressed in this thesis. Based on the shortcomings of the related work the concrete research intents and objectives arise. Performing Information Exchange Tasks by Speech While Driving As pointed out information exchange tasks are performed in several steps. First, the user indicates several input parameters based on which he receives a list of objects. The user has to browse through the list of possible options in order to find the object of his desire. One goal of this research work is the design of an optimal SDS concepts to perform these required multi-step dialogs while driving. The development of speech dialogs offers a variety of design options: for instance, different speaking styles or dialog management techniques have to be taken into consideration. In the driving environment the SDS is supported by a GUI, which is why different GUI designs are conceivable.

91 2.4 Summary and Discussion 73 In order to find the optimal SDS different SDS concepts have to be designed and prototypically implemented. In order to find the most suitable speech interface for performing information exchange Tasks as secondary tasks, the developed prototypes are evaluated in user studies. The decision about the dialog initiative (user-initiated, system-initiated or mixed-initiative) has strong influence on the usability of the speech dialog. Research has already been conducted on the comparison of SDS employing different dialog initiative strategies. Ackermann and Libossek (2006) compare a system-directed with a user-directed dialog in a dual-task scenario, which is within the scope of this thesis. The system-directed SDS achieves significantly better results concerning usability and mental workload than the user-directed one. In the user-directed dialog users were often stuck in the dialog and did not know, how to proceed because the system did not understand their requests. User-directed systems seem not to be applicable and therefore, they are not further examined in this research work. Mixed-initiative dialogs are the basis for conversational speech dialogs. As this dialog strategy allows users greater input variability, developers attempt to model the ASR and the NLU to cope with spontaneous speech phenomena (such as more flexible sentence constructions). In contrast, system-directed dialogs are less flexible and limit the input possibilites to certain commands and other elliptical utterances, which is why they are also called command-based dialogs. At a first glance due to the flexibility conversational speech interfaces seem to provide greater usability and should be the mean of choice when it comes to the dialog strategy decision. However, Thomson and Wisowaty (1999) argue that conversational speech interfaces lead to user confusion, as users do not know how to talk to such a SDS. Untrained users are unfamiliar with the system capabilities and do not exactly know what spoken format is expected by the speech recognizer. Furthermore, Berg (2013) argues that allowing users to speak freely, leads to inaccuracies in the ASR and NLU components, which might negatively affect the usability. As both remaining strategies provide advantages and disadvantages research needs to focus on the comparison of these two speech dialog strategies. Walker et al. (1997a) compare a mixed-initiative dialog agent to a system-directed dialog agent. They conclude that the system-initiative dialog is better for inexperienced users. Mixed-initiative is preferred by users as they gain experience using the system successfully. As for the overall dialog quality the mixed-initiative dialog strategy did not surpass the system-directed strategy concerning the overall dialog performance across several tasks. Devillers and Bonneau-Maynard (1998) compare two SDSs allowing the user to retrieve touristic information. One dialog strategy guides the user via system suggestions, the other does not. The authors conclude that user guidance is suitable for novices and appreciated by all kinds of users. However, these studies did not address a dual-task scenario, where the user s preferences might be different as they have to perform a primary task in parallel. When users have to perform a secondary task by speech, they might prefer a simple and clearly directed dialog with low mental demands, which would speak for a system-directed dialog strategy. Nevertheless, this dialog strategy requires many dialog steps and demands the user s attention for a long time. Using a mixed-initiative dialog, the user can input multiple parameters and thereby, speed up the dialog. However, this dialog strategy is more complex and mentally demands the user, which might have a negative effect on the primary task. Depending on the environmental context one or the other strategy might be preferred. Especially in the dual-task scenario one needs to investigate if a fast and efficient or a simple and guided way of performing tasks by speech is the mean of choice. Therefore, it is important to find out, which is the most appropriate strategy for information exchange task in the driving scenario. Until today only few research has examined different speech dialog strategies while driving. Research work by Mutschler et al. (2007) investigated speech dialog strategies as part of a multimodal research question. In the EU funding project TALK 24, Mutschler et al. compared two multimodal 24 talk-project.eurice.eu

92 74 2 Background and Resulting Challenges systems, one based on a command-based speech dialog, the other based on a conversational speech dialog in a driving scenario. In the experiment participants had to control the in-car mp3-player by speech or haptic input while driving. Each speech dialog strategy was supported by the same GUI. The main research goal was to investigate multimodal interaction with the focus on modality selection. Although the conversational dialog was more efficient, the command-based dialog was more appreciated by the participants. According to Mutschler et al. a high error rate of the conversational strategy was the reason for the higher acceptance of the command-based dialog. There were no significant differences in the driving performance revealed when using the different SDS. However, the comparison of speech dialog strategies is only achieved on the basis of the available speech turns and is rather a side product of this experiment and therefore, the results have to be handled with care. As the speech recognizer quality has improved enormously within the last five years, nowadays, the influence of the weak speech recognition performance of Mutschler et al. s conversational dialog may be less significant. Furthermore, the use of the same GUI for different dialog strategies could have additionally influenced the result. The GUI should be adapted to the particular dialog strategy in order to benefit from the advantages of the respective strategy the most and to allow for a comparison of optimal systems. This research work compares a command-based and a conversational speech dialog strategy in the driving environment and focuses on speech only as input modality. Thus, the prototypes are designed for the comparison of speech dialog strategies explicitly. In order to support the speech dialog as best as possible and to present as little output on the screen as needed, the GUI is adapted to the respective strategy. In addition, also conditions without GUI are evaluated. Thereby, the full potential of the speech modality can be exploited and the effect of the GUI on usability and driver distraction can be examined. In a driving simulation study applying the ConTRe task, the different HMI concepts are evaluated concerning usability and driver distraction. Objective and subjective dialog measures are applied to assess the dialog quality and user acceptance. In order to assess driver distraction, driving performance measures and subjective workload measures are applied. Additionally, visual demand is assessed by recording the participants glances on the screen using an eye tracker. This comprehensive investigation will provide profound results and thus, deep insights of the use of speech dialog strategies in the driving environment. Handling Proactively Incoming Events by Speech While Driving As illustrated in the Introduction information exchange Tasks also show proactive behavior. This self-initiative characteristics might require a lot of attention of the user and might distract from the primary task driving. Therefore, research needs to investigate, how proactive behavior can be integrated in SDSs in the driving environment. The proactive dialog flow consists of several steps. First, the user has to be notified about an incoming event, then the problem solving process has to be started. Finally, the new task has to be completed and possibly paused tasks have to be resumed. This research work focuses on the first part of the proactive dialog flow, in which the user has to be notified about incoming information. When the new message comes in, either the user is idle or interacting with the SDS already. In this thesis it is investigated, how an idle driver should be notified about an incoming message in a user-friendly way. Proactivity in HMI in mobile environments did not gain much attention in the research community, recently. Vico et al. (2011) compare two proactive user interface concepts for a recommender system on a smartphone: A widget-based solution, which is embedded in the home screen of the device and a notification-based solution, which presents new information in the status bar of the mobile device. The results showed that users prefer the widget-based concept over the notification-based concept. However, the user interaction only concerned haptic input and visual output for use on mobile devices

93 2.4 Summary and Discussion 75 and did not involve any speech interaction, which is required in the automotive environment. Bader et al. (2011) conducted a user study in a real world driving setup to examine user acceptance of a proactive recommender system. Results show that the proactive recommender system is perceived as helpful and does not distract from driving. Again, only visual output is used to inform the user about new information. A comparison of proactive speech dialog concepts has not been addressed, yet. Furthermore, this study does not take the current contextual situation into account. An intelligent user interface needs to be adaptive and has to provide the information according to the current contextual situation. In this research work several speech-based proactive notification concepts for incoming events in different contextual situations are designed, implemented and evaluated. The goal is to find out, which is the most adequate speech interaction concept to inform the driver proactively depending on the current cognitive load and the priority of the incoming information. SDS prototypes supported by a GUI employing the designed notification concepts are developed and evaluated in a driving simulator study applying the ConTRe task. Objective and subjective dialog measures are applied to assess the dialog quality and user acceptance. In order to assess driver distraction, objective driving performance measures and subjective workload measures are applied. This research work is supposed to give initial insights about proactive behavior of SDSs and pave the way for further investigations on this relatively young research field. The research work is in line with the usability engineering lifecycle described in section 2.1.5, which suggests a situational analysis at the beginning of the development of an SDS in a new domain. Therefore, before developing different SDS concepts an initial user study has been conducted. Goal of the study is to gather information about, how people use the Internet on smartphones, to get information about preferred speaking styles and to build a corpus of utterances, which can be used for the development of the later prototypes. The user study setup and its results are presented in the following Chapter.

94

95 3 User Study on Speech Interaction with the Internet The goal of this research work is to develop a speech interface, which enables users to perform Internet tasks while driving. The development of speech interfaces to a new domain requires a wide knowledge about the domain and about how people interact with the domain. Therefore, first, a user study had to be conducted. The initial user study should give information about how people use the Internet on their smartphone and how they would use the Internet while driving. Furthermore, the study aimed at getting knowledge about how users would interact with the Internet by speech. The user study gave information about people s speaking styles depending on the Internet tasks, which needs to be performed. The collected data was needed for the concept design and the implementation of the latter speech dialog prototypes. As the concept design, the prototypical implementation and the evaluation of this research work was targeted at German users, this initial user study had been designed for German participants. In this Chapter, first, the method of the initial user study consisting of a questionnaire and an audio data collection is described. Subsequently the results and their implications on the following research work are presented. 3.1 Method Developing an SDS in a new domain is a difficult task. Glass et al. (2000) consider the development a classic chicken and egg problem. In order to design an SDS, you need a data collection from real users. However, to collect data, which reflect real usage, you need a system the user can interact with. At an early development stage a common research method to collect user data is the conduction of WOZ experiments. At a later stage in the development life cycle, when a first SDS has been developed, first user experiments can be conducted to further improve the system. When designing tasks for such user experiments it is always of great importance to ensure that the participants are able to perform their tasks in as natural a way as possible (Bernsen et al., 1998). The task briefing necessary to introduce the participants can be achieved through a variety of means. Bernsen et al. (1998) suggest giving written instructions or presenting graphically depicted scenarios. However, they report from massive priming effects when using the text scenarios. Walker et al. (2001) encounter the same problem in their data collection experiments, where the participants were presented fixed scenarios in a tabular format, which contained the information to be communicated to the system. By putting words into the user s mouth, no variety in utterances could be collected. In contrast, the graphic scenarios of Bernsen et al. s experiments did not lead to priming effects. The user study aimed at investigating the user s speech interaction when performing a large variety of Internet task. WOZ experiments, which are usually used to examine human-machine interaction,

96 78 3 User Study on Speech Interaction with the Internet are time-consuming and cost-intensive. The WOZ technique would be applicable to investigate a single or only few Internet tasks. However, covering a large variety of Internet task would require a disproportionate effort. Therefore, in this study no WOZ experiments were conducted. Instead, the participants were put situations in their minds, which have to be solved by speaking to an imaginary computer system. Graphically depicted scenarios were used to inform the user avoiding priming effects. According to a study from BITKOM (2011), 87% of German citizens at the age of 18 to 24 years and 79% at the age of 25 to 34 years would like to have Internet access in the car. The elders interest in a connected car was much lower. Therefore, according to these findings the study was targeted at young German adults at the age of 18 to 34. The whole user study was browser-based and could be performed at the participant s PC at home. Thereby, they were only required to have an Internet connection and a microphone. The link to the user study has been distributed via the social network facebook as most of today s Internet users own a facebook 1 account and since this platform offers an easy way to reach for participants of the target group. The user study was conducted at the end of 2011 shortly before the introduction of the first voice-controlled mobile assistant Siri 2. The Web-based user study was split in two parts. In the first part audio data were collected in order to gather information about people s way of interacting with the Internet by speech. The second part consisted of a questionnaire, which should help to gain insights about the participants and their use of smartphones, Internet apps or services 3 and their attitude towards SDS Collection of Audio Data In the briefing the participants were informed textually they should imagine they sat in front of a speech computer, with which they were able to use Internet services by speech. The system was able to search for information or to perform a certain action online. The presented tasks put situations in their minds. The speech computer was to be used to get the requested information or to perform the demanded task online. By keeping the briefing general the users were not influenced in advance. Depending on the Web task to be performed people may interact differently. Therefore, the tasks were classified according to Kellar s Web information task classification (Kellar, 2007), which has been described in detail in Section 1.3: Information Seeking: Tasks in which user try to change their state of knowledge (Marchionini, 1995) (e.g. fact finding etc.) Information Exchange: Actions that are performed online, subdivided in: Transactions (e.g. hotel bookings) Communications (e.g. messaging in Social Networks) Information Maintenance: Visits to Web pages with the goal of maintaining Web resources (Kellar, 2007) (e.g. Web development) As information maintenance is no action to be performed while driving, only information seeking and information exchange tasks (subdivided into transactions and communications) were designed. For each of the three categories six tasks - 18 tasks in total - were designed In this Section, the term Internet services is used instead of Internet apps since it was a common synonym at the time of the user study. Back then, Internet apps had not been as strongly integrated in daily routines as today and the term app had not reached today s status yet. However, Internet services also referred to little applications located on the Internet

97 3.1 Method 79 The tasks were designed according to a scheme in order to ensure comparability of answers. The scheme was also used to reject wrong user utterances if misunderstandings of tasks occurred. Each task demanded from the users at least two parameters of the schemes. A sample task for each task category is presented in Figures 3.1, 3.2 and 3.3. Tables 3.1, 3.2 and 3.3 show the corresponding task schemes for each of the presented tasks. All graphically depicted tasks can be found in Appendix A.1. There were two reasons for rejection of an utterance: First, a user did not speak all of the required parameters (or synonyms of those parameters) and second, a user added non-required information. If more than 75% of a participant s utterances matched the scheme, the whole participant was accepted. The task design process ran through multiple iterations and the tasks were pre-tested with friendly users to find out if the desired situation was put in the user s mind. Fig Information seeking sample tasks. Fig Communications sample tasks.

98 80 3 User Study on Speech Interaction with the Internet Fig Transactions sample tasks. Table 3.1. Information seeking task scheme including sample tasks corresponding to Figure 3.1. Information Seeking Scheme Parameter Object Properties Date Point Geo Position Sample weather Paris Possible Correct Wie ist das Wetter in Paris? Utterance (engl.: How is the weather in Paris? ) Table 3.2. Communications task scheme including sample tasks corresponding to Figure 3.2. Communications Scheme Parameter Object Action Refinement Web Service Sample message send recipient: Alexander Müller facebook Possible Correct Ich möchte eine facebook-nachricht an Alexander Müller senden Utterance (engl.: I would like to send a facebook message to Alexander Mueller. ) Table 3.3. Transactions task scheme including sample tasks corresponding to Figure 3.3. Transactions Scheme Parameter Object Action Refinement Web Service Sample Iphone bid 450 Euro ebay Possible Correct 450 Euro auf das IPhone auf ebay bieten. Utterance (engl.: Bid 450 Euro on the IPhone on ebay. ) In this part of the user study audio data had to be recorded from the user in a browser-based setup. As Flash 4 was the most pervasive cross-platform Web platform at that time Adobe Flash Player was available on most of the Internet-enabled PCs. Therefore, the recordings were achieved via a Flash application embedded in the Web page, which recorded the data and sent it via HTTP/multipart to the data storage server. The Flash recording window was placed next to each task on the Website. The users were able to record, listen to and re-record their utterance if necessary (see Figure 3.4). After pressing the saving button, they were led straight to the next task without a possibility to go back. In the briefing of part one of the user study the users were able to test their microphone and get familiar with the Flash application. After having accomplished the 18 tasks they were asked two final questions on a 5-point scale about the difficulty of the tasks (1: very easy 5: very difficult) and how much they would like this kind of interaction (1: very little 5: very much). 4

99 3.2 Results 81 Fig Screenshot of the Flash application for recording the audio data Questionnaire The second part was a PHP-based 5 questionnaire designed with the graftstat 6 questionnaire tool funded by the Bundeszentrale für politische Bildung (BPB) 7. The questionnaire consisted of 43 items in total and was organized in five short blocks of items. First, questions to gather demographic personal information about the participants were asked (age, computer expertise, use of the Internet, etc.). The next block of items addressed the participants use of their car and in-car voice-control. Part C investigated the participants use of smartphones and their mobile Internet usage. When asking, which kind of applications they frequently used, some example applications were given, but the participants were also able to mention their own applications. Next, participants were asked what kind of Internet applications they would like to use in the car. Again, some example applications were given with the possibility to mention their own applications. In the last block the participants were asked questions about the integration of Internet services into the car and their attitude towards SDS. All the answers of the questionnaire were designed as 4-point or 5-point Likert scales. Some sample questions are presented below. The whole questionnaire can be found in Appendix A.2. Do you use mobile Internet access on your smartphone while driving? o never o sometimes o often o always Imagine, save and undistracted driving is assured. Which Internet services would you use as driver while driving? - Retrieve Instant Information weather stocks news sport other information A user-friendly in-car voice-control reduces distraction from driving. o strongly disagree o disagree o neither nor o agree o strongly agree 3.2 Results This subsection presents the results of the initial user study. First, the results of the questionnaire are presented, which give information about the participants and their use of the Internet, followed by the results and the analysis of the audio data collection

100 82 3 User Study on Speech Interaction with the Internet Questionnaire Results The results of the questionnaire gave demographic information about the participants (see Figure 3.5). In total 73 participants took part in the user study whereof 63% were male and 37% female. The average age was 26.3 years (StandardDeviation(SD) = 3.4), which was the targeted age group. Most of the participants had a university or a college degree (61%) and were experienced in the use of PCs. Furthermore, 74% of the participants owned a smartphone, whereof 91% owned a mobile Internet flatrate. 45% owned a car and 85% stated that they would use voice-control if their car would provide this functionality. Gender Educational Attainment 5% Hauptschulabschluss (Certificate of Secondary Education) 37% 63% Male Female 62% 33% Realschulabschluss (General Certificate of Secondary Education) Abitur (A-levels) Hochschulabschluss (university or college degree) PC Experience 8% 48% 44% Beginner Advanced Expert Fig Demographic information about the participants. Table 3.4 shows the most popular Internet services on the smartphone and the most preferred Internet services in the car (assuming that undistracted driving is assured). The most popular Internet services used on smartphones were sending and receiving s (88% are using this service frequently), navigational services (86%) and the use of social networks (83%). When the users were asked about their preference of applications in the car, the ranking differed. Sending and receiving s was still ranked first (97%), followed by listening to Internet radio (96%) and navigational services (96%). Using social networks however, was only ranked 9th (68%). The participants were asked if they used the mobile Internet capabilities of their smartphones while driving. 50% stated to use mobile Internet on their smartphone while driving, 15% use it regularly. The results of questions about in-car voice-control and the integration of Internet services into the car can be found in Figure % of the participants believe that a user-friendly SDS in the car reduces distraction. 82% would use Internet services while driving if they were available. These findings match with the previously mentioned results from the BITKOM study. 51% would use these services only, if undistracted driving is assured. However, there are 36% of participants, who prefer being always connected although safe driving is not assured.

101 Table 3.4. Popularity of Internet services on smartphones and in the car. 3.2 Results 83 Most popular Internet services on the smartphone Most preferred Internet services in the car (assumption: undistracted driving assured) X% of the users use this services regularly: X% of the users would use this services regularly: Sending and receiving s (88%) Sending and receiving s (97%) Navigational services (86%) Listening to Internet radio (96%) Use of social networks (e.g. facebook) (83%) Navigational services (95%) Web search (79%) News reading (94%) News reading (67%) Request opening times (94%) Fact finding (e.g. from Wikipedia 8 ) (60%) Making reservations (e.g. restaurant, cinema) (85%) Looking up train or bus schedules (69%) Use of social networks (e.g. facebook) (68%) Fact finding (e.g. from Wikipedia 9 ) (65%) 60,0% 50,0% 40,0% 30,0% 20,0% strongly disagree disagree neither nor agree strongly agree 10,0% 0,0% A user-friendly in-car voice-control reduces distraction while driving. I would use the above mentioned Internetservices while driving. I would use the above mentioned Internetservices while driving although undistracted driving is not assured. Fig Results of the questions about the participants attitude towards in-car voice-control and the integration of Internet services into the car Data Collection and Speaking Style Analysis In the audio recordings experiment 1314 utterances from the 73 participants were collected. Only one set of utterances of one participant could not be used due to bad quality of the audio files. The remaining 1296 utterances have been transcribed orthographically and evaluated against the scheme of each task. The last two items in the data collection revealed that the participants generally liked the way of interacting with Internet services (MeanValue(MV ) = 3.9, SD = 1.0). The users rated the tasks as simple (MV = 2.2,SD = 0.8), which is reflected by the few rejections: From the remaining 72 participants, only 10 had to be rejected because of more than 75% wrong answers. Finally, 114 utterances of the remaining 62 participants had to be rejected due to single wrong

102 84 3 User Study on Speech Interaction with the Internet answers, whereby 1002 utterances remained. The utterances were rejected, because the participants obviously did not understand the task and therefore, the resulting utterances were not comparable anymore. The remaining utterances were analyzed on speaking styles, which could be classified in the categories illustrated in Table 3.5. The revealed differences in the speaking styles concern utterances, which are spoken, when the user takes the initiative in the dialog: For instance, when the user initiates the speech interaction with his first utterance. Human-human communication reflects the way humans communicate among each other. Concerning SDSs this speaking style was also referred to as conversational (see section 2.1.3), which is also illustrated in Table 3.5. In the given tasks this category can be further subdivided into explicit and implicit demands. The second large category classifies speaking styles, which can be found, when humans speak to machines. Here, two further subdivisions were identified. The command style (further subdivided in explicit and implicit commands) can be found, when users interact with a state-of-the-art in-car voice-control system. The second identified speaking style of human-machine interaction is the keyword style. Here the participant s utterances are influenced by a strong mental model, of how they normally use the Internet with a browser. First, the service is called and afterwards further keywords or human-machine commands are used to specify the request. Table 3.5. Identified speaking styles in the participants utterances. Communication Type Speaking Style Phrase Type Sample Utterance in German (Translation of Sample Utterance in English) Human-human Communication Conversational Style Explicit Demand Implicit Demand Sende eine facebook-nachricht an Alexander Müller. ( Send a facebook message to Alexander Müller. ) Ich möchte eine facebook-nachricht an Alexander Müller senden. ( I would like to send a facebook message to Alexander Müller. ) Human-Machine Command Style Explicit Command Implicit Command Alexander Müller eine facebook-nachricht senden. (No corresponding syntax existing in English) Facebook-Nachricht an Alexander Müller. ( Facebook message to Alexander Müller. ) Interaction Keyword Style Keywords Keywords and Facebook. Neue Nachricht. Empfänger Alexander Müller. ( Facebook. New Message. Recipient Alexander Müller. ) Facebook. Nachricht an Alexander Müller. Commands ( Facebook. Message to Alexander Müller. ) The frequency of occurrence of the main categories of the speaking styles has been analyzed (see Fig. 3.7). Across all tasks ( All ), the conversational speaking style predominates, command and keyword style are almost equal. However, if distinguishing between the three Web task categories, the situation looks differently. Concerning information seeking (InfoSeek) 65% of the tasks were solved in a conversational style. In case of communications ( Comm ) and transactions ( Trans ) the number of conversational utterances shrinks. Here, both, the command style and the keyword style

103 3.2 Results 85 grow. In communications all three styles seem to be almost equal, in transactions the conversational style still appears to occur the most frequently Conversational Command Keyword All InfoSeek Comm Trans Fig Frequency of occurence of speaking styles related to the different task categories. The number of words per utterances was analyzed in relation to the different speaking styles (see Table 3.6). When using the the conversational style more words were spoken than when using the command or the keyword style. Table 3.6. Number of words per utterance. Number of words SD Human-Machine Communication Conversational Style Human-Machine Interaction Command Style Keyword Style Discussion The results of the questionnaire showed that already by the end of 2011 almost three quarters of the participants owned a smartphone. Nine out of 10 smartphone users were using an Internet flatrate, which reflects the strong use of smartphones and the presence of an always-connected community among the participants. The most popular Internet services used on smartphones were sending and receiving s, navigational services and the use of social networks. Concerning the preference of applications in the car, the ranking differed. Sending and receiving s was still ranked first, followed by listening to Internet radio and navigational services. Using social networks however, dropped to 9th. Although young adults spend a lot of time on facebook or other social networks, it seems that the need for communication falls behind, when it comes to driving. The answers on the questions about the use of the Internet while driving was alarming. Half of the participants indicated to use their smartphone s Internet capabilities while driving. Furthermore, 36% of the participants

104 86 3 User Study on Speech Interaction with the Internet would use Internet services while driving, although undistracted driving is not assured. More than one third of young adults seem to consider being always connected as more important than the personal safety. Fortunately, the results also showed that the users are willing to use and trust in SDS as technology to reduce driver distraction. The strong desire to use the Internet while driving and the ignorance of safety issues emphasize the need for an in-car speech interface to Internet services to prevent accidents among the young generation. The analysis of the recorded audio data revealed the following three speaking style categories: conversational, command-based, and keyword style. The conversational style reflects human-human communication and is adequate to the addressed speech interaction style described in section The command-based style consists of short utterances and is found in system-directed human-machine dialogs (such as in-car voice control systems). The keyword style occurred, when people followed the strong mental model of using the Internet via an Internet browser in order to solve the task. Using a browser, first the Website, in this case the Internet service, has to be opened and then the task has to be completed step by step. Therefore, in the audio data collection participants first called the service by name and afterwards further keywords or human-machine commmands were used to specify the request all in one utterance. This keyword style was a by-product of the task briefing, which was designed in order not to prime the participants. The frequency of occurrence of the different speaking styles differed depending on the task category. Concerning information seeking tasks the conversational speaking style predominated, since most of the users spoke queries in order to make the request. This confirms the assumption made in section 1.3 that an information seeking dialog would turn out to be a question&answer dialog. Concerning information exchange tasks there was no strikingly dominating speaking style. Therefore, one has to find out, which dialog style is the most preferred by users for those task types. As the keyword style is a concatenation of input parameters without a fixed grammar, this kind of phrase construction can be highly ambiguous and difficult to be interpreted correctly. However, all initial utterances should be covered by an SDS. Afterwards the speech interface could influence the user with the help of speech outputs to switch to a different style in order to reduce recognition errors. In the conversational speaking style more words were spoken per utterance. Generally, very short utterances are difficult to recognize, especially when the number of alternative word hypotheses possible at each point in the grammar is very high. The understanding process can sometimes benefit from longer utterances, which contain more semantically relevant constituents. If more information is given, it seems to be easier to interpret the user s intention. 3.3 Summary This section summarizes the experimental setup and the main findings and finally presents implications on the following research work Summary In this section a Web-based user study, targeted at young Germans, was presented. The initial study aimed at getting knowledge, about how people use the Internet on their smartphone and how they would use the Internet while driving. Furthermore, the study should help to gain insights, about how users would interact with the Internet by speech. In the first subsection the method of the initial user study was described. The study was split into two parts. In the first part, based on graphically depicted tasks, participants had to perform Internet activities by speech, whereby audio data were recorded. In the second part, the participants had to fill out a questionnaire about the use of the Internet on smartphones, the use of the Internet while driving and their attitude towards SDSs.

105 3.3 Summary 87 The results of the questionnaire revealed that there is a strong need to develop a speech interface to the Internet while driving, but also that users are willing to use and trust in SDS. The occurring speaking styles in the data collection were classified into conversational, command and keyword style. The occurrence differs depending on the Web task category. The conversational speaking style generally occurs most frequently, but concerning communications and transactions tasks all speaking styles were almost equally distributed Implications on Research Work The results of the study confirm previously made assumptions. In information seeking tasks people seem to prefer to make search queries in a conversational speaking style. Using an SDS, the dialog would turn out to be a simple question and answer dialog, which does not pose major challenges to the dialog design. This finding confirms the assumptions made in the Introduction and therefore, information seeking tasks are not further investigated in this research work. Moreover, the further investigation should focus on information exchange tasks as none of the identified speaking styles was strikingly dominating. The goal of the described research work is the comparison of speech-based HMI concepts when performing an information exchange task. The developed concepts and the evaluation focus on speech dialogs within one task and not across tasks or applications. As the identified keyword style should only be covered by an SDS, when the users address an application in an initial utterance this speaking style will not be covered in the concept design. The graphically depicted tasks helped to collect real speech data from the participants. The transcriptions of the audio data collection from the 73 participants contained a large variety of occurring sentence constructions, synonyms, etc. This data set was used to develop grammars for the prototypes, which were needed to evaluate the developed concepts. In the following Chapter the development of speech-based in-car HMI concepts for the performance of information exchange tasks while driving is described. Here, different speech-based HMI concepts are designed based on the background knowledge from literature and related work and based on the insights gained from the initial user study. In particular, a command-based and a conversational dialog strategy supported by different GUI concepts are designed. These concepts are prototypically implemented using the Daimler SDF and evaluated in a driving simulator study concerning usability and driver distraction. The concept design, the realization of prototypes, and their evaluation are presented in detail in the next Chapter.

106

107 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks This Chapter describes the development of an SDS to perform information exchange tasks as secondary task in a dual-task scenario. Due to its importance the driving environment has been chosen as use case. The development of speech dialogs offers a variety of design options. However, as presented in Section research has already demonstrated that the comparison of a so-called command-based and a conversational dialog should be in focus. The situational analysis described in Chapter 3 additionally confirmed the need to compare these two strategies, as people prefer these kinds of speaking styles when performing information exchange tasks by speech. Furthermore, people are used to the command-based speech dialog, which can be found in today s in-car SDS. However, the upcoming conversational dialog style, driven by new smartphone technologies, has raised the users expectations of speech technology. Nevertheless, it has not been proven, yet, if such a novel implementation of a speech dialog is feasible in the driving environment in terms of usability and driver distraction. In this part of the research work a command-based and a conversational dialog strategy are both designed, implemented and evaluated. A hotel booking task by speech has been selected as use case. As described in Section a conversational speech interface has to cover many spontaneous speech phenomena and human dialog phenomena to mimic human-human communication. Designing and implementing all of these features would go beyond the scope of this research work. Therefore, focus is placed on the most relevant features, which highlight the strengths and weaknesses of these strategies and which are the most relevant to the driving scenario. The different design concepts have been implemented using the Daimler SDF and evaluated in driving simulator studies concerning usability and driver distraction. The research work has been supported by a Master thesis supervised by the author (Silberstein, 2012). The remainder of this Chapter is organized as follows: Based on the provided background, related research and the insights of the initial user study, different speech-based HMI concepts are designed, which is described in Section 4.1. In this Section, first, an overview of the hotel booking use case is given (subsection 4.1.1). Second, the different speech dialog concepts are described (subsection 4.1.2), followed by the GUI concepts, which shall support the speech dialog (subsection 4.1.3). Section 4.2 explains the realization of the different SDS concepts as prototypes using the Daimler SDF. As the existing grammar specification did not satisfy the demands of a conversational grammar, a new approach to model conversational utterances has been developed (subsection 4.2.2). In the next step the developed prototypes had to be evaluated concerning usability and driver distraction. Section 4.3 presents the evaluation of the different speech-based HMI concept prototypes. In Section 4.3, first, the experimental method of the driving simulator study is described in detail (subsection 4.3.1). Subsequently, the results of the experiment are presented and discussed (subsection and 4.3.3). Finally, Section 4.4 summarizes the Chapter and conclusions are drawn.

108 90 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks 4.1 Design of speech-based HMI Concepts This Section describes the designed speech-based HMI concepts. Before presenting the speech dialog and GUI concepts a brief overview of the selected use case is given Functionality of the Hotel Booking Use Case The chosen use case for the design of the HMI concepts is booking a hotel by speech while driving. Booking a hotel requires to input several parameters, which makes it a good use case to highlight the strengths and weaknesses of the different dialog strategies. Another advantage of the hotel booking use case is that every user has a clear conception, of what is needed to book a hotel. This distinct mental model makes it easier for the user to understand the voice-control concept, simplifies explanations about the functionality of the system and thereby, reduces the risk of misunderstandings in the experiment. In order to access hotel data, the online hotel booking service HRS 1 has been linked to the existing speech dialog framework. Section 4.2 gives technical information about the connection of the HRS service to the SDF. The hotel service HRS allows for various hotel search functions. After having input several required parameters (e.g. location, arrival date, etc.), the service delivers a list of hotels, which match the search criteria. Additionally, optional parameters (e.g. price range) may be entered to refine the search. The service offers a detailed description of each hotel. After having selected a certain hotel, it can finally be booked. The mentioned functions have been taken into consideration for the design of the different HMI concepts. Each concept has been designed to allow for parameter input, result list presentation, and hotel details presentation. The HRS Web service offers many more functions. However, these functions have not been considered when designing the HMI concepts, since they would not have been of additional use to compare the different concepts and probably would be of no use while driving. Therefore, these functions were not implemented. First, the different dialog strategies including sample dialogs are presented. Afterwards the GUI concepts, which have been designed in order to support the speech dialog, are described with the aid of screenshots Dialog Strategy Design Two different dialog strategies, a command-based and a conversational dialog strategy, have been designed. The general hotel booking dialog flow, which applies for both speech dialog strategies is illustrated in Figure 4.1. First, the user has to input the required search parameters in order to get a resulting hotel list. Additionally, he can refine his search by giving additional parameters. When the result list has been retrieved the results can be presented. If the result list does not match the user s needs, he can change the search parameters. In the list presentation sub-dialog the user is able to receive detailed information of a hotel or book the hotel directly. After having received the detailed information the user can either go back to the list, change the search parameter or book the selected hotel. The following technical SDS features were taken into consideration when designing the different concepts: in order to speak to the system the user has to press a PTA button. Furthermore, the user is able to interrupt the system while speaking ( barge-in ). State-of-the-art in-car SDS use teleprompters to inform the user visually about possible commands. However, the use of these teleprompters raises the visual attention on the head unit screen. Furthermore, the design of 1

109 4.1 Design of speech-based HMI Concepts 91!"#$%&'$()*+&,$-.$&!"#$%&'$()*+&,$/0%#&!"#$%&12/#& 12/#&3)$/$.#(4".& 5+(.6$&'$()*+&3()(7$#$)&!"#$%&8$#(2%/&3)$/$.#(4".&!"#$%&9"":2.6& Fig Overview of the hotel booking dialog flow. teleprompters is difficult for conversational dialogs and thereby, hampers the comparability of the two speech dialog strategies. Therefore, the user is only informed audibly about possible commands. The developed speech dialog prototypes have been specified for German language. However, the sample dialogs given in this Section are written in English for better understanding. The characteristic of each strategy and how they differ are described in the following. When designing the different dialog strategies, the attention was particularly focused on the dialog initiative, the possibility to enter multiple input parameters and the acoustic feedback. Command-Based Dialog Strategy The dialog behavior of the command-based dialog strategy corresponds to the voice-control, which can be found in current state-of-the-art in-car SDS. By calling predefined explicit or implicit speech commands, the speech dialog is initiated. There are several synonyms available for each command. System feedback of what was understood is given via the system reaction (e.g. execution of what was demanded) or by spoken (implicit or explicit) feedback in the system s voice prompt. If more information is needed from the user, in order to fulfill the user s demands, the system guides the user. This system-directed dialog strategy is adopted to the command-based dialog strategy of a hotel booking. The whole dialog flow of the command-based hotel dialog can be represented by a graph. The graph illustrated in Figure 4.2 shows an example graph of the parameter input sub-dialog (only for the required parameters). If the user s input is missing or invalid ( < 1 ), the system has to re-request the parameter. The GUI supports the speech dialog by showing the speakable commands as widgets on the screen (see Section 4.1.3). The different sub-dialogs are explained in the following. Parameter Input After the first speech command the user is guided by the system and executes the next steps, which are suggested and displayed by the system. The dialog is mainly system-driven and the input possibilities

110 92 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Destination? #city<1 city #city=1 Arrival? #date<1 date #date=1 Departure? #date<1 date #room_no<1 && #room_type<1 Room Number and Room Type? Room Number? #room_no=1 && #room_type<1 room_no room_type #room_no<1 && #room_type=1 Room Number? #room_type<1 room_type #room_type=1 Hotel category or price Constraint? #room_no<1 room_no #room_no=1 Fig Command-based dialog flow graph while parameter input. are restricted to direct answers (no overanswering). Furthermore, the user is only able to set one input parameter within an utterance. After having input all parameters, the system repeats the parameters once again. A sample dialog for the hotel booking dialog is illustrated below: 1 User: Book a hotel. 2 System: Where would you like to book a hotel? 3 User: In Stuttgart. 4 System: When would you like to arrive in Stuttgart? 5 User: Tomorrow. 6 System: How long would you like to stay in Stuttgart? 7 User: Until the day after tomorrow. 8 System: What kind of room do you prefer for the one night? 9 User: a double room.

111 4.1 Design of speech-based HMI Concepts System: Would you like to additionally constrain the hotel category or the price? System: You would like to book a double room in a three-star hotel in Stuttgart from October 29th, 2013 until October 30th, Would you like to see the hotel list or change the search criteria? When all the parameters have been provided by the user, HRS is called to retrieve the list of hotels. The user can then continue the interaction by speaking certain commands, for instance for presenting the list or changing parameters. Changing Parameters If during the parameter input sub-dialog recognition errors occur, the user is able to change parameters saying the change command. This is not possible before the parameter input sub-dialog has finished. During parameter input the user has to follow the system s requests and has to wait until the system offers him the possibility to change the parameters after repeating the understood parameters. Later, when presenting the list, the user can initiate the sub-dialog to make changes anytime. In the change sub-dialog, the system asks the user, which parameter he would like to change. If the user does not answer, the system makes suggestions and guides him throughout the sub-dialog: 1 User: Change search criteria. 2 System: What would you like to change? 3 User:... 4 System: The destination? 5 User: No. 6 System: The arrival date? 7 User: Yes. 8 System: How did your arrival date change? 9 User: Into October, 28th. 10 System: Ok, the new arrival data is October, 28th. You would like to book a double room in a three-star hotel in Stuttgart from October 28th, 2013 until October 30th. Would you like... Presenting the Result List When all the required parameters have been provided and the system gives feedback about the understood search criteria, the user can ask for the result list. The hotels are presented one-by-one. For each hotel the most important facts are read out. The user can also browse through the list by using speech commands. When a hotel is found, which meets his expectations, he can request for a detailed information or immediately book a hotel as illustrated in the following: 1 User: Hotel list. 2 System: The three-star hotel Brita in Stuttgart-Obertuerkheim for 94 Euro. 3 User:... 4 System: The three-star hotel Romantisches Hotel Traube in Stuttgart- Plieningen for 99 Euro. 5 User: Back. 6 System: The three-star hotel Neotel in Stuttgart-Moehringen for 102 Euro. 7 User: Show details. 8 System: The three-star hotel Neotel is located in the heart of... 9 User: Book this hotel.

112 94 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Conversational Dialog Strategy A conversation between an agent and a client over the phone (as illustrated in Section 2.1.3) is taken as an example for the conversational dialog design. In these human-human conversations spontaneous speech and human dialog phenomena, such as varying sentence constructions, frequent dialog initiative switches, etc., occur. As modeling all the occurring phenomena is still technologically not possible and would go beyond the scope of this research, only the most important features are incorporated in the dialog design of the conversational dialog strategy. In the conversational dialog strategy the dialog initiative switches during the speech interaction. The dialog control follows a user adaptive dialog strategy without having a globally fixed plan of the dialog. Similar to the frame-based dialog control the dialog flow is less constraint and offers multi-slot filling abilities. The user is able to speak whole sentences, whereby multiple parameters can be set within one single utterance. Thereby, the dialog can run more naturally, be flexible and efficient. Such a kind of dialog is difficult to illustrate formally. With the aid of a very simplified SCXML state chart a rough impression of the input dialog is provided in Figure 4.3 (the state chart is only taking the required parameters into consideration). destination:= not set, arrival:=not set, departure:= not set, room_number:= not set, room_type:=not set Parameter Input (only required parameter) destination:= set input parameter < 1 all required parameters are set arrival:= set Re-request parameter parameter_set:=true not all required parameters are set departure:= set room_number:= set Request missing parameters room_type:= set parameter_set:=true Fig Conversational dialog state chart while parameter input. The user is informed about what the system has understood by using implicit feedback. If the user has set multiple parameters in his utterance, the system does not repeat all parameters, as the system

113 4.1 Design of speech-based HMI Concepts 95 response would be too long. Therefore, the system repeats only the contextually most important parameter. The user does not have to speak a certain command in order to start a certain sub-dialog. The system allows the user to speak sentences indicating the sub-dialog whenever he wants if the context allows so. The GUI does not present the speakable utterances on the screen. In order to indicate the possible functions, icons are used (see Section 4.1.3). The different sub-dialogs are explained in the following. Parameter Input The user starts the speech interaction by speaking to the system in whole sentences. He can already mention some input parameters when addressing the system for the first time. The system checks, which input parameters are missing in order to send a request to the HRS service. The system prompts the user and collects the missing information. Although the system asks for only one parameter, the user is able to give more or other information than requested. As before, after having elicited all search criteria, the system repeats the parameters once again. A sample dialog is illustrated below: 1 User: I would like to book a hotel in Stuttgart. 2 System: When would you like to arrive in Stuttgart? 3 User: I arrive tomorrow and leave the day after tomorrow. 4 System: What kind of room do you prefer for the one night? 5 User: I need a double room. 6 System: Would you like to additionally constrain the hotel category or the price? System: You would like to book a double room in a three-star hotel in Stuttgart from October 29th, 2013 until October 30th, When the parameters have been provided by the user, HRS is called to retrieve the list of hotels. The user can now continue the interaction by speaking freely and without having to call certain commands. If the user does not say anything, the system immediately continues with the list presentation. Changing Parameters If the user detects a misrecognized search parameter, he can change this parameter immediately and does not have to wait, until the system has elicited all the information from the user: 1 User: I would like to book a hotel in Stuttgart. 2 System: When would you like to arrive in Strassburg? 3 User: Change the destination. 4 System: What is the new destination? 5 User: Stuttgart 6 System: Ok, Stuttgart. When would you like to arrive in Stuttgart? Furthermore, the user is able to directly indicate the new parameter value when asking for changes. Thereby, the dialog can run more natural and more efficient: 1 User: I would like to book a hotel in Stuttgart. 2 System: When would you like to arrive in Strassburg? 3 User: Change the destination into Stuttgart. 4 System: Ok, Stuttgart. When would you like to arrive in Stuttgart?

114 96 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Presenting the Result List After having received the result list from the HRS service the system presents the list of hotels. As before the hotels are presented one-by-one and for each hotel the most important facts are read out. However, the user does not have to use speech commands anymore to browse the list. He can use conversational utterances to indicate, what he would like to do: 1 User: Show me the list of hotels. 2 System: The three-star hotel Brita in Stuttgart-Obertuerkheim for 94 Euro. 3 User:... 4 System: The three-star hotel Romantisches Hotel Traube in Stuttgart- Plieningen for 99 Euro. 5 User: The previous one. 6 System: The three-star hotel Neotel in Stuttgart-Moehringen for 102 Euro. 7 User: Show me some more information. 8 System: The three-star hotel Neotel is located in the heart of... 9 User: Book that one. In order to give an overview of the strengths and weaknesses of the different strategies and the later prototypes, the command-based and the conversational are characterized using the TRINDI ticklist in the next subsection. Comparison of Dialog Strategies The TRINDI ticklist from Bohlin et al. (1999), which characterizes the dialog behavior of a SDS with the help of 12 Yes-No-questions, gives a good overview of the implemented dialog features. Both of the SDS prototypes have been developed and differentiated corresponding to this list. The filled out TRINDI ticklist for both dialog strategies is illustrated in Table 4.1. Table 4.1: Characterization of Speech Dialog Strategies on the basis of the TRINDI Ticklist. Command-Based Dialog Conversational Dialog Q1: Is utterance interpretation sensitive to context? Q2: Can the system deal with answers to questions that give more information than was requested? Q3: Can the system deal with answers to questions that give different information than was actually requested? Q4: Can the system deal with answers to questions that give less information than was requested? Q5: Can the system deal with ambiguous designators? Q6: Can the system deal with negatively specified information? Q7: Can the system deal with no answer to a question at all?

115 4.1 Design of speech-based HMI Concepts 97 Q8: Can the system deal with noisy input? Not in scope of the research work Q9: Can the system deal with help sub-dialogs initiated by the user? Q10: Can the system deal with non-help sub-dialogs initiated by the user? Q11: Does the system only ask appropriate follow-up questions? No relevant dialog step in existent hotel booking dialog Q12: Can the system deal with inconsistent information? In this research work the most important dialog features, which allow for a differentiation of both dialog strategies, have been realized so far. Concerning the dialog design of the conversational dialog, high value on the flexibility to input parameters by speech was set (e.g. Q2, Q3). Dialog features, which are no beneficial characteristic of one of the dialog strategies and which do not reveal differences in the evaluation are left out to lower the development effort (e.g. Q5, Q6, Q8). Impact of the environment on the speech interaction is not in focus of this research (Q8). The dialog flow of a hotel booking dialog is linear and does not allow for context-relevant branches, whereby Q11 becomes superfluous. In the next subsection the different GUI concepts, which have been designed to support the speech dialog, are described GUI Design Several GUI concepts have been designed in order to optimally support the speech dialog strategies. The screen designs have been customized corresponding to the dialog strategies only as much as necessary, since an objective comparison is targeted. When designing the screens the international standardized AAM-Guidelines (Driver Focus-Telematics Working Group, 2002) were adhered to, which determine the minimum font sizes, the maximum numbers of widgets, etc. in order to minimize distraction. The general layout of the different GUI screens is as follows. Each screen is split in three different regions: The top bar, the playfield and the sub-function line. In order to illustrate the different regions, Figure 4.4 presents the start screen ( app selection screen), which is displayed at the very beginning of the speech interaction and highlights the different regions of the screen. In the top bar a talking head icon is displayed. The talking head provides visual feedback to the user in order to indicate if the system is currently listening. Thereby, the user knows if he is allowed to talk or not. For instance, if the talking head is in sleep mode as illustrated in Figure 4.4, the user has to press the PTA button to wake the system up. In the playfield the main content is presented. Here, the input parameters or the hotel list are displayed. The sub-function line indicates the current GUI state or gives information about currently accessible sub-dialogs. Several GUI concepts have been designed. One concept, which is especially adapted to the command-based dialog and one concept which is adapted to the conversational dialog. Furthermore, the conversational dialog is supported by an avatar to raise the level of naturalness in the interaction. Finally, a concept which gives almost no visual feedback is designed in order to investigate the

116 98 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Talking head Top bar Playfield Subfunction line Fig Layout of the GUI screens. necessity of a GUI for a successful speech dialog and its influence on driver distraction. The different GUI concepts are described in the following with the aid of screenshots. Command-based Dialog GUI In the command-based dialog strategy the speakable input is presented on the GUI supporting thus the speech dialog. All currently possible speech commands are displayed on the screen at all times, which may lead to a high visual distraction. Hence, in automotive terms the command-based speech dialog strategy is also called speak-what-you-see strategy. The dialog starts from the app selection screen, which presents the HRS hotel application (as illustrated in Figure 4.4). When the user has called the hotel booking command, the parameter input process begins and the screen illustrated in Figure 4.5 appears. Here the first input parameter destination ( Ziel in German) has to be set by the user after being requested by the system. Afterwards the user is guided step-by-step by the system. When the user has given the requested information, a new widget appears on the screen and the system asks the user for the corresponding input (see Figure 4.6). Fig Screen of the command-based dialog at the beginning of the parameter input. Fig Screen of the command-based dialog after the first parameter input. When all the parameters are elicited by the system and the hotel service has returned the list of hotels, the possible commands for changing the input parameters ( Suche ändern ), presenting the

117 4.1 Design of speech-based HMI Concepts 99 result list ( Liste ) and starting a new search ( Neue Suche ) become visible in the sub-function line (see Figure 4.7). For instance, by calling the command Liste (or synonyms of the command) the list browsing sub-dialog is triggered and the hotel list is displayed (see Figure 4.8). For parameter changes no additional screen was designed. In the course of the parameter change sub-dialog the entries of the widgets of the screen of Figure 4.7 are manipulated. For presenting the details of a selected hotel or for booking confirmation there are screen overlays, similar to the list presentation. Fig Screen of the command-based dialog after parameter input. Fig List presentation overlay of the command-based dialog. Conversational Dialog GUI In the conversational dialog strategy the user can speak freely and does not have to call certain commands. There is no need to give the user a visual feedback of the currently speakable input, whereby the visual distraction may be lowered. For that reason the content on the head unit screen does not have to indicate the possible options to proceed with the speech dialog. The sub-function line, which was used to indicate the available commands, is replaced by only few symbols which resemble the current GUI state. The dialog starts from the same start screen as before (as illustrated in Figure 4.4). When the user asks to book a hotel, the form filling screen illustrated in Figure 4.9 appears, which represents the main screen at the beginning of the parameter input dialog, where the user is already able to input several parameters at once. The respective fields of the form are filled in the course of the parameter input (see Figure 4.10). After having elicited all required (and optional) parameters the system calls the HRS service and retrieves a list of hotels (see Figure 4.11). The symbols on the bottom of the screen resemble the GUI states for parameter input/changes and the result list. Depending on the current GUI state the respective symbol is highlighted. As illustrated in Figure 4.12 the design of the playfield of the result list screen is the same as the list presentation screen of the command-based GUI. For parameter changes, no additional screen was designed. Visual feedback is given by updating the fields of the form. For presenting the details of a selected hotel or for booking confirmation there are further screens, whose playfield designs do not differ from the command-based GUI. Conversational Dialog GUI with Avatar The avatar is only used in combination with the conversational dialog strategy. The goal of using an avatar is to raise the naturalness of the human-machine interaction. By expressing gestures and

118 100 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Fig Screen of the conversational dialog at the beginning of the parameter input. Fig Screen of the conversational dialog while parameter input. Fig Screen of the conversational dialog after parameter input. Fig List presentation screen of the conversational dialog. mimics, the avatar contributes to a more human-like interaction. When seeing a human character on the screen, the user may tend to speak more naturally, as if he would talk to a human being. This may have a positive effect on speech dialog quality and user acceptance. However, the user may be more distracted by a human character on the screen. So far, those positive and negative effects of an SDS with avatar while driving have not been examined. The GUI concept with avatar is based on the conversational dialog GUI. A virtual character designed and developed by Charamel 2 is integrated. The avatar overlays the background illustrated in Figure 4.9 and Figure 4.12 but does not cover the widgets, which are currently important for the speech dialog (see Figures 4.13, 4.14). The human agent is already visible on the app selection screen at the very beginning of the speech dialog. The avatar makes certain gestures to give the SDS some human character. For example, when the system asks for the arrival date, the avatar points towards the arrival date widget on the screen. When the user browses the hotel result list, the avatar makes a swipe gesture to support the scrolling in the list as illustrated in Figures 4.13, Command and Conversational Dialog without GUI Another goal of this research was to investigate the need for a visual feedback during the speech interaction. Can a speech dialog without a GUI still be performed effectively and efficient and will users accept such a kind of speech interaction? How strong are the influences of the GUI on driving 2

119 4.2 Realization 101 Fig Screen of the conversational dialog with avatar while parameter input. Fig List presentation screen of the conversational dialog with avatar. distraction? In order to answer these questions the two speech dialog strategies are also evaluated without GUI. In this case without GUI means that no content information is displayed on the screen. However, the visual feedback, which indicates if the user is allowed to talk, is still presented in the top bar of the screen (see Figure 4.15). Fig Screen of the command and conversational dialog without GUI. In this Section the concept design of the different speech dialog strategies and GUI concepts have been described. The next Section illustrates the prototypical implementation of these concepts using the Daimler SDF. 4.2 Realization The different concepts have been realized using the Daimler SDF. In this Section first, the general implementation of the SDS prototypes is described. As the existing grammar implementation method of the Daimler SDF did not suffice to model the large variety of utterances, which the conversational dialog requires a new approach had to be developed. The so-called linguistic grammar approach is described in subsection Prototype Implementation in the Daimler Speech Dialog Framework The speech-based HMI concepts have been implemented using the Daimler SDF described in Section In order to realize the different concepts the SDF had to be modified and extended. Figure 4.16 illustrates the new framework architecture.

120 102 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Automatic Speech Recognition Language Understanding Linguistic Analysis Contextual Interpretation Daimler SDF Task-Driven Dialog Manager Dialog Manager XML Socket Synchronization component XML Socket HotelApp Interface Text-to-Speech Synthesis Avatar Engine Language Generation SOAP Hotel Web Service Canned List of Hotels GraphicalUser Interface Avatar Visualization Basic GUI XML Socket Third-party software Fig Extended SDF prototype architecture of the HMI concept implementation. As described above the data were received from the hotel booking Web Service HRS. The Web Service had to be linked to the SDF which is described in the next subsection. In order to design the dialog the ASR and the NLU had to be configured and and the dialog flow had to be implemented in the DM. The configuration and the implementation are described in the second subsection. The basic GUI screens had to be designed and the interaction of the basic GUI screens had to be implemented (without avatar the basic GUI was simply called Graphical User Interface, as illustrated in Figure 2.17). Additionally, displaying movements and gestures of an avatar on the head unit screen required the integration and configuration of Charamel s avatar engine, which is explained in the last subsection. Connectivity to the Hotel Booking Service In order to receive realistic hotel data the HRS Web service has been linked to the Daimler SDF. When calling the hotel booking service, currently available hotels and their detailed specifications were retrieved. The hotel booking process has only been simulated. The HotelApp interface, which manages the data exchange between the HRS Web service and the SDF, has been implemented in C++. In the following, first the data exchange between the HotelApp and the Web services is explained, followed by the description of the communication between the HotelApp and the SDF modules. The Web service has been linked via the provided (Simple Object Access Protocol) SOAP interface into the existing framework. A SOAP message consists of an envelope, which contains an optional header and a mandatory body element. The body element contains the necessary details of the client s request or the response data from the Web service. The XML structure of the requests and

121 4.2 Realization 103 the responses is specified in the HRS application programming interface (API). As the HRS API is not publicly available, the following SOAP requests and responses are just simplified examples for illustration and do not conform to the real API. The XML code snippet below illustrates the request for available hotels containing the required search criteria of the examples above. 1 <env:envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"> 2 <env:header/> 3 <env:body> 4 <ns2:hotelsearchrequest xmlns:ns2="some-uri"> 5 <hotelsearchrequest> 6 <!-- [... CREDENTIALS & SESSION...] --> 7 <searchcriteria> 8 <locationname>stuttgart</locationname> 9 <hotelcategory>3</hotelcategory> 10 <arrival> </arrival> 11 <departure> </departure> 12 <roomnumber>1</roomnumber> 13 <roomtype>double</roomtype> 14 </searchcriteria> 15 </hotelsearchrequest> 16 </ns2:hotelsearchrequest> 17 </env:body> 18 </env:envelope> The SOAP XML body of each request includes login credentials and a session key. Each client receives his own login credentials, since the Web service is not publicly accessible. The Web service keeps track of the request history using sessions. The session key is necessary to allocate the current request to previous requests. After the credentials and session block the search criteria are listed. This SOAP envelope requests a double room in a three-star hotel in Stuttgart from October 19th until October 20th, 2013 as in the examples above. When sending the SOAP XML request via the interface, the service responds with the requested list of hotels encapsulated in a SOAP XML message: 1 <env:envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"> 2 <env:header/> 3 <env:body> 4 <ns2:hotelsearchresponse xmlns:ns2="some-uri"> 5 <hotelsearchresponse> 6 <numberofavailablehotels>8</numberofavailablehotels> 7 <availablehotels> 8 <hotel> 9 <hotelname>brita</hotelname> 10 <hotelcategory>3</hotelcategory> 11 <street>augsburger Str </street> 12 <postalcode>70329</postalcode> 13 <city>stuttgart</city> 14 <citydistrict>obertuerkheim</citydistrict> 15 <hoteldescription>das Hotel Brita in Stuttgart ist ein modernes Business- und Tagungshotel, das Geschaefts- und Privatreisende empfaengt.</hoteldescription> 16 <localitydescription>das Hotel Brita liegt zentral und ruhig in Stuttgart-Obertuerkheim, nahe der S-Bahnstation. Die Stuttgarter Innenstadt ist in 12 Minuten...</ localitydescription> 17 <parkingavailable>true</parkingavailable>

122 104 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks 18 <restaurantavailable>true</restaurantavailable> 19 <internetaccessavailable>true</internetaccessavailable> 20 <wellnessavailable>false</wellnessavailable> 21 <nosmokingavailable>true</nosmokingavailable> 22 <roomoffer> 23 <amount>94</amount> 24 <currency>eur</currency> 25 </roomoffer> 26 </hotel> 27 <hotel> 28 <hotelname>romantisches Hotel Traube</hotelName> </hotel> 31 <hotel> 32 <hotelname>neotel</hotelname> </hotel> </availablehotels> 37 </hotelsearchresponse> 38 </ns2:hotelsearchresponse> 39 </env:body> 40 </env:envelope> The SOAP XML response contains the number of available hotels and the detailed information about the respective hotel. For each hotel in the result list, the response provides the hotel name, the hotel category, the address, a detailed hotel description in prose text, the hotel facilities and room offers. The Web service provides much more information for each hotel, which is not illustrated in this code snippet, since it was not relevant for the implementation. During the research work connectivity problems became an issue. Therefore, several hotel list responses had been canned before the experiments. In order to stay flexible the HotelApp has been implemented to run online retrieving live data from the HRS Web service and offline using the canned hotel lists (see Figure 4.16). The HotelApp is linked to the SDF via the SYNC component. In order to exchange messages between the TDDM and the GUI and the HotelApp the SYNC XML messages had to be configured. The communication between the different modules is illustrated in Figure 4.17 and explained with the aid of an example. Task- Driven Dialog Manager Basic GUI 1 2 Synchronization component 4a 3 4b HotelApp Interface Fig XML message exchange during the hotel search. When the user has indicated all his search criteria the system has to request the list of matching hotels. The list of hotels have to be presented on the GUI and communicated verbally to the user. The XML message flow in the SDS protoype is as follows. First, the TDDM has to send the elicited parameters in the form of AVPs to the SYNC (1), which forwards the XML message to the HotelApp (2). In the next step, the HotelApp calls the HRS Web service an receives the number of available

123 4.2 Realization 105 hotels and the list of available hotels or simply uses the canned hotel result lists. Subsequently, the HotelApp sends the number of available hotels to the SYNC (3). In the last step the SYNC forwards the information to the TDDM (4a) and the GUI (4b). The list of hotels including their detailed specifications are sent from the HotelApp to the SYNC in further separate messages. Dialog Specification The structure of the dialog specification corresponds to the sub-dialogs, which have been illustrated in the previous subsection. The task hierarchy of the hotel booking dialog is illustrated in Figure For each sub-task the roles, which are required for the respective tasks, are specified. For instance, the required parameter consist of the destination, the arrival and departure date, the room number and the room type. Hotel booking Parameter Input Change Parameter Hotel List Presentation Hotel Details Presentation destination: arrival: departure: room_number: room_type: hotel_category: price: destination: arrival: departure: room_number: room_type: hotel_category: price: hotel_name: hotel_category: price_offer: district: number_available_hotels: action: hotel_name: hotel_category: price_offer: district: hotel_facilities: action: Required Parameter destination: arrival: departure: room_number: room_type: Optional Parameter hotel_category: price: Fig Task hierarchy of the hotel booking dialog. Using the Daimler SDF for each task several sub-dialogs and the dialog roles were specified. Furthermore, the prompts and the context constraints were defined. The prompts contain variables for implicit verification. The context constraints have a strong influence on the dialog behavior. In the command-based dialog the context is constraint very strictly. In each context only few ASR grammar rules, which match the current context constraint, are activated. Thereby, the user is only allowed to answer on the system s request and to follow the system s guidance. In the conversational dialog the context is less constraint and the ASR grammar is less limited. Thus, the user does not have to answer the system s request. He can overanswer the system s request or even take the initiative to introduce a new sub-dialog. The context constraint is explained with the aid of XML code snippets. The code snippets only show a tiny excerpt of the dialog specification or ASR grammar in a very limited fashion in order to help the reader understand the differences. Imagine, the system has already elicited the destination and requests the arrival date, the following prompt would become active: 1 <prompt promptname="requestdeparture" prompttext="when would you like to arrive in %locationname%?"> 2 <dialogrole rolename="from" state="request" /> 3 </prompt>

124 106 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks In the command-based dialog the grammar would be limited to only few commands, which address the requested topic arrival date : 1 //Sample command-based grammar 2 3 grammar departure; 4 5 public <date> = today tomorrow the day after tomorrow; In this contextual situation the conversational dialog would allow users to answer with elliptical utterances or whole sentences on the system s request (lines (5) and (6) in the code snippet below). Furthermore, the user can over-answer the system s request by indicating the arrival date and already the departure date ((7)). Additionally, if the user would like to change the destination entry, he can switch the topic instead of answering on the system s question ((8)). 1 //Sample conversational grammar 2 3 grammar departure; 4 5 public <days> = today tomorrow the day after tomorrow; 6 public <date> = <days> I [would like to] arrive <days> the arrival date is <days>; 7 public <date1_date2> = I [would like to] arrive <days> and [I] [would like to ] [leave depart] <days>; 8 public <change_destination> = [I would like to] change [the] [destination city] [the] [destination city] is [incorrect wrong]; 9... In addition to the differences in the context constraints the command-based dialog and the conversational dialog specification differed also in the flexibility of the ASR and the NLU lexica and grammar. The knowledge from the initial user study was used to design the grammars. The command-based dialog is restricted to commands or user inputs in the form of short elliptical utterances, whereas the conversational dialog covers both, short elliptical utterances and syntactically correct full sentences. Furthermore, the conversational dialog grammar had to cover utterances, which contain multiple input parameters. The hotel booking SDS offers seven different input parameters, which the user could combine within a single utterance. In addition, people make use of phrases with the same meaning ( three-star hotel, hotel with three stars, etc.) as they speak. Due to this flexibility in the language a large variety of possible utterances has to be taken into consideration. Some sample utterances with the same meaning in the context of a hotel booking, which contain two input parameters realized with different wordings are illustrated below: I would like to book a three-star hotel in Stuttgart. I would like to book a hotel with three stars in Stuttgart. I would like to book a three-star hotel in the city of Stuttgart. I would like to book a hotel with three stars in the city of Stuttgart. I am looking for a three-star hotel in Stuttgart. I am looking for a hotel with three stars in Stuttgart. I am looking for a three-star hotel in the city of Stuttgart. I am looking for a hotel with three stars in the city of Stuttgart. Book a three-star hotel in Stuttgart. Book a hotel with three stars in Stuttgart....

125 4.2 Realization 107 The variety of input parameters and the use of synonyms already pose great challenges to the grammar implementation, and to make matter worse, the sentence construction rules of the German language are not as strict as in other languages, such as English. In German conversational speech, speakers reorder constituents freely, which further raises the number of permutations and which is why thousands of possible utterances had to be taken into consideration. In the first iteration of the conversational SDS prototype, the utterance combinations in the grammar were generated mainly by hand. This implementation process was very time-consuming and still many syntactically wrong sentences were generated. Furthermore, an explorative pretest revealed that still not all combinations were covered. In order to overcome this problem, a new grammar design approach was developed. The socalled linguistic grammar approach, which incorporates linguistic knowledge in the grammar design, helps to model the German language by generating quickly a large number of only syntactically correct sentences. The linguistic grammar approach for conversational dialog design is described in subsection GUI Implementation This subsection describes the design and the implementation of the basic GUI and the integration and animation of the avatar. Basic GUI Implementation The background of the GUI and the different widgets have been designed using the GNU Image Manipulation Program (GIMP) 3, which is a high-end free image editing program. The different regions of the GUI have been organized as illustrated in Figure 4.4. For each region different renderings were specified. For instance, the application field can present the app selection screen or the parameter input screen. In each region different widgets, which can be manipulated, were configured. The parameter input screen contained the widgets for the different search parameter fields, which could be filled with labels (e.g. arrival date) or images (e.g. stars of the hotel category). In order to evoke changes on the GUI the transitions between the GUI states had to be defined. Changes on the GUI can concern switches of the background image of each region (transition from the app selection screen to the parameter input screen at the beginning of the interaction) or manipulations of the widgets within a region (filling parameters during parameter input). Avatar Integration and Motion Animation Specification The presentation of an animated avatar on the GUI required the integration of Charamel s avatar engine and visualization into the the Daimler SDF. The avatar visualization is rendered in an additional window with a transparent background and runs parallel to the basic GUI. Thereby, the virtual agent overlays the basic GUI and appears to move around on the display. In order to appear as natural as possible the gestures and mimics of the virtual character had to be synchronized with the TTS output. For example, the lip movements had to be in line with the currently spoken words. By locating the avatar engine between the TDDM and the TTS module (see Figure 4.16) this synchronization could be achieved. The process of requesting or presenting information to the user visually and audibly is described in the following. For better understanding the relevant modules are presented in Figure 4.19 once again. The TDDM triggers the system s multimodal output by sending a message to the avatar engine and to the basic GUI via SYNC. As always the SYNC forwards the message to the basic GUI, which 3

126 108 4 Development of Speech-based In-Car HMI Concepts for Information Exchange Tasks Text-to-Speech Synthesis Avatar Engine Task-Driven Dialog Manager Synchronization component GraphicalUser Interface Avatar Visualization Third-party software Basic GUI Fig SDS output process when using the avatar. updates its states and thereby updates the screens. The avatar engine receives a message from the TDDM, which contains the information, about what the system should speak next and a link to the animation, which the avatar shall perform. These animations are stored in an XML script, which define the movements of the avatar s body. How to define the movements of the virtual character is described in the next paragraph. The avatar engine forwards the content of the prompt to the TTS and the system starts speaking. However, at the same time the avatar engine receives feedback about the currently spoken phoneme and triggers the avatar visualization to animate the avatar. The virtual character executes the movements of the predefined script and its lips move accordingly to the TTS output. Thereby, synchronization between the TTS and the avatar visualization is ensured. For designing the animations of the virtual character the software CharAT 4 provided by Charamel was used. CharAT is an authoring tool, which allows to easily create high quality 3D real-time avatar animations. Using CharAT animations can be designed and instantly rendered for previewing the resulting animation of the virtual character. A Screenshot of the GUI of the software is presented in Figure The CharAT software allows to configure motions, emotions, speech output and camera perspectives of the avatar. For the avatar emotions and motions already predefined animations exist. For example, there are different predefined presentation motions, which can be used to make the virtual character make a presentation gesture. In order to make the avatar smile or be angry predefined emotions exist, which define the avatar s mimics. The camera perspectives can be used to change the view on the virtual character. The CharAT software also allows to configure the speech output in order to make the avatar speak. However, as the output of the TTS is specified in the dialog specification the speech output was not configured using CharAT. The described features can be added to the time line and comfortably attuned to each other. By overlapping the different features a smooth and realistic animation is generated. When the design of the animation is finished, CharAT allows to export the new animation as XML script. The predefined scripts are stored locally on the machine and can be executed by the avatar engine. In the SDS prototype, the links to the scripts are embedded in the prompts in the dialog specification. An example XML code snippet corresponding to the animation designed with the CharAT software presented in Figure 4.20 is illustrated below: 1 <prompt promptname="hoteldetailspresentation" prompttext="$(xmlscriptglobal, Scripts/presentHotellist/details.xml) %atmospheredescription%. Would you like to book the selected hotel?"> 2 <dialogrole rolename="book" state="request"/> 3 </prompt> 4 3d character animation.html

Evaluation of Speech Dialog Strategies for Internet Applications in the Car

Evaluation of Speech Dialog Strategies for Internet Applications in the Car Evaluation of Speech Dialog Strategies for Internet Applications in the Car Hansjörg Hofmann Ute Ehrlich André Berton Daimler AG / Ulm, Germany, hansjoerg.hofmann@daimler.com Angela Mahr Rafael Math Christian

More information

Voice Driven Animation System

Voice Driven Animation System Voice Driven Animation System Zhijin Wang Department of Computer Science University of British Columbia Abstract The goal of this term project is to develop a voice driven animation system that could take

More information

Position Paper for W3C Web and Automotive Workshop. Marius Spika, Mark Beckmann

Position Paper for W3C Web and Automotive Workshop. Marius Spika, Mark Beckmann Position Paper for W3C Web and Automotive Workshop Marius Spika, Mark Beckmann Responsible: Dr. Marius Spika, K-EFF/B Date: 26.09.2012 2 1 Introduction The Volkswagen group is one of the world s leading

More information

How do non-expert users exploit simultaneous inputs in multimodal interaction?

How do non-expert users exploit simultaneous inputs in multimodal interaction? How do non-expert users exploit simultaneous inputs in multimodal interaction? Knut Kvale, John Rugelbak and Ingunn Amdal 1 Telenor R&D, Norway knut.kvale@telenor.com, john.rugelbak@telenor.com, ingunn.amdal@tele.ntnu.no

More information

Part I. Introduction

Part I. Introduction Part I. Introduction In the development of modern vehicles, the infotainment system [54] belongs to the innovative area. In comparison to the conventional areas such as the motor, body construction and

More information

CONSUMERLAB CONNECTED LIFESTYLES. An analysis of evolving consumer needs

CONSUMERLAB CONNECTED LIFESTYLES. An analysis of evolving consumer needs CONSUMERLAB CONNECTED LIFESTYLES An analysis of evolving consumer needs An Ericsson Consumer Insight Summary Report January 2014 Contents INTRODUCTION AND KEY FINDINGS 3 THREE MARKETS, THREE REALITIES

More information

Abbreviation Acknowledgements. The History Analysis of the Consumer Applicatl.o Using ASR to drive Ship, One Application Example

Abbreviation Acknowledgements. The History Analysis of the Consumer Applicatl.o Using ASR to drive Ship, One Application Example Contents Preface Abbreviation Acknowledgements xv xix xxiii 1. INTRODUCTION 1 1.1 NEW TIME WITH NEW REQUIREMENT 1 1.2 THE ASR APPLICATIONS 2 The History Analysis of the Consumer Applicatl.o Using ASR to

More information

Driving the User Interface. Trends in Automotive GUIs

Driving the User Interface. Trends in Automotive GUIs Whi t epaper Dr i vi ngt heus eri nt er f ace Tr endsi naut omot i vegui s Driving the User Interface Trends in Automotive GUIs Sami Makkonen, Senior Manager, The Qt Company Motor show concept cars have

More information

ABOVE ALL WAS THE SENSE OF HEARING

ABOVE ALL WAS THE SENSE OF HEARING ABOVE ALL WAS THE SENSE OF HEARING Pleasant sounds are a symbiosis of art and technology, which arouses emotions and awakens associations. SOUND IS OUR LINE OF BUSINESS CONCEIVING > RESEARCHING > DEVELOPING

More information

2011 Springer-Verlag Berlin Heidelberg

2011 Springer-Verlag Berlin Heidelberg This document is published in: Novais, P. et al. (eds.) (2011). Ambient Intelligence - Software and Applications: 2nd International Symposium on Ambient Intelligence (ISAmI 2011). (Advances in Intelligent

More information

An OSGi based HMI for networked vehicles. Telefónica I+D Miguel García Longarón

An OSGi based HMI for networked vehicles. Telefónica I+D Miguel García Longarón June 10-11, 2008 Berlin, Germany An OSGi based HMI for networked vehicles Telefónica I+D Miguel García Longarón Networked Vehicle 2 Networked Vehicle! Tomorrow, the vehicles will be networked! Using Always

More information

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE GETTING STARTED WITH ADOBE ACROBAT CONNECT PRO MOBILE FOR IPHONE AND IPOD TOUCH Overview Attend Acrobat Connect Pro meetings using your iphone

More information

interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps

interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps This PDF contains embedded interactive features. Make sure to download and save the file to your computer to

More information

Consumers want conversational virtual assistants.

Consumers want conversational virtual assistants. Consumers want conversational virtual assistants. Research Now study reveals insights into virtual assistants usage and preferences. 1 Table of contents 1 Survey says... / p2 2 Speak easy: how conversational

More information

This article presents an overview of these four applications. 2. DOCOMO Phone Book

This article presents an overview of these four applications. 2. DOCOMO Phone Book DOCOMO Cloud Android Application Android Application Development in DOCOMO Cloud At NTT DOCOMO, we aim to provide our users with unique, innovative services that combine smartphones with the DOCOMO Cloud.

More information

Language Technology II Language-Based Interaction Dialogue design, usability,evaluation. Word-error Rate. Basic Architecture of a Dialog System (3)

Language Technology II Language-Based Interaction Dialogue design, usability,evaluation. Word-error Rate. Basic Architecture of a Dialog System (3) Language Technology II Language-Based Interaction Dialogue design, usability,evaluation Manfred Pinkal Ivana Kruijff-Korbayová Course website: www.coli.uni-saarland.de/courses/late2 Basic Architecture

More information

INTELLIGENT AGENTS AND SUPPORT FOR BROWSING AND NAVIGATION IN COMPLEX SHOPPING SCENARIOS

INTELLIGENT AGENTS AND SUPPORT FOR BROWSING AND NAVIGATION IN COMPLEX SHOPPING SCENARIOS ACTS GUIDELINE GAM-G7 INTELLIGENT AGENTS AND SUPPORT FOR BROWSING AND NAVIGATION IN COMPLEX SHOPPING SCENARIOS Editor: Martin G. Steer (martin@eurovoice.co.uk) Contributors: TeleShoppe ACTS Guideline GAM-G7

More information

Chapter 11. HCI Development Methodology

Chapter 11. HCI Development Methodology Chapter 11 HCI Development Methodology HCI: Developing Effective Organizational Information Systems Dov Te eni Jane Carey Ping Zhang HCI Development Methodology Roadmap Context Foundation Application 1

More information

Specialty Answering Service. All rights reserved.

Specialty Answering Service. All rights reserved. 0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...

More information

The Future of Communication

The Future of Communication Future Technologies I: Communication Session 2 Hannover, 3 November 2010 The Future of Communication Wolfgang Wahlster German Research Center for Artificial Intelligence Saarbrücken, Kaiserslautern, Bremen,

More information

KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE

KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE BROCHURE MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE Effective collaboration among employees, partners and customers is a critical driver of any organization

More information

Automotive HMI: Current status and future challenges

Automotive HMI: Current status and future challenges Major achievements last ten years Advanced display technologies have changed the dashboard layout from a rather static to a more flexible, dynamic and adaptable design Haptic devices have become available,

More information

Wireless Internet Usage in Taiwan

Wireless Internet Usage in Taiwan Wireless Internet Usage in Taiwan Summary Report of the July 2013 Survey Taiwan Network Information Center 目 錄 1. Survey Methodology... 1 2. Wireless and Mobile Internet Usage... 2 (1) Percentage of respondents

More information

What People Do with Voice Search Services

What People Do with Voice Search Services What People Do with Voice Search Services March 3, 2009 Yvonne Chou Lead Program Manager Voice Services Tellme Brief History 2000: Launched 1-800-555-TELL Voice Portal 2001: Branched into Enterprise Customer

More information

What's new in OneNote 2010

What's new in OneNote 2010 What's new in OneNote 2010 What's new in OneNote 2010 Universal access to all of your information With OneNote 2010, you have virtually uninterrupted access to your notes and information by providing the

More information

EMEREC. Information management for emergency crews.

EMEREC. Information management for emergency crews. EMEREC Information management for emergency crews. Rosenbauer EMEREC New screen New screen IT solutions from Rosenbauer For more than 140 years, the Rosenbauer name has been synonymous with ground-breaking

More information

The Tablet Revolution: A Report on Tablet Usage, Tablet Conversation Analysis & How Tablet Users Interact with Search Ads

The Tablet Revolution: A Report on Tablet Usage, Tablet Conversation Analysis & How Tablet Users Interact with Search Ads The Tablet Revolution: A Report on Tablet Usage, Tablet Conversation Analysis & How Tablet Users Interact with Search Ads RISE OF THE TABLET According to Gartner, tablets have sold faster than any other

More information

Chapter 3. Application Software. Chapter 3 Objectives. Application Software. Application Software. Application Software. What is application software?

Chapter 3. Application Software. Chapter 3 Objectives. Application Software. Application Software. Application Software. What is application software? Chapter 3 Objectives Chapter 3 Application Software Identify the the categories of of application software Explain ways software is is distributed Explain how to to work with application software Identify

More information

Next Internet Evolution: Getting Big Data insights from the Internet of Things

Next Internet Evolution: Getting Big Data insights from the Internet of Things Next Internet Evolution: Getting Big Data insights from the Internet of Things Internet of things are fast becoming broadly accepted in the world of computing and they should be. Advances in Cloud computing,

More information

MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE KEY BENEFITS ENHANCED PRODUCTIVITY AND INNOVATION

MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE KEY BENEFITS ENHANCED PRODUCTIVITY AND INNOVATION BROCHURE MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE Effective collaboration among employees, partners and customers is a critical driver of any organization

More information

SMARTPHONE CAPABILITIES TO THE SERVICE OF THE BLIND AND VISUALLY IMPAIRED

SMARTPHONE CAPABILITIES TO THE SERVICE OF THE BLIND AND VISUALLY IMPAIRED SMARTPHONE CAPABILITIES TO THE SERVICE OF THE BLIND AND VISUALLY IMPAIRED A large number of different highly advanced devices that make everyday life easier for all those people with sight problems have

More information

CRM (General Example from the Automobile Aftermarket) Identifying the most suitable Tool and Approach Supplement: Advanced Media and Process Support

CRM (General Example from the Automobile Aftermarket) Identifying the most suitable Tool and Approach Supplement: Advanced Media and Process Support Tomorrow's Mobile Enterprise, Today. CRM /// Process & Operations CRM (General Example from the Automobile Aftermarket) Identifying the most suitable Tool and Approach Supplement: Advanced Media and Process

More information

CS 528 Mobile and Ubiquitous Computing Lecture 2: Android Introduction and Setup. Emmanuel Agu

CS 528 Mobile and Ubiquitous Computing Lecture 2: Android Introduction and Setup. Emmanuel Agu CS 528 Mobile and Ubiquitous Computing Lecture 2: Android Introduction and Setup Emmanuel Agu What is Android? Android is world s leading mobile operating system Google: Owns Android, maintains it, extends

More information

Technology.Transfer.Application.

Technology.Transfer.Application. Technology.Transfer.Application. Steinbeis Research Center Elektromobility and Information Systems 2015 Steinbeis Technology. Transfer. Application. www.steinbeis.de Open and Secure Operating for mobility

More information

Hybridcast: Technical Specifications and Recent Progress

Hybridcast: Technical Specifications and Recent Progress Hybridcast: Technical Specifications and Recent Progress NHK launched Hybridcast in September, 2013. Hybridcast, which uses HyperText Markup Language 5 (HTML5) as an application language to combine broadcast

More information

White Paper. Banking on the Tablet Channel

White Paper. Banking on the Tablet Channel White Paper Banking on the Tablet Channel Banking on the Tablet Channel Industry estimates forecast that almost half of the U.S. Internet population will be using tablets by year-end. Tablets, with attributes

More information

Mobile Apps: What Consumers Really Need and Want. A Global Study of Consumers Expectations and Experiences of Mobile Applications

Mobile Apps: What Consumers Really Need and Want. A Global Study of Consumers Expectations and Experiences of Mobile Applications Mobile Apps: What Consumers Really Need and Want A Global Study of Consumers Expectations and Experiences of Mobile Applications The Difference Between a Mobile App and a Mobile Website Before we evaluate

More information

>> smart cross connect Users Guide. November 2014.

>> smart cross connect Users Guide. November 2014. >> smart cross connect Users Guide November 2014. >> Table of Contents 1 Overview 1.1 Getting Around 1.2 Pairing your smart with smart cross connect 2 In-Car Mode 2.1 Car Info 2.2 Navigation 2.2.1 Addresses

More information

Discovering Computers 2008. Chapter 3 Application Software

Discovering Computers 2008. Chapter 3 Application Software Discovering Computers 2008 Chapter 3 Application Software Chapter 3 Objectives Identify the categories of application software Explain ways software is distributed Explain how to work with application

More information

Dragon Solutions Enterprise Profile Management

Dragon Solutions Enterprise Profile Management Dragon Solutions Enterprise Profile Management summary Simplifying System Administration and Profile Management for Enterprise Dragon Deployments In a distributed enterprise, IT professionals are responsible

More information

What is a Mobile Responsive Website?

What is a Mobile Responsive Website? More and more of your target audience is viewing websites using smart phones and tablets. What is a Mobile Responsive Website? Web Design is the process of creating a website to represent your business,

More information

Mobile Search Deployments. Mike Wehrs Vice President, Nuance Mobile & Consumer Services

Mobile Search Deployments. Mike Wehrs Vice President, Nuance Mobile & Consumer Services Mobile Search Deployments Mike Wehrs Vice President, Nuance Mobile & Consumer Services 1 Scary Stats 80 percent of crashes involve some sort of driver inattention within three seconds of the event. The

More information

EB Automotive Driver Assistance EB Assist Solutions. Damian Barnett Director Automotive Software June 5, 2015

EB Automotive Driver Assistance EB Assist Solutions. Damian Barnett Director Automotive Software June 5, 2015 EB Automotive Driver Assistance EB Assist Solutions Damian Barnett Director Automotive Software June 5, 2015 Advanced driver assistance systems Market growth The Growth of ADAS is predicted to be about

More information

ITU-T. FG Distraction. Report on Use Cases. ITU-T Focus Group on Driver Distraction. Focus Group Technical Report. Version 1.

ITU-T. FG Distraction. Report on Use Cases. ITU-T Focus Group on Driver Distraction. Focus Group Technical Report. Version 1. I n t e r n a t i o n a l T e l e c o m m u n i c a t i o n U n i o n ITU-T TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU FG Distraction Version 1.0 (03/2013) ITU-T Focus Group on Driver Distraction

More information

Support and Compatibility

Support and Compatibility Version 1.0 Frequently Asked Questions General What is Voiyager? Voiyager is a productivity platform for VoiceXML applications with Version 1.0 of Voiyager focusing on the complete development and testing

More information

NVIDIA AUTOMOTIVE. Driving Innovation

NVIDIA AUTOMOTIVE. Driving Innovation NVIDIA AUTOMOTIVE Driving Innovation Today, NVIDIA processors are found in more than 6,200,000 PMS 186 cars and the number is growing rapidly. Realistic computer-generated 3D models and virtual simulations

More information

The Call Center in the Mobile World. A Voyeur s View Jordan Cohen Semantic Machines, Kextil, Voice Morphing, etc.

The Call Center in the Mobile World. A Voyeur s View Jordan Cohen Semantic Machines, Kextil, Voice Morphing, etc. The Call Center in the Mobile World A Voyeur s View Jordan Cohen Semantic Machines, Kextil, Voice Morphing, etc. Mobile is Money In Europe, mobile shoppers are set to spend 19.8 billion in 2014, almost

More information

Advanced Testing Methods for Automotive Software

Advanced Testing Methods for Automotive Software Advanced Testing Methods for Automotive Software Madison Turner, Technology Analyst Accelerated Technology, a Mentor Graphics Division Recent history attests to the need for improved software testing methods

More information

Mobile Dashboards For Executives

Mobile Dashboards For Executives April 2014, HAPPIEST MINDS TECHNOLOGIES Mobile Dashboards For Executives Author Umesh Narayan Gondhali SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright Information This

More information

What is a Mobile Responsive Website?

What is a Mobile Responsive Website? More and more of your target audience is viewing websites using smart phones and tablets. What is a Mobile Responsive Website? Web Design is the process of creating a website to represent your business,

More information

Owner's Manual for Voice Control. The Convenient Alternative to Manual Control.

Owner's Manual for Voice Control. The Convenient Alternative to Manual Control. Owner's Manual for Voice Control. The Convenient Alternative to Manual Control. 2000 BMW AG Munich/Germany Reprinting, including excerpts, only with the written consent of BMW AG, Munich. Part number 01

More information

U.S. Department of Health and Human Services (HHS) The Office of the National Coordinator for Health Information Technology (ONC)

U.S. Department of Health and Human Services (HHS) The Office of the National Coordinator for Health Information Technology (ONC) U.S. Department of Health and Human Services (HHS) The Office of the National Coordinator for Health Information Technology (ONC) econsent Trial Project Architectural Analysis & Technical Standards Produced

More information

Screen Design : Navigation, Windows, Controls, Text,

Screen Design : Navigation, Windows, Controls, Text, Overview Introduction Fundamentals of GUIs - methods - Some examples Screen : Navigation, Windows, Controls, Text, Evaluating GUI Performance 1 Fundamentals of GUI What kind of application? - Simple or

More information

Cisco Context-Aware Mobility Solution: Put Your Assets in Motion

Cisco Context-Aware Mobility Solution: Put Your Assets in Motion Cisco Context-Aware Mobility Solution: Put Your Assets in Motion How Contextual Information Can Drastically Change Your Business Mobility and Allow You to Achieve Unprecedented Efficiency What You Will

More information

ANDROID NOTE MANAGER APPLICATION FOR PEOPLE WITH VISUAL IMPAIRMENT

ANDROID NOTE MANAGER APPLICATION FOR PEOPLE WITH VISUAL IMPAIRMENT ANDROID NOTE MANAGER APPLICATION FOR PEOPLE WITH VISUAL IMPAIRMENT ABSTRACT Gayatri Venugopal Symbiosis Institute of Computer Studies and Research (SICSR), Symbiosis International University (SIU), Atur

More information

Introduction. Mobile GIS emerged in the mid-1990s to meet the needs of field work such as surveying and utility maintenance.

Introduction. Mobile GIS emerged in the mid-1990s to meet the needs of field work such as surveying and utility maintenance. Mobile GIS Introduction With more than 6.8 billion mobile cellular subscribers, (2013), wireless communication and mobile computing have gained acceptance worldwide with speed that has surpassed many other

More information

Radio R 4.0 IntelliLink Frequently Asked Questions

Radio R 4.0 IntelliLink Frequently Asked Questions List of content 1. Audio... 1 2. Phone... 2 3. Apple CarPlay... 2 4. Android Auto... 5 5. Gallery... 7 6. Other... 7 1. Audio Q: How can I change between different audio sources (e.g. FM radio and USB

More information

TURN YOUR DATA INTO KNOWLEDGE

TURN YOUR DATA INTO KNOWLEDGE TURN YOUR DATA INTO KNOWLEDGE 100% open source Business Intelligence and Big Data Analytics www.spagobi.org @spagobi Copyright 2016 Engineering Group, SpagoBI Labs. All rights reserved. Why choose SpagoBI

More information

Interaction Techniques for Co-located Collaborative TV

Interaction Techniques for Co-located Collaborative TV Work-in-Progress Interaction Techniques for Co-located Collaborative TV Karolina Buchner Yahoo Labs 701 First Avenue Sunnyvale, CA 94089 USA karolina@yahoo-inc.com Roman Lissermann Telecooperation Group

More information

Vehicles in the Cloud: Using Online Services to Improve Safety and Functionality in the Automotive Industry

Vehicles in the Cloud: Using Online Services to Improve Safety and Functionality in the Automotive Industry Vehicles in the Cloud: Using Online Services to Improve Safety and Functionality in the Automotive Industry Far from being pie in the sky, the cloud is already the foundation of hundreds of breakthrough

More information

Chapter 3. Application Software. Chapter 3 Objectives. Application Software

Chapter 3. Application Software. Chapter 3 Objectives. Application Software Chapter 3 Objectives Chapter 3 Application Software Identify the categories of application software Explain ways software is distributed Explain how to work with application software Identify the key features

More information

CARS Configurable Automotive Research Simulator

CARS Configurable Automotive Research Simulator CARS Configurable Automotive Research Simulator Dagmar Kern, Marco Müller, Stefan Schneegaß, Lukasz Wolejko-Wolejszo, Albrecht Schmidt Pervasive Computing University of Duisburg-Essen Schützenbahn 70 45117

More information

Research on Transplanted Design of Mobile-Terminal-Based Educational Games

Research on Transplanted Design of Mobile-Terminal-Based Educational Games Open Journal of Social Sciences, 2014, 2, 17-21 Published Online May 2014 in SciRes. http://www.scirp.org/journal/jss http://dx.doi.org/10.4236/jss.2014.25005 Research on Transplanted Design of Mobile-Terminal-Based

More information

MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE KEY BENEFITS

MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE KEY BENEFITS BROCHURE MITEL MiCOLLAB KEEPING PEOPLE CONNECTED AND PRODUCTIVE ANYTIME, ANYWHERE, ON ANY DEVICE Delivering effective collaboration amongst employees, partners and customers is a critical driver to the

More information

What is a Mobile Responsive Website?

What is a Mobile Responsive Website? Moreandmoreofyourtargetaudienceis viewingwebsitesusingsmartphonesand tablets. What is a Mobile Responsive Website? Web Design is the process of creating a website to represent your business, brand, products

More information

NTT DOCOMO Technical Journal. Shabette-Concier for Raku-Raku Smartphone Improvements to Voice Agent Service for Senior Users. 1.

NTT DOCOMO Technical Journal. Shabette-Concier for Raku-Raku Smartphone Improvements to Voice Agent Service for Senior Users. 1. Raku-Raku Smartphone Voice Agent UI Shabette-Concier for Raku-Raku Smartphone Improvements to Voice Agent Service for Senior Users We have created a new version of Shabette-Concier for Raku-Raku for the

More information

The 4 Mindsets of Mobile Product Design. Scott Plewes

The 4 Mindsets of Mobile Product Design. Scott Plewes The 4 Mindsets of Mobile Product Design Scott Plewes With the recent popularity of smart phones and tablets, software product managers are under pressure to create mobile versions of their products for

More information

Tables in the Cloud. By Larry Ng

Tables in the Cloud. By Larry Ng Tables in the Cloud By Larry Ng The Idea There has been much discussion about Big Data and the associated intricacies of how it can be mined, organized, stored, analyzed and visualized with the latest

More information

DECT. DECT Density. Wireless Technology. and WHITE PAPER

DECT. DECT Density. Wireless Technology. and WHITE PAPER WHITE PAPER DECT Wireless Technology and DECT Density INDEX Introduction 3 About DECT 3 Advantages 4 Density 4 Considerations 4 How to increase DECT density? 7 2 Introduction a summary of DECT DECT technology

More information

Workshop on Android and Applications Development

Workshop on Android and Applications Development Workshop on Android and Applications Development Duration: 2 Days (8 hrs/day) Introduction: With over one billion devices activated, Android is an exciting space to make apps to help you communicate, organize,

More information

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien 26.-27. September 2013 HU Berlin

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien 26.-27. September 2013 HU Berlin Pragmatic Web 4.0 Towards an active and interactive Semantic Media Web Prof. Dr. Adrian Paschke Arbeitsgruppe Corporate Semantic Web (AG-CSW) Institut für Informatik, Freie Universität Berlin paschke@inf.fu-berlin

More information

Online Computer Science Degree Programs. Bachelor s and Associate s Degree Programs for Computer Science

Online Computer Science Degree Programs. Bachelor s and Associate s Degree Programs for Computer Science Online Computer Science Degree Programs EDIT Online computer science degree programs are typically offered as blended programs, due to the internship requirements for this field. Blended programs will

More information

Virtual Environments - Basics -

Virtual Environments - Basics - Virtual Environments - Basics - What Is Virtual Reality? A Web-Based Introduction Version 4 Draft 1, September, 1998 Jerry Isdale http://www.isdale.com/jerry/vr/whatisvr.html Virtual Environments allow

More information

HTML5 and Device APIs for Automotive: Is it time to power Infotainment and Car Portal Applications with Web Technologies?

HTML5 and Device APIs for Automotive: Is it time to power Infotainment and Car Portal Applications with Web Technologies? HTML5 and Device APIs for Automotive: Is it time to power Infotainment and Car Portal Applications with Web Technologies? Diana Cheng - diana.cheng@vodafone.com Introduction A key advantage of HTML5 and

More information

RingCentral Office Product Overview UK. Learn what a cloud phone system can do for your business.

RingCentral Office Product Overview UK. Learn what a cloud phone system can do for your business. RingCentral Office Product Overview UK Learn what a cloud phone system can do for your business. RingCentral Office Product Overview Cloud Business Phone Systems RingCentral is the market leader in cloud

More information

Call Logging Quick Reference User Guide

Call Logging Quick Reference User Guide Call Logging provides companywide call records, comparison and analytical tools for tracking and improving the efficiency and effectiveness of business communications. An intuitive, feature rich interface

More information

2012 REALTORS Use of Mobile Technology & Social Media

2012 REALTORS Use of Mobile Technology & Social Media 2012 REALTORS Use of Mobile Technology & Social Media CALIFORNIA ASSOCATION OF REALTORS RESEARCH AND ECONOMICS 1 Table of Contents LIFE IS CLOSE TO BEING FULLY MOBILE... 2 Mobile Technology and Social

More information

SALES GUIDE CONSTRUCTION INDUSTRY

SALES GUIDE CONSTRUCTION INDUSTRY SALES GUIDE CONSTRUCTION INDUSTRY How to successfully sell in the construction industry table of CONTENTs THE CONSTRUCTION INDUSTRY 04 The Workflow 06 THE SOLUTION 08 THE ADVANTAGES SALES ARGUMENTS 10

More information

Investigating the effectiveness of audio capture and integration with other resources to support student revision and review of classroom activities

Investigating the effectiveness of audio capture and integration with other resources to support student revision and review of classroom activities Case Study Investigating the effectiveness of audio capture and integration with other resources to support student revision and review of classroom activities Iain Stewart, Willie McKee School of Engineering

More information

Introduction to Digital Marketing. Student Handbook Syllabus Version 5.0

Introduction to Digital Marketing. Student Handbook Syllabus Version 5.0 Introduction to Digital Marketing Student Handbook Syllabus Version 5.0 Copyright All rights reserved worldwide under International copyright agreements. No part of this document can be reproduced, stored

More information

EBERSPÄCHER ELECTRONICS automotive bus systems. solutions for network analysis

EBERSPÄCHER ELECTRONICS automotive bus systems. solutions for network analysis EBERSPÄCHER ELECTRONICS automotive bus systems solutions for network analysis DRIVING THE MOBILITY OF TOMORROW 2 AUTOmotive bus systems System Overview Analyzing Networks in all Development Phases Control

More information

How Apple s Corporate Strategy Drives High Growth. Blue Ocean Strategic Moves from ipod to ipad

How Apple s Corporate Strategy Drives High Growth. Blue Ocean Strategic Moves from ipod to ipad How Apple s Corporate Strategy Drives High Growth Blue Ocean Strategic Moves from ipod to ipad 1996 the Apple debacle Market Cap (US $ billion, 06/02/2011) Walmart GE Oracle IBM Google Microso6 Apple Exxon

More information

Always in touch. Adapting mobile devices to industrial applications

Always in touch. Adapting mobile devices to industrial applications Always in touch Adapting mobile devices to industrial applications Fredrik Alfredsson, Jonas Bronmark, Petter Dahlstedt, Magnus Larson, Elina Vartiainen Mobile devices have great potential to increase

More information

SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services Communication procedures

SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services Communication procedures I n t e r n a t i o n a l T e l e c o m m u n i c a t i o n U n i o n ITU-T TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU Technical Paper (11 July 2014) SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure

More information

Sentiment Analysis on Big Data

Sentiment Analysis on Big Data SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social

More information

ecornell Online Professional Development Programs

ecornell Online Professional Development Programs ecornell Online Professional Development Programs Certificate Program in Hospitality Marketing 866-326-7635 +1-607-330-3200 www.ecornell.com Certificate Program in Hospitality Marketing Certificate Information

More information

Zero-in on business decisions through innovation solutions for smart big data management. How to turn volume, variety and velocity into value

Zero-in on business decisions through innovation solutions for smart big data management. How to turn volume, variety and velocity into value Zero-in on business decisions through innovation solutions for smart big data management How to turn volume, variety and velocity into value ON THE LOOKOUT FOR NEW SOURCES OF VALUE CREATION WHAT WILL DRIVE

More information

Domus, the connected home

Domus, the connected home Domus, the connected home Amazouz Ali, Bar Alexandre, Benoist Hugues, Gwinner Charles, Hamidi Nassim, Mahboub Mohamed, Mounsif Badr, Plane Benjamin {aamazouz, abar, hbenoist, cgwinner, nhamidi, mmahboub,

More information

Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps

Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps TM Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps This PDF contains embedded interactive features. Make sure to download and save the file to your

More information

MOBILE MARKETING. A guide to how you can market your business to mobile phone users. 2 April 2012 Version 1.0

MOBILE MARKETING. A guide to how you can market your business to mobile phone users. 2 April 2012 Version 1.0 MOBILE MARKETING A guide to how you can market your business to mobile phone users 2 April 2012 Version 1.0 Contents Contents 2 Introduction 3 Skill Level 3 Terminology 3 Video Tutorials 4 What is Mobile

More information

Travelers between Finland and Russia: Translation needs IMU-project

Travelers between Finland and Russia: Translation needs IMU-project Travelers between Finland and Russia: Translation needs Tutkimus- ja Analysointikeskus TAK Oy :: GSM +358 45 137 5099 :: info@tak.fi :: www.tak.fi TABLE OF CONTENTS Summary... 1 Introduction... 2 Structure

More information

The Internet of Everything: Ecosystems and the Disconnected User

The Internet of Everything: Ecosystems and the Disconnected User The Internet of Everything: Ecosystems and the Disconnected User The Internet of Things has arrived. We can see its initial fingerprints in our smartphones, in our cars, and in our homes. It s helping

More information

Smartphone Integration in the Car IT: User Acceptance of Phone-Centric Car Connectivity Solutions

Smartphone Integration in the Car IT: User Acceptance of Phone-Centric Car Connectivity Solutions Smartphone Integration in the Car IT: User Acceptance of Phone-Centric Car Connectivity Solutions Masterarbeit zur Erlangung des akademischen Grades Master of Science (M.Sc.) im Studiengang Wirtschaftswissenschaft

More information

ABSTRACT. would end the use of the hefty 1.5-kg ticket racks carried by KSRTC conductors. It would also end the

ABSTRACT. would end the use of the hefty 1.5-kg ticket racks carried by KSRTC conductors. It would also end the E-Ticketing 1 ABSTRACT Electronic Ticket Machine Kerala State Road Transport Corporation is introducing ticket machines on buses. The ticket machines would end the use of the hefty 1.5-kg ticket racks

More information