Speech Recognition of a Voice-Access Automotive Telematics System using VoiceXML Ing-Yi Chen Tsung-Chi Huang ichen@csie.ntut.edu.tw rick@ilab.csie.ntut.edu.tw Department of Computer Science and Information Engineering, National Taipei University of Technology, Taiwan, ROC Abstract In order to provide a safe way for drivers to retrieve information, a voice-access Telematics system is implemented based on VoiceXML and web architecture. The noise problems during moving affect recognition rate greatly, and make drivers repeat commands once and once again. For the sake of raising the accuracy of recognition, this paper makes several improvements to the major components of Automated Speech Recognition engine. After applying these enhancements, the average recognition rate exceeds 70% even in the high speed condition. This result makes the Telematics system more practical. Keywords : Telematics system, VoiceXML, Speech Recognition 1. INTRODUCTION The main concept of Telematics is the combination of telecommunication and information in the vehicle. Telematics is viewed as the third revolution after high compression engine and micro electronic system in the automobile industry. In the past, the first generation of Telematics system was composed of a GPS (Global Positioning System) and a CD-ROM module for data storage, and the main function of this system was just for car navigation. The second generation of Telematics integrated the telecommunication module to connect the call center of the service provider. In the recent years, with the integration of telecommunication and internet technology, Telematics systems can retrieve more data from other content providers and provide more useful and real-time information for drivers. With the convenience of Telematics systems, drivers take more time and attention on operating these systems while they are driving. These distractions usually cause many car accidents. Due to this reason, it is necessary to operate Telematics systems with more safe and simple way. Voice-access interface is a proper solution for this problem. Drivers send commands by speaking to Telematics systems, and receive responses in the form of voice. In this hand-free environment, the distractions can be reduced greatly. With the improvement of voice technologies, it is easier to apply voice-access interface to the Telematics systems. 465
2. ANALYSIS Traditionally, the IVR (Interactive Voice Response) systems must be deployed and implemented on the specialized PBX (Private Branch Exchange) hardware. Programmers have to develop applications in the particular environment, because each vendor provides a set of proprietary programming interface and library. Besides, the voice responses during the conversation usually need to be pre-recorded, and users are restricted to use the telephone keypad to input their commands (DTMF mode, Dual-Tone Multi-Frequency, i.e., touch-tone). Therefore, the service providers usually take many efforts on developing the IVR systems, and actually, these systems are just one-way (DTMF input, voice output) voice systems. VoiceXML In order to provide an open and standard platform for voice system, several CTI (Computer Telephony Integration) companies (IBM, AT&T, Lucent and Motorola) submitted the VoiceXML specification to the W3C organization in 2000. The key technologies of VoiceXML are TTS (Text to speech) and ASR (Automated Speech Recognition). TTS provides a great support in transforming a large number of text data into voices. ASR is an important component to recognize what users say. That means the VoiceXML-based systems are truly two-way (voice input, voice output) voice systems. Due to these benefits of VoiceXML, it is easier to implement a voice-access system based on web architecture. By replacing the markup language with VoiceXML and integrating with a voice server, a basic voice system is set up. The Telematics system is composed of this voice system and the GPS/GSM module. Figure1. VoiceXML based Telematics system. The major difference between traditional and VoiceXML-based IVR system is the speech recognition input mode, and it is also the most important key of the whole system. The accuracy of recognition affects the practicability of voice system greatly, especially in the environment of Telematics system. In order to decrease distractions during driving, the recognition rate must be raised as high as possible. Noise Issues In the environment of Telematics system, the major obstacle to recognition is the noise during car moving. These noises usually cause the incorrect results in recognition. The main reasons are as follows. 466
i. In general, voice system usually prompts some information before users select their options, but it is redundant when users are familiar with this system. Therefore, most voice systems provide the function called Barge-in. This function provides users to interrupt the prompt information and enter the next voice layer directly. In the Telematics environment, the microphone that installed in the car is a sensitive one in order to receive the voices from driver precisely. That means each louder noise such as conversation of passengers or sound of horn is possibly regarded as a voice input command. ii. The second issue is the noise in low frequency during car is moving. These continued noises do not cause the Barge-in issue above, but affect the accuracy of recognition. The mixed waves of voice signal decrease the true signals that drivers input. If the signals of noise exceed the limit, ASR engine will receive incomplete waves and fail to recognize the correct input commands. This kind of noise is relative to the friction about the stability of streets and moving speed. When the car drives on a smoother street with a lower speed, the noise of low frequency is smaller; on the contrary, especially in a high speed condition, the Telematics system usually fails to recognize the input commands due to the low frequency noises. Figure2 shows the recognition rates in different test environments. Vehicle (80kph) 0.32 Vehicle (60kph) 0.58 Vehicle (0kph) General Phone Line PC-Headset 0.87 0.91 0.93 0 0.2 0.4 0.6 0.8 1 Recognition Rate Figure2. Recognition rates in different test environments. Each test environment uses 100 sample single words to test the recognition rate. Obviously, in the PC-Headset and stop environments, the recognition rates are proper to fit the requirement. With the increasing speed, the recognition rates are lower and lower. In the high speed condition (100kph), the system is almost unable to recognize any word due to the noises. This result shows the seriousness of noise problems. 3. SOLUTION Figure3 shows the basic steps of recognition process. 467
Recognition Processes Figure3. Recognition Processes Step1: User Input Microphone catches user s voices in the form of analog signals. Step2: Digitization Sound card digitizes the analog signals. Step3: Phonetic Breakdown Breaking signals into phonemes. Step4: Matching According to the grammar, phonetic representation and vocabulary library, the system returns the proper word. The whole process of recognition can be separated into hardware and software. The telephone line from the telephone company to the PBX server was a traditional analog line, and it caused many unnecessary signals during signal transmission. Hence, the system provider decides to upgrade the whole telephone system form analog to digital. High (100kph) Middle (80kph) Low (60kph) Park (0kph) 0.52 0.67 0.8 0.91 0 0.2 0.4 0.6 0.8 1 Recognition Rate Figure4. Recognition rates in digital environment. The test results of digital environment are much better than before. The hardware upgrade gains an effective improvement in the process of transforming signals from analog to digital (From Step1 to Step3), but the recognition rates of higher speed are still unable to fit the requirement. In order to reduce the effect of these noise problems and increase the accuracy of recognition, the Telematics system is disabled the function of Barge-in to avoid all unexpected interruptions, and make several enhancements as follows. Improvements in Recognition The most important fundamentals of recognition process are vocabulary library and grammar of applications. When ASR engine receives the digital signals, it recognizes each phoneme depending on the vocabulary library, and returns a matching list to the VoiceXML application. 468
VoiceXML application receives the matching list and compares with grammar list. If there is a matching word in the grammar list, application will process this command. Otherwise, application will return a nomatch error to user. Therefore, improving these two elements is the most efficient way to raise the recognition rate. Figure5. Command matching flow i. Basically, ASR engine contains a regular vocabulary library. It is sufficient to recognize basic words. In the Telematics system, applications usually contain many peculiar options such as name of street or restaurant. It is difficult to recognize these words especially in the noise environment. The first enhancement is to rebuild the vocabulary library of recognition engine with the specified words which are usually used in this Telematics system. This helps ASR engine to recognize a particular word more easily. The line in figure5 shows this result. The recognition rate retains upon 60% after rebuilding the ASR engine. Recognition Rate(%) 100 90 80 70 60 50 40 30 20 10 0 Park 0 Low 60 Middle 80 High 100 Car Speed(km/h) Analog Digital Digital + new ASR engine Figure6. ii. In the later stage of implementation, the W3C announce the VoiceXML 2.0 specification. In this specification, the major improvement is to strengthen the capability of recognizing grammar. This helps application to specify a separate word more effective. By upgrading the VoiceXML from 1.0 to 2.0, the recognition rate is higher than before. Figure7 shows the final results after these improvements. Recognition rates exceed 70% in each testing speed, and it is reach the requirement. 469
High (100kph) Middle (80kph) Low (60kph) Park (0kph) 0.74 0.83 0.92 0.97 0 0.2 0.4 0.6 0.8 1 Recognition Rate Figure7. Final Recognition Rate 4. CONCLUSION In this paper, the Telematics system is implemented with voice-access capability by using VoiceXML. In order to provide a safe and convenient voice interface, the most important thing of whole system is to raise the recognition rate as high as possible. After applying several improvements, the final average recognition rates exceed 70% in the real driving environment. This result is proper enough to apply in a Telematics system. 5. ACKNOWLEDGEMENT The authors would like to thank the National Science Council (NSC 91-2213-E-033-028) for supports to this project. 6. REFERENCES [1] C. Bisdikian, I. Boamah, V. Rasin, Intelligent Pervasive Middleware for Context-Based and Localized Telematics Services, International Conference on Mobile Computing and Networking, 2002. [2] D. Reilly, A. Taleb-Bendiab, A Service-Based Architecture for In-Vehicle Telematics Systems, IEEE 22nd International Conference on Distributed Computing Systems Workshops, 2002. [3] Y. Obuchi, E. Nyberg, T. Mitamura, M. Duggan, Robust Dialog Management Architecture using VoiceXML for Car Telematics Systems, Proc. IEEE Workshop on DSP in Mobile and Vehicular Systems, April, 2003. [4] E.Nyberg, T.Mitamura, P.Placeway, M.Duggam and N.Hataoka, DialogXML : Extending Voice-XML for Dynamic Dialog Management, Proc. of HLT-2002, 2002. [5] N. Hataoka, Y. Obuchi, T. Mitamura and E. Nyberg, Robust speech dialog interface for car telematics service, IEEE Consumer Communications and Networking Conference 2004, Page 331~335, Jan. 2004. [6] Carl M. Rebman, Jr., Milam W. Aiken and Casey G. Cegielski, Speech recognition in the human computer interface, Information & Management, Volume 40, Issue 6, July 2003, Pages 509-519. [7] J. Gröschel, F. Philipp, St. Skonetzki, H. Genzwürker, Th. Wetter and K. Ellinger, Automated speech recognition for time recording in out-of-hospital emergency medicine, Resuscitation, Volume 60, Issue 2, February 2004, Pages 205-212. 470