TUNING SPHINX TO OUTPERFORM GOOGLE S SPEECH RECOGNITION API. 1 Introduction. Patrick Lange 1,2,3 and David Suendermann-Oeft 1
|
|
|
- Vivian Scott
- 10 years ago
- Views:
Transcription
1 TUNING SPHINX TO OUTPERFORM GOOGLE S SPEECH RECOGNITION API Patrick Lange 1,2,3 and David Suendermann-Oeft 1 1 DHBW, Stuttgart, Germany 2 Linguwerk GmbH, Dresden, Germany 3 Staffordshire University, Stafford, UK [email protected] [email protected] Abstract: In this paper, we investigate whether the open-source speech recognizer Sphinx can be tuned to outperform Google s cloud-based speech recognition API in a spoken dialog system task. According to this target domain, we use data from CMU s Let s Go bus information system comprising 258k utterances of telephony speech recorded in the bus information dialog system of Pittsburgh. By training a domain-specific language model on the aforementioned corpus and tuning a number of Sphinx s parameters, we achieve a WER of 51.2%. This result is significantly lower than the one produced by Google s speech recognition API whose language model is built on millions of times more training data. 1 Introduction Cloud computing has become one of the most popular topics in the IT world during the past decade. One of the biggest benefits of a cloud-based service is the higher computational power of servers compared to customer devices. Among other applications, speech recognizers can greatly benefit from this advantage and are now available as cloud-based services offered by several vendors. Mobile devices like smartphones usually have less computation power than non-mobile ones but are often equipped with wireless Internet connection. Since cloud computing shifts the requirement of devices from powerful hardware to reliable network connectivity, it is particularly appropriate to be used inside mobile applications. Further considering that user input on mobile devices is rather inconvenient due to the lack of a full hardware keyboard, it becomes obvious that mobile devices like smartphones are a good platform for voice-controlled applications utilizing cloud-based speech recognition. Two of the biggest companies building voice-powered applications are Google and Microsoft. In the years 2007 to 2009, they have been providing toll-free alternatives to 4-1-1, a costly directory assistant by phone companies in the USA and Canada. These free services were commonly known as GOOG-411 [1] and BING-411. However, due to the above mentioned reasons, smartphones have proven to be better suited for voice search applications, and both GOOG-411 and BING-411 were discontinued a few years after their launch. Instead, both vendors are now focusing on providing mobile and voice-controlled versions of their search engines Google and Bing for various smartphone operating systems. Data collected during the operation of GOOG-411 and BING-411 served for bootstrapping acoustic and language models deployed in the voice search engines. This work was supported by a grant from the Baden-Wuerttemberg Ministry of Science and Arts as part of the research project OASIS Open-Source Automatic Speech Recognition In Smart Devices
2 Although the recognizers of both companies are used by millions of users around the world, they come with some major drawbacks regarding 3rd-party applications. First of all, to the best of our knowledge, scientific evaluations on their performance are still outstanding. Secondly, the cloud-based recognizers cannot be customized regarding the used models or the internal parameters. Thirdly, no official APIs are provided to 3rd-party organizations to develop their own voice-powered applications. Finally, they are proprietary software and not open-source. In this paper, we answer the question whether it is possible to achieve similar or even better performance with an open-source speech recognizer which is customizable and provides an API usable by 3rd-party application developers. 2 State of the Art Besides the above mentioned vendors Google and Microsoft, there are many other companies developing speech recognizers. In this section, we present some of the most relevant commercial and open-source state-of-the-art products and motivate why we have chosen to compare Google s speech recognition API and Sphinx. 2.1 Commercial recognizers One of the most popular speech-powered applications is Apple s personal assistant Siri [2]. First launched as a smartphone app, in October 2011, Siri was integrated as a key feature of the iphone 4S. Ever since, Siri has regularly appeared in the media as a killer app, including New York Times and CNN. It is probably the most successful application combining speech recognition with understanding of natural language, context, and reasoning. Another widely known application featuring highly sophisticated natural language and speech processing technology is the question answering system IBM Watson [3] famous for beating the two ultimate champions in the quiz show Jeopardy. IBM has announced that Watson will be made available as a customer service agent. The Ask Watson feature will be accessible through different channels such as web chat, , smartphone apps, and SMS. Furthermore, speech recognition is to make Watson accessible by voice. As a reaction to the succes of Apple s Siri, in 2012, Google introduced the personal assistant Google Now in Android 4.1 [4]. It focuses less on understanding natural language, context, and reasoning but more on predicting search queries and preemptively showing results. Hence, the biggest difference between Google Now and Siri is that the former is to proactively offer information when appropriate. For example, in a train station, Google Now shows arrival and departure times without user request. Besides Google Now on smartphones, Google has released a web-based speech recognition API. This Javascript library provides speech recognition and synthesis functionality and is available starting in Chrome version 25. Google s speech recognition API allows developers to create voice-controlled Chrome add-ons and websites. Another vendor of cloud-based as well as local speech recognizers is Nuance Communications. Besides developing the popular speech-to-text desktop software Dragon NaturallySpeaking, Nuance also offers a Siri alternative for Android and ios called Vlingo Virtual Assistant [5]. Furthermore, recently, their CEO Paul Ricci disclosed that Nuance is powering the cloud-based speech recognition behind Apple s Siri [6]. 2.2 Open-source speech recognizers One of the most popular open-source speech recognizers is the Hidden Markov Model Toolkit (HTK) [7]. Although it can be used to train any type of hidden Markov model (HMM), HTK is primarily applied to speech recognition research. It consists of a set of libraries and tools
3 written in C. The tools provide facilities for speech analysis, HMM training, testing and result analysis. HTK can be used to build recognizers based on huge corpora including thousands of hours of speech [8]. Another open-source toolkit is SRI s Decipher [9]. It can recognize natural, continuous speech without requiring the user to train the system in advance. It can be distinguished from other speech recognizers by its detailed modeling of variations in pronunciation and its robustness to background noise and channel distortion. These features let Decipher excel in recognizing spontaneous speech of different dialects and makes recognition less dependent on the idiosyncrasies of different microphones and acoustic environments. Kaldi is a free, open-source toolkit for speech recognition developed at Microsoft Research [10]. It provides a speech recognition system based on finite-state transducers together with detailed documentation and scripts for building complete recognition systems. Kaldi is written in C++ and is released under the Apache License v2.0 which is intrinsically nonrestrictive, making it suitable for a wide community of users. MIT s WAMI toolkit provides a framework for developing, deploying, and evaluating webaccessible multimodal interfaces in which users interact using speech, mouse, pen, or touch [11]. The toolkit makes use of modern web-programming techniques, enabling the development of browser-based applications which rival the quality of traditional native interfaces, yet are available on a wide array of Internet-connected devices. Additionally, it provides resources for collecting, transcribing, and annotating data from multimodal user interactions. Sphinx is the name of a group of speech recognition systems developed at Carnegie Mellon University. Among these systems are Sphinx 3 [12], a decoder for speech recognition written in C, Sphinx 4 [13], a modified recognizer written in Java, and Pocketsphinx [14], a lightweight recognizer library written in C. RWTH Automatic Speech Recognition (RASR) is written in C++ containing a speech recognizer and tools for the development of acoustic models [15]. Speech recognition systems developed using this framework have shown high performance in multiple international research projects and at regular evaluations, e.g. in the scope of the DARPA project GALE [16]. 2.3 Experimental systems Our aim is to show that open-source speech recognizers can be an alternative to cloud-based commercial speech recognizers that are trained on humongous amounts of data. From the list of available open-source speech recognizers provided in Section 2.2, we chose CMU s Sphinx. Main reason for this choice was the ultimate goal of this research to optimize recognition performance in the scope of distributed industry-standard-compliant open-source dialog systems [17, 18, 19]. In addition to Sphinx s support of the ARPA language model format [20] and the Java Speech Grammar Format (JSGF), there exists also a Media Resource Control Protocol (MRCP) port [21] which we are using within a complete VoiceXML and SIP-based spoken dialog system. The architecture of this system is depicted in Figure 1. Moreover, Sphinx s license is least restrictive and permits even commercial use of the code which makes it very appealing for industrial partners of the authors. For this paper s experiments, we used Pocketsphinx Version 0.8. Pocketshinx is still in an experimental state and under development. We chose Google s speech recognition API because it is the only industrial cloud-based speech recognizer with a freely accessible API.
4 PSTN Asterisk (telephony server) voice browser SIP SIP SIP Zanzibar JVoiceXML RTP (audio streams) SIP MRCP HTTP speech server web server Cairo (MRCP server) Apache Sphinx (ASR) FreeTTS (TTS) VXML, JSGF, SRGS, WAV Figure 1 - Architecture of a Distributed Industry-Standard-Compliant Open-Source Spoken Dialog System 3 Corpus Our experiments are based on data collected within the Let s Go bus information system in the years 2008 and 2009 [22]. This spoken dialog system created at Carnegie Mellon University provides bus schedule information to the Pittsburgh population during off-peak times. The entire corpus consists of about 250k utterances which have been fully transcribed. We randomly selected 3000 sentences as test and 3000 sentences as development data. The remaining sentences were used to train a domain-specific language model. To estimate the difficulty of the recognition task, we counted OOV words and computed the 3-gram perplexity for each set. Utterances which could not be transcribed are tagged as NON UNDERSTANDABLE in the corpus. For our experiments, we replaced the NON UN- DERSTANDABLE tags by empty transcriptions. See Table 1 for detailed corpus statistics. 4 Experiments As motivated in the introduction, the ultimate goal of our experiments was to find out whether the open-source speech recognizer Sphinx is able to outperform Google s speech recognition API (the baseline ).
5 Table 1 - Let s Go Bus Information System Corpus Statistics Sentences Words Train Vocabulary Words Singletons 1819 NON UNDERSTANDABLE ( 17% ) Sentences Words Development OOVs 25 Perplexity NON UNDERSTANDABLE 502 ( 16% ) Sentences Words 5885 Test OOVs 16 Perplexity NON UNDERSTANDABLE 489 ( 16% ) As a first step, we determined the baseline performance on the test set. To this end, we converted the raw audio files into the flac format supported by the Google speech recognition API and achieved a word error rate (WER) of 54.6% (3211 errors out of 5885 words). Short utterances (like yes or no) or times or bus numbers were recognized very well, but Google s speech recognition API struggled with grammatically wrong phrases and addresses. For example, Google recognized as oakland knox avenue in homewood knoxville last bus i cannot have another else like knoxville stuff. WER [%] 100,0 95,0 90,0 85,0 80,0 75,0 70,0 65,0 60, frate Figure 2 - Results: Frate Optimization First Iteration
6 WER [%] lw Figure 3 - Results: Lw Optimization First Iteration WER [%] bestpathlw Figure 4 - Results: Bestpathlw Optimization First Iteration After establishing the baseline, we started experimentation with Sphinx using a set of default models and settings. In all the following experiments, we used the Hub-4 acoustic model and dictionary [23]. All tuning experiments were run exclusively on the development set to prevent optimization on the test set. The Hub-4 language model resulted in a WER of 171% on the development set. Also WSJ-5k and WSJ-20k language models produced WERs over 100%. The reason was obviously that vocabulary and n-gram distributions did not match those of the Let s Go bus information system domain. To overcome this issue, we created a domain-specific language model using the language model toolkit CMUclmtk and the data of the training set (466k word tokens). This customized language model improved the WER to 83.3%. To further increase the performance of Sphinx, we now looked into optimizing recognizer parameters. After running a number of informal tests, we chose to include the following parameters in formal experiments: frate, lw, bestpathlw, fwdflat, wip, pip, and uw. See Table 2 for more information about these parameters. Although the most rewarding way to optimize parameters would be to brute-force-test every possible combination of parameter values, this approach would be too time consuming regarding the wide range of the values. Therefore, we assumed independence of the parameters and, hence, that parameters can be optimized individually. As this assumption is probably not entirely valid, we decided to repeat the cycle of individual parameter optimization twice. In the first iteration, we reduced the WER to 54.4% which is an improvement by 37.4% com-
7 uw pip wip fwdflatlw bestpathlw lw frate 0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0 Absolute WER Improvement [%] Figure 5 - Absolute WER Improvement First Iteration Table 2 - Parameters Optimized During The Experiment Name Default value Description frate 100 Frame rate lw 6.5 Language model probability weigth bestpathlw 9.5 Language model probability weight for DAG best path search fwdflatlw 8.5 Language model probability weight for flat lexicon decoding wip 0.65 Word insertion penalty pip 1 Phone insertion penalty uw 1 Unigram weigth pared to the custom language model with default parameter values. Figure 5 shows that the main improvement was gained through optimizing frate, lw, and bestpathlw. Figures 2, 3, and 4 show how the WER depends on the values of these three parameters during the first optimization cycle. While frate and lw exhibit clearly visible local optima, the local minima of bestpathlw are not very obvious. Rather, it looks like that any sufficiently high value is acceptable meaning that the language model score is overly important during the directed acyclic graph (DAG) best path search. In the second iteration, we were able to decrease the WER by another 2.8% which is relatively small compared to the first iteration. Again, frate was the parameter with the highest impact followed by lw, bestpathlw, and wip. Since the WER improved only by 2.8% during the second iteration, we decided to stop tuning parameters at this point. We believed it to be unlikely that further (slim) improvements would populate to the test set but would rather overfit the development set. In the last step, we applied the optimized parameter configuration to the test set achieving a WER of 51.2% (3015 errors out of 5885 words). As aforementioned, Google s speech recognition API achieved a WER of 54.6% (3211 errors out of 5885 words). With a p-value of as determined by a two-sample pooled t-test, this improvement is highly statistically significant.
8 Table 3 - Parameter Optimization Results WER Total error Parameter Value default lm 171% custom lm 83.3% % frate % lw % bestpathlw 85 first iteration 55.6% fwdflatlw % wip % pip % uw % 3106 frate % 3078 lw % 3053 bestpathlw 390 second iteration 52.2% 3053 fwdflatlw % 3037 wip % 3037 pip % 3021 uw 2 5 Conclusion and future work Using the Let s Go bus information system corpus, we were able to show that the open-source speech recognizer Sphinx is capable of outperforming Google s speech recognition API. The former achieved 51.2% WER on a test set comprising 3000 sentences. This result outperformed Google by 3.3%, being highly statistically significant. The most effective parameters tuned during Sphinx s optimization cycles were frate, lw, and bestpathlw. Even though improvements in the Let s Go bus information system domain were significant, the overall WER is still very high. One of the reasons is that we did not distinguish individual contexts within the spoken dialog system when training the language model. For instance, in a confirmation context, the user input (mainly yes or no) looks entirely different from a bus destination collection context which is dominated by names of bus stops. We have shown in [24] that dividing data into micro-contexts has a strongly positive impact on performance. Another weak point of our current setting is that we were using default acoustic and lexical models. Building, or at least tuning, the acoustic model on the collected telephone speech considering the typical Pittsburghian caller and updating the lexicon with relevant vocabulary should improve performance even more, as should the application of speaker adaptation and normalization techniques [25]. We also suspect that a brute-force approach to parameter tuning is more likely to find the global optimum than the greedy two-iteration approach. As motivated before, we are presently applying the knowledge gained in this tuning experiment to building open-source industry-standard-compliant spoken dialog systems [19]. In order to supply the best possible performance to our systems, in addition to the aforementioned measures, we are also considering to compare Sphinx with other free speech recognizers and possibly apply system combination techniques (see examples in Section 2.2).
9 References [1] M. Bacchiani, F. Beaufays, J. Schalkwyk, M. Schuster, and B. Strope, Deploying GOOG- 411: Early Lessons in Data, Measurement, and Testing, in Proc. of the ICASSP, Las Vegas, USA, [2] N. Winarsky, B. Mark, and H. Kressel, The Development of Siri and the SRI Venture Creation Process, SRI International, Menlo Park, USA, Tech. Rep., [3] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty, Building Watson: An Overview of the DeepQA Project, AI Magazine, vol. 31, no. 3, [4] G. de Melo and K. Hose, Advances in Information Retrieval. New York, USA: Springer, [5] Z. Liu and M. Bacchiani, TechWare: Mobile Media Search Resources, Signal Processing Magazine, vol. 28, no. 4, [6] P. Ricci, Nuance CEO Paul Ricci: The Full D11 Interview (Video), interview-video/, May [7] S. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, Version 3.4. Cambridge, UK: Cambridge University, [8] G. Evermann, H. Chan, M. Gales, B. Jia, D. Mrva, P. Woodland, and K. Yu, Training LVCSR Systems on Thousands of Hours of Data, in Proc. of the ICASSP, Philadelphia, USA, [9] H. Murveit, M. Weintraub, and M. Cohen, Training Set Issues in SRI s DECIPHER Speech Recognition System, in Proc. of the DARPA Speech and Natural Language Workshop, Hidden Valley, USA, [10] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, The KALDI Speech Recognition Toolkit, in Proc. of ASRU, Hawaii, USA, [11] A. Gruenstein, I. McGraw, and I. Badr, The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible Multimodal Interfaces, in Proc. of the ICMI, Chania, Greece, [12] K. Seymore, S. Chen, S. Doh, M. Eskenazi, E. Gouvea, B. Raj, M. Ravishankar, R. Rosenfeld, M. Siegler, R. Stern, and E. Thayer, The 1997 CMU Sphinx-3 English Broadcast News Transcription System, in Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, USA, [13] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, The CMU SPHINX-4 Speech Recognition System, in Proc. of the ICASSP 03, Hong Kong, China, 2003.
10 [14] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky, Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand- Held Devices, in Proc. of the ICASSP, Toulouse, France, [15] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer, Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney, RASR The RWTH Aachen University Open Source Speech Recognition Toolkit, in Proc. of ASRU, Hawaii, USA, [16] D. Rybach, S. Hahn, C. Gollan, R. Schlüter, and H. Ney, Advances in Arabic Broadcast News Transcription at RWTH, in Proc. of the ASRU, Kyoto, Japan, [17] D. Suendermann, Advances in Commercial Deployment of Spoken Dialog Systems. New York, USA: Springer, [18] A. Schmitt, M. Scholz, W. Minker, J. Liscombe, and D. Suendermann, Is It Possible to Predict Task Completion in Automated Troubleshooters? in Proc. of the Interspeech, Makuhari, Japan, [19] T. von Oldenburg, J. Grupp, and D. Suendermann, Towards a Distributed Open-Source Spoken Dialog System Following Industry Standards, in Proc. of the ITG, Braunschweig, Germany, [20] D. Gibbon, R. Moore, and R. Winski, Handbook of Standards and Resources for Spoken Language Systems. New York, USA: Mouton de Gruyter, [21] D. Prylipko, D. Schnelle-Walka, S. Lord, and A. Wendemuth, Zanzibar OpenIVR: An Open-Source Framework for Development of Spoken Dialog Systems, in Proc. of the TSD, Pilsen, Czech Republic, [22] A. Raux, D. Bohus, B. Langner, A. Black, and M. Eskenazi, Doing Research on a Deployed Spoken Dialogue System: One Year of Let s Go! Experience, in Proc. of the Interspeech, Pittsburgh, USA, [23] K. Seymore, S. Chen, M. Eskenazi, and R. Rosenfeld, Experiments in Evaluating Interactive Spoken Language Systems, in Proc. of the DARPA Speech Recognition Workshop, Washington, D.C., USA, [24] D. Suendermann, J. Liscombe, K. Evanini, K. Dayanidhi, and R. Pieraccini, From Rule- Based to Statistical Grammars: Continuous Improvement of Large-Scale Spoken Dialog Systems, in Proc. of the ICASSP, Taipei, Taiwan, [25] D. Sündermann, A. Bonafonte, H. Ney, and H. Höge, Time Domain Vocal Tract Length Normalization, in Proc. of the ISSPIT, Rome, Italy, 2004.
Comparing Open-Source Speech Recognition Toolkits
Comparing Open-Source Speech Recognition Toolkits Christian Gaida 1, Patrick Lange 1,2,3, Rico Petrick 2, Patrick Proba 4, Ahmed Malatawy 1,5, and David Suendermann-Oeft 1 1 DHBW, Stuttgart, Germany 2
HALEF: an open-source standard-compliant telephony-based modular spoken dialog system A review and an outlook
HALEF: an open-source standard-compliant telephony-based modular spoken dialog system A review and an outlook David Suendermann-Oeft, Vikram Ramanarayanan, Moritz Teckenbrock, Felix Neutatz and Dennis
Zanzibar OpenIVR: an Open-Source Framework for Development of Spoken Dialog Systems
Zanzibar OpenIVR: an Open-Source Framework for Development of Spoken Dialog Systems Dmytro Prylipko 1, Dirk Schnelle-Walka 2, Spencer Lord 3, and Andreas Wendemuth 1 1 Chair of Cognitive Systems, Otto-von-Guericke
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
WebRTC: Why and How? FRAFOS GmbH. FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com
WebRTC: Why and How? FRAFOS GmbH FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com This docume nt is copyright of FRAFOS GmbH. Duplication or propagation or e xtracts
Reading Assistant: Technology for Guided Oral Reading
A Scientific Learning Whitepaper 300 Frank H. Ogawa Plaza, Ste. 600 Oakland, CA 94612 888-358-0212 www.scilearn.com Reading Assistant: Technology for Guided Oral Reading Valerie Beattie, Ph.D. Director
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
Transcription FAQ. Can Dragon be used to transcribe meetings or interviews?
Transcription FAQ Can Dragon be used to transcribe meetings or interviews? No. Given its amazing recognition accuracy, many assume that Dragon speech recognition would be an ideal solution for meeting
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results
Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results Alan W Black 1, Susanne Burger 1, Alistair Conkie 4, Helen Hastie 2, Simon Keizer 3, Oliver Lemon 2, Nicolas Merigaud 2, Gabriel
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
C E D A T 8 5. Innovating services and technologies for speech content management
C E D A T 8 5 Innovating services and technologies for speech content management Company profile 25 years experience in the market of transcription/reporting services; Cedat 85 Group: Cedat 85 srl Subtitle
Middleware- Driven Mobile Applications
Middleware- Driven Mobile Applications A motwin White Paper When Launching New Mobile Services, Middleware Offers the Fastest, Most Flexible Development Path for Sophisticated Apps 1 Executive Summary
Generating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN [email protected] Michael Schonwetter Linguistech Consortium, NJ [email protected] Joan Bachenko
Building A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com
WebRTC for Service Providers FRAFOS GmbH FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com This document is copyright of FRAFOS GmbH. Duplication or propagation or
FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com
WebRTC for the Enterprise FRAFOS GmbH FRAFOS GmbH Windscheidstr. 18 Ahoi 10627 Berlin Germany [email protected] www.frafos.com This document is copyright of FRAFOS GmbH. Duplication or propagation or extracts
Materials Software Systems Inc (MSSI). Enabling Speech on Touch Tone IVR White Paper
Materials Software Systems Inc (MSSI). Enabling Speech on Touch Tone IVR White Paper Reliable Customer Service and Automation is the key for Success in Hosted Interactive Voice Response Speech Enabled
Specialty Answering Service. All rights reserved.
0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...
Automated Speech to Text Transcription Evaluation
Automated Speech to Text Transcription Evaluation Ryan H Email: [email protected] Haikal Saliba Email: [email protected] Patrick C Email: [email protected] Bassem Tossoun Email: [email protected]
Mobile Application Languages XML, Java, J2ME and JavaCard Lesson 03 XML based Standards and Formats for Applications
Mobile Application Languages XML, Java, J2ME and JavaCard Lesson 03 XML based Standards and Formats for Applications Oxford University Press 2007. All rights reserved. 1 XML An extensible language The
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps
interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps This PDF contains embedded interactive features. Make sure to download and save the file to your computer to
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
Dragon speech recognition Nuance Dragon NaturallySpeaking 13 comparison by product. Feature matrix. Professional Premium Home.
matrix Recognition accuracy Recognition speed System configuration Turns your voice into text with up to 99% accuracy New - Up to a 15% improvement to out-of-the-box accuracy compared to Dragon version
VoiceXML-Based Dialogue Systems
VoiceXML-Based Dialogue Systems Pavel Cenek Laboratory of Speech and Dialogue Faculty of Informatics Masaryk University Brno Agenda Dialogue system (DS) VoiceXML Frame-based DS in general 2 Computer based
ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
Prevention of Spam over IP Telephony (SPIT)
General Papers Prevention of Spam over IP Telephony (SPIT) Juergen QUITTEK, Saverio NICCOLINI, Sandra TARTARELLI, Roman SCHLEGEL Abstract Spam over IP Telephony (SPIT) is expected to become a serious problem
Web page creation using VoiceXML as Slot filling task. Ravi M H
Web page creation using VoiceXML as Slot filling task Ravi M H Agenda > Voice XML > Slot filling task > Web page creation for user profile > Applications > Results and statistics > Language model > Future
DRAGON NATURALLYSPEAKING 12 FEATURE MATRIX COMPARISON BY PRODUCT EDITION
1 Recognition Accuracy Turns your voice into text with up to 99% accuracy NEW - Up to a 20% improvement to out-of-the-box accuracy compared to Dragon version 11 Recognition Speed Words appear on the screen
Abstract. Avaya Solution & Interoperability Test Lab
Avaya Solution & Interoperability Test Lab Application Notes for LumenVox Automated Speech Recognizer, LumenVox Text-to-Speech Server and Call Progress Analysis with Avaya Aura Experience Portal Issue
Version 2.6. Virtual Receptionist Stepping Through the Basics
Version 2.6 Virtual Receptionist Stepping Through the Basics Contents What is a Virtual Receptionist?...3 About the Documentation...3 Ifbyphone on the Web...3 Setting Up a Virtual Receptionist...4 Logging
IVR CRM Integration. Migrating the Call Center from Cost Center to Profit. Definitions. Rod Arends Cheryl Yaeger BenchMark Consulting International
IVR CRM Integration Migrating the Call Center from Cost Center to Profit Rod Arends Cheryl Yaeger BenchMark Consulting International Today, more institutions are seeking ways to change their call center
SPEECH RECOGNITION SYSTEM
SPEECH RECOGNITION SYSTEM SURABHI BANSAL RUCHI BAHETY ABSTRACT Speech recognition applications are becoming more and more useful nowadays. Various interactive speech aware applications are available in
RTMP Channel Server. 2013 I6NET Solutions and Technologies www.i6net.com
RTMP Channel Server 2013 About... Do you need to make voice or video calls over the web? Do you need to add web calling to your contact center? Do you need to extend your business? Do you need to manage
Web Conferencing: It should be easy THE REASONS WHY IT IS NOT AND THE PATHS TO OVERCOME THE CHALLENGES.
September 2013 Daitan White Paper Web Conferencing: It should be easy THE REASONS WHY IT IS NOT AND THE PATHS TO OVERCOME THE CHALLENGES. Highly Reliable Software Development Services http://www.daitangroup.com/webconferencing
Dragon Solutions Enterprise Profile Management
Dragon Solutions Enterprise Profile Management summary Simplifying System Administration and Profile Management for Enterprise Dragon Deployments In a distributed enterprise, IT professionals are responsible
SOME INSIGHTS FROM TRANSLATING CONVERSATIONAL TELEPHONE SPEECH. Gaurav Kumar, Matt Post, Daniel Povey, Sanjeev Khudanpur
SOME INSIGHTS FROM TRANSLATING CONVERSATIONAL TELEPHONE SPEECH Gaurav Kumar, Matt Post, Daniel Povey, Sanjeev Khudanpur Center for Language and Speech Processing & Human Language Technology Center of Excellence
Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps
TM Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps This PDF contains embedded interactive features. Make sure to download and save the file to your
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
VoiceXML and VoIP. Architectural Elements of Next-Generation Telephone Services. RJ Auburn
VoiceXML and VoIP Architectural Elements of Next-Generation Telephone Services RJ Auburn Chief Network Architect, Voxeo Corporation Editor, CCXML Version 1.0, W3C Ken Rehor Software Architect, Nuance Communications
Controlling the computer with your voice
AbilityNet Factsheet August 2015 Controlling the computer with your voice This factsheet provides an overview of how you can control computers (and tablets and smartphones) with your voice. Communication
Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device
International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki
Voice Processing Standards. Mukesh Sundaram Vice President, Engineering Genesys (an Alcatel company)
Voice Processing Standards Mukesh Sundaram Vice President, Engineering Genesys (an Alcatel company) Agenda Interactive Voice Response Speech Processing Computer Telephony Integration IP Telephony Standards
Design Grammars for High-performance Speech Recognition
Design Grammars for High-performance Speech Recognition Copyright 2011 Chant Inc. All rights reserved. Chant, SpeechKit, Getting the World Talking with Technology, talking man, and headset are trademarks
Speech Recognition of a Voice-Access Automotive Telematics. System using VoiceXML
Speech Recognition of a Voice-Access Automotive Telematics System using VoiceXML Ing-Yi Chen Tsung-Chi Huang [email protected] [email protected] Department of Computer Science and Information
Avaya Aura Orchestration Designer
Avaya Aura Orchestration Designer Avaya Aura Orchestration Designer is a unified service creation environment for faster, lower cost design and deployment of voice and multimedia applications and agent
Developing Usable VoiceXML Applications
Speech Technology Summit 2001 Developing Usable VoiceXML Applications Neil Bowers [email protected] Contents About SRC Professional services company Methodology How we go about developing speech applications
Thin Client Development and Wireless Markup Languages cont. VoiceXML and Voice Portals
Thin Client Development and Wireless Markup Languages cont. David Tipper Associate Professor Department of Information Science and Telecommunications University of Pittsburgh [email protected] http://www.sis.pitt.edu/~dtipper/2727.html
The preliminary design of a wearable computer for supporting Construction Progress Monitoring
The preliminary design of a wearable computer for supporting Construction Progress Monitoring 1 Introduction Jan Reinhardt, TU - Dresden Prof. James H. Garrett,Jr., Carnegie Mellon University Prof. Raimar
Internet Video Streaming and Cloud-based Multimedia Applications. Outline
Internet Video Streaming and Cloud-based Multimedia Applications Yifeng He, [email protected] Ling Guan, [email protected] 1 Outline Internet video streaming Overview Video coding Approaches for video
WHITEPAPER. Mobile Workforce Productivity Solutions. Streamline Field Reporting Workflow with Speech Recognition
WHITEPAPER Mobile Workforce Productivity Solutions Streamline Field Reporting Workflow with Speech Recognition GROWING DEMAND FOR MOBILE PRODUCTIVITY The total number of mobile workers will reach 1.2 billion
Service Providers and WebRTC
Whitepaper Service Providers and WebRTC New Product Opportunities Over- the- Top (OTT) services are those that deliver communications features to customers but are apps running on the data network rather
Overview. Unified Communications
OpenScape Mobile V7 OpenScape Mobile is the next-generation mobile client of Unify, for the latest mobile phones and tablets. It combines SIP-based VoIP, UC and video features into one single application.
Technically Speaking: Moodle, mobile apps and mobile content
the jaltcalljournal issn 1832-4215 Vol. 8, No.1 Pages 33 44 2012 jalt call sig Technically Speaking: Moodle, mobile apps and mobile content Paul Daniels Kochi University of Technology More and more learners
VoipSwitch softphones
VoipSwitch softphones [email protected] 3/21/2011 Voiceserve ltd.grosvenor House,1 High Street,London United Kingdom 1 Contents Introduction and solution overview... 2 iphone mobile softphone... 3 Google
Christian Leibold CMU Communicator 12.07.2005. CMU Communicator. Overview. Vorlesung Spracherkennung und Dialogsysteme. LMU Institut für Informatik
CMU Communicator Overview Content Gentner/Gentner Emulator Sphinx/Listener Phoenix Helios Dialog Manager Datetime ABE Profile Rosetta Festival Gentner/Gentner Emulator Assistive Listening Systems (ALS)
Analysis of Asia-Pacific Enterprise Mobile Collaboration Applications Market
Analysis of Asia-Pacific Enterprise Mobile Collaboration Applications Market Disruptive Integrated Applications and New Partnerships Drive Growth 9847-64 January 2014 Contents Section Slide Numbers Executive
Technology Services...Ahead of Times. Enterprise Application on ipad
Technology Services...Ahead of Times Enterprise Application on ipad Diaspark, 60/2 Babu Labhchand Chhajlani Marg, Indore M.P. (India) 452009 Overview This white paper talks about the capabilities of ipad
Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)
Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara
Building Applications with Vision Media Servers
Building Applications with Vision Media Servers Getting Your Ideas to Market Fast David Asher Director, Product Management, Platform Solutions NMS at a Glance Founded in 1983, publicly traded since 1994
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
Interfaces de voz avanzadas con VoiceXML
Interfaces de voz avanzadas con VoiceXML Digital Revolution is coming Self driving cars Self voice services Autopilot for CAR Speaker Automatic Speech Recognition ASR DTMF keypad SIP / VoIP or TDM Micro
Contact Centers and the Voice-Enabled Web. Peter J. Cornelius
Contact Centers and the Voice-Enabled Web Peter J. Cornelius Agenda Introduction WebRTC in the Contact Center LiveOps Browser VoIP Implementation Looking Ahead LiveOps Built as a VoIP based solution for
Phone Routing Stepping Through the Basics
Ng is Phone Routing Stepping Through the Basics Version 2.6 Contents What is Phone Routing?...3 Logging in to your Ifbyphone Account...3 Configuring Different Phone Routing Functions...4 How do I purchase
Cisco IOS Voice XML Browser
Cisco IOS Voice XML Browser Cisco Unified Communications is a comprehensive IP communications system of voice, video, data, and mobility products and applications. It enables more effective, more secure,
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
SASSC: A Standard Arabic Single Speaker Corpus
SASSC: A Standard Arabic Single Speaker Corpus Ibrahim Almosallam, Atheer AlKhalifa, Mansour Alghamdi, Mohamed Alkanhal, Ashraf Alkhairy The Computer Research Institute King Abdulaziz City for Science
ezuce Uniteme TM Unified Communications for the Enterprise ezuce Inc. 2015
ezuce Uniteme TM Unified Communications for the Enterprise ezuce Inc. 2015 TABLE OF CONTENTS Unified Communications for the Enterprise Introduction User Centric Unified Communications Origin Platform Overview
tibbr Now, the Information Finds You.
tibbr Now, the Information Finds You. - tibbr Integration 1 tibbr Integration: Get More from Your Existing Enterprise Systems and Improve Business Process tibbr empowers IT to integrate the enterprise
Estonian Large Vocabulary Speech Recognition System for Radiology
Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe, Einar Meister Institute of Cybernetics Tallinn University of Technology, Estonia October 8, 2010 Alumäe, Meister (TUT, Estonia)
Issues of Hybrid Mobile Application Development with PhoneGap: a Case Study of Insurance Mobile Application
DATABASES AND INFORMATION SYSTEMS H.-M. Haav, A. Kalja and T. Robal (Eds.) Proc. of the 11th International Baltic Conference, Baltic DB&IS 2014 TUT Press, 2014 215 Issues of Hybrid Mobile Application Development
Cisco IOS Voice XML Browser
Cisco IOS Voice XML Browser Cisco Unified Communications is a comprehensive IP communications system of voice, video, data, and mobility products and applications. It enables more effective, more secure,
RTC:engine. WebRTC SOLUTION SIPWISE AND DEUTSCHE TELEKOM / TLABS ANNOUNCE COOPERATION FOR THE
SIPWISE AND DEUTSCHE TELEKOM / TLABS ANNOUNCE COOPERATION FOR THE WebRTC SOLUTION RTC:engine Sipwise and Deutsche Telekom AG / Telekom Innovation Laboratories signed a cooperation agreement for joint development
Speech Recognition Software Review
Contents 1 Abstract... 2 2 About Recognition Software... 3 3 How to Choose Recognition Software... 4 3.1 Standard Features of Recognition Software... 4 3.2 Definitions... 4 3.3 Models... 5 3.3.1 VoxForge...
What's New in Sametime 8.5. Roberto Chiabra IBM Certified IT Specialist
What's New in Sametime 8.5 Roberto Chiabra IBM Certified IT Specialist What's new in Sametime 8.5 Sametime Connect Client Online Meetings Audio / Video W eb Browser Clients & W eb 2.0 APIs Sametime Systems
Using the Amazon Mechanical Turk for Transcription of Spoken Language
Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
