What's the Deal with Automatic Speech Recognition?

Tyler Fulcher 11.30.2012 Artificial Intelligence Lydia Tapia What's the Deal with Automatic Speech Recognition? 1. Introduction According to [Forsberg 2003], "Automatic Speech Recognition(ASR), is the process of interpreting human speech in a computer". The history of ASR devices dates back to the late 18th century when the first dictation machines were invented. While these are not strictly ASR machines because they do not interpret speech, as per our definition, they were the first big stepping stone in ASR's development. Today we use ASR machines in automated telecommunications like banking and in web browsers like chrome and of course Siri but these are not the only uses. ASR is useful but not without issues, as you may have experienced while trying to tell your phone to call your friend or by interacting with an automated telephone service for your bank. The major problems ASR faces now have to deal with space complexity and speaker variability. ASR has it's roots in technology developed by Alexander Graham Bell and others to record dictations. This lead to the creation of machines that could emulate human speech in the 1800's. Then in the 1900's these machines were updated and improved upon, which lead scientists to begin thinking about phonetic sound range of speech and how to create machines that would emulate it, The phonemes could then be broken down and created by machines but also recognized by machines. This lead some people to develop machines that would recognize vowels and digits, thus the modern understanding of ASR is born, and while it developed and changed over the next century it is only within the last thirty years or so that it has been formalized, popularized and widely used within the public. The uses of ASR vary widely and it yet to reach it's full potential but it seems commonly to be developed to make some jobs obsolete and others easier. When the technology was developing it started out as simple recording devices that recorded dictations so a personal stenographer was not needed. Then in the 1980's it became popularized for automating services over the phone such as banking and billing. Today we have the iphone with it's witty Siri program that can query the web for answers to simple questions like directions and listings. Being a fairly new technology, ASR is not without it's problems. The main obstacles that this technology must overcome are the space complexity and the speaker variability. An exhaustive ASR

program would contain a very large lexicon would be impractical to search. Also it would take a large amount of memory to store the speech, phonemes and words. Speaker variability is a much harder problem to address because the best ASR systems are and always will make errors when interpreting speech, this is exacerbated by the fact that no two speakers sound exactly the same, even a single speaker will never say the same word the same way twice. ASR systems must be very robust to handle the variability. The future of ASR is really promising for the consumer market. Recent popularity will most likely spawn a lot of interest in the topic. Developers will begin working on more and more devices that can be controlled simply by speaking. While ASR and telecommunications are linked and will probably only improve, we can look forward to seeing household appliances like coffee makers and televisions that are controlled by voice. 2. History Despite recent Popularity of ASR systems like the iphone's Siri, ASR has had a long history of development as discussed by [Juang & Rabiner 2004]. ASR is rooted with Alexander Graham Bell and his company Volta Graphaphone, started in 1888, where he and his fellows sold their invention by the same name; a recording device that used a stylus to create grooves in wax. In the late 18th century people started to develop machines that would mimic human speech. Christian Kratzenstein used resonance tubes connected to a pipe organ to create vowel sounds. Wolfgang Von Kempelen followed this up with an Acoustic Mechanical Speech Machine in 1791. This was improved upon by Charles Wheatstone, who used leather resonators so the speech could be manipulated. In the 20th century Bell Laboratories produced research on the relationship between speech spectrum and sound characteristics. This influenced Homer Dudley to create the VODER(Voice Operating Demonstrator) an electrical version of the speech machine developed by Wheatstone. While these are not strictly speech recognition it shows the evolution of the technology. The earliest ASR machines were based on the theory of acoustic phonetics that was instrumental in the machines that preceded like the VODER. By breaking speech down to the phonemes that machines like the VODER could produce, it was only a small step to develop machines that could recognize the phonemes. In 1952 Bell Laboratories did this by developing a machine that could recognize single digits from a single speaker. What followed was a decade or so of other labs, including RCA and MIT developing their own machines to recognize sets of vowel sounds. Next came the ideas of recognizing speech at different pitches and with different inferences and with different lengths of time. In the 1970's, Tom Martin founded the first commercial ASR company, Threshold Technology, which created

the VIP-100 system. This led the U.S. government to develop the Harpy system, capable of recognizing 1000 or so words. It is notable for being the first ASR system to use a graph search to determine the words being spoken. With the development of these machines research began to split into different areas, for instance, some companies began developing Voice Activated Typewriters for office use. Bell Laboratories began to develop a system that would allow for voice dialing of a telephone that would be able to recognize a wide variety of accents and talkers. They also placed importance on recognizing certain key words in speech. These led to formal mathematics being defined for the field and the technologies began to converge on what became the hidden Markov Model, now the standard model for speech recognition. 3. Uses Though the first machines developed in the field of ASR focused on creating machines that aped human speech they eventually got people asking, what if machines could recognize and at some level understand speech? The first of these machines were simple recording devices that allowed someone to dictate a letter. The recorded media could then be turned over to a company of stenographers who would draft up the letters. This cut down on the cost of having a personal stenographer. The technology stagnated here for a century before the development of voice activated typewriters which were used in offices and replaced many secretaries and still more stenographers. Meanwhile the technology had started to develop in other areas specifically in use with telephones. AT&T began developing systems that would allow their customers to dial simply by speaking the numbers, it had to be capable of recognizing different tones and pitches and even accents and determining the number to dialed. While ultimately this endeavor was unsuccessful it spawned many other uses of ASR. Particularly, it created the hidden Markov Model which [Plannerer 2005] claims is widely used today for the development of speech recognition software. We have all used a software based on this model; if you have called your bank's customer service line recently they may have had a machine that requests that you key in or speak your account number. Similarly if you have ever tried to troubleshoot a program online, some companies provide an interactive instant messaging style program that a person can communicate with to try to solve your problem. These robots are usually just simple programs that read what you type for key phrases and try to link you to the appropriate help articles based on that key word. The most recent and most popular use today though is the iphone with it's Siri program. Siri is voice activated and is a general query system with web access. A user can pose questions to Siri like

asking for directions or for nearby restaurants and Siri finds the key words and using GPS on the phone and the web searches for maps or restaurant listings. Siri also handles more obscure or ambiguous questions with a measure of wit and cleverness. For example if you were to say to Siri 'Beam me up', it might respond 'Okay, stand still' or, as told by [Pinola 2011], if you say 'Siri, I need to hide a body' it will volunteer the locations of nearby dumps or metal foundries. The use of ASR today is usually tied with phones but the technology is still in it's infancy. 4. Problems One major problem facing ASR implementations is the space complexity. Storing audio, processing the the audio, splitting the phonemes and the words all require a lot of space. Also the space required for storing the lexicon of words that the system knows is very space consuming. It's possible to compress the audio and reduce the lexicon to solve these problems. Compressing audio reduces the quality and creates a new set of interpretation problems in a system that already struggles to be precise. Reducing the lexicon is best but limits the amount of words the computer can understand. This can cause a speaker to be misunderstood even though they are speaking correctly. In general, perfection is idealistic for an ASR system but time will improve the systems we have and we may eventually realize that we just have to interact with computers in a different way than we do humans. Another major problem that complicates ASR implementations, described by [Doe 1998], is variability. Because ASR systems are supposed to be general use systems they have to support multiple speakers and be able to adapt to all the variations that introduces. There are variations in speech styles, pitch and anatomy that make each speaker unique. Also things like background noise, utterances, and dialects can negatively affect the interpretation of speech. Even words that sound alike can create problems for ASR systems. The best that developers can do is to develop a robust system that can handle the variability in a clean way when it cannot accurately determine the meaning. 5. Future Now that we have standardized the way to implement speech recognition, the real question is what is the limit? The technology has recently gained popularity with consumers in the form a of web browsing and with Siri, so it is likely that developers are going to latch on to this trend and run with it. The future of speech recognition will most likely continue to expand in the app market; offering apps that use ASR to control them. Then it will branch out into our homes, offering us lights that turn on and off with simple commands or coffee makers we can program by talking to. In business, people that have their hands occupied will be able to make notes or do clerical work simply by speaking to their

computers. All these things will be available in the near future if they are not already. 6. Conclusion As the ASR technology develops it will become a more integral part of life, from it's humble beginning as a dictation machine to the commercial popularity of Siri, it has come a long way and it has a long way to go yet. The uses of ASR have always been intertwined with telecommunications and will in all likelihood remain there but also they will expand into the realm of other household and workplace appliances. ASR is a complex problem and like most new technologies is not without problems but the future is bright and there plenty of possibilities. This is by no means a scientific paper about ASR for more detailed information about the implementation of ASR read the paper by Plannerer mentioned in the sources section. It covers a lot of the theory behind ASR. 7. Sources Juang, B.H. and Lawrence R. Rabiner. Automatic Speech Recognition -- A Brief History of the Technology Development. USCB. 8 October 2004. 19 November 2012. http://www.ece.ucsb.edu/faculty/rabiner/ece259/reprints/354_lali-asrhistory-final-10-8.pdf. Forsberg, Markus. Why is Speech Recognition Difficult?. speech.kth. 24 February 2003. 19 November 2012. http://www.speech.kth.se/~rolf/gslt_papers/markusforsberg.pdf. Plannerer, B. An Introduction to Speech Recognition. speech-recognition. 28 March 2005. 19 November 2012. http://www.speech-recognition.de/pdf/introsr.pdf. Pinola, Melanie. Speech Recognition Through the Decades: How we Ended Up with Siri. Techworld. 6 November 2011. 19 November 2012. http://features.techworld.com/applications/3315959/speech-recognition-through-the-decades-how-weended-up-with-siri/. Doe, Hope L. Evaluating the Effects of Automatic Speech Recognition Word Accuracy. Virginia Tech. 10 July 1998. 11 November 2012. http://scholar.lib.vt.edu/theses/available/etd-7598-165040/unrestricted/thesis1.pdf.