What's the Deal with Automatic Speech Recognition?

Similar documents
Speech Recognition Software Review

VOICE RECOGNITION KIT USING HM2007. Speech Recognition System. Features. Specification. Applications

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Liz Myers. From Front Porch to Back Seat: Courtship in Twentieth-Century America. Secondary Resource Paper. HIST 215A: American History

Technology Finds Its Voice. February 2010

Communicate effectively with customers. unit 202

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Things to remember when transcribing speech

Chapter 3 Review Math 1030

A Comparative Analysis of Speech Recognition Platforms

Lecture 12: An Overview of Speech Recognition

Speech Analytics. Whitepaper

IVR Primer Introduction

IF The customer should receive priority service THEN Call within 4 hours PCAI 16.4

Writing a Project Report: Style Matters

Key #1 - Walk into twenty businesses per day.

INFORMATIVE SPEECH. Examples: 1. Specific purpose: I want to explain the characteristics of the six major classifications of show dogs.

1. The RSA algorithm In this chapter, we ll learn how the RSA algorithm works.

NO 1 IN SMART HOUSE AND HIGH TECH BUILDING TECHNOLOGIES

Montezuma II.

Lies My Calculator and Computer Told Me

UNDERSTAND YOUR CLIENTS BETTER WITH DATA How Data-Driven Decision Making Improves the Way Advisors Do Business

A Guide for Writing a Technical Research Paper

Greatest Common Factor and Least Common Multiple

What was the impact for you? For the patient? How did it turn out? How has this helped you in your job? What was the result?

DIGITAL MUSIC DAY 1 WHAT IS SOUND? ANALOG AND DIGITAL EARLY RECORDING WAX FOR YOUR EARS ROUND BUT FLAT WIRE AND TAPE PURE SOUND

INTRUSION PREVENTION AND EXPERT SYSTEMS

Why Disruptive Innovations Matter in Laboratory Diagnostics

PART I. The Mechanics of Mastering

Vieta s Formulas and the Identity Theorem

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Background Biology and Biochemistry Notes A

Speech-Enabled Interactive Voice Response Systems

A Tutorial on the Decibel

MOST FREQUENTLY ASKED INTERVIEW QUESTIONS. 1. Why don t you tell me about yourself? 2. Why should I hire you?

Develop Software that Speaks and Listens

(Refer Slide Time: 2:03)

Module 9. Building Communication Skills

Customer Service Training 101, Second Edition By Renee Evenson

Communication Process

Working with whole numbers

1. What are the four laws that work with the law of gravity to make an airplane fly?

Creating voices for the Festival speech synthesis system.

BBC Learning English Talk about English Business Language To Go Part 1 - Interviews

Specialty Answering Service. All rights reserved.

The Future of Customer Service

Main Point: God gives each of us gifts and abilities. We should use them to glorify Him.

Masterclass Series. Sales Training Courses

Integrating Technology into the Classroom. Trevor Moore. Western Oregon University

The Physics and Math of Ping-pong and How It Affects Game Play. By: Connor Thompson & Andrew Johnson

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Magnets. Electromagnets. and. Thomas Jefferson National Accelerator Facility - Office of Science Education

T-MOBILE USES SOCIAL MEDIA ANALYTICS TO BOOST EFFICIENCY

Using OK in English. Speaking activities for discourse markers part 1 by Lindsay Clandfield

Johari Window. Lesson Plan

Participants Manual Video Seven The OSCAR Coaching Model

A Short Guide to Significant Figures

Degree of highness or lowness of the voice caused by variation in the rate of vibration of the vocal cords.

Independent samples t-test. Dr. Tom Pierce Radford University

The Adobe PostScript Printing Primer

Self-directed learning: managing yourself and your working relationships

What is Organizational Communication?

Successful People. By John Celestand Field Associate of World Leadership Group

Click on the links below to jump directly to the relevant section

CAD/ CAM Prof. P. V. Madhusudhan Rao Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture No. # 03 What is CAD/ CAM

NICE Systems and Avaya provide businesses with Insight from Interactions

CHAPTER 2. Logic. 1. Logic Definitions. Notation: Variables are used to represent propositions. The most common variables used are p, q, and r.

Guidelines for Effective Business Writing: Concise, Persuasive, Correct in Tone and Inviting to Read

BBC Learning English - Talk about English July 11, 2005

RSA Encryption. Tom Davis October 10, 2003

GUESSING BY LOOKING AT CLUES >> see it

Speech recognition technology for mobile phones

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology Madras

THE EF ENGLISHLIVE GUIDE TO: Dating in English TOP TIPS. For making the right impression

How to teach listening 2012

SPEECH Biswajeet Sarangi, B.Sc.(Audiology & speech Language pathology)

Student Essays on NASA Project

Publishing papers in international journals

Turkish Radiology Dictation System

Kickass JV Interview Generator

Qualitative and Quantitative Evaluation of a Service Learning Program

Academic Standards for Reading, Writing, Speaking, and Listening

EMILY WANTS SIX STARS. EMMA DREW SEVEN FOOTBALLS. MATHEW BOUGHT EIGHT BOTTLES. ANDREW HAS NINE BANANAS.

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Thinking Skills. Lesson Plan. Introduction

Speech Signal Processing: An Overview

A Guide to Cover Letter Writing

(Keyboarding Strategies Guide, NBEA, Pg. 1) Hunting and Pecking

Significant Figures, Propagation of Error, Graphs and Graphing

First Affirmative Speaker Template 1

Information Leakage in Encrypted Network Traffic

23. RATIONAL EXPONENTS

PIANOWEB.com ( 800 )

Transcription:

Tyler Fulcher 11.30.2012 Artificial Intelligence Lydia Tapia What's the Deal with Automatic Speech Recognition? 1. Introduction According to [Forsberg 2003], "Automatic Speech Recognition(ASR), is the process of interpreting human speech in a computer". The history of ASR devices dates back to the late 18th century when the first dictation machines were invented. While these are not strictly ASR machines because they do not interpret speech, as per our definition, they were the first big stepping stone in ASR's development. Today we use ASR machines in automated telecommunications like banking and in web browsers like chrome and of course Siri but these are not the only uses. ASR is useful but not without issues, as you may have experienced while trying to tell your phone to call your friend or by interacting with an automated telephone service for your bank. The major problems ASR faces now have to deal with space complexity and speaker variability. ASR has it's roots in technology developed by Alexander Graham Bell and others to record dictations. This lead to the creation of machines that could emulate human speech in the 1800's. Then in the 1900's these machines were updated and improved upon, which lead scientists to begin thinking about phonetic sound range of speech and how to create machines that would emulate it, The phonemes could then be broken down and created by machines but also recognized by machines. This lead some people to develop machines that would recognize vowels and digits, thus the modern understanding of ASR is born, and while it developed and changed over the next century it is only within the last thirty years or so that it has been formalized, popularized and widely used within the public. The uses of ASR vary widely and it yet to reach it's full potential but it seems commonly to be developed to make some jobs obsolete and others easier. When the technology was developing it started out as simple recording devices that recorded dictations so a personal stenographer was not needed. Then in the 1980's it became popularized for automating services over the phone such as banking and billing. Today we have the iphone with it's witty Siri program that can query the web for answers to simple questions like directions and listings. Being a fairly new technology, ASR is not without it's problems. The main obstacles that this technology must overcome are the space complexity and the speaker variability. An exhaustive ASR

program would contain a very large lexicon would be impractical to search. Also it would take a large amount of memory to store the speech, phonemes and words. Speaker variability is a much harder problem to address because the best ASR systems are and always will make errors when interpreting speech, this is exacerbated by the fact that no two speakers sound exactly the same, even a single speaker will never say the same word the same way twice. ASR systems must be very robust to handle the variability. The future of ASR is really promising for the consumer market. Recent popularity will most likely spawn a lot of interest in the topic. Developers will begin working on more and more devices that can be controlled simply by speaking. While ASR and telecommunications are linked and will probably only improve, we can look forward to seeing household appliances like coffee makers and televisions that are controlled by voice. 2. History Despite recent Popularity of ASR systems like the iphone's Siri, ASR has had a long history of development as discussed by [Juang & Rabiner 2004]. ASR is rooted with Alexander Graham Bell and his company Volta Graphaphone, started in 1888, where he and his fellows sold their invention by the same name; a recording device that used a stylus to create grooves in wax. In the late 18th century people started to develop machines that would mimic human speech. Christian Kratzenstein used resonance tubes connected to a pipe organ to create vowel sounds. Wolfgang Von Kempelen followed this up with an Acoustic Mechanical Speech Machine in 1791. This was improved upon by Charles Wheatstone, who used leather resonators so the speech could be manipulated. In the 20th century Bell Laboratories produced research on the relationship between speech spectrum and sound characteristics. This influenced Homer Dudley to create the VODER(Voice Operating Demonstrator) an electrical version of the speech machine developed by Wheatstone. While these are not strictly speech recognition it shows the evolution of the technology. The earliest ASR machines were based on the theory of acoustic phonetics that was instrumental in the machines that preceded like the VODER. By breaking speech down to the phonemes that machines like the VODER could produce, it was only a small step to develop machines that could recognize the phonemes. In 1952 Bell Laboratories did this by developing a machine that could recognize single digits from a single speaker. What followed was a decade or so of other labs, including RCA and MIT developing their own machines to recognize sets of vowel sounds. Next came the ideas of recognizing speech at different pitches and with different inferences and with different lengths of time. In the 1970's, Tom Martin founded the first commercial ASR company, Threshold Technology, which created

the VIP-100 system. This led the U.S. government to develop the Harpy system, capable of recognizing 1000 or so words. It is notable for being the first ASR system to use a graph search to determine the words being spoken. With the development of these machines research began to split into different areas, for instance, some companies began developing Voice Activated Typewriters for office use. Bell Laboratories began to develop a system that would allow for voice dialing of a telephone that would be able to recognize a wide variety of accents and talkers. They also placed importance on recognizing certain key words in speech. These led to formal mathematics being defined for the field and the technologies began to converge on what became the hidden Markov Model, now the standard model for speech recognition. 3. Uses Though the first machines developed in the field of ASR focused on creating machines that aped human speech they eventually got people asking, what if machines could recognize and at some level understand speech? The first of these machines were simple recording devices that allowed someone to dictate a letter. The recorded media could then be turned over to a company of stenographers who would draft up the letters. This cut down on the cost of having a personal stenographer. The technology stagnated here for a century before the development of voice activated typewriters which were used in offices and replaced many secretaries and still more stenographers. Meanwhile the technology had started to develop in other areas specifically in use with telephones. AT&T began developing systems that would allow their customers to dial simply by speaking the numbers, it had to be capable of recognizing different tones and pitches and even accents and determining the number to dialed. While ultimately this endeavor was unsuccessful it spawned many other uses of ASR. Particularly, it created the hidden Markov Model which [Plannerer 2005] claims is widely used today for the development of speech recognition software. We have all used a software based on this model; if you have called your bank's customer service line recently they may have had a machine that requests that you key in or speak your account number. Similarly if you have ever tried to troubleshoot a program online, some companies provide an interactive instant messaging style program that a person can communicate with to try to solve your problem. These robots are usually just simple programs that read what you type for key phrases and try to link you to the appropriate help articles based on that key word. The most recent and most popular use today though is the iphone with it's Siri program. Siri is voice activated and is a general query system with web access. A user can pose questions to Siri like

asking for directions or for nearby restaurants and Siri finds the key words and using GPS on the phone and the web searches for maps or restaurant listings. Siri also handles more obscure or ambiguous questions with a measure of wit and cleverness. For example if you were to say to Siri 'Beam me up', it might respond 'Okay, stand still' or, as told by [Pinola 2011], if you say 'Siri, I need to hide a body' it will volunteer the locations of nearby dumps or metal foundries. The use of ASR today is usually tied with phones but the technology is still in it's infancy. 4. Problems One major problem facing ASR implementations is the space complexity. Storing audio, processing the the audio, splitting the phonemes and the words all require a lot of space. Also the space required for storing the lexicon of words that the system knows is very space consuming. It's possible to compress the audio and reduce the lexicon to solve these problems. Compressing audio reduces the quality and creates a new set of interpretation problems in a system that already struggles to be precise. Reducing the lexicon is best but limits the amount of words the computer can understand. This can cause a speaker to be misunderstood even though they are speaking correctly. In general, perfection is idealistic for an ASR system but time will improve the systems we have and we may eventually realize that we just have to interact with computers in a different way than we do humans. Another major problem that complicates ASR implementations, described by [Doe 1998], is variability. Because ASR systems are supposed to be general use systems they have to support multiple speakers and be able to adapt to all the variations that introduces. There are variations in speech styles, pitch and anatomy that make each speaker unique. Also things like background noise, utterances, and dialects can negatively affect the interpretation of speech. Even words that sound alike can create problems for ASR systems. The best that developers can do is to develop a robust system that can handle the variability in a clean way when it cannot accurately determine the meaning. 5. Future Now that we have standardized the way to implement speech recognition, the real question is what is the limit? The technology has recently gained popularity with consumers in the form a of web browsing and with Siri, so it is likely that developers are going to latch on to this trend and run with it. The future of speech recognition will most likely continue to expand in the app market; offering apps that use ASR to control them. Then it will branch out into our homes, offering us lights that turn on and off with simple commands or coffee makers we can program by talking to. In business, people that have their hands occupied will be able to make notes or do clerical work simply by speaking to their

computers. All these things will be available in the near future if they are not already. 6. Conclusion As the ASR technology develops it will become a more integral part of life, from it's humble beginning as a dictation machine to the commercial popularity of Siri, it has come a long way and it has a long way to go yet. The uses of ASR have always been intertwined with telecommunications and will in all likelihood remain there but also they will expand into the realm of other household and workplace appliances. ASR is a complex problem and like most new technologies is not without problems but the future is bright and there plenty of possibilities. This is by no means a scientific paper about ASR for more detailed information about the implementation of ASR read the paper by Plannerer mentioned in the sources section. It covers a lot of the theory behind ASR. 7. Sources Juang, B.H. and Lawrence R. Rabiner. Automatic Speech Recognition -- A Brief History of the Technology Development. USCB. 8 October 2004. 19 November 2012. http://www.ece.ucsb.edu/faculty/rabiner/ece259/reprints/354_lali-asrhistory-final-10-8.pdf. Forsberg, Markus. Why is Speech Recognition Difficult?. speech.kth. 24 February 2003. 19 November 2012. http://www.speech.kth.se/~rolf/gslt_papers/markusforsberg.pdf. Plannerer, B. An Introduction to Speech Recognition. speech-recognition. 28 March 2005. 19 November 2012. http://www.speech-recognition.de/pdf/introsr.pdf. Pinola, Melanie. Speech Recognition Through the Decades: How we Ended Up with Siri. Techworld. 6 November 2011. 19 November 2012. http://features.techworld.com/applications/3315959/speech-recognition-through-the-decades-how-weended-up-with-siri/. Doe, Hope L. Evaluating the Effects of Automatic Speech Recognition Word Accuracy. Virginia Tech. 10 July 1998. 11 November 2012. http://scholar.lib.vt.edu/theses/available/etd-7598-165040/unrestricted/thesis1.pdf.