Automated Speech to Text Transcription Evaluation Ryan H Email: rhnaraki@calpoly.edu Haikal Saliba Email: hsaliba@calpoly.edu Patrick C Email: pcasao@calpoly.edu Bassem Tossoun Email: btossoun@calpoly.edu Chad Brantley Email: cbrantle@calpoly.edu Gagandeep Kohli Email: gasingh@calpoly.edu Abstract The California State Legislature is a state governmental body that meets consistently to discuss state legislative action. During these meetings, no full transcriptions of the minutes are generally taken; instead, recordings of the long sessions are taken, should they ever need to be referenced. This presents a problem: videos are hard to extract data from. As part of a project aimed at collecting this data into a knowledge repository, we have worked to evaluate a number of different transcription softwares and services based on their ability to transcribe the data properly, and provide relevant data regarding their costs. Our results point to Microsofts MAVIS technology providing the highest quality transcript; however, we found that this is certainly not the cheapest option, considering the limited presence of open-source alternatives, like Julius and Sphinx. I. INTRODUCTION California State Legislature holds various committee meetings to discuss governmental issues. These meetings are recorded through video and audio and uploaded in bulk to the California Channel website. To obtain access, ordinary citizens and media must either search the California Channel and watch the videos or visit the California State Capitol. Through the use of modern technology, we hope to make California Legislature more easily accessible to the public. This project aims at evaluating the many transcription technologies currently available. Natural Language Processing tools such as OpenCalias will be used to obtain significant key words such as names, places, and events. The keywords obtained from OpenCalias will be used create an ontology map so documents that discuss similar domains or issues are linked together, thus, making the documents searchable. II. BACKGROUND/RELATED WORK Two of the many organizations that have taken the initiative to make US Legislature transparent are OpenCongress and OpenGovernment. OpenCongress is a non-profit, non-partisan public resource that was established when they noticed that US Congress offered few channels for the mass public to voice their opinion to policy makers. They state that there are only a few groups in the US that act on and distribute valuable information about political insiders and lobbyists. Even with technology, websites such as The Library of Congress doesnt offer a clear way for one to read and obtain documents. Therefore, OpenCongress is a webpage that offers governmental data obtained from news, blogs, and social networking to make the government more transparent. They aggregate all the data obtained from the sources mentioned above and classify bills, votes, issues, and people in congress. Finally, they use a userfriendly webpage to allow the open public to read and search for governmental data. In addition, they use social networking, such as Facebook, to allow one to share information with their friends. OpenGovernment is a public website that aims at making data about the United States three branches: executive, legislative, and judicial, free and open to the public and is made by the same founders as OpenCongress. They believe that by making data openly available, the public is more likely to engage in governmental matters, reduce corruption, promote better policy, and create a richer democratic institution. As of November 2010, OpenGovernment contains information about five legislatures: California, Louisiana, Maryland, Texas, and Wisconsin. They obtain governmental information from Open State Project, Google News, Blog Search, TransparencyData, and Project VoteSmart. Their web page is centralized by use of sort-by buttons for browsing bills and people to obtain information about particular domains. A track button allows one to obtain the latest actions of a domain. In addition, they provide users with the ability to comment and share bills or peoples documents, contact elected officials, and organize campaigns.
III. FEATURES/REQUIREMENTS EVAL Legislative Transparency is a long-term project with an ultimate goal of allowing the average user to easily search for information about legislative meetings and documents at a centralized place. Therefore, in order to achieve this big goal the project is broken down into iterations. The initial iteration hopes to produce meta-data tags, databases, query types, a white paper detailing the work, and a prototype. As knowledge engineers, Team 2 will focus on evaluating various audio-to-text or transcription software to find one that is lesserror prone and provide a report concluding the evaluation process that will become part of the white paper. The goal of this is to provide Dr. Blakeslee and the rest of the Legislative team with a building block for the future. The chosen software will be used to convert audio from legislative videos into text which will be processed through a Natural Language Process (NLP). NLP software will identify key speakers and information within the audio and its relationship to other meetings, and ultimately allow one to construct a database repository that one can query for desire questions. A. Feature List 1) Evaluation of speech-to-text software 2) Cost Effective (Money, time, computational resources) - Looks at free vs paid software and cost of time B. Requirement List 1) Major, Minor, and Proper Noun Errors produced by various speech-to-text software 2) Time it takes to transcribe an audio file 3) Usability/Accessibility of API, Web services, or etc - the need for human intervention such as breaking up the audio into various chunks or converting the format. C. Evaluation List 1) Chart that displays the breakdown of errors as Major, Minor, and Proper Noun. 2) Time it takes to transcribe an audio file 3) Usability will be measured by a scale of 1-5, in which 5 means the system requires major outside help to pre-process the audio and 1 means no outside work is involved beside uploading the audio and pressing transcribe. A. Technologies Explored Mavis AT&T Dragon Dictation Google Voice Voxforge/Julius IV. IMPLEMENTATION B. Mavis 1) Overview: Microsoft Audio Video Indexing Service (MAVIS) is a Windows Azure application which uses speech recognition technology developed at Microsoft Research to enable searching of digitized spoken content. MAVIS generates automatic closed captions and keywords which can increase accessibility of audio and video files with speech content. MAVIS uses a Deep Neural Net (DNN) based speech recognition] technology, which reduces errors in speech recognition by automatically expanding its vocabulary and storing word alternatives using a technique referred to as Probabilistic Word-Lattice Indexing. More explanation is available at the Microsoft website in the technical background. MAVIS, the technology at the foundation of the Washington Post s Truth Teller Project, was proven to transcribe sessions of Congress and fact check them. It is worth taking a look into the technology. Cost $20 per hour Major Errors Minor Errors Proper Noun Errors Noun Recognition 28 59 16 5 4 4 3) Advantages/Strengths: Hosted solution in the cloud Transcribes multiple speakers No initial voice training required Good customer support Better at recognizing names than other technologies Words that are confidently understood are in bold script Wide variety of input files allowed Captions synced to video 4) Disadvantages/Weaknesses: Punctuation and capitalization can appear arbitrary at times Transcription of a 20 minute video can take up to 2 hours Words can tend to be left out altogether if not understand Strange characters can appear in the transcript C. AT&T 1) Overview: AT&Ts Speech API is a cloud-based service meant to transcribe audio to text using AT&Ts Watson speech engine. In order to do this, AT&T requires that you specify a relevant context for it to gather data from; all contexts are built into the service with no ability to specify your own context. In total, AT&T provides and maintains 7 contexts, including: Web Search Business Search Voicemail To Text SMS
Question and Answer TV Generic Being a cloud-based service, most of the hard work is done on AT&Ts platform. As such, the API is able to be called from many different environments and languages to achieve the same results. Requests are made to AT&T servers through an HTTP request, which perform speech-to-text analysis on the input files using Watson speech engine. Input file formats can be of two types: WAV, 16-bit PCM, single channel, 8 khz sampling AMR (narrowband), 12.2 kbit/s, 8 khz sampling (recommended) As an additional constraint, audio files can only be sent 4 minutes at a time. AT&T provides a number of APIs to use their service, supporting the following environments: HTML5 MS RESTful As a result, most languages can give a speech-to-text request to AT&T, include Java, Ruby, and C#. Language Cost RESTful Java $99/yr + $0.01/API call past 1 million/mth Correct Proper Major Errors Minor Errors Noun Recognition 16 27 65 2 3 4 3) Advantages: Cheap: 1 yearly fee of $99 + $0.01 per API call past 1 million/month Easy to use and versatile: any language with HTTP support should be able to use it Works on multiple speakers Quick calculation: around 1 min audio / 1 min calculation 4) Disadvantages: 4 minutes at a time; must break up long text Transcription is not very strong; many errors AMR audio format (mostly) required : WAV format worked inconsistently Proper noun recognition is bad: doesnt capitalize except for start of sentence, and often errors in names Poor punctuation: seems arbitrary at times D. Dragon Dictation 1) Overview: Dragon Dictation is speech recognition software that lets you use your voice to create and edit text or interact with applications on your machine. It lets you use your voice to create and edit documents, manage e-mail, surf the Web, and more. It also provides digital voice software for mobile devices that let you capture your notes on-the-go and transcribe them with Dragon Dictate. The software is not 100 percent accurate out of the box and depends on the user correcting its dictation as it s used. The more it is used, and the more it s corrected, the better and more accurate its language model becomes. You can even use recordings that you ve made on your mobile device in order to build your personal language model. Although Dragon appeared to be a solid transcription technology for a single user, it proved that it was intended for exactly that: a single user. Output from Dragon also did not have any punctuation. For our purposes, it is not worth pursuing further evaluation of Dragon. Cost Platform $200 Windows, Mac OS X Proper Noun Errors Major Errors Minor Errors 16 45 35 1 4 4 3) Advantages: Relatively malleable language model Transcribes audio relatively quickly Can easily load audio files with a range of different formats 4) Disadvantages: Requires voice training Intended to learn a single users speech patterns No punctuation Proper nouns may get lost in the noise E. Google Voice 1) Overview: The Google Voice API is a speech recognition API that supports audio to text automation. It allows you to use your voice to create and edit text or interact with applications on your machine. Google Voice has its own software and also provides the framework and essence of the Closed Captioning feature on YouTube. The software is often used to translate voice mail messages to text in order to provide a message to the user without the user having to listen to it. The Google Voice API can also be found in Android mobile phones, which it provides for Speech Recognition and navigation through applications on the phone. This version of the Google Voice API is actually not public and can support any size videos. A Speech2Text program was written using this version of Google Voice API function calls, which takes in a WAV file and outputs the text it transcribes from the audio file. The software still has a few rough edges and also a fatal flaw when trying to process audio files with sections of little or no sound (variability in frequency). The program does a decent job, and because its code is available and editable, hopefully, can be improved by us.
Cost $0 Major Errors Minor Errors Proper Noun Errors Noun Recognition 34 43 25 1 3 2 3) Advantages/Strengths: Transcribes audio relatively quickly Free Can transcribe any length video 4) Disadvantages/Weaknesses: Only supports WAV files Has trouble with audio files that includes sections of little or no sound No punctuation Proper nouns may get lost in the noise F. VoxForge/Julius Voxforge is the most complete open-source English speech corpus; it compiles speech into acoustic models for other software systems such as: Julius, Sphinx, and HTK to work with. Using this data, these software systems can match certain sets of the resulting acoustic model to words, or perform other operations on them. Julius is an open-source speech recognition system; its development began in 1997 in Japan and since has been refit to work for many different languages. Julius requires two things to interpret speech: an acoustic model, which Voxforge provides, and a grammar of words to match the audio against. The grammar, however, must be tailored to the acoustic model, and few generic grammars seem to exist; as such, the Julius/Voxforge combo seems like a difficult option, or one that might require more time to get setup and evaluate. V. VALIDATION For the evaluation of various software, a 6-minute sample of a legislative meeting was extracted and manually transcribed. The sample was passed through various transcription software, which produced output transcripts. The location and number of errors made by each software was compared to the manual transcription. A. Error Definition An error is defined from where the first error occurred to the end of where that type of error occurred. Errors are defined this way because an error s beginning is usually the root cause for the rest of a phrase to be invalid. B. Error Types: 1) Major (Red marks):: Continuous stream of incorrect words Continuous stream of missing words 2) Minor (Yellow marks):: One word error Spelling error Grammar error (two/too) Capitalization error Period or thought break error Commas are not counted as minor errors 3) Proper Nouns (Green marks):: Inability to identify proper nouns correctly (USCB, California, Names, Senator). Proper noun errors are counted as either a part of minor or major error. They are major if their context includes a major error, minor otherwise. We consider uncapitalized nouns an error, because Natural Language Processing software relies on correct use of nouns to identify key people and places. Therefore, we would like to minimize the number of errors that will result from Natural Language Processing software by picking a robust transcription software. C. Usability Criteria Transcription software is evaluated on several qualitative measures as well. How readable is the transcript overall (1-5, 5 = most readable). If the reader can understand the content in spite of the errors, readability is high. How easy the software is to set up initially (1-5, 5=easy) How easy the software is to continually use after initial setup (1-5, 5=easy) General advantages / strengths General disadvantages / weaknesses VI. CONCLUSION According to our results Mavis is the best choice for this use case. Even though transcription of a single file may take hours, more than one file can be processed at a time in parallel on Microsofts cloud. The AT&T API, though comparable to Mavis in terms of number of errors, often results in low readability transcripts and requires more effort in manually correcting those errors. Google Voice, though free, results in highly unreadable transcripts with a large number of errors. Using Dragon Dictation results in highly unreadable transcripts as well, mainly due to the fact that Dragon Dictation is not tailored for such a use case, instead training on a single speaker. The main concern towards using Mavis would be the price as the software is not open source, and using it requires a paid subscription. However, even with Mavis, the resulting transcripts are still unreadable, with some major errors, as well as many proper noun errors. Though the reader would be able to follow the logic of the transcript, the document would still require manual correction to achieve correct transcription.
System Cost Platform Major Errors Minor Errors Proper Noun General Ease of Ease of Errors Readability Setup Continued (1-5, (1-5, Use (1-5, 5 = readable) 5 = easy) 5 = easy) MAVIS $20/h Microsoft 28 59 16 5 4 4 Azure AT&T $99 + RESTful 27 65 16 2 3 4 $0.01/API Java call past 1 million Dragon $200 Windows/Mac 45 35 16 1 4 4 Dictation Application V. 11 Google $0 Windows/Mac 34 43 25 1 3 2 Voice Application TABLE I OVERALL SYSTEM COMPARISON VII. FUTURE WORK The final intent of the legislature project is to allow ordinary citizens and media to search through California State legislature hearing. This white paper mainly focuses on various transcription technology and reaches the conclusion that there isnt an ideal transcription software. Therefore, the legislature team envisions to take the transcription one step further by taking each of the audio transcriptions from the various technologies and process it through OpenCalais. OpenCalais is a web service that analyzes textual documents to find named entities, facts, and events known as metadata. With the help of OpenCalais, union of documents metadata can be used to reduce the noise or transcription errors, and further OpenCalais provides relevance of each metadata. The relevance weight indicated how relevant and important the metadata is. As a result, metadata with relevance score of.4 or above can be used as keywords or tags for searching through the document. Though many of the APIs evaluated did not output human readable documents, we are curious as to whether analyzing the output through a tagging system results in accurate tags. As such, we plan to use OpenCalais to analyze our output files from each of the evaluated APIs and retrieve the tags associated with the output. We then plan to compare the resultant tags against the actual nature of the analyzed audio file to determine whether the tags are valid and represent major themes portrayed in the analyzed file. REFERENCES [1] H. Kopka and P. W. Daly, A Guide to LATEX, 3rd ed. Harlow, England: Addison-Wesley, 1999.