Automated Speech to Text Transcription Evaluation



Similar documents
Speech Recognition Software Review

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

COPYRIGHT 2011 COPYRIGHT 2012 AXON DIGITAL DESIGN B.V. ALL RIGHTS RESERVED

Automatic measurement of Social Media Use

JK WEBCOM TECHNOLOGIES

Closed captions are better for YouTube videos, so that s what we ll focus on here.

Language Translation Services RFP Issued: January 1, 2015

Digital Asset Management. Content Control for Valuable Media Assets

How to Upload and Caption Videos on YouTube

C E D A T 8 5. Innovating services and technologies for speech content management

Enhancing Document Review Efficiency with OmniX

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

Industry Guidelines on Captioning Television Programs 1 Introduction

U.S. Department of Health and Human Services (HHS) The Office of the National Coordinator for Health Information Technology (ONC)

3PlayMedia. Closed Captioning, Transcription, and Subtitling

Website Accessibility Under Title II of the ADA

AFTER EFFECTS FOR FLASH FLASH FOR AFTER EFFECTS

GOALS FOR TODAY S WORKSHOP

Dragon Solutions Transcription Workflow

Closed Captioning and Educational Video Accessibility

Extracting and Preparing Metadata to Make Video Files Searchable

SmallBiz Dynamic Theme User Guide

DRAGON NATURALLYSPEAKING 12 FEATURE MATRIX COMPARISON BY PRODUCT EDITION

Hosted Fax Mail. Hosted Fax Mail. User Guide

Transcription FAQ. Can Dragon be used to transcribe meetings or interviews?

interviewscribe User s Guide

CREATING AND EDITING CONTENT AND BLOG POSTS WITH THE DRUPAL CKEDITOR

Utilizing Automatic Speech Recognition to Improve Deaf Accessibility on the Web

YouTube optimisation best practice guide

Unit Title: Content Management System Website Creation

Phone Products. TeleForum. Mobilize Predictive Dialer

MANAGEMENT AND AUTOMATION TOOLS

WHITEPAPER. Text Analytics Beginner s Guide

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Your Individual Website Assessment Includes comparison to June 2008 manufacturing study data NAME of COMPANY for WEBSITENAME

WEB DESIGN & SEO PLANNING WORKSHEET

ASR Resource Websites

8000hz Mono (single) Sound 16-bit

Kore Bots Platform Competitive Comparison Overview Kore Bots Platform Competitive Comparison Overview

Embedding Multimedia in Blackboard

Texas Success Initiative (TSI) Assessment

Voice Driven Animation System

ITP 342 Mobile App Development. APIs

Clarified Communications

An elearning platform for distanced collaborative programming

Using a Digital Recorder with Dragon NaturallySpeaking

SPeach: Automatic Classroom Captioning System for Hearing Impaired

Microsoft OneNote. Presented by Ben M. Schorr OM42 5/22/2014 2:15 PM - 3:15 PM. May 19-22, 2014, Toronto ON Canada

WRITING FOR THE WEB. Lynn Villeneuve

Automated Lecture Transcription

The preliminary design of a wearable computer for supporting Construction Progress Monitoring

media kit 2014 PUBLISH / DEVELOP Global Mobile Ad Network

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Sentiment Analysis on Big Data

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

Video Marketing for Financial Advisors How financial advisors can use online video to attract prospects and enhance their reputation

Controlling the computer with your voice

WASHINGTON STATE LEGISLATURE RSS TUTORIAL HOW TO USE RSS TO BE NOTIFIED WHEN BILLS CHANGE STATUS

First, read the Editing Software Overview that follows so that you have a better understanding of the process.

Video Transcription in MediaMosa

60% 60% 32 Good Signals. 26 Issues Found. Keyword. Landing Page Audit. UK News. Put the important stuff above the fold.

How To Manage Your Digital Assets On A Computer Or Tablet Device

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

ilegislate The leading mobile application for paperless agendas You can reach us at: (415) Overview

SEO REPORT. Prepared for searchoptions.com.au

Voice-Recognition Software An Introduction

Contents. Meltwater Quick-Start Guide

Customer Service Plan

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Dragon speech recognition Nuance Dragon NaturallySpeaking 13 comparison by product. Feature matrix. Professional Premium Home.

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Genie Gateway Buyer s Guide. Introducing the Features, Functions & Tools

CONCEPTCLASSIFIER FOR SHAREPOINT

Longman English Interactive

INBOUND MARKETING. should do online. Put up a website? Google Adwords? Facebook Ads? Both? Something else?

Glossary of terms used in the survey

SharePoint & Azure: Digital Asset Management

The Definitive Guide to. Video SEO. i5 web works Phone: Web:

KonyOne Server Prerequisites _ MS SQL Server

How To Use The Alabama Data Portal

Free Listing Distribution Website and Report Manager National Listing Distribution with Agent promotion

Transcription:

Automated Speech to Text Transcription Evaluation Ryan H Email: rhnaraki@calpoly.edu Haikal Saliba Email: hsaliba@calpoly.edu Patrick C Email: pcasao@calpoly.edu Bassem Tossoun Email: btossoun@calpoly.edu Chad Brantley Email: cbrantle@calpoly.edu Gagandeep Kohli Email: gasingh@calpoly.edu Abstract The California State Legislature is a state governmental body that meets consistently to discuss state legislative action. During these meetings, no full transcriptions of the minutes are generally taken; instead, recordings of the long sessions are taken, should they ever need to be referenced. This presents a problem: videos are hard to extract data from. As part of a project aimed at collecting this data into a knowledge repository, we have worked to evaluate a number of different transcription softwares and services based on their ability to transcribe the data properly, and provide relevant data regarding their costs. Our results point to Microsofts MAVIS technology providing the highest quality transcript; however, we found that this is certainly not the cheapest option, considering the limited presence of open-source alternatives, like Julius and Sphinx. I. INTRODUCTION California State Legislature holds various committee meetings to discuss governmental issues. These meetings are recorded through video and audio and uploaded in bulk to the California Channel website. To obtain access, ordinary citizens and media must either search the California Channel and watch the videos or visit the California State Capitol. Through the use of modern technology, we hope to make California Legislature more easily accessible to the public. This project aims at evaluating the many transcription technologies currently available. Natural Language Processing tools such as OpenCalias will be used to obtain significant key words such as names, places, and events. The keywords obtained from OpenCalias will be used create an ontology map so documents that discuss similar domains or issues are linked together, thus, making the documents searchable. II. BACKGROUND/RELATED WORK Two of the many organizations that have taken the initiative to make US Legislature transparent are OpenCongress and OpenGovernment. OpenCongress is a non-profit, non-partisan public resource that was established when they noticed that US Congress offered few channels for the mass public to voice their opinion to policy makers. They state that there are only a few groups in the US that act on and distribute valuable information about political insiders and lobbyists. Even with technology, websites such as The Library of Congress doesnt offer a clear way for one to read and obtain documents. Therefore, OpenCongress is a webpage that offers governmental data obtained from news, blogs, and social networking to make the government more transparent. They aggregate all the data obtained from the sources mentioned above and classify bills, votes, issues, and people in congress. Finally, they use a userfriendly webpage to allow the open public to read and search for governmental data. In addition, they use social networking, such as Facebook, to allow one to share information with their friends. OpenGovernment is a public website that aims at making data about the United States three branches: executive, legislative, and judicial, free and open to the public and is made by the same founders as OpenCongress. They believe that by making data openly available, the public is more likely to engage in governmental matters, reduce corruption, promote better policy, and create a richer democratic institution. As of November 2010, OpenGovernment contains information about five legislatures: California, Louisiana, Maryland, Texas, and Wisconsin. They obtain governmental information from Open State Project, Google News, Blog Search, TransparencyData, and Project VoteSmart. Their web page is centralized by use of sort-by buttons for browsing bills and people to obtain information about particular domains. A track button allows one to obtain the latest actions of a domain. In addition, they provide users with the ability to comment and share bills or peoples documents, contact elected officials, and organize campaigns.

III. FEATURES/REQUIREMENTS EVAL Legislative Transparency is a long-term project with an ultimate goal of allowing the average user to easily search for information about legislative meetings and documents at a centralized place. Therefore, in order to achieve this big goal the project is broken down into iterations. The initial iteration hopes to produce meta-data tags, databases, query types, a white paper detailing the work, and a prototype. As knowledge engineers, Team 2 will focus on evaluating various audio-to-text or transcription software to find one that is lesserror prone and provide a report concluding the evaluation process that will become part of the white paper. The goal of this is to provide Dr. Blakeslee and the rest of the Legislative team with a building block for the future. The chosen software will be used to convert audio from legislative videos into text which will be processed through a Natural Language Process (NLP). NLP software will identify key speakers and information within the audio and its relationship to other meetings, and ultimately allow one to construct a database repository that one can query for desire questions. A. Feature List 1) Evaluation of speech-to-text software 2) Cost Effective (Money, time, computational resources) - Looks at free vs paid software and cost of time B. Requirement List 1) Major, Minor, and Proper Noun Errors produced by various speech-to-text software 2) Time it takes to transcribe an audio file 3) Usability/Accessibility of API, Web services, or etc - the need for human intervention such as breaking up the audio into various chunks or converting the format. C. Evaluation List 1) Chart that displays the breakdown of errors as Major, Minor, and Proper Noun. 2) Time it takes to transcribe an audio file 3) Usability will be measured by a scale of 1-5, in which 5 means the system requires major outside help to pre-process the audio and 1 means no outside work is involved beside uploading the audio and pressing transcribe. A. Technologies Explored Mavis AT&T Dragon Dictation Google Voice Voxforge/Julius IV. IMPLEMENTATION B. Mavis 1) Overview: Microsoft Audio Video Indexing Service (MAVIS) is a Windows Azure application which uses speech recognition technology developed at Microsoft Research to enable searching of digitized spoken content. MAVIS generates automatic closed captions and keywords which can increase accessibility of audio and video files with speech content. MAVIS uses a Deep Neural Net (DNN) based speech recognition] technology, which reduces errors in speech recognition by automatically expanding its vocabulary and storing word alternatives using a technique referred to as Probabilistic Word-Lattice Indexing. More explanation is available at the Microsoft website in the technical background. MAVIS, the technology at the foundation of the Washington Post s Truth Teller Project, was proven to transcribe sessions of Congress and fact check them. It is worth taking a look into the technology. Cost $20 per hour Major Errors Minor Errors Proper Noun Errors Noun Recognition 28 59 16 5 4 4 3) Advantages/Strengths: Hosted solution in the cloud Transcribes multiple speakers No initial voice training required Good customer support Better at recognizing names than other technologies Words that are confidently understood are in bold script Wide variety of input files allowed Captions synced to video 4) Disadvantages/Weaknesses: Punctuation and capitalization can appear arbitrary at times Transcription of a 20 minute video can take up to 2 hours Words can tend to be left out altogether if not understand Strange characters can appear in the transcript C. AT&T 1) Overview: AT&Ts Speech API is a cloud-based service meant to transcribe audio to text using AT&Ts Watson speech engine. In order to do this, AT&T requires that you specify a relevant context for it to gather data from; all contexts are built into the service with no ability to specify your own context. In total, AT&T provides and maintains 7 contexts, including: Web Search Business Search Voicemail To Text SMS

Question and Answer TV Generic Being a cloud-based service, most of the hard work is done on AT&Ts platform. As such, the API is able to be called from many different environments and languages to achieve the same results. Requests are made to AT&T servers through an HTTP request, which perform speech-to-text analysis on the input files using Watson speech engine. Input file formats can be of two types: WAV, 16-bit PCM, single channel, 8 khz sampling AMR (narrowband), 12.2 kbit/s, 8 khz sampling (recommended) As an additional constraint, audio files can only be sent 4 minutes at a time. AT&T provides a number of APIs to use their service, supporting the following environments: HTML5 MS RESTful As a result, most languages can give a speech-to-text request to AT&T, include Java, Ruby, and C#. Language Cost RESTful Java $99/yr + $0.01/API call past 1 million/mth Correct Proper Major Errors Minor Errors Noun Recognition 16 27 65 2 3 4 3) Advantages: Cheap: 1 yearly fee of $99 + $0.01 per API call past 1 million/month Easy to use and versatile: any language with HTTP support should be able to use it Works on multiple speakers Quick calculation: around 1 min audio / 1 min calculation 4) Disadvantages: 4 minutes at a time; must break up long text Transcription is not very strong; many errors AMR audio format (mostly) required : WAV format worked inconsistently Proper noun recognition is bad: doesnt capitalize except for start of sentence, and often errors in names Poor punctuation: seems arbitrary at times D. Dragon Dictation 1) Overview: Dragon Dictation is speech recognition software that lets you use your voice to create and edit text or interact with applications on your machine. It lets you use your voice to create and edit documents, manage e-mail, surf the Web, and more. It also provides digital voice software for mobile devices that let you capture your notes on-the-go and transcribe them with Dragon Dictate. The software is not 100 percent accurate out of the box and depends on the user correcting its dictation as it s used. The more it is used, and the more it s corrected, the better and more accurate its language model becomes. You can even use recordings that you ve made on your mobile device in order to build your personal language model. Although Dragon appeared to be a solid transcription technology for a single user, it proved that it was intended for exactly that: a single user. Output from Dragon also did not have any punctuation. For our purposes, it is not worth pursuing further evaluation of Dragon. Cost Platform $200 Windows, Mac OS X Proper Noun Errors Major Errors Minor Errors 16 45 35 1 4 4 3) Advantages: Relatively malleable language model Transcribes audio relatively quickly Can easily load audio files with a range of different formats 4) Disadvantages: Requires voice training Intended to learn a single users speech patterns No punctuation Proper nouns may get lost in the noise E. Google Voice 1) Overview: The Google Voice API is a speech recognition API that supports audio to text automation. It allows you to use your voice to create and edit text or interact with applications on your machine. Google Voice has its own software and also provides the framework and essence of the Closed Captioning feature on YouTube. The software is often used to translate voice mail messages to text in order to provide a message to the user without the user having to listen to it. The Google Voice API can also be found in Android mobile phones, which it provides for Speech Recognition and navigation through applications on the phone. This version of the Google Voice API is actually not public and can support any size videos. A Speech2Text program was written using this version of Google Voice API function calls, which takes in a WAV file and outputs the text it transcribes from the audio file. The software still has a few rough edges and also a fatal flaw when trying to process audio files with sections of little or no sound (variability in frequency). The program does a decent job, and because its code is available and editable, hopefully, can be improved by us.

Cost $0 Major Errors Minor Errors Proper Noun Errors Noun Recognition 34 43 25 1 3 2 3) Advantages/Strengths: Transcribes audio relatively quickly Free Can transcribe any length video 4) Disadvantages/Weaknesses: Only supports WAV files Has trouble with audio files that includes sections of little or no sound No punctuation Proper nouns may get lost in the noise F. VoxForge/Julius Voxforge is the most complete open-source English speech corpus; it compiles speech into acoustic models for other software systems such as: Julius, Sphinx, and HTK to work with. Using this data, these software systems can match certain sets of the resulting acoustic model to words, or perform other operations on them. Julius is an open-source speech recognition system; its development began in 1997 in Japan and since has been refit to work for many different languages. Julius requires two things to interpret speech: an acoustic model, which Voxforge provides, and a grammar of words to match the audio against. The grammar, however, must be tailored to the acoustic model, and few generic grammars seem to exist; as such, the Julius/Voxforge combo seems like a difficult option, or one that might require more time to get setup and evaluate. V. VALIDATION For the evaluation of various software, a 6-minute sample of a legislative meeting was extracted and manually transcribed. The sample was passed through various transcription software, which produced output transcripts. The location and number of errors made by each software was compared to the manual transcription. A. Error Definition An error is defined from where the first error occurred to the end of where that type of error occurred. Errors are defined this way because an error s beginning is usually the root cause for the rest of a phrase to be invalid. B. Error Types: 1) Major (Red marks):: Continuous stream of incorrect words Continuous stream of missing words 2) Minor (Yellow marks):: One word error Spelling error Grammar error (two/too) Capitalization error Period or thought break error Commas are not counted as minor errors 3) Proper Nouns (Green marks):: Inability to identify proper nouns correctly (USCB, California, Names, Senator). Proper noun errors are counted as either a part of minor or major error. They are major if their context includes a major error, minor otherwise. We consider uncapitalized nouns an error, because Natural Language Processing software relies on correct use of nouns to identify key people and places. Therefore, we would like to minimize the number of errors that will result from Natural Language Processing software by picking a robust transcription software. C. Usability Criteria Transcription software is evaluated on several qualitative measures as well. How readable is the transcript overall (1-5, 5 = most readable). If the reader can understand the content in spite of the errors, readability is high. How easy the software is to set up initially (1-5, 5=easy) How easy the software is to continually use after initial setup (1-5, 5=easy) General advantages / strengths General disadvantages / weaknesses VI. CONCLUSION According to our results Mavis is the best choice for this use case. Even though transcription of a single file may take hours, more than one file can be processed at a time in parallel on Microsofts cloud. The AT&T API, though comparable to Mavis in terms of number of errors, often results in low readability transcripts and requires more effort in manually correcting those errors. Google Voice, though free, results in highly unreadable transcripts with a large number of errors. Using Dragon Dictation results in highly unreadable transcripts as well, mainly due to the fact that Dragon Dictation is not tailored for such a use case, instead training on a single speaker. The main concern towards using Mavis would be the price as the software is not open source, and using it requires a paid subscription. However, even with Mavis, the resulting transcripts are still unreadable, with some major errors, as well as many proper noun errors. Though the reader would be able to follow the logic of the transcript, the document would still require manual correction to achieve correct transcription.

System Cost Platform Major Errors Minor Errors Proper Noun General Ease of Ease of Errors Readability Setup Continued (1-5, (1-5, Use (1-5, 5 = readable) 5 = easy) 5 = easy) MAVIS $20/h Microsoft 28 59 16 5 4 4 Azure AT&T $99 + RESTful 27 65 16 2 3 4 $0.01/API Java call past 1 million Dragon $200 Windows/Mac 45 35 16 1 4 4 Dictation Application V. 11 Google $0 Windows/Mac 34 43 25 1 3 2 Voice Application TABLE I OVERALL SYSTEM COMPARISON VII. FUTURE WORK The final intent of the legislature project is to allow ordinary citizens and media to search through California State legislature hearing. This white paper mainly focuses on various transcription technology and reaches the conclusion that there isnt an ideal transcription software. Therefore, the legislature team envisions to take the transcription one step further by taking each of the audio transcriptions from the various technologies and process it through OpenCalais. OpenCalais is a web service that analyzes textual documents to find named entities, facts, and events known as metadata. With the help of OpenCalais, union of documents metadata can be used to reduce the noise or transcription errors, and further OpenCalais provides relevance of each metadata. The relevance weight indicated how relevant and important the metadata is. As a result, metadata with relevance score of.4 or above can be used as keywords or tags for searching through the document. Though many of the APIs evaluated did not output human readable documents, we are curious as to whether analyzing the output through a tagging system results in accurate tags. As such, we plan to use OpenCalais to analyze our output files from each of the evaluated APIs and retrieve the tags associated with the output. We then plan to compare the resultant tags against the actual nature of the analyzed audio file to determine whether the tags are valid and represent major themes portrayed in the analyzed file. REFERENCES [1] H. Kopka and P. W. Daly, A Guide to LATEX, 3rd ed. Harlow, England: Addison-Wesley, 1999.