Foreign Language Audio Information Management System

Similar documents
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems

Speech Processing / Speech Translation Case study: Transtac Details

NAIC & The Intelligence Community Open Source Architecture

Speech Analytics. Whitepaper

C E D A T 8 5. Innovating services and technologies for speech content management

Modern foreign languages

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Speech and Data Analytics for Trading Floors: Technologies, Reliability, Accuracy and Readiness

Specialty Answering Service. All rights reserved.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

customer care solutions

How To Translate A Language Into A Different Language

Develop Software that Speaks and Listens

Turkish Radiology Dictation System

320 E. 46 th Street, 11G New York, NY Tel Ext. 5

31 Case Studies: Java Natural Language Tools Available on the Web

Multi-Lingual Display of Business Documents

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Interactive product brochure :: Nina TM Mobile: The Virtual Assistant for Mobile Customer Service Apps

Speech Processing Applications in Quaero

Building a Database to Predict Customer Needs

Sample Cities for Multilingual Live Subtitling 2013

Worldwide Advanced and Predictive Analytics Software Market Shares, 2014: The Rise of the Long Tail

The history of machine translation in a nutshell

Comparative Evaluation of Three Continuous Speech Recognition Software Packages in the Generation of Medical Reports

A HAND-HELD SPEECH-TO-SPEECH TRANSLATION SYSTEM. Bowen Zhou, Yuqing Gao, Jeffrey Sorensen, Daniel Déchelotte and Michael Picheny

ETPL Extract, Transform, Predict and Load

Dragon Solutions Enterprise Profile Management

Philips 9600 DPM Setup Guide for Dragon

Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

Fulfilling World Language Requirements through Alternate Means

The ROI. of Speech Tuning

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Study of Humanoid Robot Voice Q & A System Based on Cloud

Speech Analytics Data Reliability: Accuracy and Completeness

At Your Service: Embedded MT As a Service

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Chapter 3. Basic Application Software. McGraw-Hill/Irwin. Copyright 2008 by The McGraw-Hill Companies, Inc. All rights reserved.

Archiving Technology Trends May Report #639 Ferris Research Analyzer Information Service

Agenda Item 1. Featured Start-Up -- OmniSpeech

Overview of MT techniques. Malek Boualem (FT)

PROMT Technologies for Translation and Big Data

Dragon Medical Practice Edition v2 Best Practices

Knowledge Management and Speech Recognition

Utilizing Automatic Speech Recognition to Improve Deaf Accessibility on the Web

A Comparative Analysis of Speech Recognition Platforms

customer care solutions

Collecting Polish German Parallel Corpora in the Internet

Healthcare Measurement Analysis Using Data mining Techniques

Language Translation Services RFP Issued: January 1, 2015

MLLP Transcription and Translation Platform

The Impact of Using Technology in Teaching English as a Second Language

Rotorcraft Health Management System (RHMS)

Call Recording and Speech Analytics Will Transform Your Business:

Enterprise Voice Technology Solutions: A Primer

Investigating the effectiveness of audio capture and integration with other resources to support student revision and review of classroom activities

The Future of Marketing Platforms IBM Corporation

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

EURESCOM Project BabelWeb Multilingual Web Sites: Best Practice Guidelines and Architectures

Multimedia Translations

INDEX. OutIndex Services...2. Collection Assistance...2. ESI Processing & Production Services...2. Computer-Based Language Translation...

Consolidated Clinical Document Architecture and its Meaningful Use Nick Mahurin, chief executive officer, InfraWare, Terre Haute, Ind.

DRAGON CONNECTIONS PARTNER PROGRAM GUIDE

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Reference Guide: Approved Vendors for Translation and In-Person Interpretation Services

Speech Recognition of a Voice-Access Automotive Telematics. System using VoiceXML

2015 Global Identity and Access Management (IAM) Market Leadership Award

The history of machine translation in a nutshell

Best Practices White Paper: elearning Globalization. ENLASO Corporation

Symantec University for Partners Frequently Asked Questions

Voice Driven Animation System

Harnessing the power of advanced analytics with IBM Netezza

Experiences from Verbmobil. Norbert Reithinger DFKI GmbH Stuhlsatzenhausweg 3 D Saarbrücken bert@dfki.de

Exhibit F. VA CAI - Staff Aug Job Titles and Descriptions Effective 2015

Empowering. American Translation Partners. We ll work with you to determine whether your. project needs an interpreter or translator, and we ll find

The Evolving Internet of Things Market

Dragon Solutions Transcription Workflow

Best Practices for Digital Preservation in Smaller Institutions: The SCOLA Model

IBM Content Analytics with Enterprise Search, Version 3.0

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Datalogix. Using IBM Netezza data warehouse appliances to drive online sales with offline data. Overview. IBM Software Information Management

MOVING MACHINE TRANSLATION SYSTEM TO WEB

Bilingual Education Assessment Urdu (034) NY-SG-FLD034-01

HiTech. White Paper. A Next Generation Search System for Today's Digital Enterprises

SMART Software for Mobile Devices Sales brief

Enabling Managed Service-based Solutions

The Bank of New York Mellon is Open For Business.

Robustness of a Spoken Dialogue Interface for a Personal Assistant

IEC The Fast Guide to Open Control Software

COMPUTER TECHNOLOGY IN TEACHING READING

Before You Buy: A Checklist for Evaluating Your Analytics Vendor

Why major in linguistics (and what does a linguist do)?

The Challenges of Application Service Hosting

Robust Methods for Automatic Transcription and Alignment of Speech Signals

LANGUAGE TRANSLATION SOFTWARE EXECUTIVE SUMMARY Language Translation Software Market Shares and Market Forecasts Language Translation Software Market

Transcription:

Foreign Language Audio Information Management System Marc Shichman, Mike Gaffney, Elizabeth Cornell Fake, Lisa Sokol Veridian Arlington, VA 22202 Marc.Shichman@veridian.com Abstract - Veridian created a prototype of a foreign language audio information management system that integrates speech recognition technology, machine translation and advanced information retrieval and extraction for Mandarin Chinese. The system automatically processes audio recordings to create a data warehouse of derived information using speech recognition and machine translation technology components. The data warehouse can then be further exploited using information retrieval technology components. The prototype system provides the following capabilities: Automatically transforming foreign audio files into electronic text, Transforming foreign text into English text, Matching transcribed and translated text and the topics of interest to the analyst, Displaying transcribed text by speaker. The conclusions imply that while automatic speech processing technology is far from perfect for mass market distribution, it is sufficiently advanced to help with the overload of audio and video data. Key words: Automatic Speech Recognition, Machine Translation, Audio Mining 1 Introduction Amid all of the media hype of the 1990s about advances in leading edge information technologies, Veridian began to explore alternatives for transcribing and processing the large volumes of foreign language audio and video data collected by the community. The volume of data makes it impossible to manually transcribe all of the recorded data, but community analysts must be able to find and extract the relevant information embedded in the recordings. Veridian was specifically interested in technologies that would assist in retrieving information from radio and television broadcasts, telephone conversations, call center dialogs, and help desk recordings. Veridian wanted to be able to assess and develop best of breed systems for our clients. Because Ve ridian professionals are on the periphery of the research community in automatic speech processing, we wanted to create an independent assessment to test the viability of the emerging technologies for operational use. Our investigation involved several tasks designed to produce a prototype foreign language audio information management system that utilized automatic speech processing (ASP) technology. Our first task was the completion of a literature search on the state-of-the-art in automatic speech processing technologies. Following the literature search, we began a technology assessment of available components to meet the clients needs. The final task involved integration of a proof of concept prototype from available (COTS, research or government) products for automatic speech recognition (ASR), machine translation (MT) and advanced information retrieval (IR) and extraction for Mandarin Chinese. As we have not been part of the long established research community involved with ASP, we had to assess and evaluate all possibilities. The following discussion details the findings of all three tasks literature review, technology assessment and prototype integration. 2 Literature Review Three technology areas are exploited for automatic speech processing (ASP): Automatic Speech Recognition (ASR), Machine translation (MT) and Information Retrieval (IR). ASR technology creates native language text from spoken language data. MT technology is used to translate foreign language text into an English summarization. IR technology involves the storage and retrieval of large volumes of information. Important findings from the literature search that proved to be useful in the prototype development are given in the following sections. 2.1 Automatic Speech Recognition (ASR) Extensive effort has been invested in development of automatic speech recognition products and services through research and development programs in large companies, at major universities and through government funding vehicles. Cooperative research initiatives among companies, university programs and government agencies have produced important technological advances. Combined efforts have also contributed to the development of a worldwide community of scholars who are investigating speech recognition on a continued basis. Researchers have approached automatic speech recognition from a number of different theories. These approaches have included: 1492

Focus on singular acoustic units of speech utterances called phonemes with the use of probabilistic models to determine the sounds that precede and follow the phonemes Use of lexicons or dictionaries that are groups of words listed together with pronunciation information Language and linguistic models that predict how words will be grouped together as a result of semantics or grammar. The major advances in automatic speech recognition technology have come as a result combining different research approaches such as combining the acoustic model with the linguistic model (e.g., the acoustics of speech are transcribed and interpreted in terms of the linguistic framework of the speaker). There is no one approach that exceeds the others. Highly sophisticated statistical models have proven to be the backbone of much of the research progress. Hidden Markov Models and speech recognition algorithms test and correlate acoustic and linguistic data intensively to provide correct transcriptions. In addition to the statistical analysis, thousands of hours of human-edited dictionary data on words and pronunciations have supplemented speech recognition databases. Over the past few years, several small companies have taken a leadership role in developing products for automatic speech recognition. However, given the dynamics of the small business marketplace, the small companies have planned exit strategies and have been acquired by larger businesses. Products and services originally available from Dragon Systems and BBN have been incorporated into the business offerings of ScanSoft, Virage, Inc., and Dragon Catalyst, LLC. Business strategies that govern leading edge technology products often make it difficult for integrators to find and select best of breed solutions. Speech to text dictation products have been the most visible commercial automatic speech recognition application. Speech recognition dictation systems have a strong market niche and are clearly a growth area for the future. Several companies have delivered shrink wrapped dictation products for commercial sale. These products feature large vocabulary systems that permit spoken input of text into the word processor of a personal computer and spoken control of Windows and other operating system functions. Published evaluations cite some systems as having up to 98% accuracy. Users report considerable frustration during the startup phase as success can be limited by hardware, software, microphone quality and the need for intensive system training. The most successful results have been for vertical markets such as medicine and law where the vocabularies are very specific and practitioners have traditionally used dictation. Dictation systems also include foreign language products including non-roman alphabet languages such as Japanese, Mandarin Chinese and Arabic. ScanSoft s Dragon Naturally Speaking and IBM s ViaVoice are available in foreign language versions. Dragon s broadcast news recognizer that appeared to be a leading edge choice, because of its capability to deploy a wide range of recognizers for a variety of bandwidth samples, was unavailable to us because of business issues related to the sale of Dragon to Lernout & Hauspie (L&H). Virage s Videologger incorporated a research version of the IBM Via Voice automatic speech recognizer for broadcast news. But Virage had integrated modules for both audio and video transcription, not just audio. A BBN product, BYBLOS, was identified as a research software that provided a highly reliable automatic voice recognizer for both broadcast news and telephone samples involving conversational speech. The use of automatic speech recognition for broadcast news has matured considerably over the past two years. As a result of the DARPA/NIST-sponsored programs, several research software initiatives have been developed into commercial products for audio/video content management and analysis. As recently as January 2002, BBN and Virage, Inc. announced a marketing and distribution agreement for Virage to commercialize and distribute BBN s research broadcast news speech recognizer, BYBLOS, and its retrieval product, ROUGH N READY as a new product, Audio Indexer. The research version of Audio Indexer proved to be the tool of choice for the Veridian team. 2.2 Machine Translation Machine translation (MT) has been an ongoing area of research for the past fifty years. Warren Weaver led the first wave of enthusiastic researchers beginning in 1952. Shortly thereafter numerous important research efforts proliferated worldwide with much of the work funded by government sources. Research objectives focused on the development of systems to provide FAHQT or fully automated high-quality translation. During the late 1950s and early 1960s machine translation research entered a period of high energy with competing factions engaged in computational, crypto-analytic or linguis tic approaches. In spite of the exceptional resources available to researchers, progress was slow and results negligible. Many of the research programs were abandoned in the mid -1960s. A resurgent interest in machine translation surfaced in the late 1980s at the Advanced Telecommunications Research Institute in Japan with researcher Alex Waibel as the chief architect. Waibel organized C-Star in 1991, an international consortium of businesses and research organizations to investigate speech-to-speech machine translation. Building on the experience of the fifties and sixties, this time the research effort became inclusive of all approaches including both computational and linguistic theories and focused on different and less ambitious goals than FAHQT. Advances in computing technology such as processor speed and storage capacity coupled with thirty years worth of progress in artificial intelligence made it possible to look again at machine translation as an achievable goal. [1] 1493

2.2.1 Computer-Assisted Translation The 1950 s goal of a 100% machine translation has given way to several different approaches. Most machine translation today is computer-assisted translation, which provides tools to a human translator who manages the process of making the translation. The tools include gisting, software dictionaries and web devices. One of the most popular machine translation techniques is called gisting. Gisting involves using a software program to provide the gist of the message content rather than a perfect word-to-word translation. An internationally recognized translation services company specifies their goal is to achieve a 60% gist document with the rest of the translation to be supplied by experienced translators/editors. To serve this end, machine translation software has become middleware that wraps around translation engines that have been developed for specific vocabularies such as law, medicine, business, technology and military science. MITRE Corporation s CyberTrans system for government clients has demonstrated the feasibility of this approach. IBM s Lotus translation software also uses the middleware concept of machine translation. Both companies rely on translation engines developed by SYSTRAN Corporation, a company that has held a premier position as a translation engine developer since the 1950s. Gisting is only one the new trends in machine translation. Numerous computer software dictionary tools have been developed to assist the human translator. The products have limited vocabularies and often produce humorous renditions of the original text. But, they are fast, convenient, easy to use and clearly viable as commercial products. Other tools include currency calculators, time change converters and travel aids. Another recent and important development in computer-assisted translation is the emergence of webbased translators for web page and website translation. Among these, Alta Vista s Babelfish (www.babelfish.com) enables users to translate short text passages and websites from nine languages (French, Spanish, Italian, Portuguese, German, Chinese, Japanese, and Russian) into English. Alta Vista relies on the SYSTRAN family of translation engines and has integrated SYSTRAN engines into the AltaVista home page to support INTERNET searching in 26 different foreign languages. 2.2.2 Research in Machine Translation At present, research in machine translation has dovetailed with continuous speech recognition research. Like speech recognition, the most progress has come from combining statistical and linguistic theories of language processing and making a decided shift away from large to small controlled vocabulary projects. Carnegie Mellon University (CMU) and the German Research Center for Artificial Intelligence (DFKI) have developed internationally recognized research programs in the combined disciplines. CMU s Diplomat, Janus, Pangloss, and KANT projects and DFKI s Verbmobil have served as demonstration projects for proof of principle. The C-Star project has experimented with videoconferencing and computer-assisted speech translation. C- Star projects have been demonstrated under lab conditions in six languages: English, Japanese, Korean, French, Italian and German. The emphasis here has been on spontaneous spoken language translation, which is clearly an ambitious goal.[2] Progress has been made however through the use of Interlingua and example-based machine translation. Interlingua uses a single, language-neutral representation for all language pairs. Example-based is a form of analogybased translation that uses pairs of bilingual expressions stored in a database. Recent research in Multilingual Information Retrieval (MLIR) has extended the research agenda for machine translation to foreign language document retrieval as well. The scope of the research has broadened correspondingly to include foreign language multi-media and spoken language documents. Investigation in MLIR emphasizes the importance of document translation and processing as an integral part of document query and retrieval. Although statistical approaches have been investigated, it is apparent that success in MLIR is highly dependent on having semantic context to match with the words, phrases and fragments that make up queries.[3] In the past few years, scholars have investigated cross-language information retrieval (i.e., English-French, French-English). Research findings indicate improved retrieval results when queries are translated into the same language as the document collection to be searched. 2.3 Information Retrieval (IR) Information retrieval (IR) for audio data poses several unique problems. The large volume of data makes it impossible to manually transcribe all of the recorded data. For this reason it is not feasible, nor desirable to have a word for word, edited transcription. Another problem is the need for both push and pull retrieval. Analysts want to know when certain names, places or things appear in audio sources as well as the exact place where the entity occurs. They have a continuing interest in the same entities over time, and will want to know where and when the reference appears. Should new topics of interest arise, they want to be able to use tools to search retrospectively and then have tools to alert them of new occurrences of the same entities. It is a complex research requirement and one that has been addressed through generous government support. Developed technologies that address this requirement are called Audio Mining. Ongoing research initiatives are described under Audio Text Retrieval Research. 1494

2.3.1 Audio Mining Audio Mining is a technology that has been developed to retrieve information contained in recorded video footage, radio and television broadcasts, telephone conversations, call center dialogues, and help desk recordings. The technology converts audio data into text that can be searched through keyword queries. BBN completed most of the pioneering work in the technology through the development of a government-sponsored product, Rough N Ready. Rough N Ready then evolved into Audio Indexer. It is a speech-to-speech retrieval system that employs a combination of acoustic and language models to retrieve information correctly. It is not intended to be a written text transcription system. Dragon Systems developed a product similar to BBN s Rough N Ready, but differentiated it enough to secure their own patent and made it commercially available to customers as a service, known as Audio Mining. The product was made available commercia lly as a service of Lernout & Hauspie, but was sold to Dragon Catalyst, LLC at the time of the Lernout & Hauspie bankruptcy reorganization. Virage, Inc. also has an audio/video mining product that is based on IBM s Via Voice technology for broadcast news. Virage s capabilities have been extended through integration of BBN s Audio Indexer into the Virage SmartEncode suite of products. 2.6 Audio Text Retrieval Research Over the past ten years NIST and DARPA have sponsored competitions that have encouraged the research community to explore and benchmark the state of the art in advanced information and text retrieval. Events have included the Message Understanding Conferences (MUC) 1-7, Hub-4 and Hub-5, Topic Detection and Tracking (TDT) 1-3 and Text Retrieval Conferences (TREC) 1-9, including Spoken Document Retrieval (SDR) in TREC 7-9. Through the SDR competitions, audio text retrieval research has converged with research efforts in continuous speech recognition and machine translation, information extraction and text retrieval. As a result, research efforts in the area have gained considerable momentum Objectives for topic detection and audio tracking of spoken documents include the means to respond to a continually changing focus of user inquiry and to provide exact positioning in spoken materials that have already been processed. Users would also like to have a tracking service that automatically alerts them to the addition of new audio items of interest. Because the documents are spoken, not text, the retrieval methods require sophisticated recognition algorithms and continual training of the speech recognizer as new topics are identified. Research indicates that topic detection and audio tracking can be successful with a system of named entity tagging. Other research hypotheses propose using grammar models to detect people, places and things. Statistical research methods have looked at the significance of sub-words, syllabic units, phonetic units and word spotting for audio tracking.[4] Recent research has shown promising results using language chunks, or 6 9 word strings as a search statement for retrieval.[5] Other retrieval theories propose using singular phonemes as queries for retrospective searches. It is apparent that SDR and audio tracking will provide ample opportunities for doctoral dissertation research 3 Technology Assessment Because of the visibility and media attention focused on automatic speech processing technologies, the team expected to integrate best of breed, COTS technologies to build the prototype. They learned, however, many of the component technologies that would be required were not available as commercial products. In spite of the impressive gains made through a significant commitment to research, automatic speech processing continues to be a developmental effort. Although dictation systems have been perfected for the commercial mass market, they did not meet the needs of the government for automatic speech recognition of broadcast news. Veridian was looking for a speech recognizer that could handle large vocabulary, continuous speech recognition (LVCSR) which would require a much larger, already trained vocabulary (60,000 250,000) words, methods for speaker identification, recognition and segmentation of casual dialog as well as background noise suppression. Requirements also stipulated the need for open environment with varied sampling rates and without requirements for a broadcast quality microphone and recording studio. When faced with these requirements and the lack of commercialized products, Veridian began to search the literature for sources of research software. Dragon Systems, Inc. (now ScanSoft and Dragon Catalyst, LLC) and BBN developed deployable systems that integrated all of the technologies required for the prototype. However, in the case of both companies, the components were available as a solution, or service, not as individual parts that can be purchased as shrink wrapped products and integrated independently. The component parts lacked robustness and required too much technical support to be sold without a continuing service contract. Because of the classified nature of the client s work, it would have been very complicated to enter into a continuing service contract with a vendor. Another finding of the state-of-the-art review is the current trend is toward more and larger research investment in Command and Control and interactive voice applications of the technology. Large vocabulary, continuous speech recognition (LVCSR) projects will continue to be an important focus of research. However, there is a decided shift away from LVCSR to narrowly defined text to speech (TTS) projects. There will be more short-term progress toward the development of commercial products in TTS than 1495

in LVCSR. For this reason it has been imperative to consider available research systems as the basis for the prototype. Although commercial speech recognizers have been developed in a number of different foreign languages, target language development has been limited to Mandarin Chinese with recent addition of Arabic. The other target languages, Russian and, Farsi as well as other Chinese dialects require more research and development. Even though both Dragon Systems and BBN claimed to have a language independent recognizer, the research community responded to the claim with skepticism. True language independence requires more developmental effort and demonstration. Veridian tested several products for the integration. The products included: Lernout & Hauspie Voice Express, an English dictation system, Lernout & Hauspie SDK 3, a dictation system for Mandarin Chinese, Simplified Chinese and Cantonese, Dragon Naturally Speaking Dictation System for Mandarin Chinese, C-Star, a windows interface for Chinese, IBM Via Voice for Broadcast News (beta version), Chant Speech Kit, a voice interface for all applications, and SYSTRAN Enterprise Translation Software, English /Chinese. Although the team did not find commercial products, they located components for the prototype system from government-sponsored research projects. BBN agreed to provide access to their BYBLOS recognizer. The Linguistic Data Consortium provided 125 files of Mandarin Chinese Call Home/Call Friend data to train the recognizer for the system. interface component provides the analyst with a tool to find Chinese audio recordings of interest by using English keyword searches. 3.2 Speech Processing Speech processing is a set of data extraction applications that perform transcription, translation and information storage functions. The following figure shows the information flow of recorded audio through the speech processing system. 4 Prototype System The Prototype System has developed an information infrastructure that can support the exploitation of foreign language audio files. The prototype has the following operational capabilities: Automatic transcription of foreign audio files into electronic native text, Translation of foreign text into English text, and Matching transcribed and translated text and topics of interest to the user. 4.1 System Components The system consists of a speech processing component and a search and retrieval user interface component. The speech-processing component extracts Chinese text from the audio recording, translates that text and stores the resulting information in a data warehouse. The prototype uses the BBN BYBLOS Mandarin Chinese automatic speech transcription component to create a text file of native Mandarin transcribed speech. A SYSTRAN Mandarin Chinese translation engine processes the transcribed text into English text that is then stored in an Oracle Database. The Oracle Database is queried through the system User Interface. The search and retrieval user Figure 1. Automatic Audio Processing Information Flow The prototype has several audio related assumptions buried in its model. The first is that foreign language audio must be a recording of Mandarin Chinese. The recording must be toll quality stored in two-channel mu -law format. It is expected that there is primarily one speaker on each channel of audio. The input data must be similar to the LDC call home Chinese recordings used for demonstration. All of these assumptions are potentially modifiable, e.g., the prototype could be easily adapted to use different foreign language audio recordings. The basic infrastructure of the prototype will stay constant even when the assumptions vary. BBN s BYBLOS is the automatic speech recognition system. It was developed under the auspices of DARPA and the other government agencies. BYBLOS used the audio recording and a specification of speaker turns within the 1496

recorded data as input. BYBLOS then writes out a transcription of the Chinese recording in GB2312 UTF-8 format. The Chinese text is then translated using SYSTRAN. The SYSTRAN and BYBLOS output are then formatted to denote the speaker identification (channel of the utterance) and the start and end times of each utterance. The formatted data is then loaded into an Oracle 8i data warehouse. 4.3 Search and Retrieval User Interface (SRUI) The SRUI lets the analyst use English language queries to find the audio documents of interest. The search goes against the English text that is derived from foreign language audio recordings using Englis h queries. A list of documents that contain keywords of interest as well as pointers to the translations as well as the native recordings are returned to the user. The analyst can then select a matching document and jump to all locations of the keyword within the text automatically. The corresponding audio recording can also be played using a system multimedia tool such as Real Player. The native language or English translations can be viewed through the interface. Each utterance is displayed with a header detailing the audio channel where the data is found and start and end time of the utterance. The header information is used to determine where to set the multimedia tool to start playing from so that the true utterance can be heard while the translated or transcribed text is viewed. Figure 2 shows the web-based interface displaying the native Chinese text and the translated English text: 4.4 Performance Analysis The intent of the system performance analysis was to identify possible metrics for the performance of the components audio mining system. These metrics could be later used for evaluation and comparison of competing technologies and products. These metrics could also be used to track the improvement of current technology. The performance of the prototype system was evaluated on to determine how well topics in the Chinese audio recording transcribed and translated into English text. Using data provided through the Linguistic Data Consortium (LDC), the system transcribed and translated 120 audio recordings of HUB-5 (Call Home/Call Friend) Mandarin Chinese data. The data set included 80 audio recording that were used as training files for the system recognizer, 20 recordings that were development test (untrained) files and 20 audio recordings that were used for our evaluation test (untrained) files. An evaluation was based on transcription and translation accuracy to determine how well topics discussed in the Chinese conversation translated into English. A native Mandarin/English-speaking analyst evaluated both the transcription data and translation data for 12 audio recordings. Comparisons where made between the system transcribed Mandarin text with the official transcriptions of the same Mandarin audio recordings from LDC. Comparisons between the official LDC Mandarin transcription with the English translation where also performed. The evaluation used the following criteria: Identification of Key Words a word was identified and documented as a key word if it appeared at least three times in the translated version. Identification of Topics any proper name of a person, place or thing was identified and documented as a topic in the translated version. Transcription Accuracy evaluated the how well the content of the conversation was preserved in the automatic transcription: a subjective scale of 1-5 with 1 being the lowest score and 5 the highest. Translation Accuracy evaluated the how well the content of the conversation was preserved in the translation of the transcribed text: a subjective scale of 1-5 with 1 being the lowest score and 5 the highest. 4.5 Results Among the 8 training audio files, the performance of the transcription system was significantly better than the translation engine. This suggests that the translation engine requires additional training to achieve optimal performance. This occurs because the current translation engine expects a well-formed grammar and a limited vocabulary within the input text. Additional training would provide the additional vocabulary and extend the grammar model. The analyst rating average of the transcription accuracy of the 12 audio recordings is 3.03 using the evaluation criteria. This rating far surpassed the average translation accuracy of 1.56 of the translations of the Chinese transcriptions from the 12 audio recordings. This suggests that the translation system is introducing new errors into the text. Examination of numerous unusual word groupings also indicated poor translation accuracy. The translation of the Chinese transcriptions compounds the error from the transcription system. Further improvement in automatic speech processing technology will require improvement in the machine translation arena. 5 Conclusion The system provides a unique prototype foreign language audio information management capability. The system provides a search and retrieval capability of the large volume of audio data collected by an organization. The system processes the audio recordings to create a data warehouse of information using automatic speech recognition and machine translation technology components automatically. The data 1497

warehouse can then be further exploited using information retrieval technology components. The prototype system was able to provide the following capabilities: Automatically transforming foreign audio files into electronic text Transforming foreign text into English text Matching transcribed and translated text and the topics of interest to the analyst Display of transcribed text by speaker. The conclusions imply that while automatic speech recognition technology is far from perfect for continuous speech recognition problems, it is sufficiently advanced to help with the overload of audio and video data. 6 References [1] Steve Silberman, Talking to Strangers, Wired Magazine, Volume 8, No 5, pp 225-239, May 2000 [2] 1 Ibid., p. 233 [3] Douglas W. Oard, Cross-Language Information Retrieval, Annual Review of Information Science and Technology (ARIST), Volume 33, pp. 223-256, 1998. [4] Kenney Ng, Subword-based Approaches for Spoken Document Retrieval, PhD Dissertation, Massachusetts Institute of Technology, February 2000. [5] Farzad Ehsani and Eva Knodt, Speech Technology in Computer-Aided Language Learning, Language Learning and Technology, Vol. 2, No. 1, pp 45-60, July 1998. 1498