Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

Transcription

1 white paper Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential A Whitepaper by Jacob Garland, Colin Blake, Mark Finlay and Drew Lanham Nexidia, Inc., Atlanta, GA

2 People who create, own, distribute, or consume media need simple, accurate, cost-effective tools for discovering it, especially if they must sift through years worth of media files or need immediate access to their media for production. After all, if they can t find it, they can t manage or monetize it. Production and media asset management systems, generically referred as MAMs in this paper, hold file-based metadata such as the date the footage was shot, but often there is not much descriptive metadata about the content, making it difficult to find an asset quickly. Manual logging and transcription are not only time-consuming and prohibitively expensive, but they also yield limited detail and or introduce potential costly delays when the media is not available to be searched. Image recognition reveals who s pictured, but not what they re saying. Speech-to-text has had insufficient performance and accuracy to be useful even on the clearest speech. Dialogue, on the other hand, is present in almost all program types and often provides more detailed, precise content description than any other metadata. The automated, phonetic-based search of dialogue is accurate, extremely fast, affordable, and can integrate with existing MAM systems and editing applications. It also applies broadly to any industry that creates, owns, or distributes content. This paper will discuss the technology behind the phonetic-based search of dialogue used in products like the patented Nexidia Dialogue Search, Avid PhraseFind, and Boris Soundbite, and how it can change the way content owners, creators, and aggregators discover, use, and monetize their media assets. Introduction For media creators and owners, the amount of digital media in their libraries never stops growing as they continue to produce new digital content every day. Those that have also progressed to the point of digitizing their legacy audio and video assets sometimes decades worth can easily be dealing with hundreds of thousands of hours of content. Content owners who don t organize and manage their data are not getting the maximum value out of their investment in their archived media, despite spending significant sums of money to manage and store it. Whether the aim is to reach new audiences, serve second (or third) screens, create completely new programming, prove compliance, license their content, or some combination of those, the ultimate goal for many content owners is to monetize the assets that would otherwise languish in their media libraries. To do so, many content owners have chosen to use a media asset management (MAM) system. There are many MAM systems to choose from at a range of price points and features, and the process of choosing the right one can require a lot of research and consideration. Once the MAM application is in place, it s tempting to think that the process is over and that discovery problems are solved, but for most content owners, the MAM is just the beginning. A good MAM system is a critical part of any file-based media operation, to be sure, but its search capability is only as good as the metadata that goes into it. Without rich, descriptive metadata, a MAM system may not meet the expectations of the content owner and or justify the expense of the MAM, storage, and other systems required to utilize the MAM. Different Approaches to Search Assets usually get ingested into a MAM system along with basic information such as filename, file type, date, timecode, and duration information. Metadata might also include show/project name, summary, and or relevant keywords. Unfortunately for many content owners, that s where the metadata stops and the search problems begin. 2

3 When it comes to metadata, the more you have, usually the better your chances are of finding exactly what you re looking for but it s hard to find what you re looking for based on simple file attributes alone. It takes additional metadata that describes the content within a given media file, which usually must be entered manually using some type of logging application. It s a very laborious process that requires watching the video and making notes about it in the logging application. The logging process is often two to four times slower than real time, so most media operations simply don t have the resources required to do it regularly and thoroughly. Simply logging the content (live or during ingest) might cost upwards of $80 per hour of video, while more involved transcription costs can be as much as $160 per hour of material if timing information is included. Further, it can take days before that material is then available for searching. Also, when no timing information is available, there is no synchronized link between the search results in the text document and the media. The result: the files in the MAM system often don t contain enough descriptive metadata for the MAM to be useful. And so valuable media assets sit unused and, potentially, forgotten. Another search method is caption-based search, but this is challenging for a number of reasons. First, very few MAMs are able to use captions to inform search. Also, typically only content that has already been aired actually contains captions and, for those, the captions are rarely verbatim and frequently have misspellings that limit effective search. In addition, many broadcast processes such as encoding and decoding can break the captions, so that the caption data is actually lost when new versions of the media are created. Finally, if the captions are embedded, the time required to extract the captions can be prohibitive. An additional potential category is speech-to-text based applications. In order to address these properly, it is necessary to look a little deeper into how they work. Speech-to-Text Retrieval of information from audio and speech has been a goal of many researchers over the past 20 years. The simplest approach is to apply Large Vocabulary Continuous Speech Recognition (LVCSR), perform time alignment, and produce an index of text content along with time stamps. Much of the improved performance demonstrated in current LVCSR systems comes from better linguistic modeling 1 to eliminate sequences of words that are not allowed within the language. In the LVCSR approach, the recognizer tries to transcribe all input speech as a chain of words in its vocabulary. Keyword spotting is another technique for searching audio for specific words and phrases where the recognizer is only concerned with occurrences of one keyword or phrase. Since the score of the single word must be computed (instead of the entire vocabulary), much less computation is required. 2, 3 Another advantage of keyword spotting is the potential for an open, constantly changing vocabulary at search time, making this technique useful in archive retrieval but not so ideal for real-time execution. When searching through tens or hundreds of thousands of hours of archived audio data, scanning must be executed many thousands of times faster than real time in order to be practical. A new class of keyword spotters has been developed that performs separate indexing and searching stages. In doing so, search speeds that are several thousand times faster than real time have been successfully achieved. However, many of the same limitations regarding vocabulary still apply for this approach. Introducing Phonetic Search Another approach is phonetic searching, illustrated in Figure 1. This high-speed algorithm comprises two phases indexing and searching. The first phase indexes the input speech to produce a phonetic search track and is performed only once. The second phase, performed whenever a search for a word or phrase is initiated, is comprised of searching the phonetic search track. Once 3

4 the indexing is completed, this search stage can be repeated for any number of queries. Since the search is phonetic, search queries do not need to be in any pre-defined dictionary thus allowing searches for proper names, new words, misspelled words, and jargon. Note that once indexing has been completed, the original media files are not involved at all during searching. This means the search track can be generated on the highest-quality media available for highest accuracy, but then the audio can be replaced by a compressed representation for storage and subsequent playback afterwards. Figure 1 Index and Search Architecture Indexing The indexing phase begins with decoding of the input media, into a standard audio representation for subsequent handling (PCM). Then, using an acoustic model, the indexing engine scans the input speech and produces the corresponding phonetic search track. An acoustic model jointly represents characteristics of both an acoustic channel (an environment in which the speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings express the input speech). Audio channel characteristics include frequency response, background noise, and reverberation. Characteristics of a natural language include gender, dialect, and the speaker s accent. The end result of phonetic indexing of an audio file is the creation of a Phonetic Audio Track (PAT) file a highly compressed representation of the phonetic content of the input speech. Unlike LVCSR, whose essential purpose is to make irreversible and possibly incorrect bindings between speech sounds and specific words, phonetic indexing merely infers the likelihood of potential phonetic content as a reduced lattice, deferring decisions about word bindings to the subsequent searching phase. In order to support search in any given language, a database of that language must be built. This requires roughly 100 hours of diverse content containing dialogue from a wide variety of speakers (gender, age, accent/inflection) and genres along with the complete transcripts for that content, which are compiled and processed to create a language pack. The ability to have broad language support is a key advantage over speech-to-text applications. 4

5 Searching The system begins the searching phase by parsing the query string, which is specified as text containing one or more: Words or phrases (e.g., President or Supreme Court Justice ) Phonetic strings (e.g., _B _IY _T _UW _B _IY, six phonemes representing the acronym B2B ) Temporal operators (e.g., Obama &15 bailout, representing two words or phrases spoken within 15 seconds of each other) After the system parses words, phrases, phonetic strings, and temporal operators within the query term, actual searching begins. Multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences (possibly separated by offsets specified by temporal operators) that closely match corresponding strings of phonemes in the query term. Since PAT files encode potential sets of phonemes, not irreversible bindings to sounds, the matching algorithm is probabilistic and returns multiple results, each as a 4-tuple: PAT File (to identify the media segment associated with the hit) Start Time Offset (beginning of the query term within the media segment, accurate to one hundredth of a second) End Time Offset (approximate time offset for the end of the query term) Confidence Level (that the query term occurs as indicated, between 0.0 and 1.0) Key Benefits Speed, accuracy, scalability. The indexing phase devotes its limited time allotment only to categorizing input speech sounds into potential sets of phonemes rather than making irreversible decisions about words. This approach preserves the possibility for high accuracy so that the searching phase can make better decisions when presented with specific query terms. Also, the architecture separates indexing and searching so that the indexing needs to be performed only once, typically during media ingest, and the relatively fast operation (searching) can be performed as often as needed. Open vocabulary. LVCSR systems can only recognize words found in their lexicons. Many common query terms, such as specialized terminology and names of people, places, and organizations (collectedly referred as entities ) are typically omitted from LVCSR lexicons, partly to keep them small enough that LVCSRs can be executed cost-effectively in real-time, and also because these kinds of query terms are notably unstable as new terminology and names are constantly evolving. By enabling the search of entities, the search can be more specific and allow better discrimination of search results. Phonetic indexing is unconcerned about such linguistic issues, maintaining completely open vocabulary (or, perhaps more accurately, no vocabulary at all). Low penalty for new words. LVCSR lexicons can be updated with new terminology, names, and other words. However, this exacts a serious penalty in the cost of ownership because the entire media archive must then be reprocessed through LVCSR to recognize the new words (an operation that typically executes only slightly faster than real time at best). Also, probabilities need to be assigned to the new words, either by guessing their frequency or context or by retraining a language model that includes the new words. The dictionary within the phonetic searching architecture, on the other hand, is consulted only during the searching phase, which is relatively fast compared to indexing. Adding new words incurs only another search, and it is often unnecessary to add words, since the spelling-to-sound engine can handle most cases automatically, or users can simply enter sound-it-out versions of words. 5

6 Phonetic, inexact spelling, and multiple pronunciations. Proper names are particularly useful query terms but they are also particularly difficult for LVCSR, not only because they may not occur in the lexicon as described above, but also because they often have multiple spellings (and any variant may be specified at search time). With phonetic searching, exact spelling is not required. This advantage becomes clear with a name that can be spelled Qaddafi, Khaddafi, Quadafy, Kaddafi, or Kadoffee, any of which could be located by phonetic searching. User-determined depth of search. If a particular word or phrase is not spoken clearly, or if background noise interferes at that moment, then LVCSR will likely not recognize the sounds correctly. Once that decision is made, the correct interpretation is hopelessly lost to subsequent searches. Phonetic searching, however, returns multiple results, sorted by confidence level. The sound at issue may not be the first, or even in the top 10 or 100, but it is very likely in the results list somewhere, particularly if some portion of the word or phrase is relatively unimpeded by channel artifacts. If enough time is available, and if the retrieval is sufficiently important, then a motivated user aided by an efficient human interface can drill as deeply as necessary. Amenable to parallel execution. The phonetic searching architecture can take full advantage of any parallel processing accommodations. For example, a server with 32 cores can index nearly 32 times as fast as a single core. Additionally, PAT files can be processed in parallel by banks of computers to search more media per unit time (or search tracks can be replicated in the same implementation to handle more queries over the same media). This is a significant improvement over LVCSR that doesn t scale in a linear way since it has to devote more processing cores to ensure higher accuracy rates. Broad language support. The ability to recognize different languages is unique and unmatched. New languages are being added on a regular basis. Accuracy Phonetic-based search results are returned as a list of potential hit locations, in descending likelihood order. See Figure 2. As a user progresses further down this list, they will find more and more instances of their query occurring. However, they will also eventually encounter an increasing amount of false alarms (results that do not correspond to the desired search term). Figure 2 Search Results 6

7 When using a typical North American English broadcast language pack and a query length of phonemes, you can expect, on average, to find 85% of the true occurrences, with less than one false hit per two hours of media searched. The user has the flexibility to choose to ensure a high probability of detection by accepting results with a moderate confidence score, or to reduce false alarms by raising the score threshold and only accepting those with high confidence scores. These settings can also be automatically calculated to balance accuracy and recall to conform to the given phonetic characteristics of a body of media about to be searched. In a word-spotting system, more phonemes in the query means more discriminative information is available at search time. Fortunately, rather than short, single word queries such as no or the, most real-world searches are for proper names, phrases, or other interesting speech that represents longer phoneme sequences. Even when the desired phrase is short, it can almost always be interpreted as an OR of several common carrier phrases. For example tennis court OR tennis elbow OR tennis match OR tennis shoes would be four carrier phrases that would capture most instances of the word tennis. Speed Indexing speed is another important metric and is defined as the speed in which media can become searchable. Indexing requires a relatively constant amount of computation per media hour. Thus, on a single server the indexing time for 3,200 hours of media is less than an hour of real time for PCM content. Put another way, a single server at full capacity can index over 76,800 hours worth of media per day. When more cores are added, speed increases. A final performance measure is the speed at which media can be searched once it has been indexed. Two main factors influence the speed of searching. The most important factor is whether the PAT files are currently in memory, and the read access time of the storage. Once an application requests a PAT file to be loaded (if it expects it to be needed soon), or else upon the first search of a track, the search engine will load it into memory. Recent advancements have also enabled the creation of a higher level index which can achieve search performance that is millions of times faster than real time. Conclusions This paper has given an overview of the issues surrounding the need to find and access clips in any body of media including those from news, sports, education, government, and production, so that they can be used to create new content and be monetized quickly and easily. A discussion of different types of metadata creation and the various shortcomings brought us to the relatively new method of using phonetic search. The method breaks searching into two stages: indexing and searching. Search queries can be words, phrases, or even structured queries that allow operators such as AND, OR, and time constraints on groups of words. Search results are returned as lists, file names, or unique identifiers for the media, the name of the search query or queries, and their corresponding time code along with an accompanying score giving the likelihood that a match to the query happened at this time. Phonetic searching has several advantages over previous methods of searching media. By not constraining the pronunciation of searches, the method can find any proper name, slang, or even words that have been incorrectly spelled completely avoiding the out-of-vocabulary problems of speech recognition systems. Phonetic search is also fast. For deployments such as major broadcasting networks with tens of thousands of hours of media, users don t have to choose a subset to analyze since all media can be indexed for search, even with modest resources. Once media is ingested, it can be immediately indexed as part of the workflow and become immediately searchable. Unlike other approaches, phonetic search technologies are very scalable, allowing for fast and efficient searching and analysis of extremely large archives. 7

8 References 1 D. Jurafsky and J. Martin, Speech and Language Processing, Prentice-Hall, J. Wilpon, L. Rabiner, L. Lee, and E. Goldman, Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 38, no. 11, pp , November, R. Wohlford, A. Smith, and M. Sambur, The Enhancement of Wordspotting Techniques, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Denver, CO, Vol. 1, pp , R. R. Sarukkai and D. H. Ballard, Phonetic Set Indexing for Fast Lexical Access, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, no. 1, pp 78-82, January, D. A. James and S. J. Young, A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Adelais, SA, Australia, Vol. 1, pp , P Yu, K. Chen, C. Ma, and F. Seide, Vocabulary-Independent Indexing of Spontaneous Speech, IEEE Transactions on Speech and Audio Processing, volume 13, no. 5, September Applicable Patents Patent 7,263,484; issued August 28, 2007 Creation and search of phonetic index for audio/video files Patent 7,313,521; issued December 25, 2007 Assessment of search term quality Patent 7,324,939; issued January 29, 2008 Indexing and search covering both forward and backward directions in time Patent 7,406,415; issued July 29, 2008 Structured queries: combination of search terms via Boolean and time-based operators Patent 7,475,065; issued January 6, 2009 Search via linguistic search term plus phonetic search term or voice command Wordspotting System Normalization Patent 7,650,282; issued January 19, 2010 Structured query normalization: statistical modeling of score distributions of potential hits to characterize and reduce false alarms to improve accuracy; auto thresholding Patent 7,769,587; issued August 3, 2010 Phonetic indexing and search of text 8

9 Comparing events in word spotting Patent 8,170,873; Issued May 1st, 2012 Application of subword unit models to classify audio Spoken Word Spotting Queries Patent 7,904,296; Issued March 8, 2011 Searching audio by selecting audio clips as the search query Multiresolution Searching Patent 7,949,527; Issued May 24, 2011 Faster search of spoken content via multiresolution phonetic indexing and novel compression techniques Keyword Spotting Using a Phoneme-Sequence Index Patent 8,311,828; Issued November 13, 2012 Application of phonetic search to very large sets of data Copyright Notice Copyright , Nexidia Inc. All rights reserved. This document and any software described herein, in whole or in part may not be reproduced, translated or modified in any manner, without the prior written approval of Nexidia Inc. This document is the copyrighted work of Nexidia Inc. or its licensors and is owned by Nexidia Inc. or its licensors. This document contains information that may be protected by one or more U.S. patents, foreign patents or pending applications. TRADEMARKS Nexidia, Nexidia Dialogue Search, the Nexidia logo, and combinations thereof are trademarks of Nexidia Inc. in the United States and other countries. Other product names and brands mentioned in this manual may be the trademarks or registered trademarks of their respective companies and are hereby acknowledged. disclaimer This paper was first presented at the 2014 NAB Broadcast Engineering Conference on Wednesday, April 9, 2014 in Las Vegas, Nevada. You can find additional papers from the 2014 NAB Broadcast Engineering Conference by purchasing a copy of the 2014 BEC Proceedings at Nexidia Inc. Headquarters 3565 Piedmont Road NE Building Two, Suite 400 Atlanta, GA USA [email protected] 2014 Nexidia Inc. All rights reserved. All trademarks are the property of their respective owners. Nexidia products are protected by copyrights and one or more of the following United States patents: 7,231,351; 7,263,484; 7,313,521; 7,324,939; 7,406,415, 7,475,065, 7,769,586, 7,231,351, 7,487,086, 7,640,161, 7,650,282, 7,904,296, 7,949,527, 8,051,086, 8,170,873, 8,311,828 and other patents pending. nexidia.tv