Automatic Speech Annotation and Transcription in a Broadcast News task

Size: px
Start display at page:

Download "Automatic Speech Annotation and Transcription in a Broadcast News task"

Transcription

1 Automatic Speech Annotation and Transcription in a Broadcast News task Hugo Meinedo, João Neto L F - Spoken Language Systems Lab INESC-ID / IST, Rua Alves Redol, 9, Lisboa, Portugal Hugo.Meinedo, Abstract This paper describes our work on the development of a system for automatic speech transcription applied to a Broadcast News (BN) task for the European Portuguese language. We developed audio segmentation modules including a scheme for tagging certain speaker clusters (anchors). We developed a speech recognition system for the broadcast news task using appropriate models. The tests were conducted using large quantities of BN data and show good results in terms of word error rate and processing time. This system is currently integrated in a prototype audio indexing and document retrieval system that is daily processing the main news show of the national Portuguese broadcaster. 1. Automatic transcription of BN data In the last few years we have seen a large development of speech recognition systems associated to BN tasks. These developments open the possibility for new applications where the use of speech recognition is a major attribute. During the last two years we have been working on the IST-HLT European programme project ALERT. The main goal of ALERT is to develop a system for selective dissemination of multimedia information. The idea is to build a system capable of identifying specific information in multimedia data consisting of audio/text streams, using continuous speech recognition, audio segmentation techniques and topic detection techniques [1, 2, 3]. This paper describes in more detail the development of our speech recognition engine namely the acoustic models alignment and training procedures, the vocabulary, lexicon and language model building and the decoding algorithm used. We will also describe the audio segmentation, classification and clustering module used to pre-process the audio. Finally we will present our BN speech recognition results and evaluate the impact of the automatic segmentation modules in the recognition. 2. Audio segmentation, classification and clustering Audio segmentation is used in order to deliver to the user only the relevant information and to generate a set of acoustic cues to the speech recognition system and the topic detection algorithms. This work results in the segmentation of audio into homogeneous regions according to background conditions, speaker gender and special speaker id (anchors). This segmentation can provide useful information such as division into speaker turns and speaker identities, allowing for automatic indexing and retrieval of all occurrences of a particular speaker. If we group together all segments produced by the same speaker we can perform an automatic online adaptation of the speech recognition acoustic models to improve overall system performance. We use several modules for segmentation, classification and clustering of each news show before proceeding to the speech recognition system. The purpose of the segmentation module is to generate homogeneous acoustic audio segments. The segmentation algorithm tries to detect changes in the acoustic conditions and marks those time instants as segment boundaries. Each homogeneous audio segment is then passed through the first classification stage in order to tag non-speech segments. All audio segments go through the second classification stage where they are classified according to background status. Segments that were marked as containing speech are also classified according to gender and are subdivided into sentences by an endpoint detector. All labeled speech segments are clustered separately by gender in order to produce homogeneous clusters according to speaker and background conditions. In the last stage an anchor detection is done, attempting to identify those speaker clusters that were produced by one of the pre-defined news anchors.

2 2.1. Audio Segmentation The main goal for the segmentation is to divide the input audio stream into acoustically homogeneous segments. This is accomplished by evaluating, in the cepstral domain, the similarity between two contiguous windows of fixed length that are shifted in time every 10ms. In our system we introduce three distinct time analysis window pairs of 0.5, 1.0 and 4.0 seconds. Test results were obtained by two segmentation modules in a news program with total duration of 1 hour. When comparing our scheme of analysis windows with different sizes (0.5, 1.0 and 4.0 sec) against a system [4] with a single window pair of 0.5 sec, we were able to reduce significantly the number of missed boundaries, from 22% to 14%, although at the cost of increasing slightly the insertion rate, from 17% to 18% Speech / Non-Speech discrimination After the acoustic segmentation stage each segment is classified using a speech / non-speech discriminator [5, 6], tagging audio portions without speech, with too much noise or pure music. This stage is very important for the rest of the processing since we are not interested in wasting time trying to recognize audio segments that do not contain useful speech. Classification tests were obtained for a subset of the ALERT development test set consisting of 4 different news programs with a total of 2 hours. We achieved with this classifier a very low error rate of 4.4% for tagging speech segments as non-speech. It is the worst error since these segments had useful speech and will not be recognized Gender and Background Classification In our framework, gender classification is used as a mean to improve speaker clustering. By separately clustering each gender class we will have a smaller distance matrix when evaluating cluster distances which effectively reduces the search space. It also avoids short segments having opposite gender tags being erroneously clustered together. Background classification can be used to switch between tuned acoustic models in recognition and can help to detect better special situations like anchor filler sections with background music. The classification module [6], uses two MLP estimating posterior probabilities. One for the gender classification and the other for background status. In both cases, the output class is chosen through maximum likelihood calculation over the audio segment. Gender classification is very precise with a 7.1% misclassification error rate. The background classifier has a very difficult task because in the training material there are many overlapping, especially music plus noise Sentence Division Segments that were labeled as containing speech are divided into sentences by an energy endpoint detector. This is a crude and simple approximation that assumes a speech pause will correspond to an end-ofsentence. Unfortunately the news reporters and news anchors not always do a breath pause at the end-ofsentence points. This is the major source for incorrect sentence boundaries Speaker Clustering The goal of speaker clustering is to identify and group together all speech segments that were produced by the same speaker. The clusters can then be used for an acoustic model adaptation in order to improve the speech recognition rate. Speaker cluster information can also be used by topic detection and story segmentation algorithms to determine speaker roles inside the news show allowing for easier story identification. Our speaker clustering algorithm makes use of gender detection. Speech segments with different gender classification are clustered separately. We used bottom-up hierarchical clustering [4]. We developed an efficient distance measure based on the Bayesian Information Criterion (BIC) [7, 8]. An adjacency term is used instead of the BIC threshold [6]. Empirically clusters having adjacent speech segments are closer in time and the probability of belonging to the same speaker must be higher. Using this we obtained a cluster purity greater than 97% with a mean number of clusters per speaker slightly over 3.1. The adjacency term in our modified BIC expression retained a high cluster purity and decreased significantly the number of clusters per speaker. The clustering algorithm proved to be sensitive not only to different speakers but also to different acoustic background conditions. This side-effect is responsible for the high number of clusters per speaker obtained in the test set results Anchor Detection Anchors introduce the news and provide a synthetic summary for the story. Normally this is done in studio conditions (clean background) and with the anchor reading the news. Anchor speech segments convey all the story cues and are invaluable for automatic topic and summary generation algorithms. Also in these speech segments the recognition error rate is the lowest possible. The news shows that our system is currently monitoring are presented by three anchor persons, two male and one female. We built individual speaker

3 models for these anchors [6]. During the processing of a news show, after speaker sentence clustering, the resulting clusters are compared one by one against the special anchor cluster models to determine which of those belongs to one of the news anchors. We were able to achieve a percentage of deletions, that is, clusters not identified as belonging to the anchor around 9%, and percentage of insertions, that is clusters incorrectly labeled as anchor around 2%. These results are very promising especially due to the very low insertion rate. 3. AUDIMUS Recognition System AUDIMUS is a hybrid speech recognition system [9, 10] that combines the temporal modeling capabilities of Hidden Markov Models (HMMs) with the pattern discriminative classification capabilities of multilayer perceptrons (MLPs) [11]. In this hybrid HMM/MLP system a Markov process is used to model the basic temporal nature of the speech signal. The MLP is used as the acoustic model estimating contextindependent posterior phone probabilities given the acoustic data at each frame. JJ G G =7C K7L GG M7N.C L L =7BC D.E,.-0/.12-3$/ : ;0< =>?>@ =7>?>@ A D2E ;D7OQP@ E=C D.E D7E F2G BD7H.@ EI ! Figure 1: AUDIMUS - Recognition System. "$#&%('*#(%&)(# #&%()&# The acoustic modeling of AUDIMUS combines phone probabilities generated by several MLPs trained on distinct feature sets resulting from different feature extraction processes [12]. These probabilities are taken at the output of each MLP classifier and combined using an average in the log-probability domain. The processing stages are represented in Figure 1. All MLPs use the same phone set constituted by 38 phones for the Portuguese language plus silence and breath noises. The combination algorithm merges together the probabilities associated to the same phone. We are using three different feature extraction methods and MLPs with the same basic structure, that is, an input layer with 7 context frames, a non-linear hidden layer with over 1000 sigmoidal units and 40 softmax outputs. The feature extraction methods are PLP, Log-RASTA and MSG [13]. AUDIMUS presently uses a dynamic decoder that builds the search space as the composition of three Weighted Finite-State Transducers (WFSTs) [14, 15, 16, 17]. 4. Language modeling During the last years we have been collecting Portuguese newspapers from the web, which allowed us to start building a considerably large text corpus. Until the end of 2001 we have texts amounting to a total of 24.0M sentences with 434.4M words. A language model generated only from newspaper texts becomes too much adapted to the type of language used in those texts. When this kind of language model is used in a continuous speech recognition system applied to a Broadcast News task it will not perform as good as one would expect because the sentences spoken in Broadcast news do not match the style of the sentences written in the newspaper. If we created a language model from Broadcast News transcriptions it would probably be more adequate for this kind of speech recognition task. The problem is that we do not have enough BN transcriptions to generate a satisfactory language model. However we can improve the adaptation of a language model to the BN task by combining a model created from the newspaper texts with a model created from BN transcriptions, using linear interpolation [18]. One of the models is always generated from the newspaper text corpus while the other is a backoff trigram model using absolute discounting and based on the training set transcriptions of our BN database. The optimal weights used in the interpolation are computed using the transcriptions from the evaluation set of our BN database. These are also used to estimate the perplexity of all models. The final interpolated model has a perplexity of and the newspapers model has It is clear that even using a very small model based in BN transcriptions we can obtain some improvement in the perplexity of the interpolated model [19]. 5. Vocabulary and pronunciation lexicon When creating the vocabulary we have to consider that it needs to have a limited size to avoid expanding considerably the search space during a speech recognition task. Currently we are limiting our vocabulary size to 64k words. From the text corpus with 335 million words created from all the newspaper editions collected until the end of 2000, we extracted 427k different words. About 100k of these words occur more than 50 times in the text corpus. Using this smaller set, all the words were classified according to syntactic classes. Different weights were given to each class and a subset with 56k words was created based on the weighted frequencies of occurrence of the words. To

4 this set we added basically all the new words present in the transcripts of the training data of our Broadcast News database still being developed, giving a total of 57,564 words. Currently the transcripts contain 12,812 different words from a total of 142,547. From the vocabulary we were able to build the pronunciation lexicon. To obtain the pronunciations we used different lexica available in our lab. For the words not present in those lexica (mostly proper names, foreign names and some verbal forms) we used an automatic grapheme-phone system to generate corresponding pronunciations. Our final lexicon has a total of 65,895 different pronunciations. For the ALERT development test set corpus which has 5,426 different word in a total of 32,319 words, the number of OOVs using the 57K word vocabulary was 444 words representing a OOV word rate of 1.4%. 6. Weighted Finite-State Transducer based decoder The integration of knowledge sources in large vocabulary continuous speech recognition using weighted finite state transducers (WFSTs) is spreading in the speech recognition community [20, 21]. The approach can be briefly described in the following way: all knowledge sources (such as the lexicon or the language model) are encoded as weighted finitestate transducers (WFSTs); the knowledge sources are combined using WFST composition, in a very large integrated static network, that is then optimized using well founded algorithms such as determinization, minimization and pushing. In this way, the search space is built by the dynamic composition of the HMM/MLP topology transducer, the lexicon transducer and the language model transducer. The flexibility of the approach comes from the uniformity of representation and combination of the knowledge sources, allowing the easy integration of novel sources. The efficiency comes from the optimization algorithms that make very good use of the sparsity of the integrated network. Furthermore the decoder [14] uses a specialized algorithm integrating linguistic components in run time computing the minimization, pushing and determinization of the composition between the lexicon and language model transducers [15, 16]. In our BN experiments we used the compact representation of WFSTs in memory, and where able to run in less than 512MB of RAM. 7. Acoustic models alignment and training For training the acoustic models in our BN speech recognition system we started collecting broadcast news data in the scope of the ALERT project. The collected BN data was first automatically transcribed using our baseline speech recognition system AU- DIMUS, that was originally developed for dictation tasks and was trained using a clean read speech corpus [9, 10] similar to WSJ0. The automatic transcriptions were then manually corrected. Each time that a significant amount of manually verified BN data was available, we realigned and retrained our acoustic models. So far we made 8 iterations. The first available data, about 5 hours from the ALERT Pilot corpus, was used to bootstrap the BN acoustic models by two iterations of forced alignment and MLP training. The third alignment and training iteration already used some data from the ALERT Speech Recognition corpus, about a full week of news programs. Together with the pilot corpus it consisted on 13 hours of BN data. For the acoustic models trained using alignment number 4, over 22.5 hours of BN data were used. In the alignment number 6 we were able to use the complete ALERT Pilot corpus and Speech recognition corpus consisting of 45 hours of training material. 8. Speech recognition results Speech recognition results were conducted in the ALERT development test set which has over 6 hours of BN data. The experiments were made using a Pentium III 1GHz computer running Linux with 1Gb RAM. Table 1 summarizes the word error rate (% WER) evaluation obtained by the recognition system. The lines in Table 1 show the increase in performance by each successive improvement to the recognition system. The first column of results refers to the F0 focus condition where the sentences contain only prepared speech with low background noise and good quality audio. The second results column comprises the WER obtained in all test sentences, including noise, music, spontaneous speech, telephone speech, nonnative accents and also including the F0 focus condition sentences. % WER MLPs Decoder F0 All xrt 1000-a stack a WFST b WFST min det L Table 1: BN speech recognition evaluation using the complete development test set.

5 The first line of results in Table 1 were obtained using MLPs with 1000 hidden units and our old stack decoder algorithm. Compared with the second line of results we see that there was a significant increase in performance obtained when we switched to the new WFST dynamic decoder, especially in decoding time, expressed in the last column as real-time speed. The third line of results refers to the recognition results after alignment number 6 which was the first to use the complete BN training corpus totalizing 45 hours of speech data. The fourth line of results shows the improvement obtained by substituting the determinized lexicon transducer by one that was also minimized. The last line shows 8% relative improvement obtained from increasing the hidden layers to 4000 units. This increase was necessary because the acoustic models MLPs were no longer coping with all the variability present in the ALERT BN corpora that were used as training data. 9. Impact of segmentation in recognition Our automatic segmentation modules pre-process the audio stream and tag the segments that are fed to the speech recognition system for transcription. Since this automatic segmentation is not perfect we wanted to evaluate its impact on speech recognition in terms of word errors and processing time. We conducted a series of tests using a news program with 29 minutes from the ALERT development test set. For this news program the automatic transcription was produced considering three different audio pre-processing situations. In the first situation, the sentences for transcription and their divisions were chosen using the manual segmentation information. The program was divided into 241 useful sentences. In the second situation, no pre-processing was done and the whole program was considered a single sentence. In the third situation the program was processed by our automatic audio segmentation system, by which the news show was divided into 366 sentences. Table 2 summarizes the word error rate results obtained and the processing time expressed as multiples of real-time speed. xrt Segmentation sent. % WER front-end decod manual without automatic Table 2: Audio segmentation impact on BN speech recognition. The first conclusion drawn from the inspection of the results presented in Table 2 is that our automatic audio segmentation module produced more sentences than it should when compared with the manual segmentation. Sentences are shorter and more subdivided. This is mainly due to incorrect sentence divisions in places where there are speaker breath pauses. These subdivided sentences originate language modeling errors near the incorrect boundaries. Without automatic segmentation, that is, when we considered the whole program as a single sentence, the opposite effect occurs although not as problematic, because the language model only introduces article words to connect different sentences. That is why the WER for this case is closer to the manual segmentation. Additionally, from Table 2 we see that in the situation without segmentation the decoding is very time consuming because of the search space dimension (the whole news program is considered a single sentence). Our automatic audio segmentation takes slightly more processing time than the manual segmentation, as it should due to the segmentation and classification modules. 10. Concluding remarks Broadcast News speech recognition is a difficult and very resource consuming task. Our recognition engine evolved substantially through the accumulation of relatively small improvements. We are still far from ideal recognition results, the ultimate goal, nevertheless this technology is able to drive very useful applications, including audio indexing and spoken document retrieval. Our automatic audio segmentation, classification and clustering modules proved fairly accurate while consuming relatively less computational resources than other approaches. Additionally we developed a scheme for tagging anchor speaker clusters using trained models which aids the story segmentation module of our document retrieval system. Our recognition system evolved substantially with the availability of large quantities of BN training data. The MLPs with 4000 units trained on all of this BN data permitted a significant decrease in WER. Furthermore, the use of interpolated language models showed that is possible to improve a newspaper texts based language model using a small BN transcriptions based model. The use of dynamic WFST decoder allowed us to improve the speed and also reduce WER from over 30 to as little as 4 times real-time speed. Our system is currently integrated in a prototype audio indexing and document retrieval system that is daily processing the main news show of the national Portuguese broadcaster [2].

6 11. Acknowledgments This work was partially funded by IST-HLT European programme project ALERT and by FCT project POSI/33846/PLP/2000. INESC-ID Lisboa had support from the POSI Program of the Quadro Comunitário de Apoio III. Hugo Meinedo is sponsored by a FCT scholarship (SFRH/BD/6125/2001). 12. References [1] R. Amaral, T. Langlois, H. Meinedo, J. Neto, N. Souto, and I. Trancoso, The development of a portuguese version of a media watch system, in Proc. EUROSPEECH 2001, Aalborg, Denmark, [2] J. Neto, H. Meinedo, R. Amaral, and I. Trancoso, A system for selective dissemination of multimedia information resulting from the alert project, in Proc. ISCA MSDR 2003, Hong- Kong, China, April [3] R. Amaral and I. Trancoso, Segmentation and indexation of broadcast news, in Proc. ISCA MSDR 2003, Hong-Kong, China, April [4] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, Automatic segmentation, classification and clustering of broadcast news, in DARPA Proc. Speech Recognition Workshop, [5] G. Williams and D. Ellis, Speech/music discrimination based on posterior probability features, in Proc. EUROSPEECH 1999, Budapest, Hungary, September [6] H. Meinedo and J. Neto, Audio segmentation, classification and clustering in a broadcast news task, in Proc. ICASSP 2003, Hong-Kong, China, April [7] S. Chen and P. S. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in DARPA Proc. Speech Recognition Workshop, [8] B. Zhou and J. Hansen, Unsupervised audio stream segmentation and clustering via the bayesian information criterion, in Proc. ISCLP 2000, Beijing, China, October [9] J. Neto, C. Martins, and L. Almeida, The development of a speaker independent continuous speech recognizer for portuguese, in Proc. EU- ROSPEECH 1997, Rhodes, Greece, [10] J. Neto, C. Martins, and L. Almeida, A large vocabulary continuous speech recognition hybrid system for the portuguese language, in Proc. ICSLP 1998, Sydney, Australia, [11] H. Bourlard and N. Morgan, Connectionist Speech Recognition - A Hybrid Approach, Kluwer Academic Publishers, Massachusetts, EUA, [12] H. Meinedo and J. Neto, Combination of acoustic models in continuous speech recognition, in Proc. ICSLP 2000, Beijing, China, [13] B. E. Kingsbury, N. Morgan, and S. Greenberg, Robust speech recognition using the modulation spectrogram, Speech Comunication, vol. 25, pp , [14] D. Caseiro and I. Trancoso, A decoder for finite-state structured search spaces, in ASR 2000 Workshop, Paris, France, Sept [15] D. Caseiro and I. Trancoso, On integrating the lexicon with the language model, in Proc. EU- ROSPEECH 2001, Aalborg, Denmark, September [16] D. Caseiro and I. Trancoso, Transducer composition for on-the-fly lexicon and language model integration, in Proc. ASRU 2001 Workshop, Madonna di Campiglio, Trento, Italy, December [17] D. Caseiro and I. Trancoso, A tail-sharing wfst composition for large vocabulary speech recognition, in Proc. ICASSP 2003, Hong-Kong, China, April [18] R. Rosenfeld, Adaptive Statistical Language Modeling: A Maximum Entropy Approach, Ph.D. thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Apr. 1994, also appears as technical report CMU-CS [19] H. Meinedo, N. Souto, and J. Neto, Speech recognition of broadcast news for the european portuguese language, in Proc. ASRU 2001 Workshop, Madonna di Campiglio, Trento, Italy, December [20] M. Mohri, Finite-state transducers in language and speech processing, Computational Linguistics, vol. 23, no. 2, pp , June [21] M. Mohri, F. Pereira, and M. Riley, Weighted automata in text and speech processing, in ECAI 96 Workshop. Budapest, Hungary, August 1996.

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID

More information

Exploring the Structure of Broadcast News for Topic Segmentation

Exploring the Structure of Broadcast News for Topic Segmentation Exploring the Structure of Broadcast News for Topic Segmentation Rui Amaral (1,2,3), Isabel Trancoso (1,3) 1 Instituto Superior Técnico 2 Instituto Politécnico de Setúbal 3 L 2 F - Spoken Language Systems

More information

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data Alberto Abad, Hugo Meinedo, and João Neto L 2 F - Spoken Language Systems Lab INESC-ID / IST, Lisboa, Portugal {Alberto.Abad,Hugo.Meinedo,Joao.Neto}@l2f.inesc-id.pt

More information

SEGMENTATION AND INDEXATION OF BROADCAST NEWS

SEGMENTATION AND INDEXATION OF BROADCAST NEWS SEGMENTATION AND INDEXATION OF BROADCAST NEWS Rui Amaral 1, Isabel Trancoso 2 1 IPS/INESC ID Lisboa 2 IST/INESC ID Lisboa INESC ID Lisboa, Rua Alves Redol, 9,1000-029 Lisboa, Portugal {Rui.Amaral, Isabel.Trancoso}@inesc-id.pt

More information

Portuguese Broadcast News and Statistical Machine Translation (SMT)

Portuguese Broadcast News and Statistical Machine Translation (SMT) Statistical Machine Translation of Broadcast News from Spanish to Portuguese Raquel Sánchez Martínez 1, João Paulo da Silva Neto 2, and Diamantino António Caseiro 1 L²F - Spoken Language Systems Laboratory,

More information

Research Article A Prototype System for Selective Dissemination of Broadcast News in European Portuguese

Research Article A Prototype System for Selective Dissemination of Broadcast News in European Portuguese Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2007, Article ID 37507, 11 pages doi:10.1155/2007/37507 Research Article A Prototype System for Selective Dissemination

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

More information

A Platform of Distributed Speech Recognition for the European Portuguese Language

A Platform of Distributed Speech Recognition for the European Portuguese Language A Platform of Distributed Speech Recognition for the European Portuguese Language João Miranda, João P. Neto L 2 F - Spoken Language Systems Lab / INESC-ID, Instituto Superior Técnico / Technical University

More information

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Robust Methods for Automatic Transcription and Alignment of Speech Signals Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background

More information

SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS

SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS Mbarek Charhad, Daniel Moraru, Stéphane Ayache and Georges Quénot CLIPS-IMAG BP 53, 38041 Grenoble cedex 9, France Georges.Quenot@imag.fr ABSTRACT The

More information

Establishing the Uniqueness of the Human Voice for Security Applications

Establishing the Uniqueness of the Human Voice for Security Applications Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

More information

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition , Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology

More information

Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly

More information

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda

More information

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

More information

Online Diarization of Telephone Conversations

Online Diarization of Telephone Conversations Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of

More information

TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification

TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification Mahesh Viswanathan, Homayoon S.M. Beigi, Alain Tritschler IBM Thomas J. Watson Research Labs Research

More information

Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training

Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training Matthias Paulik and Panchi Panchapagesan Cisco Speech and Language Technology (C-SALT), Cisco Systems, Inc.

More information

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University

More information

C E D A T 8 5. Innovating services and technologies for speech content management

C E D A T 8 5. Innovating services and technologies for speech content management C E D A T 8 5 Innovating services and technologies for speech content management Company profile 25 years experience in the market of transcription/reporting services; Cedat 85 Group: Cedat 85 srl Subtitle

More information

Generating Training Data for Medical Dictations

Generating Training Data for Medical Dictations Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN pakhomov.sergey@mayo.edu Michael Schonwetter Linguistech Consortium, NJ MSchonwetter@qwest.net Joan Bachenko

More information

Artificial Neural Network for Speech Recognition

Artificial Neural Network for Speech Recognition Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken

More information

Building A Vocabulary Self-Learning Speech Recognition System

Building A Vocabulary Self-Learning Speech Recognition System INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes

More information

EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS

EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS Christian Fügen, Sebastian Stüker, Hagen Soltau, Florian Metze Interactive Systems Labs University of Karlsruhe Am Fasanengarten 5 76131 Karlsruhe, Germany

More information

The LENA TM Language Environment Analysis System:

The LENA TM Language Environment Analysis System: FOUNDATION The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File Dongxin Xu, Umit Yapanel, Sharmi Gray, & Charles T. Baer LENA Foundation, Boulder, CO LTR-04-2 September

More information

Speech Transcription

Speech Transcription TC-STAR Final Review Meeting Luxembourg, 29 May 2007 Speech Transcription Jean-Luc Gauvain LIMSI TC-STAR Final Review Luxembourg, 29-31 May 2007 1 What Is Speech Recognition? Def: Automatic conversion

More information

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN J. Lööf (1), D. Falavigna (2),R.Schlüter (1), D. Giuliani (2), R. Gretter (2),H.Ney (1) (1) Computer Science Department, RWTH Aachen

More information

TED-LIUM: an Automatic Speech Recognition dedicated corpus

TED-LIUM: an Automatic Speech Recognition dedicated corpus TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France firstname.lastname@lium.univ-lemans.fr

More information

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara

More information

Using the Amazon Mechanical Turk for Transcription of Spoken Language

Using the Amazon Mechanical Turk for Transcription of Spoken Language Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.

More information

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas

More information

Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese

Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese Thomas Pellegrini (1), Helena Moniz (1,2), Fernando Batista (1,3), Isabel Trancoso (1,4), Ramon Astudillo (1) (1)

More information

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA

More information

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,

More information

2014/02/13 Sphinx Lunch

2014/02/13 Sphinx Lunch 2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue

More information

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Myanmar Continuous Speech Recognition System Based on DTW and HMM Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-

More information

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department,

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

A System for Searching and Browsing Spoken Communications

A System for Searching and Browsing Spoken Communications A System for Searching and Browsing Spoken Communications Lee Begeja Bernard Renger Murat Saraclar AT&T Labs Research 180 Park Ave Florham Park, NJ 07932 {lee, renger, murat} @research.att.com Abstract

More information

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library

More information

Ericsson T18s Voice Dialing Simulator

Ericsson T18s Voice Dialing Simulator Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of

More information

Sample Cities for Multilingual Live Subtitling 2013

Sample Cities for Multilingual Live Subtitling 2013 Carlo Aliprandi, SAVAS Dissemination Manager Live Subtitling 2013 Barcelona, 03.12.2013 1 SAVAS - Rationale SAVAS is a FP7 project co-funded by the EU 2 years project: 2012-2014. 3 R&D companies and 5

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems

More information

Mixed-Source Multi-Document Speech-to-Text Summarization

Mixed-Source Multi-Document Speech-to-Text Summarization Mixed-Source Multi-Document Speech-to-Text Summarization Ricardo Ribeiro INESC ID Lisboa/ISCTE/IST Spoken Language Systems Lab Rua Alves Redol, 9 1000-029 Lisboa, Portugal rdmr@l2f.inesc-id.pt David Martins

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 Data balancing for efficient training of Hybrid ANN/HMM Automatic Speech Recognition systems Ana Isabel García-Moral,

More information

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3 Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web. By C.Moreno, A. Antolin and F.Diaz-de-Maria. Summary By Maheshwar Jayaraman 1 1. Introduction Voice Over IP is

More information

LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION

LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon

More information

IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE

IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE Jing Zheng, Arindam Mandal, Xin Lei 1, Michael Frandsen, Necip Fazil Ayan, Dimitra Vergyri, Wen Wang, Murat Akbacak, Kristin

More information

Face Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213

Face Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213 Face Locating and Tracking for Human{Computer Interaction Martin Hunke Alex Waibel School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Eective Human-to-Human communication

More information

DeNoiser Plug-In. for USER S MANUAL

DeNoiser Plug-In. for USER S MANUAL DeNoiser Plug-In for USER S MANUAL 2001 Algorithmix All rights reserved Algorithmix DeNoiser User s Manual MT Version 1.1 7/2001 De-NOISER MANUAL CONTENTS INTRODUCTION TO NOISE REMOVAL...2 Encode/Decode

More information

Prosodic Phrasing: Machine and Human Evaluation

Prosodic Phrasing: Machine and Human Evaluation Prosodic Phrasing: Machine and Human Evaluation M. Céu Viana*, Luís C. Oliveira**, Ana I. Mata***, *CLUL, **INESC-ID/IST, ***FLUL/CLUL Rua Alves Redol 9, 1000 Lisboa, Portugal mcv@clul.ul.pt, lco@inesc-id.pt,

More information

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31 Disclaimer: This document was part of the First European DSP Education and Research Conference. It may have been written by someone whose native language is not English. TI assumes no liability for the

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION

KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,

More information

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,

More information

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,

More information

Tracking Groups of Pedestrians in Video Sequences

Tracking Groups of Pedestrians in Video Sequences Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal

More information

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke 1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline

More information

Novelty Detection in image recognition using IRF Neural Networks properties

Novelty Detection in image recognition using IRF Neural Networks properties Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,

More information

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box

More information

Emotion Detection from Speech

Emotion Detection from Speech Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction

More information

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.

More information

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

MUSICAL INSTRUMENT FAMILY CLASSIFICATION MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.

More information

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION By Ramasubramanian Sundaram A Thesis Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the

More information

Scaling Shrinkage-Based Language Models

Scaling Shrinkage-Based Language Models Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 USA {stanchen,mangu,bhuvana,sarikaya,asethy}@us.ibm.com

More information

Automatic structural metadata identification based on multilayer prosodic information

Automatic structural metadata identification based on multilayer prosodic information Proceedings of Disfluency in Spontaneous Speech, DiSS 2013 Automatic structural metadata identification based on multilayer prosodic information Helena Moniz 1,2, Fernando Batista 1,3, Isabel Trancoso

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction : A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Speech Recognition in the Informedia TM Digital Video Library: Uses and Limitations

Speech Recognition in the Informedia TM Digital Video Library: Uses and Limitations Speech Recognition in the Informedia TM Digital Video Library: Uses and Limitations Alexander G. Hauptmann School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA Abstract In principle,

More information

An Arabic Text-To-Speech System Based on Artificial Neural Networks

An Arabic Text-To-Speech System Based on Artificial Neural Networks Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department

More information

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS Caldas, Carlos H. 1 and Soibelman, L. 2 ABSTRACT Information is an important element of project delivery processes.

More information

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering

More information

Estonian Large Vocabulary Speech Recognition System for Radiology

Estonian Large Vocabulary Speech Recognition System for Radiology Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe, Einar Meister Institute of Cybernetics Tallinn University of Technology, Estonia October 8, 2010 Alumäe, Meister (TUT, Estonia)

More information

The Influence of Topic and Domain Specific Words on WER

The Influence of Topic and Domain Specific Words on WER The Influence of Topic and Domain Specific Words on WER And Can We Get the User in to Correct Them? Sebastian Stüker KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment

More information

Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments

Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments Josef Psutka 1, Pavel Ircing 1, Josef V. Psutka 1, Vlasta Radová 1, William Byrne 2, Jan

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present

CHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present CHANWOO KIM (BIRTH: APR. 9, 1976) 2602E NSH Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Phone: +1-412-726-3996 Email: chanwook@cs.cmu.edu RESEARCH INTERESTS Speech recognition system,

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer

More information

Numerical Field Extraction in Handwritten Incoming Mail Documents

Numerical Field Extraction in Handwritten Incoming Mail Documents Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr

More information

Lecture 12: An Overview of Speech Recognition

Lecture 12: An Overview of Speech Recognition Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated

More information

Speaker age estimation for elderly speech recognition in European Portuguese

Speaker age estimation for elderly speech recognition in European Portuguese Speaker age estimation for elderly speech recognition in European Portuguese Thomas Pellegrini 1, Vahid Hedayati 2, Isabel Trancoso 2,3 Annika Hämäläinen 4, Miguel Sales Dias 4 1 Université de Toulouse;

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

An Adaptive Approach to Named Entity Extraction for Meeting Applications

An Adaptive Approach to Named Entity Extraction for Meeting Applications An Adaptive Approach to Named Entity Extraction for Meeting Applications Fei Huang, Alex Waibel Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh,PA 15213 fhuang@cs.cmu.edu,

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Semantic Data Mining of Short Utterances

Semantic Data Mining of Short Utterances IEEE Trans. On Speech & Audio Proc.: Special Issue on Data Mining of Speech, Audio and Dialog July 1, 2005 1 Semantic Data Mining of Short Utterances Lee Begeja 1, Harris Drucker 2, David Gibbon 1, Patrick

More information

Recent advances in Digital Music Processing and Indexing

Recent advances in Digital Music Processing and Indexing Recent advances in Digital Music Processing and Indexing Acoustics 08 warm-up TELECOM ParisTech Gaël RICHARD Telecom ParisTech (ENST) www.enst.fr/~grichard/ Content Introduction and Applications Components

More information