Automatic Speech Annotation and Transcription in a Broadcast News task
|
|
- May Day
- 7 years ago
- Views:
Transcription
1 Automatic Speech Annotation and Transcription in a Broadcast News task Hugo Meinedo, João Neto L F - Spoken Language Systems Lab INESC-ID / IST, Rua Alves Redol, 9, Lisboa, Portugal Hugo.Meinedo, Abstract This paper describes our work on the development of a system for automatic speech transcription applied to a Broadcast News (BN) task for the European Portuguese language. We developed audio segmentation modules including a scheme for tagging certain speaker clusters (anchors). We developed a speech recognition system for the broadcast news task using appropriate models. The tests were conducted using large quantities of BN data and show good results in terms of word error rate and processing time. This system is currently integrated in a prototype audio indexing and document retrieval system that is daily processing the main news show of the national Portuguese broadcaster. 1. Automatic transcription of BN data In the last few years we have seen a large development of speech recognition systems associated to BN tasks. These developments open the possibility for new applications where the use of speech recognition is a major attribute. During the last two years we have been working on the IST-HLT European programme project ALERT. The main goal of ALERT is to develop a system for selective dissemination of multimedia information. The idea is to build a system capable of identifying specific information in multimedia data consisting of audio/text streams, using continuous speech recognition, audio segmentation techniques and topic detection techniques [1, 2, 3]. This paper describes in more detail the development of our speech recognition engine namely the acoustic models alignment and training procedures, the vocabulary, lexicon and language model building and the decoding algorithm used. We will also describe the audio segmentation, classification and clustering module used to pre-process the audio. Finally we will present our BN speech recognition results and evaluate the impact of the automatic segmentation modules in the recognition. 2. Audio segmentation, classification and clustering Audio segmentation is used in order to deliver to the user only the relevant information and to generate a set of acoustic cues to the speech recognition system and the topic detection algorithms. This work results in the segmentation of audio into homogeneous regions according to background conditions, speaker gender and special speaker id (anchors). This segmentation can provide useful information such as division into speaker turns and speaker identities, allowing for automatic indexing and retrieval of all occurrences of a particular speaker. If we group together all segments produced by the same speaker we can perform an automatic online adaptation of the speech recognition acoustic models to improve overall system performance. We use several modules for segmentation, classification and clustering of each news show before proceeding to the speech recognition system. The purpose of the segmentation module is to generate homogeneous acoustic audio segments. The segmentation algorithm tries to detect changes in the acoustic conditions and marks those time instants as segment boundaries. Each homogeneous audio segment is then passed through the first classification stage in order to tag non-speech segments. All audio segments go through the second classification stage where they are classified according to background status. Segments that were marked as containing speech are also classified according to gender and are subdivided into sentences by an endpoint detector. All labeled speech segments are clustered separately by gender in order to produce homogeneous clusters according to speaker and background conditions. In the last stage an anchor detection is done, attempting to identify those speaker clusters that were produced by one of the pre-defined news anchors.
2 2.1. Audio Segmentation The main goal for the segmentation is to divide the input audio stream into acoustically homogeneous segments. This is accomplished by evaluating, in the cepstral domain, the similarity between two contiguous windows of fixed length that are shifted in time every 10ms. In our system we introduce three distinct time analysis window pairs of 0.5, 1.0 and 4.0 seconds. Test results were obtained by two segmentation modules in a news program with total duration of 1 hour. When comparing our scheme of analysis windows with different sizes (0.5, 1.0 and 4.0 sec) against a system [4] with a single window pair of 0.5 sec, we were able to reduce significantly the number of missed boundaries, from 22% to 14%, although at the cost of increasing slightly the insertion rate, from 17% to 18% Speech / Non-Speech discrimination After the acoustic segmentation stage each segment is classified using a speech / non-speech discriminator [5, 6], tagging audio portions without speech, with too much noise or pure music. This stage is very important for the rest of the processing since we are not interested in wasting time trying to recognize audio segments that do not contain useful speech. Classification tests were obtained for a subset of the ALERT development test set consisting of 4 different news programs with a total of 2 hours. We achieved with this classifier a very low error rate of 4.4% for tagging speech segments as non-speech. It is the worst error since these segments had useful speech and will not be recognized Gender and Background Classification In our framework, gender classification is used as a mean to improve speaker clustering. By separately clustering each gender class we will have a smaller distance matrix when evaluating cluster distances which effectively reduces the search space. It also avoids short segments having opposite gender tags being erroneously clustered together. Background classification can be used to switch between tuned acoustic models in recognition and can help to detect better special situations like anchor filler sections with background music. The classification module [6], uses two MLP estimating posterior probabilities. One for the gender classification and the other for background status. In both cases, the output class is chosen through maximum likelihood calculation over the audio segment. Gender classification is very precise with a 7.1% misclassification error rate. The background classifier has a very difficult task because in the training material there are many overlapping, especially music plus noise Sentence Division Segments that were labeled as containing speech are divided into sentences by an energy endpoint detector. This is a crude and simple approximation that assumes a speech pause will correspond to an end-ofsentence. Unfortunately the news reporters and news anchors not always do a breath pause at the end-ofsentence points. This is the major source for incorrect sentence boundaries Speaker Clustering The goal of speaker clustering is to identify and group together all speech segments that were produced by the same speaker. The clusters can then be used for an acoustic model adaptation in order to improve the speech recognition rate. Speaker cluster information can also be used by topic detection and story segmentation algorithms to determine speaker roles inside the news show allowing for easier story identification. Our speaker clustering algorithm makes use of gender detection. Speech segments with different gender classification are clustered separately. We used bottom-up hierarchical clustering [4]. We developed an efficient distance measure based on the Bayesian Information Criterion (BIC) [7, 8]. An adjacency term is used instead of the BIC threshold [6]. Empirically clusters having adjacent speech segments are closer in time and the probability of belonging to the same speaker must be higher. Using this we obtained a cluster purity greater than 97% with a mean number of clusters per speaker slightly over 3.1. The adjacency term in our modified BIC expression retained a high cluster purity and decreased significantly the number of clusters per speaker. The clustering algorithm proved to be sensitive not only to different speakers but also to different acoustic background conditions. This side-effect is responsible for the high number of clusters per speaker obtained in the test set results Anchor Detection Anchors introduce the news and provide a synthetic summary for the story. Normally this is done in studio conditions (clean background) and with the anchor reading the news. Anchor speech segments convey all the story cues and are invaluable for automatic topic and summary generation algorithms. Also in these speech segments the recognition error rate is the lowest possible. The news shows that our system is currently monitoring are presented by three anchor persons, two male and one female. We built individual speaker
3 models for these anchors [6]. During the processing of a news show, after speaker sentence clustering, the resulting clusters are compared one by one against the special anchor cluster models to determine which of those belongs to one of the news anchors. We were able to achieve a percentage of deletions, that is, clusters not identified as belonging to the anchor around 9%, and percentage of insertions, that is clusters incorrectly labeled as anchor around 2%. These results are very promising especially due to the very low insertion rate. 3. AUDIMUS Recognition System AUDIMUS is a hybrid speech recognition system [9, 10] that combines the temporal modeling capabilities of Hidden Markov Models (HMMs) with the pattern discriminative classification capabilities of multilayer perceptrons (MLPs) [11]. In this hybrid HMM/MLP system a Markov process is used to model the basic temporal nature of the speech signal. The MLP is used as the acoustic model estimating contextindependent posterior phone probabilities given the acoustic data at each frame. JJ G G =7C K7L GG M7N.C L L =7BC D.E,.-0/.12-3$/ : ;0< =>?>@ =7>?>@ A D2E ;D7OQP@ E=C D.E D7E F2G BD7H.@ EI ! Figure 1: AUDIMUS - Recognition System. "$#&%('*#(%&)(# #&%()&# The acoustic modeling of AUDIMUS combines phone probabilities generated by several MLPs trained on distinct feature sets resulting from different feature extraction processes [12]. These probabilities are taken at the output of each MLP classifier and combined using an average in the log-probability domain. The processing stages are represented in Figure 1. All MLPs use the same phone set constituted by 38 phones for the Portuguese language plus silence and breath noises. The combination algorithm merges together the probabilities associated to the same phone. We are using three different feature extraction methods and MLPs with the same basic structure, that is, an input layer with 7 context frames, a non-linear hidden layer with over 1000 sigmoidal units and 40 softmax outputs. The feature extraction methods are PLP, Log-RASTA and MSG [13]. AUDIMUS presently uses a dynamic decoder that builds the search space as the composition of three Weighted Finite-State Transducers (WFSTs) [14, 15, 16, 17]. 4. Language modeling During the last years we have been collecting Portuguese newspapers from the web, which allowed us to start building a considerably large text corpus. Until the end of 2001 we have texts amounting to a total of 24.0M sentences with 434.4M words. A language model generated only from newspaper texts becomes too much adapted to the type of language used in those texts. When this kind of language model is used in a continuous speech recognition system applied to a Broadcast News task it will not perform as good as one would expect because the sentences spoken in Broadcast news do not match the style of the sentences written in the newspaper. If we created a language model from Broadcast News transcriptions it would probably be more adequate for this kind of speech recognition task. The problem is that we do not have enough BN transcriptions to generate a satisfactory language model. However we can improve the adaptation of a language model to the BN task by combining a model created from the newspaper texts with a model created from BN transcriptions, using linear interpolation [18]. One of the models is always generated from the newspaper text corpus while the other is a backoff trigram model using absolute discounting and based on the training set transcriptions of our BN database. The optimal weights used in the interpolation are computed using the transcriptions from the evaluation set of our BN database. These are also used to estimate the perplexity of all models. The final interpolated model has a perplexity of and the newspapers model has It is clear that even using a very small model based in BN transcriptions we can obtain some improvement in the perplexity of the interpolated model [19]. 5. Vocabulary and pronunciation lexicon When creating the vocabulary we have to consider that it needs to have a limited size to avoid expanding considerably the search space during a speech recognition task. Currently we are limiting our vocabulary size to 64k words. From the text corpus with 335 million words created from all the newspaper editions collected until the end of 2000, we extracted 427k different words. About 100k of these words occur more than 50 times in the text corpus. Using this smaller set, all the words were classified according to syntactic classes. Different weights were given to each class and a subset with 56k words was created based on the weighted frequencies of occurrence of the words. To
4 this set we added basically all the new words present in the transcripts of the training data of our Broadcast News database still being developed, giving a total of 57,564 words. Currently the transcripts contain 12,812 different words from a total of 142,547. From the vocabulary we were able to build the pronunciation lexicon. To obtain the pronunciations we used different lexica available in our lab. For the words not present in those lexica (mostly proper names, foreign names and some verbal forms) we used an automatic grapheme-phone system to generate corresponding pronunciations. Our final lexicon has a total of 65,895 different pronunciations. For the ALERT development test set corpus which has 5,426 different word in a total of 32,319 words, the number of OOVs using the 57K word vocabulary was 444 words representing a OOV word rate of 1.4%. 6. Weighted Finite-State Transducer based decoder The integration of knowledge sources in large vocabulary continuous speech recognition using weighted finite state transducers (WFSTs) is spreading in the speech recognition community [20, 21]. The approach can be briefly described in the following way: all knowledge sources (such as the lexicon or the language model) are encoded as weighted finitestate transducers (WFSTs); the knowledge sources are combined using WFST composition, in a very large integrated static network, that is then optimized using well founded algorithms such as determinization, minimization and pushing. In this way, the search space is built by the dynamic composition of the HMM/MLP topology transducer, the lexicon transducer and the language model transducer. The flexibility of the approach comes from the uniformity of representation and combination of the knowledge sources, allowing the easy integration of novel sources. The efficiency comes from the optimization algorithms that make very good use of the sparsity of the integrated network. Furthermore the decoder [14] uses a specialized algorithm integrating linguistic components in run time computing the minimization, pushing and determinization of the composition between the lexicon and language model transducers [15, 16]. In our BN experiments we used the compact representation of WFSTs in memory, and where able to run in less than 512MB of RAM. 7. Acoustic models alignment and training For training the acoustic models in our BN speech recognition system we started collecting broadcast news data in the scope of the ALERT project. The collected BN data was first automatically transcribed using our baseline speech recognition system AU- DIMUS, that was originally developed for dictation tasks and was trained using a clean read speech corpus [9, 10] similar to WSJ0. The automatic transcriptions were then manually corrected. Each time that a significant amount of manually verified BN data was available, we realigned and retrained our acoustic models. So far we made 8 iterations. The first available data, about 5 hours from the ALERT Pilot corpus, was used to bootstrap the BN acoustic models by two iterations of forced alignment and MLP training. The third alignment and training iteration already used some data from the ALERT Speech Recognition corpus, about a full week of news programs. Together with the pilot corpus it consisted on 13 hours of BN data. For the acoustic models trained using alignment number 4, over 22.5 hours of BN data were used. In the alignment number 6 we were able to use the complete ALERT Pilot corpus and Speech recognition corpus consisting of 45 hours of training material. 8. Speech recognition results Speech recognition results were conducted in the ALERT development test set which has over 6 hours of BN data. The experiments were made using a Pentium III 1GHz computer running Linux with 1Gb RAM. Table 1 summarizes the word error rate (% WER) evaluation obtained by the recognition system. The lines in Table 1 show the increase in performance by each successive improvement to the recognition system. The first column of results refers to the F0 focus condition where the sentences contain only prepared speech with low background noise and good quality audio. The second results column comprises the WER obtained in all test sentences, including noise, music, spontaneous speech, telephone speech, nonnative accents and also including the F0 focus condition sentences. % WER MLPs Decoder F0 All xrt 1000-a stack a WFST b WFST min det L Table 1: BN speech recognition evaluation using the complete development test set.
5 The first line of results in Table 1 were obtained using MLPs with 1000 hidden units and our old stack decoder algorithm. Compared with the second line of results we see that there was a significant increase in performance obtained when we switched to the new WFST dynamic decoder, especially in decoding time, expressed in the last column as real-time speed. The third line of results refers to the recognition results after alignment number 6 which was the first to use the complete BN training corpus totalizing 45 hours of speech data. The fourth line of results shows the improvement obtained by substituting the determinized lexicon transducer by one that was also minimized. The last line shows 8% relative improvement obtained from increasing the hidden layers to 4000 units. This increase was necessary because the acoustic models MLPs were no longer coping with all the variability present in the ALERT BN corpora that were used as training data. 9. Impact of segmentation in recognition Our automatic segmentation modules pre-process the audio stream and tag the segments that are fed to the speech recognition system for transcription. Since this automatic segmentation is not perfect we wanted to evaluate its impact on speech recognition in terms of word errors and processing time. We conducted a series of tests using a news program with 29 minutes from the ALERT development test set. For this news program the automatic transcription was produced considering three different audio pre-processing situations. In the first situation, the sentences for transcription and their divisions were chosen using the manual segmentation information. The program was divided into 241 useful sentences. In the second situation, no pre-processing was done and the whole program was considered a single sentence. In the third situation the program was processed by our automatic audio segmentation system, by which the news show was divided into 366 sentences. Table 2 summarizes the word error rate results obtained and the processing time expressed as multiples of real-time speed. xrt Segmentation sent. % WER front-end decod manual without automatic Table 2: Audio segmentation impact on BN speech recognition. The first conclusion drawn from the inspection of the results presented in Table 2 is that our automatic audio segmentation module produced more sentences than it should when compared with the manual segmentation. Sentences are shorter and more subdivided. This is mainly due to incorrect sentence divisions in places where there are speaker breath pauses. These subdivided sentences originate language modeling errors near the incorrect boundaries. Without automatic segmentation, that is, when we considered the whole program as a single sentence, the opposite effect occurs although not as problematic, because the language model only introduces article words to connect different sentences. That is why the WER for this case is closer to the manual segmentation. Additionally, from Table 2 we see that in the situation without segmentation the decoding is very time consuming because of the search space dimension (the whole news program is considered a single sentence). Our automatic audio segmentation takes slightly more processing time than the manual segmentation, as it should due to the segmentation and classification modules. 10. Concluding remarks Broadcast News speech recognition is a difficult and very resource consuming task. Our recognition engine evolved substantially through the accumulation of relatively small improvements. We are still far from ideal recognition results, the ultimate goal, nevertheless this technology is able to drive very useful applications, including audio indexing and spoken document retrieval. Our automatic audio segmentation, classification and clustering modules proved fairly accurate while consuming relatively less computational resources than other approaches. Additionally we developed a scheme for tagging anchor speaker clusters using trained models which aids the story segmentation module of our document retrieval system. Our recognition system evolved substantially with the availability of large quantities of BN training data. The MLPs with 4000 units trained on all of this BN data permitted a significant decrease in WER. Furthermore, the use of interpolated language models showed that is possible to improve a newspaper texts based language model using a small BN transcriptions based model. The use of dynamic WFST decoder allowed us to improve the speed and also reduce WER from over 30 to as little as 4 times real-time speed. Our system is currently integrated in a prototype audio indexing and document retrieval system that is daily processing the main news show of the national Portuguese broadcaster [2].
6 11. Acknowledgments This work was partially funded by IST-HLT European programme project ALERT and by FCT project POSI/33846/PLP/2000. INESC-ID Lisboa had support from the POSI Program of the Quadro Comunitário de Apoio III. Hugo Meinedo is sponsored by a FCT scholarship (SFRH/BD/6125/2001). 12. References [1] R. Amaral, T. Langlois, H. Meinedo, J. Neto, N. Souto, and I. Trancoso, The development of a portuguese version of a media watch system, in Proc. EUROSPEECH 2001, Aalborg, Denmark, [2] J. Neto, H. Meinedo, R. Amaral, and I. Trancoso, A system for selective dissemination of multimedia information resulting from the alert project, in Proc. ISCA MSDR 2003, Hong- Kong, China, April [3] R. Amaral and I. Trancoso, Segmentation and indexation of broadcast news, in Proc. ISCA MSDR 2003, Hong-Kong, China, April [4] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, Automatic segmentation, classification and clustering of broadcast news, in DARPA Proc. Speech Recognition Workshop, [5] G. Williams and D. Ellis, Speech/music discrimination based on posterior probability features, in Proc. EUROSPEECH 1999, Budapest, Hungary, September [6] H. Meinedo and J. Neto, Audio segmentation, classification and clustering in a broadcast news task, in Proc. ICASSP 2003, Hong-Kong, China, April [7] S. Chen and P. S. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in DARPA Proc. Speech Recognition Workshop, [8] B. Zhou and J. Hansen, Unsupervised audio stream segmentation and clustering via the bayesian information criterion, in Proc. ISCLP 2000, Beijing, China, October [9] J. Neto, C. Martins, and L. Almeida, The development of a speaker independent continuous speech recognizer for portuguese, in Proc. EU- ROSPEECH 1997, Rhodes, Greece, [10] J. Neto, C. Martins, and L. Almeida, A large vocabulary continuous speech recognition hybrid system for the portuguese language, in Proc. ICSLP 1998, Sydney, Australia, [11] H. Bourlard and N. Morgan, Connectionist Speech Recognition - A Hybrid Approach, Kluwer Academic Publishers, Massachusetts, EUA, [12] H. Meinedo and J. Neto, Combination of acoustic models in continuous speech recognition, in Proc. ICSLP 2000, Beijing, China, [13] B. E. Kingsbury, N. Morgan, and S. Greenberg, Robust speech recognition using the modulation spectrogram, Speech Comunication, vol. 25, pp , [14] D. Caseiro and I. Trancoso, A decoder for finite-state structured search spaces, in ASR 2000 Workshop, Paris, France, Sept [15] D. Caseiro and I. Trancoso, On integrating the lexicon with the language model, in Proc. EU- ROSPEECH 2001, Aalborg, Denmark, September [16] D. Caseiro and I. Trancoso, Transducer composition for on-the-fly lexicon and language model integration, in Proc. ASRU 2001 Workshop, Madonna di Campiglio, Trento, Italy, December [17] D. Caseiro and I. Trancoso, A tail-sharing wfst composition for large vocabulary speech recognition, in Proc. ICASSP 2003, Hong-Kong, China, April [18] R. Rosenfeld, Adaptive Statistical Language Modeling: A Maximum Entropy Approach, Ph.D. thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Apr. 1994, also appears as technical report CMU-CS [19] H. Meinedo, N. Souto, and J. Neto, Speech recognition of broadcast news for the european portuguese language, in Proc. ASRU 2001 Workshop, Madonna di Campiglio, Trento, Italy, December [20] M. Mohri, Finite-state transducers in language and speech processing, Computational Linguistics, vol. 23, no. 2, pp , June [21] M. Mohri, F. Pereira, and M. Riley, Weighted automata in text and speech processing, in ECAI 96 Workshop. Budapest, Hungary, August 1996.
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
More informationExploring the Structure of Broadcast News for Topic Segmentation
Exploring the Structure of Broadcast News for Topic Segmentation Rui Amaral (1,2,3), Isabel Trancoso (1,3) 1 Instituto Superior Técnico 2 Instituto Politécnico de Setúbal 3 L 2 F - Spoken Language Systems
More informationAutomatic Classification and Transcription of Telephone Speech in Radio Broadcast Data
Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data Alberto Abad, Hugo Meinedo, and João Neto L 2 F - Spoken Language Systems Lab INESC-ID / IST, Lisboa, Portugal {Alberto.Abad,Hugo.Meinedo,Joao.Neto}@l2f.inesc-id.pt
More informationSEGMENTATION AND INDEXATION OF BROADCAST NEWS
SEGMENTATION AND INDEXATION OF BROADCAST NEWS Rui Amaral 1, Isabel Trancoso 2 1 IPS/INESC ID Lisboa 2 IST/INESC ID Lisboa INESC ID Lisboa, Rua Alves Redol, 9,1000-029 Lisboa, Portugal {Rui.Amaral, Isabel.Trancoso}@inesc-id.pt
More informationPortuguese Broadcast News and Statistical Machine Translation (SMT)
Statistical Machine Translation of Broadcast News from Spanish to Portuguese Raquel Sánchez Martínez 1, João Paulo da Silva Neto 2, and Diamantino António Caseiro 1 L²F - Spoken Language Systems Laboratory,
More informationResearch Article A Prototype System for Selective Dissemination of Broadcast News in European Portuguese
Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2007, Article ID 37507, 11 pages doi:10.1155/2007/37507 Research Article A Prototype System for Selective Dissemination
More informationTurkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr
More informationBLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be
More informationA Platform of Distributed Speech Recognition for the European Portuguese Language
A Platform of Distributed Speech Recognition for the European Portuguese Language João Miranda, João P. Neto L 2 F - Spoken Language Systems Lab / INESC-ID, Instituto Superior Técnico / Technical University
More informationRobust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
More informationSPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS
SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS Mbarek Charhad, Daniel Moraru, Stéphane Ayache and Georges Quénot CLIPS-IMAG BP 53, 38041 Grenoble cedex 9, France Georges.Quenot@imag.fr ABSTRACT The
More informationEstablishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
More informationInvestigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
More informationAutomatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
More informationSpeech Recognition on Cell Broadband Engine UCRL-PRES-223890
Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda
More informationOPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
More informationAutomatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
More informationAUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
More informationOnline Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
More informationTranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification
TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification Mahesh Viswanathan, Homayoon S.M. Beigi, Alain Tritschler IBM Thomas J. Watson Research Labs Research
More informationLeveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training
Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training Matthias Paulik and Panchi Panchapagesan Cisco Speech and Language Technology (C-SALT), Cisco Systems, Inc.
More informationSOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS
SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University
More informationC E D A T 8 5. Innovating services and technologies for speech content management
C E D A T 8 5 Innovating services and technologies for speech content management Company profile 25 years experience in the market of transcription/reporting services; Cedat 85 Group: Cedat 85 srl Subtitle
More informationGenerating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN pakhomov.sergey@mayo.edu Michael Schonwetter Linguistech Consortium, NJ MSchonwetter@qwest.net Joan Bachenko
More informationArtificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
More informationBuilding A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
More informationEFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS
EFFICIENT HANDLING OF MULTILINGUAL LANGUAGE MODELS Christian Fügen, Sebastian Stüker, Hagen Soltau, Florian Metze Interactive Systems Labs University of Karlsruhe Am Fasanengarten 5 76131 Karlsruhe, Germany
More informationThe LENA TM Language Environment Analysis System:
FOUNDATION The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File Dongxin Xu, Umit Yapanel, Sharmi Gray, & Charles T. Baer LENA Foundation, Boulder, CO LTR-04-2 September
More informationSpeech Transcription
TC-STAR Final Review Meeting Luxembourg, 29 May 2007 Speech Transcription Jean-Luc Gauvain LIMSI TC-STAR Final Review Luxembourg, 29-31 May 2007 1 What Is Speech Recognition? Def: Automatic conversion
More informationEVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN
EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN J. Lööf (1), D. Falavigna (2),R.Schlüter (1), D. Giuliani (2), R. Gretter (2),H.Ney (1) (1) Computer Science Department, RWTH Aachen
More informationTED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France firstname.lastname@lium.univ-lemans.fr
More informationTranscription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)
Proceedings of the Twenty-Fourth Innovative Appications of Artificial Intelligence Conference Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet) Tatsuya Kawahara
More informationUsing the Amazon Mechanical Turk for Transcription of Spoken Language
Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.
More informationSENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen
SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas
More informationExtension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese
Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese Thomas Pellegrini (1), Helena Moniz (1,2), Fernando Batista (1,3), Isabel Trancoso (1,4), Ramon Astudillo (1) (1)
More informationABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
More informationComparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
More informationAn Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
More informationSUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,
More information2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
More informationMyanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
More informationADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney
ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department,
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationA System for Searching and Browsing Spoken Communications
A System for Searching and Browsing Spoken Communications Lee Begeja Bernard Renger Murat Saraclar AT&T Labs Research 180 Park Ave Florham Park, NJ 07932 {lee, renger, murat} @research.att.com Abstract
More informationUsing Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library
More informationEricsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
More informationSample Cities for Multilingual Live Subtitling 2013
Carlo Aliprandi, SAVAS Dissemination Manager Live Subtitling 2013 Barcelona, 03.12.2013 1 SAVAS - Rationale SAVAS is a FP7 project co-funded by the EU 2 years project: 2012-2014. 3 R&D companies and 5
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationSecure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
More informationMixed-Source Multi-Document Speech-to-Text Summarization
Mixed-Source Multi-Document Speech-to-Text Summarization Ricardo Ribeiro INESC ID Lisboa/ISCTE/IST Spoken Language Systems Lab Rua Alves Redol, 9 1000-029 Lisboa, Portugal rdmr@l2f.inesc-id.pt David Martins
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationMIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1
IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 Data balancing for efficient training of Hybrid ANN/HMM Automatic Speech Recognition systems Ana Isabel García-Moral,
More informationHow To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3
Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web. By C.Moreno, A. Antolin and F.Diaz-de-Maria. Summary By Maheshwar Jayaraman 1 1. Introduction Voice Over IP is
More informationLEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
More informationIMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE
IMPLEMENTING SRI S PASHTO SPEECH-TO-SPEECH TRANSLATION SYSTEM ON A SMART PHONE Jing Zheng, Arindam Mandal, Xin Lei 1, Michael Frandsen, Necip Fazil Ayan, Dimitra Vergyri, Wen Wang, Murat Akbacak, Kristin
More informationFace Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213
Face Locating and Tracking for Human{Computer Interaction Martin Hunke Alex Waibel School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Eective Human-to-Human communication
More informationDeNoiser Plug-In. for USER S MANUAL
DeNoiser Plug-In for USER S MANUAL 2001 Algorithmix All rights reserved Algorithmix DeNoiser User s Manual MT Version 1.1 7/2001 De-NOISER MANUAL CONTENTS INTRODUCTION TO NOISE REMOVAL...2 Encode/Decode
More informationProsodic Phrasing: Machine and Human Evaluation
Prosodic Phrasing: Machine and Human Evaluation M. Céu Viana*, Luís C. Oliveira**, Ana I. Mata***, *CLUL, **INESC-ID/IST, ***FLUL/CLUL Rua Alves Redol 9, 1000 Lisboa, Portugal mcv@clul.ul.pt, lco@inesc-id.pt,
More informationImplementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31
Disclaimer: This document was part of the First European DSP Education and Research Conference. It may have been written by someone whose native language is not English. TI assumes no liability for the
More information31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
More informationKL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
More informationObjective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
More informationTHE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
More informationTracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
More informationOffline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
More informationNovelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
More informationGerman Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
More informationEmotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
More informationThe effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
More informationMUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
More informationEFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION
EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION By Ramasubramanian Sundaram A Thesis Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the
More informationScaling Shrinkage-Based Language Models
Scaling Shrinkage-Based Language Models Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 USA {stanchen,mangu,bhuvana,sarikaya,asethy}@us.ibm.com
More informationAutomatic structural metadata identification based on multilayer prosodic information
Proceedings of Disfluency in Spontaneous Speech, DiSS 2013 Automatic structural metadata identification based on multilayer prosodic information Helena Moniz 1,2, Fernando Batista 1,3, Isabel Trancoso
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationSpeech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationSpeech Recognition in the Informedia TM Digital Video Library: Uses and Limitations
Speech Recognition in the Informedia TM Digital Video Library: Uses and Limitations Alexander G. Hauptmann School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA Abstract In principle,
More informationAn Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
More informationA COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS
A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS Caldas, Carlos H. 1 and Soibelman, L. 2 ABSTRACT Information is an important element of project delivery processes.
More informationSemantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
More informationEstonian Large Vocabulary Speech Recognition System for Radiology
Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe, Einar Meister Institute of Cybernetics Tallinn University of Technology, Estonia October 8, 2010 Alumäe, Meister (TUT, Estonia)
More informationThe Influence of Topic and Domain Specific Words on WER
The Influence of Topic and Domain Specific Words on WER And Can We Get the User in to Correct Them? Sebastian Stüker KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der
More informationTHREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
More informationMovie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani
More informationCITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment
More informationAutomatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments
Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments Josef Psutka 1, Pavel Ircing 1, Josef V. Psutka 1, Vlasta Radová 1, William Byrne 2, Jan
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationCHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present
CHANWOO KIM (BIRTH: APR. 9, 1976) 2602E NSH Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Phone: +1-412-726-3996 Email: chanwook@cs.cmu.edu RESEARCH INTERESTS Speech recognition system,
More informationCrowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
More informationAPPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer
More informationNumerical Field Extraction in Handwritten Incoming Mail Documents
Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr
More informationLecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
More informationSpeaker age estimation for elderly speech recognition in European Portuguese
Speaker age estimation for elderly speech recognition in European Portuguese Thomas Pellegrini 1, Vahid Hedayati 2, Isabel Trancoso 2,3 Annika Hämäläinen 4, Miguel Sales Dias 4 1 Université de Toulouse;
More information6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationAn Adaptive Approach to Named Entity Extraction for Meeting Applications
An Adaptive Approach to Named Entity Extraction for Meeting Applications Fei Huang, Alex Waibel Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh,PA 15213 fhuang@cs.cmu.edu,
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationSemantic Data Mining of Short Utterances
IEEE Trans. On Speech & Audio Proc.: Special Issue on Data Mining of Speech, Audio and Dialog July 1, 2005 1 Semantic Data Mining of Short Utterances Lee Begeja 1, Harris Drucker 2, David Gibbon 1, Patrick
More informationRecent advances in Digital Music Processing and Indexing
Recent advances in Digital Music Processing and Indexing Acoustics 08 warm-up TELECOM ParisTech Gaël RICHARD Telecom ParisTech (ENST) www.enst.fr/~grichard/ Content Introduction and Applications Components
More information