EVALITA-ISTC Comparison of Open Source Tools on Clean and Noisy Digits Recognition Tasks
|
|
|
- Louise Wright
- 10 years ago
- Views:
Transcription
1 EVALITA-ISTC Comparison of Open Source Tools on Clean and Noisy Digits Recognition Tasks Piero Cosi Mauro Nicolao ISTITUTO DI SCIENZE E TECNOLOGIE DELLA COGNIZIONE SEDE DI PADOVA - FONETICA E DIALETTOLOGIA Consiglio Nazionale delle Ricerche Via Martiri della libertà, Padova (Italy) {piero.cosi, mauro.nicolao}@pd.istc.cnr.it - www: EVALITA Connected Digits Recognition task Reggio Emilia (Italia), 12 Dicembre 2009 Copyright, 2009 ISTC-SPFD-CNR
2 Introduction Data EVALITA speech tasks: connected digits recognition Training, Development, Test ASR Architectures CSLU Speech Toolkit SONIC Feature extraction PMVDR: Perceptual Minimum Variance Distortionless Response SPHINX Results Concluding Remarks Outline
3 Commissione di Gestione della base dei dati vocali italiani Ministero delle Poste e Telecomunicazioni ForumTAL Coordinamento delle iniziative di ricerca e di sviluppo nel campo del Trattamento Automatico del Linguaggio Ministero delle Comunicazioni 2002 Introduction EVALITA Evaluation campaign of Natural Language Processing tools AI*IA - NLP working group 2007 EVALITA Speech Tasks Connected Digits Recognition Dialogue System Evaluation Speaker Identity Verification (Application & Forensic) AISV 2009
4 Data Sub Set Clean Audio Files Noisy Audio Files Clean Digit Sequences Noisy Digit Sequences Training Development Test [dz E r o] 1 [u n o] 2 [d u e] 3 [t r E] 4 [k w a t r o] 5 [ts i n k w e] 6 [s E I] 7 [s E t e] 8 [O t o] 9 [n O v e] simple grammar [<any> (<digit> [silence]) + <any>] Within the EVALITA framework only the orthographic transcriptions are available so one of our previously-created generalpurpose recognizer [ 9 ] has been used to create the phonetically aligned transcriptions needed from CSLU and SONIC systems to start the training.
5 CSLU Speech Toolkit John-Paul Hosom P.O. Box 91000, Portland Oregon USA CLSU Toolkit:
6 Cosa sono? Language Resources Ricerca di base: Speech Recognition Natural Language Understanding Dialogue Modeling Speech Generation (TTS) Talking Heads Integrazione di Sistema: CSLU Toolkit Dialogue Authoring Tools / Tutorials Documentation / Visualization Tools / Labeling Trasferimento di Tecnologia: High Schools Universities Researchers Enthusiasts Industry download:
7 Il sistema di riconoscimento (classifica questo frame) analisi finestra spettrale contesto j E s Viterbi sil j E yes s n ou neural network classifier time vocabolario, grammatica classifica
8 Standard HMM: Il sistema di riconoscimento Le probabilità degli stati sono stimate da ANN invece che da GMM sistemi ibridi HMM/ANN (cfr. Bourlard, Morgan)
9 Sistemi ibridi HMM/ANN HMM $alv < E <E> E > $pld NN $alv < E <E> E > $pld output nodes hidden nodes input nodes /s Ette/
10 CSLU Speech Toolkit: EVALITA three-layer fully connected feed-forward network trained to estimate, at every frame, the probability of 98 contextdependent phonetic categories; these categories were created by splitting each Acoustic Unit (AU), into one, two, or three parts, depending on the length of the AU and how much the AU was thought to be influenced by co-articulatory effects. Silence and closure are 1-part units, vowels are 3-part units, unvoiced plosive is 1-part right dependent unit, voiced plosive, affricate, fricative, nasal, liquid retroflex and glide are all 2-part units 100 iterations; the best network iteration (baseline network - B) was determined by evaluation on the EVALITA clean and noisy digits development sets respectively after a comparison among the CSLU system driven by different feature types, we have found that 13-coefficient PLP plus 13- coefficient MFCC with CMS computed, once every 10 ms obtained the best score.
11 SONIC University of Colorado, Center for Spoken Language Research
12 ASR: sonic Front-End Lexicon Acoustic Model Language Model Search Adaptation University of Colorado, Center for Spoken Language Research (Bryan PELLOM)
13 ASR Structure Lexicon (phoneme-based) Language Model (n-gram) Feature Extraction (spectral analysis) Speech Text + Optimization (Viterbi Search) Timing + Confidence Acoustic Model (HMM)
14 Feature Extraction: PMVDR Perceptual Minimum Distortionless Response Cepstral Coefficients (Yapanel & Hansen, Eurospeech 2003)
15 MVDR Minimum Variance Distortionless Response Spectral envelopes: LP (solid), MVDR (dashed)
16 HMM the acoustic models consists of decision-tree state-clustered HMMs with associated gamma probability density functions to model statedurations. the acoustic models have a fixed 3-state topology each HMM state can be modelled with variable number of multivariate mixture Gaussian distributions the training process consists of first performing state-based alignment of the training audio followed by an expectation-maximization (EM) step in which decision-tree state-clustered HMMs are estimated acoustic model parameters (means, covariances, and mixture weights) are estimated in the maximum likelihood sense the training process can be iterated between alignment of data and model estimation to gradually achieve adequate parameter estimation
17 Viterbi-based Training of Italian Speech Models SAMPA Example SAMPA Example SAMPA Example SAMPA Example i pini i1 così d dente dz magia e velo e1 mercé g gatto m mano E aspetto E1 caffè f faro n nave a vai a1 bontà s sole J legna o polso o1 Roma S sci nf anfora O cosa O1 però v via ng ingordo u punta U1 più z peso l palo j piume t torre ts pizza L soglia w quando k caldo ts pece r remo p pera b botte dz zero SIL silence
18 SONIC: EVALITA 12 PMVDR cepstral parameters were retained and augmented with normalized log frame energy plus and 39-dimensional feature vector is computed, once every 10 ms the model developed in [ 9 ] was inserted in the first alignment step to provide a good segmentation to start from and a first acousticmodel estimation was computed at the end of further 8 loops of phonetic alignment and acoustic model re-estimation, the final AM is considered well trained [9] Cosi, P., Hosom, J.P. (2000). High Performance General Purpose Phonetic Recognition for Italian, Proceedings of ICSLP 2000, International Conference on Spoken Language Processing, Beijing, China (2000), vol. II, pp
19 Carnegie Mellon University SPHINX School of Computer Science Department of Electrical & Computer Engineering CMU SPHINX
20 SPHINX 3 Lexical model: The lexical or pronunciation model contains pronunciations for all the words of interest to the decoder. Sphinx-3 uses phonetic units to build word pronunciations. Currently, the pronunciation lexicon is almost entirely handcrafted. Acoustic model: Sphinx uses acoustic models based on statistical hidden Markov models (HMMs). The acoustic model is trained from acoustic training data using the Sphinx-3 trainer. The trainer is capable of building acoustic models with a wide range of structures, such as discrete, semi-continuous, or continuous. However, the s3.3 decoder is only capable of handling continuous acoustic models. Language model (LM): Sphinx-3 uses a conventional backoff bigram or trigram language model.
21 SPHINX 3 training is an iterative sequence of alignments and AM-estimations; it starts from an audio segmentation aligned to training-data transcriptions and it estimates a raw first AM from them this is the starting point of the following loops of Baum-Welch probability density functions estimation and transcription alignment; models can be computed either for each phoneme (Contest Independent, CI) or, considering phoneme context (Contest Dependent, CD) SPHINX acoustic models are trained over MFCC + + feature vectors SPHINX-3, is a C-based state-of-the art large-vocabulary continuousmodel ASR, and it is limited to 3 or 5-state left-to-right HMM topologies and to a bigram or trigram language model the decoder is based on the conventional Viterbi search algorithm and beam search heuristics it uses a lexical-tree search structure, too, in order to prune the state transitions
22 SPHINX 3: EVALITA no previously developed AM was applied and a simple uniform segmentation was chosen as starting point after a raw first-am estimation, 4 loops of re-alignment and CI (contest-independent) AM re-estimations were done the last CI trained model was employed to create a minimum-error segmentation and train contest-dependent AMs an all-state (untied) AM was computed, and then 4 loops of CD state-tied segmentation re-estimation were done
23 Results: CSLU Speech Toolkit development WA % IA FA FB1 FB2 FB3 clean AM on clean 99,82 99,75 99,94 99,82 99,75 noisy AM on noisy 90,15 90,93 92,11 91,75 91,49 full AM on clean+noisy 93,86 94,12 94,28 94,28 94,2 test FB1 WA % SA % clean AM on clean 99,10 94,80 noisy AM on noisy 94,00 82,00 full AM on clean + noisy 95,00 87,20
24 Results: SONIC development WA % full AM clean AM noisy AM clean 99,70 99,80 99,70 noisy 94,20 89,90 94,80 clean + noisy 96,71 94,42 97,04 test WA % SA % clean AM on clean 99,60 97,30 noisy AM on noisy 96,30 87,90 full AM on clean + noisy 97,30 90,60
25 Results: SPHINX 3 development WA % full AM clean AM noisy AM clean 99,40 99,40 98,80 noisy 93,30 78,70 92,60 clean + noisy 96,10 88,31 95,43 test WA % SA % clean AM on clean 98,90 94,50 noisy AM on noisy 91,70 72,70 full AM on clean + noisy 95,50 86,00
26 Results test CSLR SONIC SPHINX WA % SA % WA % SA % WA % SA % clean AM on clean 99,10 94,80 99,60 97,30 98,90 94,50 noisy AM on noisy 94,00 82,00 96,30 87,90 91,70 72,70 full AM on clean + noisy 95,00 87,20 97,30 90,60 95,50 86,00
27 Concluding Remarks 3 of the most used open source ASR tools were considered in this work, i.e. CSLU Toolkit, SONIC, and SPHINX, because promising results were obtained in the past on similar digit recognition tasks beyond the fact that finding similarity among the three ASR systems was one of the main difficulties, an homogeneous and unique test framework for comparing different Italian ASR systems was quite possible and effective if 3-gram LM weight is set to 0 and the results produced by the best WA-score configuration were compared for each system CSLU Toolkit is good in recognizing clean digit sequences, but it is not so good at recognizing clean-plus-noisy audio; SONIC is the best system in all situations and we believe this is mainly due to the adoption of the PMVDR features; SPHINX is quite more sensible to AM specialization than other systems and clean models can not recognize noisy speech with high performance. finally we should conclude that the EVALITA Speech campaign was quite effective in forcing various Italian research groups to focus on similar recognition tasks working on common data thus comparing and improving various different recognition methodologies and strategies
28 Future Work we hope more complex tasks and data will be exploited in the future we are looking for CHILDREN Speech evaluation campaign EVALEU Interspeech Satellite Workshop????
29 Thank You!!!... and WELCOME to Florence 2011
Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools
Connected Digits Recognition Task: ISTC CNR Comparison of Open Source Tools Piero Cosi, Mauro Nicolao Istituto di Scienze e Tecnologie della Cognizione, C.N.R. via Martiri della libertà, 2 35137 Padova
Towards the Italian CSLU Toolkit
Towards the Italian CSLU Toolkit Piero Cosi *, John-Paul Hosom and Fabio Tesser *** * Istituto di Fonetica e Dialettologia C.N.R. Via G. Anghinoni, 10-35121 Padova (ITALY), e-mail: [email protected]
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
Artificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,
Myanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
Lecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA
Ericsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
Evaluating grapheme-to-phoneme converters in automatic speech recognition context
Evaluating grapheme-to-phoneme converters in automatic speech recognition context Denis Jouvet, Dominique Fohr, Irina Illina To cite this version: Denis Jouvet, Dominique Fohr, Irina Illina. Evaluating
Automatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium [email protected]
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
Advanced Signal Processing and Digital Noise Reduction
Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
9RLFH$FWLYDWHG,QIRUPDWLRQ(QWU\7HFKQLFDO$VSHFWV
Université de Technologie de Compiègne UTC +(8',$6
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA
SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
TED-LIUM: an Automatic Speech Recognition dedicated corpus
TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France [email protected]
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Online Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
Developing an Isolated Word Recognition System in MATLAB
MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling
Creation of a Speech to Text System for Kiswahili
Creation of a Speech to Text System for Kiswahili Dr. Katherine Getao and Evans Miriti University of Nairobi Abstract The creation of a speech to text system for any language is an onerous task. This is
CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN
CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN Min Tang, Bryan Pellom, Kadri Hacioglu Center for Spoken Language Research University of Colorado at Boulder Boulder, Colorado
Information Leakage in Encrypted Network Traffic
Information Leakage in Encrypted Network Traffic Attacks and Countermeasures Scott Coull RedJack Joint work with: Charles Wright (MIT LL) Lucas Ballard (Google) Fabian Monrose (UNC) Gerald Masson (JHU)
Modeling Cantonese Pronunciation Variations for Large-Vocabulary Continuous Speech Recognition
Computational Linguistics and Chinese Language Processing Vol. 11, No. 1, March 2006, pp. 17-36 17 The Association for Computational Linguistics and Chinese Language Processing Modeling Cantonese Pronunciation
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM
THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM Simon Wiesler 1, Kazuki Irie 2,, Zoltán Tüske 1, Ralf Schlüter 1, Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department,
Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems
DynaSpeak: SRI s Scalable Speech Recognizer for Embedded and Mobile Systems Horacio Franco, Jing Zheng, John Butzberger, Federico Cesari, Michael Frandsen, Jim Arnold, Venkata Ramana Rao Gadde, Andreas
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 Weighting and Normalisation of Synchronous HMMs for
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis
Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis S-E.Fotinea 1, S.Bakamidis 1, T.Athanaselis 1, I.Dologlou 1, G.Carayannis 1, R.Cowie 2, E.Douglas-Cowie
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine George E. Dahl, Marc Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton Department of Computer Science University of Toronto
Improving Automatic Forced Alignment for Dysarthric Speech Transcription
Improving Automatic Forced Alignment for Dysarthric Speech Transcription Yu Ting Yeung 2, Ka Ho Wong 1, Helen Meng 1,2 1 Human-Computer Communications Laboratory, Department of Systems Engineering and
Slovak Automatic Transcription and Dictation System for the Judicial Domain
Slovak Automatic Transcription and Dictation System for the Judicial Domain Milan Rusko 1, Jozef Juhár 2, Marian Trnka 1, Ján Staš 2, Sakhia Darjaa 1, Daniel Hládek 2, Miloš Cerňak 1, Marek Papco 2, Róbert
Solutions to Exam in Speech Signal Processing EN2300
Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.
THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT. Daniel Bolaños. Boulder Language Technologies (BLT), Boulder, CO, 80301 USA
THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT Daniel Bolaños Boulder Language Technologies (BLT), Boulder, CO, 80301 USA [email protected] ABSTRACT This article describes the design of Bavieca, an opensource
Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU
1 Dimension Reduction Wei-Ta Chu 2014/10/22 2 1.1 Principal Component Analysis (PCA) Widely used in dimensionality reduction, lossy data compression, feature extraction, and data visualization Also known
CONATION: English Command Input/Output System for Computers
CONATION: English Command Input/Output System for Computers Kamlesh Sharma* and Dr. T. V. Prasad** * Research Scholar, ** Professor & Head Dept. of Comp. Sc. & Engg., Lingaya s University, Faridabad, India
Developing speech recognition software for command-and-control applications
Developing speech recognition software for command-and-control applications Author: Contact: Ivan A. Uemlianin [email protected] Contents Introduction Workflow Set up the project infrastructure
Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition
1 Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition Émilie CAILLAULT LASL, ULCO/University of Littoral (Calais) Christian VIARD-GAUDIN IRCCyN, University of
NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models
NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Part of Speech (POS) Tagging Given a sentence X, predict its
Evaluation of speech technologies
CLARA Training course on evaluation of Human Language Technologies Evaluations and Language resources Distribution Agency November 27, 2012 Evaluation of speaker identification Speech technologies Outline
THE goal of Speaker Diarization is to segment audio
1 The ICSI RT-09 Speaker Diarization System Gerald Friedland* Member IEEE, Adam Janin, David Imseng Student Member IEEE, Xavier Anguera Member IEEE, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox, Oriol
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content
CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content Martine Garnier-Rizet 1, Gilles Adda 2, Frederik Cailliau
Editorial Manager(tm) for Journal of the Brazilian Computer Society Manuscript Draft
Editorial Manager(tm) for Journal of the Brazilian Computer Society Manuscript Draft Manuscript Number: JBCS54R1 Title: Free Tools and Resources for Brazilian Portuguese Speech Recognition Article Type:
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
HIDDEN MARKOV MODELS FOR SPECTRAL SIMILARITY OF SONGS. Arthur Flexer, Elias Pampalk, Gerhard Widmer
Proc. of the th Int. Conference on Digital Audio Effects (DAFx 5), Madrid, Spain, September 2-22, 25 HIDDEN MARKOV MODELS FOR SPECTRAL SIMILARITY OF SONGS Arthur Flexer, Elias Pampalk, Gerhard Widmer The
Automated Content Analysis of Discussion Transcripts
Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
Hidden Markov Model Toolkit (HTK) - Quick Start by David Philippou-Hübner
Hidden Markov Model Toolkit (HTK) - Quick Start by David Philippou-Hübner This 5 page step-by-step tutorial shows how to build a (minimal) automated speech recognizer using HTK. It is a compact version
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
Implementation of the Discrete Hidden Markov Model in Max / MSP Environment
Implementation of the Discrete Hidden Markov Model in Max / MSP Environment Paul Kolesnik and Marcelo M. Wanderley Sound Processing and Control Lab, Centre for Interdisciplinary Research in Music Media
Direct Loss Minimization for Structured Prediction
Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago [email protected] Tamir Hazan TTI-Chicago [email protected] Joseph Keshet TTI-Chicago [email protected] Abstract In discriminative
FRANCESCO BELLOCCHIO S CURRICULUM VITAE ET STUDIORUM
FRANCESCO BELLOCCHIO S CURRICULUM VITAE ET STUDIORUM April 2011 Index Personal details and education 1 Research activities 2 Teaching and tutorial activities 3 Conference organization and review activities
SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY. www.phonexia.com, 1/41
SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY www.phonexia.com, 1/41 OVERVIEW How to move speech technology from research labs to the market? What are the current challenges is speech recognition
A HAND-HELD SPEECH-TO-SPEECH TRANSLATION SYSTEM. Bowen Zhou, Yuqing Gao, Jeffrey Sorensen, Daniel Déchelotte and Michael Picheny
A HAND-HELD SPEECH-TO-SPEECH TRANSLATION SYSTEM Bowen Zhou, Yuqing Gao, Jeffrey Sorensen, Daniel Déchelotte and Michael Picheny IBM T. J. Watson Research Center Yorktown Heights, New York 10598 zhou, yuqing,
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
