at Spoken Term Detection Task in NTCIR-10 SpokenDoc-2
|
|
- Evangeline Hampton
- 7 years ago
- Views:
Transcription
1 at Spoken Term Detection Task in NTCIR-10 SpokenDoc-2 ABSTRACT Iori Sakamoto Masanori Morise The development of spoken term detection (STD) techniques, which detect a given word or phrase from spoken documents, is widely conducted in order to realize easy access to large amount of multimedia contents including speech This paper describes improvement of the STD method which is based on the vector quantization (VQ) and has been proposed in NTCIR-9 SpokenDoc. Spoken documents are represented as sequences of VQ codes, and they are matched with a text query to be detected based on the V-P score which measures the relationship between a VQ code and a phoneme. The matching score between VQ codes and phonemes is calculated after normalization for each phoneme in a query term to avoid biased scoring to particular phonemes. The score normalization improves the STD performance by 6% of F measure. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Experimentation Keywords spoken term detection, out-of-vocabulary, vector quantization, score normalization Team Name YLAB Subtasks / Languages Spoken Term Detection / Japanese Dr. Morise currently works for Yamanashi University. Kook Cho cho@slp.is Yoichi Yamashita yama@media External Resources Used CSJ (Corpus of Spontaneous Japanese) 1. INTRODUCTION Rapid increase of multimedia contents including spoken messages is raising the need of information retrieval for speech to facilitate to access the spoken documents that we want. Spoken term detection (STD) is a task of the information retrieval for speech data and nds words or phrases which match with a given query term[2]. STD can be accomplished by the combination of two techniques, automatic speech recognition (ASR) and text search. This simple approach is not sucient because ASR can not recognize speech data completely without recognition errors and can not recognize out-of-vocabulary (OOV) words which are not contained in the ASR dictionary. One of important issues in STD is how to detect OOV words[8]. Several methods have been proposed to avoid the OOV word problem. One of promising approaches is the use of speech recognition based on sub-word units, such as phonemes and syllables[5, 11]. Speech segments are detected by matching between two sub-word sequences which are obtained from input text query and speech recognition. In this approach, the speech recognition converts spoken documents into sub-word sequences in a vocabulary-free manner, and can avoid the OOV word problem because a set of sub-word units cover all words or sentences. Theoretically, any word can be recognized correctly. However, the accuracy of the speech recognition based on sub-word units is lower than word-based speech recognition. Another approach is a word spotting technique which has been widely studied in the 1980's[4, 6, 7, 13] in order to realize speech understanding based on keyword recognition rather than information retrieval. The word spotting based on Hidden Markov Model (HMM) detects speech segments similar to a query term by calculating the likelihood for sequences of acoustic parameters of the speech segment. The 638
2 ASR Phoneme sequences V-P score Spoken documents Vector Quantization (VQ) VQ code sequences DTW matching Candidates of detection Phoneme sequence of query i p 5 p 4 p 3 p 2 p 1 DP path v v v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 11 v 12 v 13 VQ code sequence Figure 2: A sample of DTW matching. j Text query Phoneme sequence Detected segments Figure 1: The overview of the STD method based on Vector Quantization. word spotting takes much time to calculate the likelihood while the accuracy is high. This characteristics of the word spotting is not appropriate to the on-line STD task for large database of spoken documents. Representation scheme of spoken documents is a crucial issue for STD with high accuracy. The sub word is one of symbolic representations of spoken documents. Some acoustic properties are possibly missed in the process of conversion from an acoustic parameter sequence into a sub-word symbol. On the other hand, the acoustic parameters have full acoustic properties of speech, but they consume much time for the STD task. The authors have proposed a method of STD based on vector quantization (VQ) which uses the sequence of VO codes as an alternative representation scheme of spoken documents for the STD[12, 10]. A query term is detected by matching between the VQ sequence and input text query by dening the cooccurrence score between phonemes and VQ codes for each speaker. This paper describes a normalization method of matching score which evaluates equally phonemes in the query to reject candidates with unnatural structure of phoneme durations. 2. STD METHOD BASED ON VECTOR QUANTIZATION The overview of the STD method based on VQ is shown in Figure 1. The VQ process converts spoken documents in the database into sequences of VQ codes by a clustering technique for each speaker. The automatic speech recognition (ASR) also converts the spoken documents into sequences of phonemes with time alignment information. The V-P score, which is dened as a cooccurrence score of a phoneme for a VQ code, is trained for each speaker-dependent VQ codebook. The continuous DTW matching technique compares the VQ code sequences of spoken documents with the phoneme sequence of input text query, and detects segments using some threshold logics. 2.1 Vector Quantization of Acoustic Parameters Spoken documents are analyzed with 20 [ms] frames and 10 [ms] intervals. MFCC parameter of 12 dimensions are obtained for each frame. The feature vector of a frame consists of 60 parameters including 12 MFCCs of two preceding and two following frames as well as the current frame. The VQ process converts a 60-dimensional parameter vector into a VQ code frame by frame. 2.2 V-P score: cooccurrence score of a phoneme for a VQ code The V-P score, which is the cooccurrence score of a phoneme for a VQ code, is beforehand trained to compare VQ code sequences of spoken documents with the input query. The V-P score s(v, p) of a phoneme p for a VQ code v is dened based on the occurrence count of the phoneme for the VQ code, as ( Cv(p) s(v, p) = log N v ) log ( Cv(p best ) N v ), (1) where C v (p) is the number of frames which are labeled with the phoneme p and is quantized into the VQ code v, N v is the total number of frames of v, and p best is the phoneme which appears most in v. 2.3 Continuous DTW matching The continuous DTW matching compares the VQ code sequences with the phoneme sequence of the input query. Figure 2 shows a sample of DTW matching and the DP path. The V-P score is used as a local score between a VQ code and a phoneme. Let p j (1 j K) and K be the phoneme sequence of the input query and the number of the phoneme in a query. Let v j (1 i L) and L be a VQ code sequence and the number of the VQ code in a spoken documents. The maximum accumulated score S i,k at the frame i for the input query is calculated as follows. 1) S 0,j = 0.0 (1 j K) (2) 2) repeat 3),4),5) for i = 1, 2,, L 3) S i,0 = 0.0 4) repeat 5) for j = 1, 2,, K { Si 1,j 1 + s(v 5) S i,j = i, p j) ( S i 1,j 1 > S i 1,j) S i 1,j + s(v i, p j) ( S i 1,j 1 S i 1,j) The S i,j is a duration-normalized accumulated intermediate score and is dened as 1 S i,j = i start(i, j) + 1 S i,j, (4) (3) 639
3 where start(i, j) indicates the starting frame which is determined by backward tracking from the matching between the i-th phoneme and the j-th VQ code. The continuous DTW matching calculates the duration-normalized accumulated score S(i)(= S i,k ). If S(i) shows a local maximum and it is larger than a threshold, the segment from start(i, K)- to i-frames is a candidate of detection. 2.4 Re-evaluation of Candidates Considering Duration Structure Preliminary experiments uncovers that the term detection only using a threshold of S(i) generates many false detections, which include (1) Too small number of frames match with a phoneme in the term. (2) A few phonemes in the term occupies most of the detected segment. Speech segment candidates detected by a threshold logic for matching score S(i) are re-evaluated in terms of duration of the phonemes. First of all, a candidate segment is removed if the frame length of a phoneme in it is lower than a threshold. The threshold is set for each phoneme based on the average frame length of the phoneme. Secondly, the distortion of phoneme duration in a candidate segment is calculated and candidates are re-evaluated by the unied score which incorporates both matching score and naturalness of phoneme duration structure Distortion of Phoneme Duration The distortion of phoneme duration, D(i), is dened as D(i) = 1 K ( dl (p j) d ) 2 d(p j), (5) L l L d where d l (p) is the average frame length of a phoneme p which is obtained by speech recognition results of the same speaker, and d d (p) is the frame length of a phoneme p in the detected segment. The estimated total duration L l of the term is given by L l = d l (p j). (6) The actual total duration L d of the detected segment is given by L d = d d (p j). (7) Unified Score The unied score evaluates candidates for detection in terms of both matching and naturalness of duration structure. The matching score S(i) is already normalized by the total duration of the candidate segment as shown in the equation (4) and is is again normalized based on the statistical distribution of matching scores for all candidate segments. The statistically normalized matching score P S(i) is calculated as follows. P S(i) = S(i) µ S, (8) σ S µ S = 1 N σ S = 1 N N S(i), (9) i=1 N ( S(i) µ S) 2, (10) i=1 where N is the number of candidate segments. For the distortion of phoneme duration D(i), the statistically normalized duration distortion P D (i) is calculated in the same manner. The unied score which evaluates candidates is dened as P (i) = P S(i) P D(i). (11) The nal decision of detection is executed by threshold logic for the unied score P (i). 3. PROPOSED METHOD: INTRA- PHONEME NORMALIZATION The STD method which is described in Section 2 uses the vector quantization and calculates the matching score between VQ codes and phonemes. In usual STD methods, the HMM-based decoder converts speech into a sequence of phonemes or words considering time-variable properties of speech. The VQ process independently converts a feature vector of acoustic parameters into a VQ code frame by frame. A sequence of VQ codes for a spoken document has weak constraints on time structure although the feature vector consists of a segment of 5 frames. Our proposed STD method more likely generates false detections with unnatural time structure. In order to reduce such false detections, we introduced the additional metric measuring unnaturalness of duration structure for candidates, which is mentioned in However, our method still detects some false segments with unnatural duration structure. In this paper, we propose a new method of calculating matching score for more reduction of false detection. The new matching score is dened as N(i) = 1 1 m j +n j 1 s(v m, p j ), (12) K N m=m j where n j is the number of frames matching with j-th phoneme of the query term, and m j is the starting frame of the segment of j-th phoneme. This equation executes an intraphoneme normalization by averaging scores within a phoneme to avoid that some particular phonemes yield too more effects on the total score. The new STD method uses this score for calculating the unied score, mentioned in 2.4.2, by replacing S(i) by N(i) and statistically normalizing N(i) in the same manner as to get P N (i). The new nal score is dened as 4. EVALUATION P (i) = P N (i) P D(i). (13) 4.1 Experiment Setup The proposed method was evaluated on 177 spoken lectures in the CORE set of the Corpus of Spoken Japanese (CSJ)[9]. Each lecture in the CSJ is divided in segments, called Inter-Pausal Unit (IPU), by the pauses that no shorter 640
4 1.00 Table 1: Performance of STD methods. F-measure[%] MAP[%] baseline old version proposed Precision [%] baseline (BL-3) proposed Precision baseline old version proposed Recall Figure 3: Recall and precision of STD methods. than 200 [ms]. IPUs detected by proposed methods are judged whether the IPUs include a specied query term or not, with the same measure as the formal run of the STD task[2]. The query terms are 50 terms in the STD test collection which were proposed by the Spoken Document Processing Working Group[1]. The size of VQ codebook is 1024 for all speakers. Our STD method based on VQ needs phoneme label information of spoken documents to dene the V-P score. We used the 1-best results of continuous phoneme recognition which were provided by the task organizer of NTCIR-9 SpokenDoc[3]. The evaluation measures are F-measure and mean average precision(map). 4.2 Evaluation Results Table 1 and Figure 3 compare evaluation results for several STD methods. The baseline STD is based on matching between two phoneme sequences, the phoneme sequence of input query and the phoneme sequences recognized by ASR for spoken documents, using the edit distance as local score of DTW. The old version and proposed STD are the methods described in Section 2 and 3. The performance of the old version is comparable to the baseline method. The proposed method improves the performance by about 6% to other two methods. 4.3 Formal Run Result For the formal run of SDPWS[2], we trained the V-P score using the 1-best results of REF-SYLLABLE-UNMATCHED, which are produced by ASR with the syllable-based trigram language model trained with unmatched language resources. Figure 4 and Table 2 show the STD performance for SD- PWS data in the formal run[2]. We compared the proposed method with the baseline performance BL-3 which was provided by the task organizer and detected query terms based on the DP-based detection for two kinds of phoneme se Recall [%] Figure 4: Recall and precision for SDPWS data in the formal run. Table 2: STD Performance for SDPWS data in the formal run. F-measure[%] MAP[%] (max) (spec.) baseline(bl-3) proposed quences generated by the continuous syllable recognition and the continuous word recognition with matched language models[2]. The performance is degraded in comparison with the results mentioned in 4.2 not only for the proposed method but also for the baseline. We guess that the accuracy of speech recognition is lower for SDPWS data rather than for CSJ data. Our STD method expects that the accuracy of speech recognition is so high that the V-P score can be trained with high accuracy. The low performance of speech recognition degrade our method based on VQ more than the baseline method of phoneme matching. 5. INEXISTENT SPOKEN TERM DETEC- TION We conducted the inexistent spoken term detection (istd) that is a newly introduced in NTCIR-10 SpekenDoc-2 and is a task of judging whether a query term is existent or inexistent[2]. The collection of spoken documents is the same as data in 4.1 and consists of 177 spoken lectures in the CORE set of CSJ. the measure for judging the existence of the term is the same as the unied score described in Section 3. Figure 5 shows our result of istd. The maximum F-measure is 73.3%. 6. CONCLUSIONS This paper describes a method of normalizing a matching score for VQ-based STD which we had proposed. The intra-phoneme normalization which averages scores for each phoneme in a query term improves the performance of STD by about 6% of F-measure and MAP. However, the proposed method shows the low performance for SDPWS data in the formal run. We conducted the istd for CSJ data and ob- 641
5 Precision Recall Figure 5: Recall and precision of istd. tained the maximum F-measure of 73.3%. The automatic denition of the threshold is a future work for istd. [9] K. Maekawa, H. Koiso, S. Furui, and H. Isahara. Spontaneous speech corpus of japanese. In Proceedings of LREC, pages , [10] T. Matsunaga, K. Cho, and Y. Yamashita. Spoken term detection using the segment quantization of acoustic features of spoken documents. In Proceedings of the 6th Spoken Document Processing Workshop, SDPWS2012-7, pages 17. [11] T. Mertens, R. Wallace, and D. Schneider. Cross-site combination and evaluation of subword spoken term detection systems. In Proceedings of Content-Based Multimedia Indexing (CBMI), pages 6166, [12] Y. Yamashita, T. Matsunaga, and K. Cho. Ylab@ru at spoken term detection task in ntcir9-spokendoc. In Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-lingual Information Access, pages , [13] Y. Yamashita and R. Mizoguchi. Keyword spotting using f0 contour matching. In Proceedings of 5th Conference on Speech Communication and Technology (Eurospeech '97), volume 1, pages , REFERENCES [1] T. Akiba, K. Aikawa, Y. Itou, T. Kawahara, H. Nanjo, H. Nishizaki, N. Yasuda, Y. Yamashita, and K.Iou. Developing an sdr test collection from japanese lecture audio data. In Proceedings of the 1st Asia-Pacic Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC2009), page 6, [2] T. Akiba, H. Nishizaki, K. Aikawa, X. Hu, Y. Itoh, T. Kawahara, S. Nakagawa, H. Nanjo, and Y. Yamashita. Overview of the ntcir-10 spokendoc2 task. In Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies, [3] T. Akiba, H. Nishizaki, K. Aikawa, T. Kawahara, and T. Matsui. Overview of the ir for spoken documents task in ntcir-9 workshop. In Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-lingual Information Access, [4] E. M. Hofsteller and R. C. Rose. Techiniques for task independent word spotting in continuous speech messages. In Proceedings of ICASSP [5] Y. Itoh, T. Otake, K. Iwata, K. Kojima, M. Ishigame, K. Tanaka, and S. wook Lee. Two-stage vocabulary-free spoken document retrieval subword identication and re-recognition of the identied sections. In Proceedings of INTERSPEECH 2006, pages , [6] G. J. F. Jones, J. T. Foote, K. Sparck, and S. J. Young. Video mail retrieval: The eect of word spotting accuracy on precision. In Proceedings of ICASSP 1995, pages , [7] K. M. Knill and S. J. Young. Fast implementation methods for viterbi-based word-spotting. In Proceedings of ICASSP 1996, pages , [8] B. Logana and J. M. V. Thong. Confusion-based query expansion for oov words in spoken document retrieval. In Proceedings of ICSLP 2002, pages ,
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library
More informationSOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS
SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University
More informationError Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin
Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin Introduction Context of work: Error-based online failure prediction: error
More informationSpot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
More informationMyanmar Continuous Speech Recognition System Based on DTW and HMM
Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-
More informationMembering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN
PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,
More informationInput Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device
International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki
More informationTurkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr
More informationAn Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
More informationAUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
More informationBuilding A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
More informationRobust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
More informationLecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
More informationEricsson T18s Voice Dialing Simulator
Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of
More informationSpeech Recognition on Cell Broadband Engine UCRL-PRES-223890
Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda
More informationSchool Class Monitoring System Based on Audio Signal Processing
C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.
More informationPhonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential
white paper Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential A Whitepaper by Jacob Garland, Colin Blake, Mark Finlay and Drew Lanham Nexidia, Inc., Atlanta, GA People who create,
More informationInformation Leakage in Encrypted Network Traffic
Information Leakage in Encrypted Network Traffic Attacks and Countermeasures Scott Coull RedJack Joint work with: Charles Wright (MIT LL) Lucas Ballard (Google) Fabian Monrose (UNC) Gerald Masson (JHU)
More informationPresentation Video Retrieval using Automatically Recovered Slide and Spoken Text
Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper FX Palo Alto Laboratory Palo Alto, CA 94034 USA cooper@fxpal.com ABSTRACT Video is becoming a prevalent medium
More informationExploring the Structure of Broadcast News for Topic Segmentation
Exploring the Structure of Broadcast News for Topic Segmentation Rui Amaral (1,2,3), Isabel Trancoso (1,3) 1 Instituto Superior Técnico 2 Instituto Politécnico de Setúbal 3 L 2 F - Spoken Language Systems
More informationHardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
More informationSecure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics
Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Anastasis Kounoudes 1, Anixi Antonakoudi 1, Vasilis Kekatos 2 1 The Philips College, Computing and Information Systems
More informationUsing Wikipedia to Translate OOV Terms on MLIR
Using to Translate OOV Terms on MLIR Chen-Yu Su, Tien-Chien Lin and Shih-Hung Wu* Department of Computer Science and Information Engineering Chaoyang University of Technology Taichung County 41349, TAIWAN
More informationHMM-based Breath and Filled Pauses Elimination in ASR
HMM-based Breath and Filled Pauses Elimination in ASR Piotr Żelasko 1, Tomasz Jadczyk 1,2 and Bartosz Ziółko 1,2 1 Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science
More informationGenerating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN pakhomov.sergey@mayo.edu Michael Schonwetter Linguistech Consortium, NJ MSchonwetter@qwest.net Joan Bachenko
More informationSpeech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)
More informationOPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
More informationEffect of Captioning Lecture Videos For Learning in Foreign Language 外 国 語 ( 英 語 ) 講 義 映 像 に 対 する 字 幕 提 示 の 理 解 度 効 果
Effect of Captioning Lecture Videos For Learning in Foreign Language VERI FERDIANSYAH 1 SEIICHI NAKAGAWA 1 Toyohashi University of Technology, Tenpaku-cho, Toyohashi 441-858 Japan E-mail: {veri, nakagawa}@slp.cs.tut.ac.jp
More informationVector Quantization and Clustering
Vector Quantization and Clustering Introduction K-means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition
More informationSpeech Analytics. Whitepaper
Speech Analytics Whitepaper This document is property of ASC telecom AG. All rights reserved. Distribution or copying of this document is forbidden without permission of ASC. 1 Introduction Hearing the
More informationAudio Indexing on a Medical Video Database: the AVISON Project
Audio Indexing on a Medical Video Database: the AVISON Project Grégory Senay Stanislas Oger Raphaël Rubino Georges Linarès Thomas Parent IRCAD Strasbourg, France Abstract This paper presents an overview
More informationAPPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA
APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer
More informationAn Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
More informationOnline Diarization of Telephone Conversations
Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of
More informationSpoken Document Retrieval from Call-Center Conversations
Spoken Document Retrieval from Call-Center Conversations Jonathan Mamou, David Carmel, Ron Hoory IBM Haifa Research Labs Haifa 31905, Israel {mamou,carmel,hoory}@il.ibm.com ABSTRACT We are interested in
More informationA System for Searching and Browsing Spoken Communications
A System for Searching and Browsing Spoken Communications Lee Begeja Bernard Renger Murat Saraclar AT&T Labs Research 180 Park Ave Florham Park, NJ 07932 {lee, renger, murat} @research.att.com Abstract
More informationEvaluation of Interactive User Corrections for Lecture Transcription
Evaluation of Interactive User Corrections for Lecture Transcription Henrich Kolkhorst, Kevin Kilgour, Sebastian Stüker, and Alex Waibel International Center for Advanced Communication Technologies InterACT
More informationEstablishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
More informationNamed Entity Recognition in Broadcast News Using Similar Written Texts
Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract
More informationMIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
More informationProgramming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationNumerical Field Extraction in Handwritten Incoming Mail Documents
Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr
More informationWord Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationPERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS
The 21 st International Congress on Sound and Vibration 13-17 July, 2014, Beijing/China PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS Dan Wang, Nanjie Yan and Jianxin Peng*
More informationComp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition
Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition Tim Morris School of Computer Science, University of Manchester 1 Introduction to speech recognition 1.1 The
More informationAnnotated bibliographies for presentations in MUMT 611, Winter 2006
Stephen Sinclair Music Technology Area, McGill University. Montreal, Canada Annotated bibliographies for presentations in MUMT 611, Winter 2006 Presentation 4: Musical Genre Similarity Aucouturier, J.-J.
More informationWikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
More informationDeveloping a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value
, pp. 397-408 http://dx.doi.org/10.14257/ijmue.2014.9.11.38 Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value Mohannad Al-Mousa 1
More informationSemantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
More informationArtificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
More informationAn Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
More informationObjective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification
Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,
More informationCONATION: English Command Input/Output System for Computers
CONATION: English Command Input/Output System for Computers Kamlesh Sharma* and Dr. T. V. Prasad** * Research Scholar, ** Professor & Head Dept. of Comp. Sc. & Engg., Lingaya s University, Faridabad, India
More informationVideo Affective Content Recognition Based on Genetic Algorithm Combined HMM
Video Affective Content Recognition Based on Genetic Algorithm Combined HMM Kai Sun and Junqing Yu Computer College of Science & Technology, Huazhong University of Science & Technology, Wuhan 430074, China
More informationTRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY
4 4th International Workshop on Acoustic Signal Enhancement (IWAENC) TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY Takuya Toyoda, Nobutaka Ono,3, Shigeki Miyabe, Takeshi Yamada, Shoji Makino University
More informationEVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN
EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN J. Lööf (1), D. Falavigna (2),R.Schlüter (1), D. Giuliani (2), R. Gretter (2),H.Ney (1) (1) Computer Science Department, RWTH Aachen
More informationD2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
More informationHow To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3
Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web. By C.Moreno, A. Antolin and F.Diaz-de-Maria. Summary By Maheshwar Jayaraman 1 1. Introduction Voice Over IP is
More informationA Method of Caption Detection in News Video
3rd International Conference on Multimedia Technology(ICMT 3) A Method of Caption Detection in News Video He HUANG, Ping SHI Abstract. News video is one of the most important media for people to get information.
More informationGrammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP
More informationM3039 MPEG 97/ January 1998
INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO INFORMATION ISO/IEC JTC1/SC29/WG11 M3039
More informationDublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
More informationVideo Mining Using Combinations of Unsupervised and Supervised Learning Techniques
MERL A MITSUBISHI ELECTRIC RESEARCH LABORATORY http://www.merl.com Video Mining Using Combinations of Unsupervised and Supervised Learning Techniques Ajay Divakaran, Koji Miyahara, Kadir A. Peker, Regunathan
More informationSpeech recognition technology for mobile phones
Speech recognition technology for mobile phones Stefan Dobler Following the introduction of mobile phones using voice commands, speech recognition is becoming standard on mobile handsets. Features such
More informationAudio Coding Algorithm for One-Segment Broadcasting
Audio Coding Algorithm for One-Segment Broadcasting V Masanao Suzuki V Yasuji Ota V Takashi Itoh (Manuscript received November 29, 2007) With the recent progress in coding technologies, a more efficient
More informationSpecialty Answering Service. All rights reserved.
0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationOpen-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania oananicolae1981@yahoo.com
More informationAutomatic Evaluation Software for Contact Centre Agents voice Handling Performance
International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,
More informationA Segmentation Algorithm for Zebra Finch Song at the Note Level. Ping Du and Todd W. Troyer
A Segmentation Algorithm for Zebra Finch Song at the Note Level Ping Du and Todd W. Troyer Neuroscience and Cognitive Science Program, Dept. of Psychology University of Maryland, College Park, MD 20742
More informationOffline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
More informationTechnologies for Voice Portal Platform
Technologies for Voice Portal Platform V Yasushi Yamazaki V Hitoshi Iwamida V Kazuhiro Watanabe (Manuscript received November 28, 2003) The voice user interface is an important tool for realizing natural,
More informationOscillations of the Sending Window in Compound TCP
Oscillations of the Sending Window in Compound TCP Alberto Blanc 1, Denis Collange 1, and Konstantin Avrachenkov 2 1 Orange Labs, 905 rue Albert Einstein, 06921 Sophia Antipolis, France 2 I.N.R.I.A. 2004
More informationBLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be
More informationIndexing Local Configurations of Features for Scalable Content-Based Video Copy Detection
Indexing Local Configurations of Features for Scalable Content-Based Video Copy Detection Sebastien Poullot, Xiaomeng Wu, and Shin ichi Satoh National Institute of Informatics (NII) Michel Crucianu, Conservatoire
More informationABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition
The CU Communicator: An Architecture for Dialogue Systems 1 Bryan Pellom, Wayne Ward, Sameer Pradhan Center for Spoken Language Research University of Colorado, Boulder Boulder, Colorado 80309-0594, USA
More informationDocument Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
More informationThirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationSEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS
SEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS Peng Yu, Kaijiang Chen, Lie Lu, and Frank Seide Microsoft Research Asia, 5F Beijing Sigma Center, 49 Zhichun Rd., 100080 Beijing,
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationAn Adaptive Speech Recognizer that Learns using Short and Long Term Memory
An Adaptive Speech Recognizer that Learns using Short and Long Term Memory Helen Wright Hastie and Jody J. Daniels Lockheed Martin Advanced Technology Laboratories 3 Executive Campus, 6th Floor Cherry
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationProper Name Retrieval from Diachronic Documents for Automatic Speech Transcription using Lexical and Temporal Context
Proper Name Retrieval from Diachronic Documents for Automatic Speech Transcription using Lexical and Temporal Context Irina Illina, Dominique Fohr, Georges Linarès To cite this version: Irina Illina, Dominique
More informationSPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
More informationImproving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction
Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de Abstract
More informationSPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS
SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS Mbarek Charhad, Daniel Moraru, Stéphane Ayache and Georges Quénot CLIPS-IMAG BP 53, 38041 Grenoble cedex 9, France Georges.Quenot@imag.fr ABSTRACT The
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationGerman Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings
German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box
More informationADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt
ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA
More informationAUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
More informationSpeech Recognition in the Informedia TM Digital Video Library: Uses and Limitations
Speech Recognition in the Informedia TM Digital Video Library: Uses and Limitations Alexander G. Hauptmann School of Computer Science, Carnegie Mellon University, Pittsburgh PA, USA Abstract In principle,
More informationFigure1. Acoustic feedback in packet based video conferencing system
Real-Time Howling Detection for Hands-Free Video Conferencing System Mi Suk Lee and Do Young Kim Future Internet Research Department ETRI, Daejeon, Korea {lms, dyk}@etri.re.kr Abstract: This paper presents
More informationApplication Note. Using PESQ to Test a VoIP Network. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN
Using PESQ to Test a VoIP Network Application Note Prepared by: Psytechnics Limited 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN t: +44 (0) 1473 261 800 f: +44 (0) 1473 261 880 e: info@psytechnics.com
More informationInferring Probabilistic Models of cis-regulatory Modules. BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.
Inferring Probabilistic Models of cis-regulatory Modules MI/S 776 www.biostat.wisc.edu/bmi776/ Spring 2015 olin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following
More informationUsing Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition
1 Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition Émilie CAILLAULT LASL, ULCO/University of Littoral (Calais) Christian VIARD-GAUDIN IRCCyN, University of
More informationMethodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
More information