CHAPTER 5 SPEAKER IDENTIFICATION USING SPEAKER- SPECIFIC-TEXT

Size: px
Start display at page:

Download "CHAPTER 5 SPEAKER IDENTIFICATION USING SPEAKER- SPECIFIC-TEXT"

Transcription

1 52 CHAPTER 5 SPEAKER IDENTIFICATION USING SPEAKER- SPECIFIC-TEXT 5.1 MOTIVATION FOR USING SPEAKER-SPECIFIC-TEXT Better classification accuracy can be achieved if the training technique is able to capture the unique features of a class, the features that discriminate a class from another. In chapter 4, a GMM technique was proposed to equip a classifier to capture the unique features of a class and to make decisions based on the unique features alone. During testing, feature vectors that are unique to a class is derived thereby the classification accuracy is increased. One of the drawbacks is that, if the test utterance does not contain the unique features then the classification accuracy can be drastically reduced. Another drawback is, unique features have to be identified from the test utterances during testing thus increases the computation time. If the speaker is able to utter the word which contains only the unique features then the computation time will be reduced. Even though the unique feature vectors are known, one cannot expect / force a speaker to utter speech segments, that contain these features alone. On the other hand, if we know unique phoneme list apriori, one can formulate a text, to be uttered, using such phonemes alone. In this chapter, we investigate the effect of a subset of phonemes, that are unique to a speaker in the acoustic sense on a speaker recognition task. The proposed technique involves three main steps:

2 53 1. To find out confusing speaker for each speaker.to derive acoustically dissimilar phoneme set for each speaker when compared to his/her confusing speaker. 2. To test the system using utterances which will have maximum number of acoustically dissimilar phonemes. The proposed technique is experimented on speaker identification task using TIMIT speech corpus. The results are compared with the performance of a conventional GMM-based classifier. 5.2 EXPERIMENTAL SETUP The TIMIT speech corpus is used for both training and testing. The TIMIT corpus of read speech is designed to provide speech data for acousticphonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). Each speaker has 10 utterances and each of these utterances are approximately of 3 second duration. For the current study, only the female speakers (192 in number) are considered, due to the reason that the classification accuracy for female data is inferior to that of male data. For each speaker, among ten sentences, first 8 sentences are used for training and the last 2 sentences are used for testing.

3 54 The total number of training utterances is 1536 and the total number of test utterances is 385. For each speaker, a GMM with 64 mixture components has been trained, considering Mel-frequency cepstral coefficients (13 static + 13 dynamic + 13 acceleration) as the features. The training utterances of each speaker have been tested with 192 speaker models. Based on the loglikelihoods, two best results have been derived. The second speaker is considered as a closely resembling speaker. This process is repeated for all the 192 speakers and a confusing speaker list is derived. To derive speaker-specific-text of a speaker, as an initial step, we have to find out the acoustically dissimilar phonemes of the corresponding speaker. The common phonemes (the corresponding speech segment) of the speaker and her confusing speaker, available in the training utterances, are tested with her model and her confusing speaker model. Average log likelihood 2 of each phoneme is computed for the first speaker and her confusing speaker. If the mean difference is greater than a specific threshold, then the corresponding phoneme is considered as an acoustically dissimilar phoneme. The same process is repeated for the phonemes of all the speakers. During testing, the speaker-specific-text (the utterances which have acoustically dissimilar phonemes) is used. Since the TIMIT speech corpus is used, speaker-specific-text cannot be formulated using only the acoustically dissimilar phonemes. Therefore the speaker-specific-text is derived from the two test utterances by taking the words which have maximum number of acoustically dissimilar phonemes. Results were compared with the words which has maximum number of acoustically dissimilar phonemes and words without considering the acoustically dissimilar phonemes. When the system is 2 Since the number of examples for each of the phonemes used in the work is less, product of likelihood-gaussians used in the feature-level approach cannot be used.

4 55 tested using speech utterances which correspond to speaker-specific-text, the confusion error is found to be reduced considerably than that of the conventional GMM-based classification technique, as discussed in the next section. 5.3 PERFORMANCE ANALYSIS The performance of the system has been analyzed using acoustically dissimilar phonemes. The various values of the threshold (average log likelihood difference between the speaker and her confusing speaker) is set and different constraints are used for testing the performance of the system. Since the TIMIT corpus is used, we cannot formulate the text using only the acoustically dissimilar phonemes for testing. To derive speaker characteristics, the constraint that is set in our work is that the test utterances (words) should have at least six phonemes. Among six phonemes, the word should have a minimum of three acoustically dissimilar phonemes (ADPs) i.e., the word should contain 50% ADPs. For each speaker one such a word (satisfies the constraints) has been chosen for testing. The performance analysis of such a system is tabulated in Table 5.1. Table 5.1 Speaker identification performance of the system based on different thresholds and constraints (ADP- Acoustically Dissimilar Phonemes) Case Threshold Constraints (No. of phonemes in the test utterance / No. of ADPs) No. of speakers satisfy the constraints No. of Speakers recognized correctly Identification accuracy 1 >=9 6 / {>3} % 2 >=10 6 / {>3} % 3 >=11 6 / {>3} %

5 56 From Table 5.1, it can be noted that even with a single word, that contains more than or equal to 3 acoustically dissimilar phonemes, the classification accuracy is reasonably good (i.e., above 80%). Further, the deviation in the performance for various thresholds 3 is only minor. This shows that the performance of the system is not very sensitive to the threshold. Speaker identification performance is compared between the utterances with acoustically dissimilar phonemes and without considering the acoustically dissimilar phonemes. To derive speaker characteristics, the constraint that is set in our work is that the test utterances (words) should have at least six phonemes. Each phoneme may have approximately 80ms duration. Therefore, each test utterance is divided into 500 ms speech signal and given for testing. This 500ms speech signal may contain both acoustically similar and dissimilar phonemes(segments correspond to any silences(more than 100ms) are not considered). Table 5.2 Speaker identification performance of the system without considering the acoustically dissimilar phonemes (The speakers that satisfy the constraints given in case 1,2,3 from table 5.1 are considered for testing) Case No. of speakers for testing (as in Table 5.1) No. of 500ms speech utterances No. of times recognized correctly Identification accuracy % % % 3 Since the TIMIT corpus is used, the authors do not have the control over the number of speakers who satisfy the constraints.

6 57 From Table 5.1 and Table 5.2, it can be noted that there is a 16% performance improvement by using speaker-specific-text, as specified in row 2 of Table 5.1. The speaker identification performance is measured based on the number of acoustically dissimilar phonemes in the test utterance. From each test utterance, the words, with minimum of six phonemes and less than or equal to two acoustically dissimilar phonemes have been taken for testing. Similarly, the words with minimum of six phonemes and greater than or equal to three acoustically dissimilar phonemes have been taken for testing. The results are tabulated in Table 5.3. The number of speakers taken for the following experiment is 40. Table 5.3 Speaker identification performance based on number of acoustically dissimilar phonemes Case No. of acoustically dissimilar phonemes No. of speakers No. of speakers recognized correctly Identification accuracy 1 <= % 2 >= % From Table 5.3, it can be noted that the classification performance is improved when the number of acoustically dissimilar phonemes is increased. The speaker identification performance is measured by comparing the acoustically similar phonemes and acoustically dissimilar phonemes in the test utterance. Feature vectors of acoustically similar and dissimilar phonemes are extracted and given for testing. That is, testing is done with feature vectors extracted from the utterance of a single phoneme. The experimental results show that even with the single acoustically dissimilar phoneme the

7 58 speakers can be identified with reasonable accuracy which is shown in Figure 5.1. Figure 5.1 Comparison between acoustically similar and dissimilar phonemes(adp- Acoustically Dissimilar phonemes, ASP- Acoustically similar phonemes) From Figure 5.1, it can be noted that the acoustically dissimilar phonemes have accuracy greater than that of the acoustically similar phonemes. The speakers (9, 10, 11) have lower accuracy for the acoustically dissimilar phonemes. However, majority of the speakers were identified even with single acoustically dissimilar phonemes. This result shows that, if the test utterance contains only the acoustically dissimilar phonemes confusion error can be reduced and the classification accuracy can be increased. Computation time also reduced because unique features (acoustically dissimilar phonemes) are alone considered before testing i.e., testing is done using speech utterances correspond to a speaker-specific-text alone. Further, this shows that the duration of the test utterances can be reduced drastically without making a compromise on the classification accuracy.

8 SUMMARY In this chapter, we have proposed to use speech utterances that correspond to a speaker-specific-text for speaker recognition tasks. Here, the speaker-specific-text is formed using the unique phonemes of a speaker, in other words, a set of phonemes that are acoustically dissimilar when compared with that of a competing (acoustically closely resembling) speaker. We have shown that the classification accuracy, in a speaker identification task, is considerably higher than that of a conventional GMM-based technique, if the speech utterances correspond to the unique phonemes are used. Further, we have shown that, even with a single phoneme, if it is unique to a speaker, the classification accuracy is quite satisfactory. These results show that the duration of the test utterances can also be reduced considerably without compromising on the accuracy.

Establishing the Uniqueness of the Human Voice for Security Applications

Establishing the Uniqueness of the Human Voice for Security Applications Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

More information

IEEE Proof. Web Version. PROGRESSIVE speaker adaptation has been considered

IEEE Proof. Web Version. PROGRESSIVE speaker adaptation has been considered IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification Shou-Chun Yin, Richard Rose, Senior

More information

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890

Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Speech Recognition on Cell Broadband Engine UCRL-PRES-223890 Yang Liu, Holger Jones, John Johnson, Sheila Vaidya (Lawrence Livermore National Laboratory) Michael Perrone, Borivoj Tydlitat, Ashwini Nanda

More information

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN PAGE 30 Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN Sung-Joon Park, Kyung-Ae Jang, Jae-In Kim, Myoung-Wan Koo, Chu-Shik Jhon Service Development Laboratory, KT,

More information

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

More information

THE COLLECTION AND PRELIMINARY ANALYSIS OF A SPONTANEOUS SPEECH DATABASE*

THE COLLECTION AND PRELIMINARY ANALYSIS OF A SPONTANEOUS SPEECH DATABASE* THE COLLECTION AND PRELIMINARY ANALYSIS OF A SPONTANEOUS SPEECH DATABASE* Victor Zue, Nancy Daly, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, Stephanie Seneff, and Michal

More information

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and

More information

Speech Signal Processing: An Overview

Speech Signal Processing: An Overview Speech Signal Processing: An Overview S. R. M. Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati December, 2012 Prasanna (EMST Lab, EEE, IITG) Speech

More information

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA Nitesh Kumar Chaudhary 1 and Shraddha Srivastav 2 1 Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India 2 Bharti School Of Telecommunication,

More information

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31 Disclaimer: This document was part of the First European DSP Education and Research Conference. It may have been written by someone whose native language is not English. TI assumes no liability for the

More information

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Robust Methods for Automatic Transcription and Alignment of Speech Signals Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background

More information

Developing an Isolated Word Recognition System in MATLAB

Developing an Isolated Word Recognition System in MATLAB MATLAB Digest Developing an Isolated Word Recognition System in MATLAB By Daryl Ning Speech-recognition technology is embedded in voice-activated routing systems at customer call centres, voice dialling

More information

Emotion Detection from Speech

Emotion Detection from Speech Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction

More information

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION Horacio Franco, Luciana Ferrer, and Harry Bratt Speech Technology and Research Laboratory, SRI International, Menlo Park, CA

More information

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.

More information

An Arabic Text-To-Speech System Based on Artificial Neural Networks

An Arabic Text-To-Speech System Based on Artificial Neural Networks Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2

ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2 3rd European ignal Processing Conference (EUIPCO) ADAPTIVE AND ONLINE PEAKER DIARIZATION FOR MEETING DATA Giovanni oldi, Christophe Beaugeant and Nicholas Evans Multimedia Communications Department, EURECOM,

More information

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA Tuija Niemi-Laitinen*, Juhani Saastamoinen**, Tomi Kinnunen**, Pasi Fränti** *Crime Laboratory, NBI, Finland **Dept. of Computer

More information

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM B. Angelini, G. Antoniol, F. Brugnara, M. Cettolo, M. Federico, R. Fiutem and G. Lazzari IRST-Istituto per la Ricerca Scientifica e Tecnologica

More information

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi Institute of Cognitive Sciences and Technologies, National

More information

The Pitch-Tracking Database from Graz University of Technology

The Pitch-Tracking Database from Graz University of Technology The Pitch-Tracking Database from Graz University of Technology Author: Gregor Pirker, Michael Wohlmayr, Stefan Petrik, Franz Pernkopf Date: Graz, August 22, 2012 Rev.: alpha 1.1 Abstract The Pitch Tracking

More information

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

More information

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition , Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology

More information

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Automatic Evaluation Software for Contact Centre Agents voice Handling Performance K.K.A. Nipuni N. Perera,

More information

Carla Simões, t-carlas@microsoft.com. Speech Analysis and Transcription Software

Carla Simões, t-carlas@microsoft.com. Speech Analysis and Transcription Software Carla Simões, t-carlas@microsoft.com Speech Analysis and Transcription Software 1 Overview Methods for Speech Acoustic Analysis Why Speech Acoustic Analysis? Annotation Segmentation Alignment Speech Analysis

More information

Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion

Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion Prasanta Kumar Ghosh a) and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,

More information

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Myanmar Continuous Speech Recognition System Based on DTW and HMM Myanmar Continuous Speech Recognition System Based on DTW and HMM Ingyin Khaing Department of Information and Technology University of Technology (Yatanarpon Cyber City),near Pyin Oo Lwin, Myanmar Abstract-

More information

The LENA TM Language Environment Analysis System:

The LENA TM Language Environment Analysis System: FOUNDATION The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File Dongxin Xu, Umit Yapanel, Sharmi Gray, & Charles T. Baer LENA Foundation, Boulder, CO LTR-04-2 September

More information

TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification

TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification Mahesh Viswanathan, Homayoon S.M. Beigi, Alain Tritschler IBM Thomas J. Watson Research Labs Research

More information

Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly

More information

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations Hugues Salamin, Anna Polychroniou and Alessandro Vinciarelli University of Glasgow - School of computing Science, G128QQ

More information

Objective Speech Quality Measures for Internet Telephony

Objective Speech Quality Measures for Internet Telephony Objective Speech Quality Measures for Internet Telephony Timothy A. Hall National Institute of Standards and Technology 100 Bureau Drive, STOP 8920 Gaithersburg, MD 20899-8920 ABSTRACT Measuring voice

More information

Artificial Neural Network for Speech Recognition

Artificial Neural Network for Speech Recognition Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken

More information

Thirukkural - A Text-to-Speech Synthesis System

Thirukkural - A Text-to-Speech Synthesis System Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,

More information

have more skill and perform more complex

have more skill and perform more complex Speech Recognition Smartphone UI Speech Recognition Technology and Applications for Improving Terminal Functionality and Service Usability User interfaces that utilize voice input on compact devices such

More information

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW Marathi Interactive Voice Response System (IVRS) using MFCC and DTW Manasi Ram Baheti Department of CSIT, Dr.B.A.M. University, Aurangabad, (M.S.), India Bharti W. Gawali Department of CSIT, Dr.B.A.M.University,

More information

Hardware Implementation of Probabilistic State Machine for Word Recognition

Hardware Implementation of Probabilistic State Machine for Word Recognition IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

More information

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction : A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic, Nagpur Institute Sadar, Nagpur 440001 (INDIA)

More information

Lecture 12: An Overview of Speech Recognition

Lecture 12: An Overview of Speech Recognition Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated

More information

Speech Analytics. Whitepaper

Speech Analytics. Whitepaper Speech Analytics Whitepaper This document is property of ASC telecom AG. All rights reserved. Distribution or copying of this document is forbidden without permission of ASC. 1 Introduction Hearing the

More information

Speech and Network Marketing Model - A Review

Speech and Network Marketing Model - A Review Jastrzȩbia Góra, 16 th 20 th September 2013 APPLYING DATA MINING CLASSIFICATION TECHNIQUES TO SPEAKER IDENTIFICATION Kinga Sałapa 1,, Agata Trawińska 2 and Irena Roterman-Konieczna 1, 1 Department of Bioinformatics

More information

Online Diarization of Telephone Conversations

Online Diarization of Telephone Conversations Odyssey 2 The Speaker and Language Recognition Workshop 28 June July 2, Brno, Czech Republic Online Diarization of Telephone Conversations Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman Department of

More information

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi 1, Mansour Alghamdi 2, and Fahad Alotaiby 3 1 Computer Engineering Department, King Saud University,

More information

A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems

A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems by Sameeh Ullah A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of

More information

Convention Paper Presented at the 135th Convention 2013 October 17 20 New York, USA

Convention Paper Presented at the 135th Convention 2013 October 17 20 New York, USA Audio Engineering Society Convention Paper Presented at the 135th Convention 2013 October 17 20 New York, USA This Convention paper was selected based on a submitted abstract and 750-word precis that have

More information

Ericsson T18s Voice Dialing Simulator

Ericsson T18s Voice Dialing Simulator Ericsson T18s Voice Dialing Simulator Mauricio Aracena Kovacevic, Anna Dehlbom, Jakob Ekeberg, Guillaume Gariazzo, Eric Lästh and Vanessa Troncoso Dept. of Signals Sensors and Systems Royal Institute of

More information

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents Michael J. Witbrock and Alexander G. Hauptmann Carnegie Mellon University ABSTRACT Library

More information

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser Efficient diphone database creation for, a multilingual speech synthesiser Institute of Linguistics Adam Mickiewicz University Poznań OWD 2010 Wisła-Kopydło, Poland Why? useful for testing speech models

More information

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS Bálint Tóth, Tibor Fegyó, Géza Németh Department of Telecommunications and Media Informatics Budapest University

More information

A Comparative Analysis of Speech Recognition Platforms

A Comparative Analysis of Speech Recognition Platforms Communications of the IIMA Volume 9 Issue 3 Article 2 2009 A Comparative Analysis of Speech Recognition Platforms Ore A. Iona College Follow this and additional works at: http://scholarworks.lib.csusb.edu/ciima

More information

PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS

PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS The 21 st International Congress on Sound and Vibration 13-17 July, 2014, Beijing/China PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS Dan Wang, Nanjie Yan and Jianxin Peng*

More information

Available from Deakin Research Online:

Available from Deakin Research Online: This is the authors final peered reviewed (post print) version of the item published as: Adibi,S 2014, A low overhead scaled equalized harmonic-based voice authentication system, Telematics and informatics,

More information

Speech recognition for human computer interaction

Speech recognition for human computer interaction Speech recognition for human computer interaction Ubiquitous computing seminar FS2014 Student report Niklas Hofmann ETH Zurich hofmannn@student.ethz.ch ABSTRACT The widespread usage of small mobile devices

More information

THE goal of Speaker Diarization is to segment audio

THE goal of Speaker Diarization is to segment audio 1 The ICSI RT-09 Speaker Diarization System Gerald Friedland* Member IEEE, Adam Janin, David Imseng Student Member IEEE, Xavier Anguera Member IEEE, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox, Oriol

More information

Automatic Cross-Biometric Footstep Database Labelling using Speaker Recognition

Automatic Cross-Biometric Footstep Database Labelling using Speaker Recognition Automatic Cross-Biometric Footstep Database Labelling using Speaker Recognition Ruben Vera-Rodriguez 1, John S.D. Mason 1 and Nicholas W.D. Evans 1,2 1 Speech and Image Research Group, Swansea University,

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking The perception and correct identification of speech sounds as phonemes depends on the listener extracting various

More information

Audio Scene Analysis as a Control System for Hearing Aids

Audio Scene Analysis as a Control System for Hearing Aids Audio Scene Analysis as a Control System for Hearing Aids Marie Roch marie.roch@sdsu.edu Tong Huang hty2000tony@yahoo.com Jing Liu jliu 76@hotmail.com San Diego State University 5500 Campanile Dr San Diego,

More information

L9: Cepstral analysis

L9: Cepstral analysis L9: Cepstral analysis The cepstrum Homomorphic filtering The cepstrum and voicing/pitch detection Linear prediction cepstral coefficients Mel frequency cepstral coefficients This lecture is based on [Taylor,

More information

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song , pp.347-354 http://dx.doi.org/10.14257/ijmue.2014.9.8.32 A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song Myeongsu Kang and Jong-Myon Kim School of Electrical Engineering,

More information

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Audio Musical Genre Classification using Convolutional Neural Networks and Pitch and Tempo Transformations 使 用 捲 積 神 經 網 絡 及 聲 調 速 度 轉 換 的 音 頻 音 樂 流 派 分 類 研 究 Submitted

More information

EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS

EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS VALERY A. PETRUSHIN Andersen Consulting 3773 Willow Rd. Northbrook, IL 60062 petr@cstar.ac.com ABSTRACT The paper describes two experimental

More information

A CHINESE SPEECH DATA WAREHOUSE

A CHINESE SPEECH DATA WAREHOUSE A CHINESE SPEECH DATA WAREHOUSE LUK Wing-Pong, Robert and CHENG Chung-Keng Department of Computing, Hong Kong Polytechnic University Tel: 2766 5143, FAX: 2774 0842, E-mail: {csrluk,cskcheng}@comp.polyu.edu.hk

More information

EFFECTS OF BACKGROUND DATA DURATION ON SPEAKER VERIFICATION PERFORMANCE

EFFECTS OF BACKGROUND DATA DURATION ON SPEAKER VERIFICATION PERFORMANCE Uludağ Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi, Cilt 18, Sayı 1, 2013 ARAŞTIRMA EFFECTS OF BACKGROUND DATA DURATION ON SPEAKER VERIFICATION PERFORMANCE Cemal HANİLÇİ * Figen ERTAŞ * Abstract:

More information

Creation of a Speech to Text System for Kiswahili

Creation of a Speech to Text System for Kiswahili Creation of a Speech to Text System for Kiswahili Dr. Katherine Getao and Evans Miriti University of Nairobi Abstract The creation of a speech to text system for any language is an onerous task. This is

More information

IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream

IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream RC23499 (W0501-090) January 19, 2005 Computer Science IBM Research Report CSR: Speaker Recognition from Compressed Packet Stream Charu Aggarwal, David Olshefski, Debanjan Saha, Zon-Yin Shae, Philip Yu

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL.?, NO.?, MONTH 2009 1 Data balancing for efficient training of Hybrid ANN/HMM Automatic Speech Recognition systems Ana Isabel García-Moral,

More information

TED-LIUM: an Automatic Speech Recognition dedicated corpus

TED-LIUM: an Automatic Speech Recognition dedicated corpus TED-LIUM: an Automatic Speech Recognition dedicated corpus Anthony Rousseau, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans, France firstname.lastname@lium.univ-lemans.fr

More information

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification Raphael Ullmann 1,2, Ramya Rasipuram 1, Mathew Magimai.-Doss 1, and Hervé Bourlard 1,2 1 Idiap Research Institute,

More information

The Prognosis is Good: Speech Recognition Software Can Increase Productivity in the Medical Environment

The Prognosis is Good: Speech Recognition Software Can Increase Productivity in the Medical Environment The Prognosis is Good: Speech Recognition Software Can Increase Productivity in the Medical Environment Introduction Traditionally viewed as simply a means of dictating text into a personal computer, today

More information

Problems and Prospects in Collection of Spoken Language Data

Problems and Prospects in Collection of Spoken Language Data Problems and Prospects in Collection of Spoken Language Data Kishore Prahallad+*, Suryakanth V Gangashetty*, B. Yegnanarayana*, D. Raj Reddy+ *Language Technologies Research Center (LTRC) International

More information

Using the Amazon Mechanical Turk for Transcription of Spoken Language

Using the Amazon Mechanical Turk for Transcription of Spoken Language Research Showcase @ CMU Computer Science Department School of Computer Science 2010 Using the Amazon Mechanical Turk for Transcription of Spoken Language Matthew R. Marge Satanjeev Banerjee Alexander I.

More information

Vector Quantization and Clustering

Vector Quantization and Clustering Vector Quantization and Clustering Introduction K-means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition

More information

Spillemyndigheden s change management programme. Version 1.3.0 of 1 July 2012

Spillemyndigheden s change management programme. Version 1.3.0 of 1 July 2012 Version 1.3.0 of 1 July 2012 Contents 1 Introduction... 3 1.1 Authority... 3 1.2 Objective... 3 1.3 Target audience... 3 1.4 Version... 3 1.5 Enquiries... 3 2. Framework for managing system changes...

More information

Gender Identification using MFCC for Telephone Applications A Comparative Study

Gender Identification using MFCC for Telephone Applications A Comparative Study Gender Identification using MFCC for Telephone Applications A Comparative Study Jamil Ahmad, Mustansar Fiaz, Soon-il Kwon, Maleerat Sodanil, Bay Vo, and * Sung Wook Baik Abstract Gender recognition is

More information

Direct Loss Minimization for Structured Prediction

Direct Loss Minimization for Structured Prediction Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago mcallester@ttic.edu Tamir Hazan TTI-Chicago tamir@ttic.edu Joseph Keshet TTI-Chicago jkeshet@ttic.edu Abstract In discriminative

More information

A secure face tracking system

A secure face tracking system International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 10 (2014), pp. 959-964 International Research Publications House http://www. irphouse.com A secure face tracking

More information

Phonetic and phonological properties of the final pitch accent in Catalan declaratives

Phonetic and phonological properties of the final pitch accent in Catalan declaratives Abstract Phonetic and phonological properties of the final pitch accent in Catalan declaratives Eva Estebas-Vilaplana * This paper examines the phonetic and phonological properties of the last pitch accent

More information

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION By Ramasubramanian Sundaram A Thesis Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the

More information

Transcription Format

Transcription Format Representing Discourse Du Bois Transcription Format 1. Objective The purpose of this document is to describe the format to be used for producing and checking transcriptions in this course. 2. Conventions

More information

QMeter Tools for Quality Measurement in Telecommunication Network

QMeter Tools for Quality Measurement in Telecommunication Network QMeter Tools for Measurement in Telecommunication Network Akram Aburas 1 and Prof. Khalid Al-Mashouq 2 1 Advanced Communications & Electronics Systems, Riyadh, Saudi Arabia akram@aces-co.com 2 Electrical

More information

Evaluation of speech technologies

Evaluation of speech technologies CLARA Training course on evaluation of Human Language Technologies Evaluations and Language resources Distribution Agency November 27, 2012 Evaluation of speaker identification Speech technologies Outline

More information

Separation and Classification of Harmonic Sounds for Singing Voice Detection

Separation and Classification of Harmonic Sounds for Singing Voice Detection Separation and Classification of Harmonic Sounds for Singing Voice Detection Martín Rocamora and Alvaro Pardo Institute of Electrical Engineering - School of Engineering Universidad de la República, Uruguay

More information

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3 Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web. By C.Moreno, A. Antolin and F.Diaz-de-Maria. Summary By Maheshwar Jayaraman 1 1. Introduction Voice Over IP is

More information

ACCENT CLASSIFICATION: LEARNING A DISTANCE METRIC OVER PHONETIC STRINGS

ACCENT CLASSIFICATION: LEARNING A DISTANCE METRIC OVER PHONETIC STRINGS ACCENT CLASSIFICATION: LEARNING A DISTANCE METRIC OVER PHONETIC STRINGS by Swetha Machanavajhala A thesis submitted to the faculty of The University of Utah in partial fulfillment of the requirements for

More information

Application Note. Using PESQ to Test a VoIP Network. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN

Application Note. Using PESQ to Test a VoIP Network. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN Using PESQ to Test a VoIP Network Application Note Prepared by: Psytechnics Limited 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN t: +44 (0) 1473 261 800 f: +44 (0) 1473 261 880 e: info@psytechnics.com

More information

Information Leakage in Encrypted Network Traffic

Information Leakage in Encrypted Network Traffic Information Leakage in Encrypted Network Traffic Attacks and Countermeasures Scott Coull RedJack Joint work with: Charles Wright (MIT LL) Lucas Ballard (Google) Fabian Monrose (UNC) Gerald Masson (JHU)

More information

ANN Based Fault Classifier and Fault Locator for Double Circuit Transmission Line

ANN Based Fault Classifier and Fault Locator for Double Circuit Transmission Line International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-4, Special Issue-2, April 2016 E-ISSN: 2347-2693 ANN Based Fault Classifier and Fault Locator for Double Circuit

More information

Speech Analytics Data Reliability: Accuracy and Completeness

Speech Analytics Data Reliability: Accuracy and Completeness Speech Analytics Data Reliability: Accuracy and Completeness THE PREREQUISITE TO OPTIMIZING CONTACT CENTER PERFORMANCE AND THE CUSTOMER EXPERIENCE Summary of Contents (Click on any heading below to jump

More information

Integration of Negative Emotion Detection into a VoIP Call Center System

Integration of Negative Emotion Detection into a VoIP Call Center System Integration of Negative Detection into a VoIP Call Center System Tsang-Long Pao, Chia-Feng Chang, and Ren-Chi Tsao Department of Computer Science and Engineering Tatung University, Taipei, Taiwan Abstract

More information

Improving Automatic Forced Alignment for Dysarthric Speech Transcription

Improving Automatic Forced Alignment for Dysarthric Speech Transcription Improving Automatic Forced Alignment for Dysarthric Speech Transcription Yu Ting Yeung 2, Ka Ho Wong 1, Helen Meng 1,2 1 Human-Computer Communications Laboratory, Department of Systems Engineering and

More information

School Class Monitoring System Based on Audio Signal Processing

School Class Monitoring System Based on Audio Signal Processing C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India.

More information

THE problem of tracking multiple moving targets arises

THE problem of tracking multiple moving targets arises 728 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Binaural Tracking of Multiple Moving Sources Nicoleta Roman and DeLiang Wang, Fellow, IEEE Abstract This paper

More information

Exploring the Structure of Broadcast News for Topic Segmentation

Exploring the Structure of Broadcast News for Topic Segmentation Exploring the Structure of Broadcast News for Topic Segmentation Rui Amaral (1,2,3), Isabel Trancoso (1,3) 1 Instituto Superior Técnico 2 Instituto Politécnico de Setúbal 3 L 2 F - Spoken Language Systems

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Uitspraakevaluatie & training met behulp van spraaktechnologie. Pronunciation assessment & training by means of speech technology.

Uitspraakevaluatie & training met behulp van spraaktechnologie. Pronunciation assessment & training by means of speech technology. Uitspraakevaluatie & training met behulp van spraaktechnologie Pronunciation assessment & training by means of speech technology Helmer Strik and many others Centre for Language and Speech Technology (CLST),

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information