LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS
|
|
|
- Charlotte Charles
- 9 years ago
- Views:
Transcription
1 LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS Steven Greenberg and Shuangyu Chang International Computer Science Institute 1947 Center Street, Berkeley, CA 94704, USA ABSTRACT A diagnostic evaluation of eight Switchboard-corpus recognition systems was conducted in order to ascertain whether word-error patterns are attributable to a specific set of linguistic factors. Each recognition system s output was converted to a common format and scored relative to a reference transcript derived from phonetically hand-labeled data. This reference material was analyzed with respect to ca. forty acoustic, linguistic and speaker characteristics, which in turn, were correlated with recognition-error patterns via decision-trees and other forms of statistical analysis. The most consistent factors associated with superior recognition performance pertain to accurate classification of phonetic segments and articulatoryacoustic features. Other factors correlated with word recognition are syllable structure, prosodic stress and speaking rate (in terms of syllables per second). 1. INTRODUCTION The present study represents an effort to disect the functional architecture of large-vocabulary speech recognition systems used in the annual NIST-sponsored Switchboard Corpus evaluation. The Switchboard corpus [5] has been used in recent years (in tandem with the Call Home and Broadcast News corpora) to assess the state of automatic speech recognition (ASR) for spoken English. Switchboard is unique among the large-vocabulary corpora in having a substantial amount of material that has been phonetically labeled and segmented by linguistically trained individuals (Switchboard Transcription Project - [6] [7]) and thus provides a crucial set of reference materials with which to assess and evaluate the phonetic and lexical classification capabilities of current-generation ASR systems. In a separate paper [8] we describe in detail the methods used to evaluate the Switchboard recognition systems developed by eight separate sites (cf. Figures 1-3), as well as delineating a few key macroscopic analyses of the diagnostic material (cf. Figures 4-6). In that earlier study we concluded that the properties most predictive of lexical recognition in ASR systems pertain to the accuracy of phonetic classification (both at the phone and articulatory-feature level - cf. Figure 4 and Section 6). However, this conclusion is based primarily on the results of a decision-tree analysis (cf. Section 6) that may be of too coarse a nature to reveal certain linguistic patterns germane to ASR performance. For this reason, the current study examines the contribution of such factors as syllable structure, prosodic stress and speaking rate, as well as those pertaining to phonetic and articulatory-feature classification, in order to identify those linguistic parameters that are most highly correlated with word-recognition performance among the eight ASR systems. 2. CORPUS MATERIALS The evaluation was performed on a fifty-four-minute, phonetically annotated subset of the Switchboard corpus ( The material had previously been manually segmented at the syllabic and lexical levels and was segmented into phonetic segments using an automatic procedure trained on a seventy-twominute subset of the phonetic-transcription material that had been previously segmented by hand. The resulting segmentation was verified using a combination of manual and automatic methods. 3. EVALUATION FORMAT Eight separate sites participated in the evaluation - AT&T, BBN, Cambridge University (CU), Dragon Systems (DRAG), Johns Hopkins University (JHU), Mississippi State University (MSU), SRI International and the University of Washington (UW). Each site was asked to submit two different sets of material: (1) the word and phonetic-segment output of the recognition system used for competitive (i.e., non-diagnostic) portion of Switchboard, and (2) the word and phone-level output of forced-alignments associated with the same material (provided by six sites). In order to score the submissions in terms of phone-segments and words correct, as well as perform detailed analyses of the error patterns, it was necessary to convert the submissions into a common format. This required that: (1) each site s phonetic symbol set be mapped onto a common reference similar to that used to phonetically annotate the Switchboard corpus (STP). Care was taken to insure that the mapping was conservative in order that a site not be penalized for using a symbol set distinct from STP. In addition, phonetic symbols not contained in a site s inventory were mapped to the more fine-grained STP phone set ( phoneval). (2) a reference set of materials at the word, syllable and phone levels was created in order to score the material submitted. This reference material included: (a) word-to-phone mapping (b) syllable-to-phone mapping (c) word-to-syllable mapping (d) time points for the phones and words in the reference materials Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris, 2000.
2 Figure 1: The initial phase of the diagnostic evaluation. Materials submitted by each site are converted into a format designed for scoring (CTM files) relative to the reference transcript (at the phonetic, syllable and word level). From [8] (3) time-mediated synchronization of the phone and word output of the submission material with that of the reference set. The conversion process (Figure 1) was required in order that the submissions be scored at the word and phoneticsegment levels using SC-Lite, a program developed at the National Institute of Standards and Technology (NIST) to score competitive ASR evaluation submissions. 4. SCORING THE RECOGNITION SYSTEMS SC-Lite scores each word (and phone) in terms of being correct or not, as well as designating the error as one of three types - a substitution (i.e., a -> b), an insertion (a -> a+b) or a deletion (a -> ø ). A fourth category, null, occurs when the error can not be clearly associated with one of the other three categories (and usually implies that the error is due to some form of formatting discrepancy). For both phone- and word-scoring it was necessary to develop a method enabling each segment (either word or Figure 2: The evaluation s second phase involves time-mediated scoring of both the word- and phone-level output of the recognition and forced-alignment materials. The scored output is used to compile summary tables. From [8] phone) in the submission to be unambiguously associated with a corresponding symbol (or set of symbols) in the reference material. This was accomplished by using timemediated boundaries as synchronizing delimiters. Because the word and phone segmentation of the submission materials often deviate from those of the STP-based reference materials an algorithm was developed to minimize the time-alignment discrepancy. Files (in NIST s CTM format) were generated for each site s submission (separate files for recognition and forced alignment) and this material processed along with the CTM files associated with the word- and phone-level reference material (Figure 2). The resulting output was used as the basis for generating the data contained in the summary tables ( big lists - cf. [8]) and which form the foundation of the current analyses. 5. WORD AND PHONE ERROR PATTERNS Word (Figure 4) and phone (Figures 4-6) level error patterns were computed for both the forced-alignment (based on word-level transcripts) and unconstrained recognition material. Phone recognition error is relatively high for forced recognition (35-49% - Figure 5) only slightly less than the phone classification error for unconstrained recognition (39-55% - Figure 6). The Figure 3: Phase three of the analysis consists of computing several dozen parameters associated with the phone- and wordlevel representations of the speech signal and compiling these into summary tables ( big lists ). A complete list of the parameters computed can be found in [8]. Figure 4: A comparison of the word and phonetic-segment error for the recognition component of the diagnostic Switchboard evaluation for all eight participating sites. From [8]
3 Figure 5: Phonetic-segment errors in the forced-alignment component of the Switchboard diagnostic evaluation for the six participating sites. From [8] difference in performance between the two conditions is much smaller than anticipated, suggesting that the ASR systems may not be optimized for classification of phonetic segments. The word error rate for the ASR systems ranges between 27 and 43%, about 50% higher than that observed for the competitive portion of the evaluation [10]. The higher error rate is probably due to several factors. First, the diagnostic component of the evaluation contains relatively short utterances (mean duration = 4.76 sec) from hundreds of different speakers. In contrast, the competitive evaluation is composed of complete dialogues lasting ca. five minutes and produced by only forty different speakers [10]. Most (if not all) of the recognition systems normally use some form of speaker adaptation, which works most effectively over long spans of speech. Short utterances, such as those used in the diagnostic evaluation, are likely to mitigate the beneficial effect of speaker adaptation. Figure 4 illustrates the relationship between phone- and word-error magnitude across submission sites. The correlation between the two (r) is 0.78, suggesting that word recognition may largely depend on the accuracy of recognition at the phonetic-segment level. Certain sites, such as AT&T and Dragon, deviate from this pattern in that their performance exhibits a lower word error than would be expected based solely on recognition at the phoneticsegment level. These systems may possess extremely good pronunciation models that partially compensate for the relative deficiencies of phone classification. 6. DECISION-TREE ANALYSIS OF ERRORS In order to gain some initial insight into the factors governing word errors in recognition performance the STPbased, reference component of the Switchboard corpus was analyzed with respect to ca. forty separate parameters pertaining to speaker, linguistic and acoustic properties of the speech materials, including energy level, duration, stress pattern, syllable structure, speaking rate and so on (cf. [8] for a complete list of parameters). Decision trees [13] were used as a means of identifying the most important factors associated with word-error rate across sites. The error data were partitioned into four separate domains: Figure 6: The percentage of phone errors for the recognition component of the Switchboard diagnostic evaluation. Data are from all eight participating sites. From [8] (1) Substitutions versus all other data (both correct and incorrect), (2) Deletions versus all other data (both correct and incorrect), (3) Substitution versus deletions (i.e., excluding words correctly recognized), and (4) Substitutions versus insertions (i.e., excluding words correctly recognized) Substitution Errors (versus All Else) Half of the word errors involve substitutions (Figure 6), and it is therefore of interest to identify the parameters associated with this single most important component of recognition performance. For seven of the eight submissions the parameter dominating the decision tree at the highest (or second-highest) node-level is the number of phonetic-segment substitution errors within a word. The probability of a word being incorrectly recognized increases significantly when more than (an average of) ca. 1.5 phones are misclassified. Other important parameters in the decision trees are the acoustic-articulatory feature distance (AFDIST) between the correct and hypothesized word (which is also related to the probability of correct phone classification), (unigram) frequency of the reference word (WDFREQ), and whether the preceding (PREWDER) or following (PSTWDER) word is incorrectly recognized. Deletions (versus All Else) Deletions account for ca. 25% of the word errors. For all sites, the dominant factor associated with deletion errors pertains to either the number of phonetic segments correctly recognized (PHNCOR) or the acousticarticulatory phonetic-feature distance between the reference and hypothesized word. Other important parameters are the number of phone insertions (PHNINS) and substitutions (PHSUB), word frequency and duration of the reference word (REFDUR). Substitution versus Deletion Errors Additional information concerning the source of word errors can be obtained by analyzing factors distinguishing different types of error. For distinguishing substitution from deletion errors two sets of parameters appear to be most important - phonetic-segment classification (PHNSUB,
4 Figure 7: The number of phonetic-segment errors per word as a function of the number of phones contained in the word. The data are partitioned into two broad classes - correctly and incorrectly recognized words. Within each subset the data are divided into four classes, based on the type of error. Data represent averages across the eight ASR systems. Words of designated length 5+ contain five or more phones. PHNINS, PHNCOR, AFDIST) and the duration of the reference (REFDUR) and hypothesized (HYPDUR) words. Substitution versus Insertion Errors The duration of the hypothesized word is the most important parameter distinguishing substitution from insertion errors, followed in importance by phonetic segment classification factors (PHNSUB, PHNINS, PHNDEL, AFDIST), frequency of occurrence of the phonetic segments in a word (PHNFREQ) and the error status of the preceding and following words (PREWDER, PSTWDER). General Trends of the Decision Tree Error Analysis The most important parameters associated with wordrecognition error are those pertaining to the correct identification of a word s phonetic composition (at either the phone or articulatory-acoustic, phonetic-feature level). The results of the decision-tree analyses (cf. [8] for further detail) are also of interest because of the parameters that are absent from the decision trees - prosodic prominence, syllable structure and speaking rate. All of these parameters have been suggested as important factors associated with word-error rate. The decision trees suggest that such parameters do not account for word errors across the board in the way that acoustic-phonetic factors do. 7. PHONE ERROR PATTERNS IN WORDS It is of interest to ascertain if the number of phonetic segments in a word bears any relation to the pattern of phone errors in both correctly and incorrectly recognized words (cf. Figure 7 and [3]). The tolerance for phonetic segment errors in correctly recognized words is not linearly related to the length of the word. The tolerance for error (ca phones) is roughly constant for word lengths of four phones or less. This pattern is observed regardless of the form of error. The relatively low tolerance for phone misclassification (except for words of very short length) implies that the pronunciation and language models possess only a limited capacity to compensate for errors at the phonetic-segment level. In contrast, the average number of phones misclassified in incorrectly recognized words does increase in quasi-linear fashion as a function of word length (with the possible exception of insertions), a pattern consistant with the importance of phonetic classification for accurate word recognition. 8. ARTICULATORY-FEATURE ANALYSIS Phonetic segments can be decomposed into more elementary constituents based on their articulatory bases, such as place (e.g., labial, labio-dental, alveolar, velar), manner (e.g., stop, fricative, affricate, nasal, liquid, glide, vocalic), voicing and lip-rounding. Two additional dimensions were also used in the current analyses - frontback articulation (vowels only) and the general distinction between consonantal and vocalic segmental forms. There are approximately three times the number of Figure 8: The average number of articulatory-feature errors per phone as a function of the phonetic segment s position within the word. The data are partitioned, depending on whether the word was correctly recognized or not. Data represent averages across the eight ASR systems.
5 Figure 9: The average error in classification of articulatory features associated with each phonetic segment as a function of the position of the phone within the syllable - in this instance for consonantal onsets. CV..CV indicates that the context was a polysyllabic word. Data represent averages across the eight ASR systems. articulatory-feature (AF) errors (per phone) when the word is misrecognized, regardless of the position of the segment within the word (Figure 8). This contrast between correctly and incorrectly recognized words is maintained when the data are partitioned in articulatory-feature classes. Because words are organized into a distinct set of syllabic forms in English it is of interest to partition the AF-error data according to lexical syllable structure (i.e., consonant and vowel sequences) and separately examine the pattern of errors for consonantal onsets, vocalic nuclei and consonantal codas. These data, illustrated in Figures 9-11, exhibit certain interesting patterns. The articulatory features associated with consonantal onsets (Figure 9) exhibit a relatively low tolerance for error. Moreover, the error rate for is four to five times greater for AFs in misclassified words compared to correctly recognized lexical items. Place and manner features are particularly prone to error in misclassified words, suggesting that these onset-consonant AFs are particularly important for correctly distinguishing among words. The AF error patterns for consonantal codas (Figure 10) are similar to those associated with consonantal onsets, except that there is a slightly higher (ca. 50%) tolerance for error among the former. Both onsets and codas also exhibit an interesting pattern with respect to polysyllabic words. Liprounding appears to play a much smaller role in these longer words while vowel/consonant confusions seem to be more pronounced in such contexts. The AF error patterns associated with vocalic nuclei (Figure 11) display a pattern different from those associated with onsets and codas. There is a much higher tolerance of error in classification of AFs for the correctly Figure 10: The average error in classification of articulatory features associated with each phonetic segment as a function of the position of the phone within the syllable - in this instance for consonantal codas. CV..CVC indicates that the context was a polysyllabic word. Data represent averages across the eight ASR systems.
6 Figure 11: The average error in classification of articulatory features associated with each phonetic segment as a function of the position of the phone within the syllable - in this instance for vocalic nuclei. CV..CV indicates that the context was a polysyllabic word. Data represent averages across the eight ASR systems. recognized words - particularly marked for the place and front/back features. Moreover, there is a considerably higher degree of AF classification error among the nuclei compared to onsets and codas, particularly among the place and front-back dimensions. There is also an exceedingly high proportion of vowel/consonant confusions. Such data imply that classification of vocalic nuclei is considerably less precise than for the onsets and codas. 9. SYLLABLE STRUCTURE ANALYSIS The AF classification patterns suggest that syllabic structure may help to provide a keener understanding of recognition errors. Figure 12 shows the word error as a function of syllable form. The highest error rates are associated with vowel-initial syllables. Syllables with complex (i.e., consonant cluster) onsets and codas tend to exhibit a relatively low word-error rate, as do polysyllabic words. This effect is particularly pronounced with respect to deletions, and vowel-initial syllables are particularly prone to such errors. There is a certain degree of variability in the error patterns associated with syllable form across the eight ASR systems (Figure 13). Certain sites (such as AT&T, BBN and UW) exhibit a similar error rate across consonant-initial words. Other sites (e.g., Dragon, JHU and SRI) do particularly well on polysyllabic words. Virtually all sites show a pronounced increase in errors associated with vowel-initial words (AT&T is an exception). The pattern of errors associated with syllable form suggest that future-generation ASR systems would benefit from an accurate means of parsing the acoustic signal into syllabic segments (cf. [14]) and reliably distinguishing vocalic constituents from their consonantal counterparts (cf. [2]). Figure 12: Word error as a function of lexical syllable form (C=consonant, V=vowel), partitioned into substitution and deletion errors. The proportion of the corpus associated with each syllabic form is also indicated (e.g, V forms are 10% of the total). Figure 13: Word error as a function of syllable form for each of the eight ASR systems (and the mean, replotted from Figure 12).
7 Figure 15: The average number of word deletions as a function of the maximum stress level associated with a word for each of the eight ASR systems. Figure 14: The average error rate for the eight ASR systems as a function of the maximum stress associated with a word. The data are partitioned into substitution and deletion errors. The total error is the sum of the two (insertions were excluded from the analysis). A maximum stress of 0 indicates that the word was completely unstressed. 1 indicates that at least one syllable in the word was fully stressed. An intermediate level of stress is associated with a value of THE EFFECT OF PROSODIC STRESS Prosodic stress is an important means of providing informational emphasis in spontaneous speech [1] [9] and affects the pronunciation of phonetic elements [4] [6]. The diagnostic portion of the Switchboard corpus was prosodically labeled by two linguistically trained individuals and this material ( ~steveng/prosody) used to ascertain the relation between error rate and prosodic stress. There is a 50% higher probability of a recognition error when a word is entirely unstressed (Figure 14). The relation between lexical stress and word-error rate is particularly apparent for deletions (Figure 14) and is manifest across all ASR systems (Figure 15). The relation between deletion errors and prosodic stress suggests that ASR systems may benefit from incorporating methods for automatic classification of acoustic prominence (cf. [15] [16]). 11. THE EFFECT OF SPEAKING RATE ASR systems generally have more difficulty recognizing speech that is of particularly fast [11] [12] or slow [12] tempo. A variety of methods have been proposed for automatically estimating speaking rate from the acoustic signal as a means of adapting recognition algorithms to the speaker s tempo [11] [12]. The speaking rate of each utterance in the diagnostic material was measured using two different metrics. The first, MRATE, derives its estimate of speaking rate from the modulation spectrum of the acoustic signal [12]. Because the modulation spectrum is systematically related to syllable duration [6], MRATE should in principle provide an indirect measure of syllable rate. The second metric used is based directly on the number of syllables spoken per second and is derived from the transcription material. Figure 16 illustrates the relation between MRATE and word-error rate. Word error does not change very much as a function of MRATE. In many instances the highest error rates are associated with the middle of the MRATE range, while the flanks of the range often exhibit a slightly lower proportion of word errors. Figure 17 illustrates the relation between word-error rate and syllables per second. In contrast to MRATE, this linguistic metric exhibits a much higher correlation Figure 16: The relationship between word-error rate for each of the eight ASR systems (as well as the mean) and an acoustic measure of speaking rate (MRATE). Figure 17: The relationship between word-error rate for each of the eight ASR systems (as well as the mean) and a linguistic measure of speaking rate (syllables per second).
8 between abnormal speech tempo and ASR performance. Utterances slower than 3 syllables/sec or faster than 6 syllables/sec have 50% more word-recognition errors than their counterparts in the core of the normal speaking range. Such data imply that algorithms based on some form of linguistic segmentation related to the syllable are more likely to provide accurate estimates of speaking rate than those based on purely acoustic properties of the speech signal (which may include hesitations, filled pauses and other non-linguistic events that potentially distort the estimate of speaking rate). 12. CONCLUSIONS The diagnostic evaluation of eight Switchboard-corpus recognition systems suggests that superior recognition performance is closely associated with the ability to accurately classify a word s phonetic constituents. This conclusion is supported by several different forms of analyses, including decision trees and correlations between the magnitude of word and phonetic-segment/articulatoryfeature errors. Additional analyses indicate that syllable structure is also an important factor accounting for recognition performance, as is prosodic stress and speaking rate. Together, these analyses suggest that future-generation ASR systems should strive for highly accurate phonetic classification and integrate information about syllable structure and prosodic stress into the recognition process. ACKNOWLEDGEMENTS The authors wish to thank our colleagues at AT&T, BBN, Cambridge University, Dragon Systems, Johns Hopkins University, Mississippi State University, SRI International and the University of Washington for providing the material upon which the diagnostic evaluation of the Switchboard recognition systems is based. We would also like to express our appreciation to Jeff Good and Leah Hitchcock for prosodically labeling the diagnostic component of the Switchboard corpus, to Joy Hollenback and Rosaria Silipo for assistance with the statistical analyses and data collection, and to George Doddington, Jon Fiscus, Hollis Fitch, Jack Godfrey and Joe Kupin for their helpful comments and advice concerning the analyses described. This research was supported by the U.S. Department of Defense. REFERENCES [1] Beckman, M., Stress and Non-Stress Accent, Fortis, Dordrecht, [2] Chang, S., Shastri, L. and Greenberg, S., Automatic phonetic transcription of spontaneous speech (American English), Proc. Int. Conf. Spoken. Lang. Proc., (available at [3] Doddington, G., Evidence of differences between lexical and actual phonetic realizations, Presentation at the NIST Speech Transcription Workshop, College Park, MD, May 18, [4] Fosler-Lussier, E., Greenberg, S. and Morgan, N., Incorporating contextual phonetics into automatic speech recognition. Proc. XIVth Int. Cong. Phon. Sci., (available at [5] Godfrey, J. J., Holliman, E. C., and McDaniel, J., SWITCHBOARD: Telephone speech corpus for research and development, Proc. IEEE Int. Conf. Acoust. Speech Sig. Proc., pp , [6] Greenberg, S., Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, , [7] Greenberg, S., The Switchboard Transcription Project, Research Report #24, 1996 Large Vocabulary Continuous Speech Recognition Summer Research Workshop Technical Report Series. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, [8] Greenberg, S., Chang, S., and Hollenback, J., An introduction to the diagnostic evaluation of Switchboardcorpus automatic speech recognition systems, Proc. NIST Speech Transcription Workshop, (available at [9] Lehiste, I., Surpasegmentals, in Principles of Experimental Phonetics, N. Lass, editor. St. Louis: Mosby, [10] Martin, A., Pryzbocki, M., Fiscus, J., and Pallet, D., The 2000 NIST evaluation for recognition of conversational speech over the telephone, Presentation at the NIST Speech Transcription Workshop, College Park, MD, May 17, [11] Mirghafori, N., Fosler, E., and Morgan, N., Fast speakers in large vocabulary continuous speech recognition: Analysis and antidotes, Proc. Eurospeech, pp , (available at [12] Morgan, N., Fosler, E., and Mirghafori, N., Speech recognition using on-line estimation of speaking rate, Proc. Eurospeech, pp , (available at [13] Quinlan, J.R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, [14] Shastri, L. Chang, S. and Greenberg, S., Syllable detection and segmentation using temporal flow model neural networks," Proc. XIVth Int. Cong. Phon. Sci., pp , (available at [15] Silipo, R. and Greenberg, S., Prosodic stress revisited: Reexamining the role of fundamental frequency, Proc. NIST Speech Transcription Workshop, College Park, MD, (available at [16] Silipo, R. and Greenberg, S., Automatic transcription of prosodic prominence for spontaneous English discourse, Proc. XIVth Int. Cong. Phon. Sci., pp , (available at
Turkish Radiology Dictation System
Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey [email protected], [email protected]
Robust Methods for Automatic Transcription and Alignment of Speech Signals
Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist ([email protected]) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background
The sound patterns of language
The sound patterns of language Phonology Chapter 5 Alaa Mohammadi- Fall 2009 1 This lecture There are systematic differences between: What speakers memorize about the sounds of words. The speech sounds
A System for Labeling Self-Repairs in Speech 1
A System for Labeling Self-Repairs in Speech 1 John Bear, John Dowding, Elizabeth Shriberg, Patti Price 1. Introduction This document outlines a system for labeling self-repairs in spontaneous speech.
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Speech Analytics Data Reliability: Accuracy and Completeness
Speech Analytics Data Reliability: Accuracy and Completeness THE PREREQUISITE TO OPTIMIZING CONTACT CENTER PERFORMANCE AND THE CUSTOMER EXPERIENCE Summary of Contents (Click on any heading below to jump
Common Phonological processes - There are several kinds of familiar processes that are found in many many languages.
Common Phonological processes - There are several kinds of familiar processes that are found in many many languages. 1. Uncommon processes DO exist. Of course, through the vagaries of history languages
An Arabic Text-To-Speech System Based on Artificial Neural Networks
Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department
An analysis of coding consistency in the transcription of spontaneous. speech from the Buckeye corpus
An analysis of coding consistency in the transcription of spontaneous speech from the Buckeye corpus William D. Raymond Ohio State University 1. Introduction Large corpora of speech that have been supplemented
Evaluation of speech technologies
CLARA Training course on evaluation of Human Language Technologies Evaluations and Language resources Distribution Agency November 27, 2012 Evaluation of speaker identification Speech technologies Outline
Master of Arts in Linguistics Syllabus
Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university
Automatic slide assignation for language model adaptation
Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
On the distinction between 'stress-timed' and 'syllable-timed' languages Peter Roach
(paper originally published in Linguistic Controversies, ed. D. Crystal, 1982, pp. 73-79. The paper is now badly out of date, but since it is still frequently cited I feel it is worth making it available
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
Articulatory Phonetics. and the International Phonetic Alphabet. Readings and Other Materials. Introduction. The Articulatory System
Supplementary Readings Supplementary Readings Handouts Online Tutorials The following readings have been posted to the Moodle course site: Contemporary Linguistics: Chapter 2 (pp. 15-33) Handouts for This
English Speech Timing: A Domain and Locus Approach. Laurence White
English Speech Timing: A Domain and Locus Approach Laurence White PhD The University of Edinburgh 2002 Abstract This dissertation presents a descriptive framework for suprasyllabic processes in speech
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Things to remember when transcribing speech
Notes and discussion Things to remember when transcribing speech David Crystal University of Reading Until the day comes when this journal is available in an audio or video format, we shall have to rely
Articulatory Phonetics. and the International Phonetic Alphabet. Readings and Other Materials. Review. IPA: The Vowels. Practice
Supplementary Readings Supplementary Readings Handouts Online Tutorials The following readings have been posted to the Moodle course site: Contemporary Linguistics: Chapter 2 (pp. 34-40) Handouts for This
The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications
Forensic Science International 146S (2004) S95 S99 www.elsevier.com/locate/forsciint The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications A.
Comparative Error Analysis of Dialog State Tracking
Comparative Error Analysis of Dialog State Tracking Ronnie W. Smith Department of Computer Science East Carolina University Greenville, North Carolina, 27834 [email protected] Abstract A primary motivation
Pronunciation in English
The Electronic Journal for English as a Second Language Pronunciation in English March 2013 Volume 16, Number 4 Title Level Publisher Type of product Minimum Hardware Requirements Software Requirements
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language
AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language Hugo Meinedo, Diamantino Caseiro, João Neto, and Isabel Trancoso L 2 F Spoken Language Systems Lab INESC-ID
A CHINESE SPEECH DATA WAREHOUSE
A CHINESE SPEECH DATA WAREHOUSE LUK Wing-Pong, Robert and CHENG Chung-Keng Department of Computing, Hong Kong Polytechnic University Tel: 2766 5143, FAX: 2774 0842, E-mail: {csrluk,cskcheng}@comp.polyu.edu.hk
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Healthcare Measurement Analysis Using Data mining Techniques
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik
Reading errors made by skilled and unskilled readers: evaluating a system that generates reports for people with poor literacy
University of Aberdeen Department of Computing Science Technical Report AUCS/TR040 Reading errors made by skilled and unskilled readers: evaluating a system that generates reports for people with poor
Speech Production 2. Paper 9: Foundations of Speech Communication Lent Term: Week 4. Katharine Barden
Speech Production 2 Paper 9: Foundations of Speech Communication Lent Term: Week 4 Katharine Barden Today s lecture Prosodic-segmental interdependencies Models of speech production Articulatory phonology
Understanding Impaired Speech. Kobi Calev, Morris Alper January 2016 Voiceitt
Understanding Impaired Speech Kobi Calev, Morris Alper January 2016 Voiceitt Our Problem Domain We deal with phonological disorders They may be either - resonance or phonation - physiological or neural
Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text
Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper FX Palo Alto Laboratory Palo Alto, CA 94034 USA [email protected] ABSTRACT Video is becoming a prevalent medium
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Intonation difficulties in non-native languages.
Intonation difficulties in non-native languages. Irma Rusadze Akaki Tsereteli State University, Assistant Professor, Kutaisi, Georgia Sopio Kipiani Akaki Tsereteli State University, Assistant Professor,
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Generating Training Data for Medical Dictations
Generating Training Data for Medical Dictations Sergey Pakhomov University of Minnesota, MN [email protected] Michael Schonwetter Linguistech Consortium, NJ [email protected] Joan Bachenko
Text-To-Speech Technologies for Mobile Telephony Services
Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary
PTE Academic. Score Guide. November 2012. Version 4
PTE Academic Score Guide November 2012 Version 4 PTE Academic Score Guide Copyright Pearson Education Ltd 2012. All rights reserved; no part of this publication may be reproduced without the prior written
Lecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
Aspects of North Swedish intonational phonology. Bruce, Gösta
Aspects of North Swedish intonational phonology. Bruce, Gösta Published in: Proceedings from Fonetik 3 ; Phonum 9 Published: 3-01-01 Link to publication Citation for published version (APA): Bruce, G.
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS
The 21 st International Congress on Sound and Vibration 13-17 July, 2014, Beijing/China PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS Dan Wang, Nanjie Yan and Jianxin Peng*
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Modeling Cantonese Pronunciation Variations for Large-Vocabulary Continuous Speech Recognition
Computational Linguistics and Chinese Language Processing Vol. 11, No. 1, March 2006, pp. 17-36 17 The Association for Computational Linguistics and Chinese Language Processing Modeling Cantonese Pronunciation
SPEECH OR LANGUAGE IMPAIRMENT EARLY CHILDHOOD SPECIAL EDUCATION
I. DEFINITION Speech or language impairment means a communication disorder, such as stuttering, impaired articulation, a language impairment (comprehension and/or expression), or a voice impairment, that
L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES
L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES Zhen Qin, Allard Jongman Department of Linguistics, University of Kansas, United States [email protected], [email protected]
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
Speech Analytics. Whitepaper
Speech Analytics Whitepaper This document is property of ASC telecom AG. All rights reserved. Distribution or copying of this document is forbidden without permission of ASC. 1 Introduction Hearing the
Measuring and synthesising expressivity: Some tools to analyse and simulate phonostyle
Measuring and synthesising expressivity: Some tools to analyse and simulate phonostyle J.-Ph. Goldman - University of Geneva EMUS Workshop 05.05.2008 Outline 1. Expressivity What is, how to characterize
KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 22/2013, ISSN 1642-6037 medical diagnosis, ontology, subjective intelligence, reasoning, fuzzy rules Hamido FUJITA 1 KNOWLEDGE-BASED IN MEDICAL DECISION
Method of Fault Detection in Cloud Computing Systems
, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,
Earnings Announcement and Abnormal Return of S&P 500 Companies. Luke Qiu Washington University in St. Louis Economics Department Honors Thesis
Earnings Announcement and Abnormal Return of S&P 500 Companies Luke Qiu Washington University in St. Louis Economics Department Honors Thesis March 18, 2014 Abstract In this paper, I investigate the extent
The LENA TM Language Environment Analysis System:
FOUNDATION The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File Dongxin Xu, Umit Yapanel, Sharmi Gray, & Charles T. Baer LENA Foundation, Boulder, CO LTR-04-2 September
PTE Academic Preparation Course Outline
PTE Academic Preparation Course Outline August 2011 V2 Pearson Education Ltd 2011. No part of this publication may be reproduced without the prior permission of Pearson Education Ltd. Introduction The
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
D2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
The syllable as emerging unit of information, processing, production
The syllable as emerging unit of information, processing, production September 27-29, 2012 Dartmouth College, Hanover NH Neukom Institute for Computational Science; Linguistics and Cognitive Science Program
Recognition of Emotions in Interactive Voice Response Systems
Recognition of Emotions in Interactive Voice Response Systems Sherif Yacoub, Steve Simske, Xiaofan Lin, John Burns HP Laboratories Palo Alto HPL-2003-136 July 2 nd, 2003* E-mail: {sherif.yacoub, steven.simske,
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Stricture and Nasal Place Assimilation. Jaye Padgett
Stricture and Nasal Place Assimilation Jaye Padgett Stricture Stricture features determine the degree of constriction in the vocal tract; [son], [ cons], [cont] [-cont]: Verschluss im mediosagittalen Bereich
Building A Vocabulary Self-Learning Speech Recognition System
INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes
The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency
The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency Andrey Raev 1, Yuri Matveev 1, Tatiana Goloshchapova 2 1 Speech Technology Center, St. Petersburg, RUSSIA {raev, matveev}@speechpro.com
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
Robustness of a Spoken Dialogue Interface for a Personal Assistant
Robustness of a Spoken Dialogue Interface for a Personal Assistant Anna Wong, Anh Nguyen and Wayne Wobcke School of Computer Science and Engineering University of New South Wales Sydney NSW 22, Australia
Comprehensive Reading Assessment Grades K-1
Comprehensive Reading Assessment Grades K-1 User Information Name: Doe, John Date of Birth: Jan 01, 1995 Current Grade in School: 3rd Grade in School at Evaluation: 1st Evaluation Date: May 17, 2006 Background
Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device
International Journal of Signal Processing Systems Vol. 3, No. 2, December 2015 Input Support System for Medical Records Created Using a Voice Memo Recorded by a Mobile Device K. Kurumizawa and H. Nishizaki
CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages 8-10. CHARTE NIVEAU B2 Pages 11-14
CHARTES D'ANGLAIS SOMMAIRE CHARTE NIVEAU A1 Pages 2-4 CHARTE NIVEAU A2 Pages 5-7 CHARTE NIVEAU B1 Pages 8-10 CHARTE NIVEAU B2 Pages 11-14 CHARTE NIVEAU C1 Pages 15-17 MAJ, le 11 juin 2014 A1 Skills-based
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Improving Automatic Forced Alignment for Dysarthric Speech Transcription
Improving Automatic Forced Alignment for Dysarthric Speech Transcription Yu Ting Yeung 2, Ka Ho Wong 1, Helen Meng 1,2 1 Human-Computer Communications Laboratory, Department of Systems Engineering and
Program curriculum for graduate studies in Speech and Music Communication
Program curriculum for graduate studies in Speech and Music Communication School of Computer Science and Communication, KTH (Translated version, November 2009) Common guidelines for graduate-level studies
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
Emotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
Dual-route theory of word reading. Frequency-by-consistency interaction (SM89) Seidenberg and McClelland (1989, Psych. Rev.) Sum Squared Error
Dual-route theory of word reading Systematic spelling-sound knowledge takes the form of grapheme-phoneme correspondence (GPC) rules (e.g., G /g/, A E /A/) Applying GPC rules produces correct pronunciations
DRA2 Word Analysis. correlated to. Virginia Learning Standards Grade 1
DRA2 Word Analysis correlated to Virginia Learning Standards Grade 1 Quickly identify and generate words that rhyme with given words. Quickly identify and generate words that begin with the same sound.
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
Thirukkural - A Text-to-Speech Synthesis System
Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
SYSTEM DESIGN AND THE IMPORTANCE OF ACOUSTICS
SYSTEM DESIGN AND THE IMPORTANCE OF ACOUSTICS n Will your communication or emergency notification system broadcast intelligible speech messages in addition to alarm tones? n Will your system include multiple
SASSC: A Standard Arabic Single Speaker Corpus
SASSC: A Standard Arabic Single Speaker Corpus Ibrahim Almosallam, Atheer AlKhalifa, Mansour Alghamdi, Mohamed Alkanhal, Ashraf Alkhairy The Computer Research Institute King Abdulaziz City for Science
Prosodic Phrasing: Machine and Human Evaluation
Prosodic Phrasing: Machine and Human Evaluation M. Céu Viana*, Luís C. Oliveira**, Ana I. Mata***, *CLUL, **INESC-ID/IST, ***FLUL/CLUL Rua Alves Redol 9, 1000 Lisboa, Portugal [email protected], [email protected],
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
ETS Automated Scoring and NLP Technologies
ETS Automated Scoring and NLP Technologies Using natural language processing (NLP) and psychometric methods to develop innovative scoring technologies The growth of the use of constructed-response tasks
Quarterly Progress and Status Report. Preaspiration in Southern Swedish dialects
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Preaspiration in Southern Swedish dialects Tronnier, M. journal: Proceedings of Fonetik, TMH-QPSR volume: 44 number: 1 year: 2002
