A. Batliner, S. Burger, B. Johne, A. Kieling. L.M.-Universitat Munchen. F.{A.{Universitat Erlangen{Nurnberg



Similar documents
PROMISE - A Procedure for Multimodal Interactive System Evaluation

CIVIL Corpus: Voice Quality for Forensic Speaker Comparison

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

to Automatic Interpreting Birte Schmitz Technische Universitat Berlin

Emotion Detection from Speech

Analysis of Voice-Onset Using Active Rays and Hidden Markov Models O Jesorsky, J Denzler, E Noth, T Wittenberg Lehrstuhl fur Mustererkennung (Informat

Face Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213

THE VOICE OF LOVE. Trisha Belanger, Caroline Menezes, Claire Barboa, Mofida Helo, Kimia Shirazifard

Thirukkural - A Text-to-Speech Synthesis System

MODELING OF USER STATE ESPECIALLY OF EMOTIONS. Elmar Nöth. University of Erlangen Nuremberg, Chair for Pattern Recognition, Erlangen, F.R.G.

Text-To-Speech Technologies for Mobile Telephony Services

Lecture 1-10: Spectrograms

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Lecture 1-6: Noise and Filters

The Scientific Data Mining Process

How to Improve the Sound Quality of Your Microphone

Turkish Radiology Dictation System

L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

How To Assess Dementia With A Voice-Based Assessment

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements

Corpus Design for a Unit Selection Database

An Arabic Text-To-Speech System Based on Artificial Neural Networks

1 Example of Time Series Analysis by SSA 1

Formant Bandwidth and Resilience of Speech to Noise

Bachelors of Science Program in Communication Disorders and Sciences:

Establishing the Uniqueness of the Human Voice for Security Applications

The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency

Things to remember when transcribing speech

A CHINESE SPEECH DATA WAREHOUSE

Carla Simões, Speech Analysis and Transcription Software

Phonetic and phonological properties of the final pitch accent in Catalan declaratives

Introduction to Logistic Regression

Audio Hashing for Spam Detection Using the Redundant Discrete Wavelet Transform

Structural Health Monitoring Tools (SHMTools)

Current Status and Problems in Mastering of Sound Volume in TV News and Commercials

Norbert Reithinger, Elisabeth Maier. DFKI GmbH

SPEECH OR LANGUAGE IMPAIRMENT EARLY CHILDHOOD SPECIAL EDUCATION

Speech Signal Processing: An Overview

Functional Auditory Performance Indicators (FAPI)

SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Music Mood Classification

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Appendix 1: Methodology Interviews. A1 Sampling Strategy

The LENA TM Language Environment Analysis System:

On the distinction between 'stress-timed' and 'syllable-timed' languages Peter Roach

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

ETSI TS V1.1.1 ( )

Understanding Impaired Speech. Kobi Calev, Morris Alper January 2016 Voiceitt

Introduction to Pattern Recognition

4 Pitch and range in language and music

Sustainable construction criteria to evaluate the thermal indoor climate - further development of the BNB assessment tool and practical implementation

L9: Cepstral analysis

Recognition of Emotions in Interactive Voice Response Systems

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System


Speech Production 2. Paper 9: Foundations of Speech Communication Lent Term: Week 4. Katharine Barden

TCOM 370 NOTES 99-6 VOICE DIGITIZATION AND VOICE/DATA INTEGRATION

Aspects of North Swedish intonational phonology. Bruce, Gösta

Online experiments with the Percy software framework experiences and some early results

Portable Bushy Processing Trees for Join Queries

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

GLOTTALIMAGEEXPLORER AN OPEN SOURCE TOOL FOR GLOTTIS SEGMENTATION IN ENDOSCOPIC HIGH-SPEED VIDEOS OF THE VOCAL FOLDS.

Circle detection and tracking speed-up based on change-driven image processing

Using Praat for Linguistic Research

Recommendations for Static Firewall Configuration in D-Grid

Case study: how to use cutoff conditions in a FRA frequency scan?

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction

Figure1. Acoustic feedback in packet based video conferencing system

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Learning Today Smart Tutor Supports English Language Learners

Running head: PROCESSING REDUCED WORD FORMS. Processing reduced word forms: the suffix restoration effect. Rachèl J.J.K. Kemps, Mirjam Ernestus

Analog-to-Digital Voice Encoding

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

Working Paper Descriptive studies on stylized facts of the German business cycle

Office Phone/ / lix@cwu.edu Office Hours: MW 3:50-4:50, TR 12:00-12:30

Soil Dynamics Prof. Deepankar Choudhury Department of Civil Engineering Indian Institute of Technology, Bombay

Transcription bottleneck of speech corpus exploitation

Transcriptions in the CHAT format

SPEECH TRANSCRIPTION USING MED

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

Introduction till transcription using CHAT (with linking of audiofiles)

Artificial Neural Network for Speech Recognition

Tools for Forging the Functional Architecture

Analog and Digital Signals, Time and Frequency Representation of Signals

Sound Quality Aspects for Environmental Noise. Abstract. 1. Introduction

Waveforms and the Speed of Sound

Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis

A security analysis of the SecLookOn authentication system

PCM Encoding and Decoding:

Pronunciation in English

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

WinPitch LTL II, a Multimodal Pronunciation Software

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

1.1.1 Event-based analysis

Transcription:

M USLI: A Classication Scheme For Laryngealizations A. Batliner, S. Burger, B. Johne, A. Kieling L.M.-Universitat Munchen F.{A.{Universitat Erlangen{Nurnberg Report 45 Dezember 1994

Dezember 1994 A. Batliner, S. Burger, B. Johne, A. Kieling Institut fur Deutsche Philologie Ludwig{Maximilian Universitat Munchen Schellingstr. 3 D{80799 Munchen Lehrstuhl fur Mustererkennung (Inf. 5) Friedrich{Alexander{Universitat Erlangen{Nurnberg Martensstr. 3 D{91058 Erlangen Tel.: (089) 2180-2916 e-mail: ue102ac@cd1.lrz-muenchen.de Gehort zum Antragsabschnitt: 3.11, 3.12, 6.4 Das diesem Bericht zugrundeliegende Forschungsvorhaben wurde mit Mitteln des Bundesministers fur Forschung und Technologie unter dem Forderkennzeichen 01 IV 102 H/0 und 01 IV 102 C 6 gefordert. Die Verantwortung fur den Inhalt dieser Arbeit liegt bei dem Autor.

M USLI: A Classication Scheme For Laryngealizations A. Batliner 1, S. Burger 1, B. Johne 1, A. Kieling 2 1 L.M.-Universitat Munchen, Institut fur Deutsche Philologie, Schellingstr. 3, 80799 Munchen, F.R. of Germany 2 Universitat Erlangen-Nurnberg, Lehrstuhl fur Mustererkennung (Informatik 5), Martensstr. 3, 91058 Erlangen, F.R. of Germany ABSTRACT We developed a classication scheme for laryngealizations that can be used to discriminate the many dierent shapes of laryngealizations with dierent feature values. Potential applications are phonetic transcription and automatic detection. The scheme was developed and tested with a database from 4 speakers that contains more than 1200 laryngealizations. INTRODUCTION The normal speech register (modal voice) comprises a F0-range from about 60 to 250 Hz for male speakers and a F0-range from about 120 to 550 Hz for female speakers. Below this register there is a special phonation type whose mechanisms of production are not totally understood yet and whose linguistic functions are little investigated until now. There is a variety of dierent terms for this phenomenon, which are used more or less synonymously: creak, vocal fry, creaky voice, pulse register, laryngealization, etc. We use \laryngealization" (henceforth LA) as a cover term for all these phenomena that show up as irregular, voiced stretches of speech. Normally, LAs do not disturb pitch perception but are perceived as suprasegmental irritations modulated onto the pitch curve. Although LAs can be found not only in pathological speech, but also in normal conversational speech, most of the time, they were not objects of investigation, but considered to be an irritating phenomenon that has to be discarded. On the other hand, recently the fact that LAs often occur at word or morpheme boundaries and thus could be used in speech recognition, has been realized. Eorts for their investigation and classication have been undertaken [2] [3]. In the time signal LAs can look quite dierent (cf. gure 1{6) and it is not far-fetched to claim that the only common denominator of the dierent types is their irregularity. LAs can be produced with dierent means and dierent states of the glottis but it is not clear yet whether there is a regular relationship between dierent production mechanisms and dierent types of LAs showing up in the time signal. In [2] four dierent types of LAs are characterized (cf. below). We will follow another approach and use non-binary features for our description scheme that can be used by dierent transcribers in a consistent way. An overspecication can be reduced in a second step. It should be possible to extract the features automatically with standard pattern recognition algorithms. MATERIAL We investigated a database of 1329 sentences from 4 speakers(3female,1male 30minutes of speech in total). One third of the database consists of real spontaneous utterances gained from human-human clarication dialogues, the rest consists of the same utterances read by the same speakers nine months afterwards (own utterances and partners utterances). Recording conditions were comparable to a quiet oce environment. The utterances were digitized with 12 Bit and 10 khz for more details cf. [1]. Two trained phoneticians classied the voiced passages as [+/{ laryngealized] with the help of a segmentation program (time waveform presented on the screen and iterative listening to the segmented part). 4.8% of the speech in total (7.4% of the voiced parts) were laryngealized (henceforth la).

The mean duration of the LAs was 64.1 ms with a standard deviation of 35.1 minimum =12.8 ms (1 frame), maximum =332.8 ms (26 frames). 16% of the LAs extend through a phoneme boundary. The non-la passages will not be considered in this paper. The la parts were plotted with their non-la context and a constant resolution, and a group of 6 experts tried to cluster a subset of these plots manually using dierent criteria (the 4 classes in [2] as well as phoneme-, context-, and speaker- specic peculiarities). Based on the similarities between the tokens within the clusters and the dissimilarities between tokens of dierent clusters respectively, several features were chosen for characterizing the LAs adequately. Afterwards, a classication scheme was developed heuristically and subsequently tested and veried with the whole material. THE CLASSIFICATION SCHEME FOR LARYNGEALIZATIONS In M USLI (Munchner Schema fur Laryngalisierungs{Identikation) six dierent features in four dierent domains (cf. table 1) are used for describing LAs. The values of these features can be determined independently from each other and are coded with integers within the ranges from 1{3 or from 1{4. Thus, every LA is determined by a sextuple of integers. In this paper, we will deal only with these features and not with other, e.g. speakeror context-specic phenomena. Due to the lack of space,not every feature value can be illustrated in gure 1{6, but some of the values can be seen in the captions. The features and their values that are given in brackets are described in the following. In parentheses, the percentage of cases of all LAs assigned to the specic value are given. Values that can probably be combined into one single value (i.e. reduction of overspecication), are given in curly brackets at the end of the description of each feature. Reasons for combining are: either one of the values - e.g. [3] in AMPSYN - occurs very seldom, or because the two values might possibly not be told apart with great certainty by e.g. an automatic classication. At the same time, the values do not discriminate dierent LA-types such as e.g. the values [1] and [2] in F0SYN and F0PAR, cf. table 1. 1. NUMBER =Number of glottal pulses: [1] many periods (83.5%) [2] two to three periods (8.8%) [3] one period (7.3%) f2 3g 2. DAMPING =Special form of the damped wave: [1] relatively normal damping (42.4%) [2] strong exponential decay of the amplitude (2.6%) [3] \delta-like", triangular damping (24.4%) [4] \unusual" damping (30.1%) f2 4g 3. AMPSYN =Amplitude compared with left and right context (syntagmatic aspect): [1] normal (76.2%) [2] lower (23.3%) [3] higher (0.4%) f1 3g 4. AMPPAR =Amplitude variations inside the LA (paradigmatic aspect): [1] regular envelope, no variations (23.3%) [2] slightly irregular envelope (45.3%) [3] \diplophonic", i.e. regular variation between high and low amplitude (17.8%) [4] break down of envelope (12.7%) f1 2g 5. F0SYN =F0 compared with context (syntagmatic aspect): [1] regular, no variations (38.0%) [2] slightly irregular (15.1%) [3] subharmonic (25.2%) [4] extremely long period(s) or pause (20.3%) f1 2g 6. F0PAR =F0 variations inside the LA (paradigmatic aspect): [1] regular, no variations (39.7%) [2] slightly irregular (28.1%) [3] strong variations (25.3%) [4] periods not detectable (6.7%) f1 2g The feature value [1] is always the default value as it is found regularly in non-la speech as well. A value was determined if it showed up during more of half of the la passage. A \compound type" (56 occurrences in the database) was determined if the la passage consisted of two or more clearly distinct parts that could be classied on their own. These parts were treated separately. In total 1251 LAs were labeled with M USLI.

RESULTS AND DISCUSSION Out of all 1251 LAs 81% could be classied unequivocally and completely. 18% could also be classied, but with a disagreement in at least one feature value between the two phoneticians. In only 18 cases there was at least one feature value that could not be determined at all (feature value 0, cf. gure 6). The numbers given in the following always refer to all LAs except these 18 cases. For a grouping of the LAs into distinct LA-types, we rst chose those combinations of feature values (sextuples) that occurred 10 times. These sextuples were grouped so that (near) default values were combined with as few as possible non-default values. We distinguish four dierent domains in the time signal: Number, Damping, Amplitude, and Frequency. In the following description, parentheses contain one or more of: 1. the relevant domains 2. the number of the gure showing an example 3. the terms used in [2] if they dier. Three LA-types could be dierentiated with the help of dierent domains: glottalization (Number and Frequency, gure 1), damping (Damping, gure 2, creak), diplophonia (Amplitude, gure 3). Two LA-types could be dierentiated within one single domain, namelysubharmonic (gure 4, creak), and aperiodicity (gure 5, creak or creaky voice) both having dierent values inside Frequency for F0SYN and F0PAR. In gure 6, the waste paper basket LA-type is illustrated with an example where two feature values (for AMPPAR and F0SYN) could not been dened. AMPSYN is no \distinctive feature" because it does not discriminate LA-types but it can characterize LAs in general. In gure 1{6 the sextuple of feature values is given in each caption in parentheses. Although a \standard" glottalization has only one period followed by a long pause, the example given in gure 1 represents roughly half of all the glottalizations in our material. Table 1: LA-types and their characterization with M USLI Domains & FEATURES LA-type Number Damping Amplitude Frequency (number of cases) NUMBER DAMPING AMPSYN AMPPAR F0SYN F0PAR glottalization '[23]' 3 1 [12] '4' [13] ( 61/114/116) [1234] [123] [123] damping 1 '[34]' 1 [12] 1 [12] (161/292/680) [123] [234] [123] [124] [12] [12] diplophonia 1 1 1 '3' [12] [12] (122/166/222) [1234] [123] subharmonic 1 [134] 1 [12] '3' '[12]' (109/157/190) [1234] [123] [124] aperiodicity 1 [134] [12] 2 [34] '[34]' (158/242/384) [1234] [123] [1234] Table 1 shows the ve LA-types and their characterization with special feature values. The columns can be interpreted as regular terms: between columns holds conjunction, within brackets holds disjunction. Combinatorically 3 4 3 4 4 4 =2304 dierent sextuples can occur. In the rst line of each LA-type the combinations are shown that entail 10 cases (narrow condition 56 possible, 24 occurring sextuples). Weakening the conditions more cases can be classied cf. the possible feature combinations in the second line of each LAtype (broad condition 780 possible, 178 occurring sextuples). In the second line cells are left empty, whose terms do not dier from the corresponding terms in the rst line. Cases that are comprised in line two are kept disjoint, i.e. there is no intersection of two LAtypes. They representsotospeakpure LA-types. However, if we use as criterion only the \distinctive feature" values quoted in line one, i.e. for the other features all values are valid (very broad condition), we get 3552 possible and 247 occurring sextuples. 532 cases belong to more than one LA-type, 83% of them forming an intersection of damping with other

LA-types. In the rst column of table 1 the number of cases for narrow/broad/very broad conditions are given in parentheses below the name of each LA-type. GLOTTALIZATION Figure 1 (231243) DAMPING Figure 2 (141211) DIPLOPHONIA Figure 3 (111311) SUBHARMONIC Figure 4 (111231) APERIODICITY Figure 5 (131243) WASTE PAPER BASKET Figure 6 (111004) FINAL REMARKS It can be doubted that the features are distinctive phonologically but at least some of them might constitute allophones occurring in dierent contexts, while others might describe simply free variants. Yet, to our knowledge the feature matrix in table 1 is the rst attempt to describe a large corpus of LAs systematically and exhaustively with a feature approach it seems to work reasonably well. M USLI should, however, not be taken as the nal classication scheme for LAs but rather as a starting point for further investigation. Other possible features as e.g. spectral tilt, breathiness or (partial) devoicing could be taken into consideration as well. The next step will be the automatic extraction of the dierent features and then hopefully a more straightforward but at the same time more robust feature description and a reduction of overspecication. It should be investigated further whether dierent LA-types can be discriminated perceptually, whether dierent LA-types have dierent functions such as e.g. boundary marking, and if the dierent LAtypes are speaker-, language-, or register-specic. Acknowledgements This work was supported by the German Ministry for Research and Technology (BMFT) in the joint research project ASL/VERBMOBIL and by the Deutsche Forschungsgemeinschaft (DFG). Only the authors are responsible for the contents of this paper. References [1] A. Batliner, C. Weiand, A. Kieling, and E. Noth. Why sentence modality in spontaneous speech is more dicult to classify and why this fact is not too bad for prosody. Inthisvolume. [2] D. Huber. Aspects of the Communicative Function of Voice in Text Intonation. PhD thesis, Chalmers University, Goteborg/Lund, 1988. [3] A. Kieling, R. Kompe, E. Noth, and A. Batliner. Irregularitaten im Sprachsignal storend oder informativ? In R. Homann, editor, Elektronische Signalverarbeitung, volume 8 of Studientexte zur Sprachkommunikation, pages 104{108. TU Dresden, 1991.