Outline. Motivation. Introduction. Bilingual speech synthesis. Previous research 26/11/2013

Similar documents
SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

Thirukkural - A Text-to-Speech Synthesis System

HMM-based Speech Synthesis with Various Degrees of Articulation: a Perceptual Study

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

PhD students (1): Ing. Jozef Hámorský

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Exploring the Structure of Broadcast News for Topic Segmentation

Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments

Develop Software that Speaks and Listens

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Effect of Captioning Lecture Videos For Learning in Foreign Language 外 国 語 ( 英 語 ) 講 義 映 像 に 対 する 字 幕 提 示 の 理 解 度 効 果

Establishing the Uniqueness of the Human Voice for Security Applications

Analysis and Synthesis of Hypo and Hyperarticulated Speech

Subjective SNR measure for quality assessment of. speech coders \A cross language study

Speech Transcription

Turkish Radiology Dictation System

Emotion Detection from Speech

A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Ericsson T18s Voice Dialing Simulator

Corpus Driven Malayalam Text-to-Speech Synthesis for Interactive Voice Response System

Where to Find the Highest Audio Engineer Salary. How Education and Training can affect the Audio Engineer Salary

CallAn: A Tool to Analyze Call Center Conversations

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Web-Conferencing System SAViiMeeting

Luxembourg-Luxembourg: FL/RAIL16 Translation services 2016/S Contract notice. Services

Speech Signal Processing: An Overview

Carla Simões, Speech Analysis and Transcription Software


Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

ivms-4500 HD (Android) Mobile Client Software User Manual (V3.4)

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Central and South-East European Resources in META-SHARE

Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets

Integration of Negative Emotion Detection into a VoIP Call Center System

Program curriculum for graduate studies in Speech and Music Communication

Things to remember when transcribing speech

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt

Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

From Concept to Production in Secure Voice Communications

Music Mood Classification

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Machine Learning with MATLAB David Willingham Application Engineer

MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Effects of Automated Transcription Delay on Non-native Speakers Comprehension in Real-time Computermediated

IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream

Text-To-Speech Technologies for Mobile Telephony Services

BOOTSTRAPPING TEXT-TO-SPEECH FOR SPEECH PROCESSING IN LANGUAGES WITHOUT AN ORTHOGRAPHY

SASSC: A Standard Arabic Single Speaker Corpus

(Refer Slide Time: 4:45)

Lecture 12: An Overview of Speech Recognition

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Problems and Prospects in Collection of Spoken Language Data

Perceived Speech Quality Prediction for Voice over IP-based Networks

DVB-T 730. User s Manual

GV-FE420. Visítenos: Cámara IP Fisheye Megapixel

ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2

L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES

Attenuation (amplitude of the wave loses strength thereby the signal power) Refraction Reflection Shadowing Scattering Diffraction

TED-LIUM: an Automatic Speech Recognition dedicated corpus

ivms-4500 HD (ios) Mobile Client Software User Manual (V3.4)

IMPROVING TTS BY HIGHER AGREEMENT BETWEEN PREDICTED VERSUS OBSERVED PRONUNCIATIONS

English for communication in the workplace

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

New Features SMART Sync Collaboration Feature Improvements

TRUE: an Online Testing Platform for Multimedia Evaluation

TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE

Luxembourg-Luxembourg: FL/TERM15 Translation services 2015/S Contract notice. Services

Curriculum Vitae. Alison M. Trude December Website: Department of Psychology, University of Chicago

Creating voices for the Festival speech synthesis system.

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids

EUROPASS DIPLOMA SUPPLEMENT

Conference interpreting with information and communication technologies experiences from the European Commission DG Interpretation

Navigation Aid And Label Reading With Voice Communication For Visually Impaired People

VOICE RECOGNITION KIT USING HM2007. Speech Recognition System. Features. Specification. Applications

INTRODUCTION TO COMMUNICATION SYSTEMS AND TRANSMISSION MEDIA

Luxembourg-Luxembourg: FL/SCIENT15 Translation services 2015/S Contract notice. Services

Figure1. Acoustic feedback in packet based video conferencing system

Industry Guidelines on Captioning Television Programs 1 Introduction

Recent advances in Digital Music Processing and Indexing

placing people first SALARY REPORT Summary of 2014 Bratislava

Direct Link Networks. Introduction. Physical Properties. Lecture - Ethernet 1. Areas for Discussion. Ethernet (Section 2.6)

TranSegId: A System for Concurrent Speech Transcription, Speaker Segmentation and Speaker Identification

USER MANUAL DUET EXECUTIVE USB DESKTOP SPEAKERPHONE

MLLP Transcription and Translation Platform

AM TRANSMITTERS & RECEIVERS

Online Diarization of Telephone Conversations

HMM-based Breath and Filled Pauses Elimination in ASR

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

SPEECH INTELLIGIBILITY and Fire Alarm Voice Communication Systems

encoding compression encryption

Transcription:

Outline Application of speaker adaptation methods used in speech synthesis for speaker de-identification Miran Pobar, Department of Informatics, University of Rijeka, Croatia Introduction Motivation Previous research The idea/planned work Conclusions and future work COST ACTION IC 1206 Meeting, Mataró, Spain, November 25-26 2013 Introduction Speaker de-identification: natural and intelligible sounding voice, that doesn t reveal the speaker s identity [Jin et al., 2009.] Not concerned with identification based on message content Applications Call centers, anonymous help lines Motivation Motivation: application of VT for de-identification, without the need for previous enrollment of new speakers. Previous research by [Jin et al. 2009.]: Voice Convergin: Speaker De-Identification by Voice Transformation Application of Voice Transformation to speaker deidentification Goal: Voice sounds natural and intelligible, but doesn t reveal the speaker s identity Results show successful de-identification while preserving intelligibility Speaker enrollment phase limits applicability Previous research Speaker adaptation in speech synthesis, esp. bilingual speech synthesis A Bilingual HMM-Based Speech Synthesis System for Closely Related Languages [Justin et al., 2012] A Comparison of Two Approaches to Bilingual HMM-Based Speech Synthesis. [Pobar et al. 2013] The basic concepts were already introduced in [Latorre et al., 2006], [Wu et al., 2009], [Liang et al., 2008] Bilingual speech synthesis Bilingual speech system: synthesizing speech in two languages with the same speaker identity. Quite a large number of languages with small database resources hard to recruit speakers fluent in both languages building the synthetic voice in target language, even though data in the target language does not exist for the target speaker. Data from multiple speakers, adaptation to target speaker Data from both languages treated as monolingual, labeled with a joined phoneme set Adapted to the target speaker using the CMLLR adaptation State mapping approach Speaker independent models trained separately for both lanugages Mapping between states derived automatically from minimum KL divergence between states Mapping used to apply transforms calculated for one language to the other language How to automatize finding the phoneme/state similarities/differences between spoken data in similar languages? Similar method as in de-identification: transformation of one voice to the target voice 1

HMM bilingual system overview Truly bilingual model speaker independent bilingual voice average voice (training data from multiple speakers in both languages) Speaker adaptive training [Yamagishi et al., 2003.] Obtained voice used as basis for adaptation to the target speaker identity CMLLR adaptation of average voice models to the target speaker Data labeled with a common phoneme set Conducted with help from a phonetic expert, who has knowledge about both languages. Slovene and Croatian databases already had phonetic dictionaries closely related to SAMPA phonetic alphabets of respective languages, we decided to map the phonemes with a help of SAMPA alphabet. Diagram adopted from [Yamagishi et al., 2009] Database preparation State mapping approach State mapping approach Train two Average Voice models in both source and target languages. Target speaker: speaks one language, we want synthesis in both Intra-lingual CMLLR adaptation for the target speaker s language Application of transforms trained in one language to the other via state mapping Establish state mappings between the source state model and the target state models, i.e. find a nearest state model in target language for each state model in source language. Train the transforms for the Average Voice model in target language using the adaptation data. Adapt each state model in source language using the transform attached to the mapped state model in target language. [Wu et al., 2009] 2

Test configurations Evaluation results Figure: Mean opinion scores for systems. Left: Croatian synthesis language, right: Slovene synthesis language Figure: Similarity to target speaker (degradation mean opinion score). Left: Croatian ynthesis language, right: Slovene synthesis language Summary of results Results similar for both approaches in intralingual situation Asymmetric results in cross-lingual situation The results suggest that although there are differences in the two approaches of phoneme mapping,the target speaker itself may play important role in the performance of bilingual systems. The duration model in cases of S1* for each phoneme is also shared between the trained model, but that is not the case in S2*. De-identification using VT Source: Tanja Schultz, Qin Jin, Arthur Toth, Alan W Black: Speaker De-Identification, COST Action IC 1206 Meeting, Zagreb, July 8-9 2013 De-identification using VT Speaker de-identification via VT Source: Tanja Schultz, Qin Jin, Arthur Toth, Alan W Black: Speaker De-Identification, COST Action IC 1206 Meeting, Zagreb, July 8-9 2013 [Jin et al. 2009.] Various strategies tested on GMM and Phonetic speaker identification systems Standard Voice Transformation De-Duration Voice Transformation Double Voice Transformation - composition of de-duration + standard VT Transterpolated VT linear inter/extrapolation of MCEPs x=s+f(v-s) Results from 92% for GMM and 42% for Phonetic in case of standard voice transformation to 100% for GMM and 87.5% for Phonetic in case of transterpolated VT Better results on large database (95 male and 102 female speakers) Human listening tests: intelligibility, quality of de-identification, comparison with naïve pitch shifting approach 3

Speaker de-identification via VT [Jin et al. 2009.] Voice transformation strategies can be used to de-identify speech Standard VT performed well with GMM, but poor with Phonetic Investigated different VT strategies: Transterpolation worked best Human tests: de-identified voices are understandable while providing securely transmitted content VT targets frame level and some prosodic characteristics BUT SID possible on higher-level aspects, like prosody, word choice Secure ways to re-identify a (selection of) speaker(s) Need for enrollment of new speakers limits applications Speaker de-identification via VT: planned improvements Can we avoid enrollment phase? Similar idea to state mapping in bilingual speech synthesis: apply transform learned on one speaker pair to another speaker Automatic mapping of states (synthesis) Synthesis: find suitable transform via minimum KL divergence between states De-identification: find suitable transform via SID in a closed set of source speakers Learning phase: Database of a number of male and female speakers Train a SID system using the closed set of speakers in the database Calculate transforms for each speaker in the database towards one target speaker for deidentification De-identification phase: Forced classification of the new speaker into one of the speakers from the set in the SID system (source speaker) VT of speech using the transform learned on sorce-target speaker pair If enough data is collected from the new speaker, learn speaker GMM and transform and include in the source speaker set expanding set of deidentification transforms. Planned setup Database: The multilingual COST278BN corpus (BCN) COST278BN Audio indexing of Broadcast News files in different languages complete news shows that were broadcasted by different TV stations in different countries. 16 TV stations, 11 languages (Basque, Croatian, Czech, Dutch, Galician, Greek, Hungarian, Portuguese, Slovakian, Slovenian (2 sets), Spanish) Audio and video: the audio is (16 khz, 16 bit, wave format) and the video (only 6 data sets also come with video) is mostly in Real Media Video. The aim is more to provide data that can be used for testing, model adaptation and for research on front-end processing algorithms such as speech/nonspeech and speaker turn segmentation, background classification, speaker clustering, etc. that are supposed to be only weakly language dependent. Planned setup Speaker identification: ALIZE Open Source platform, GMM/UBM based Voice transformation: HTS? Matlab Evaluation Automatic: speaker identification using ALIZE (baseline) Subjective/human listening test Conclusions De-identification of speakers using VT Suggested improvements increase possibilities for application: Goal: eliminate the need for enrollment of new speakers to be de-identified, which significantly increases application possibilities for de-identification. A pool of transformations, apply transfomation according to speaker identification results Expanding set of transformation using data collected from new speakers Application of speaker adaptation methods used in speech synthesis for speaker de-identification Miran Pobar, Department of Informatics, University of Rijeka, Croatia COST ACTION IC 1206 Meeting, Mataró, Spain, November 25-26 2013 4

References Jin, Q., Toth, A. R., Schultz, T., & Black, A. W. (2009, April). Voice convergin: Speaker de-identification by voice transformation. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on (pp. 3909-3912). IEEE. Miran Pobar, Tadej Justin, Janez Žibert, France Mihelič, Ivo Ipšić. A Comparison of Two Approaches to Bilingual HMM-Based Speech Synthesis. Text, Speech, and Dialogue, Lecture Notes in Computer Science, Volume 8082, 2013, pp 44-51 A. Vandecatseye, J.P. Martens, J. Neto, H. Meinedo, C. Garcia-Mateo, J.Dieguez, F. Mihelic, J. Zibert, J. Nouza, P. David, M. Pleva, A.Cizmar, H. Papageorgiou, C. Alexandris (2004). The COST278 pan- European broadcast news database, Proceedings LREC (Lissabon), 873-876. J. Zibert, F. Mihelic, J.P. Martens, H. Meinedo, J. Neto, L.Docio, C. Garcia-Mateo, P. David, J. Zdansky, M. Pleva, A. Cizmar, A. Zgank, Z. Kacic, C. Teleki, K. Vicsi (2005). The COST278 Broadcast News Segmentation and Speaker Clustering Evaluation - Overview, Methodology, Systems, Results,Proceedings Interspeech (Lissabon), 629-632. Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., and Isogai, J. (2009). Analysis of speaker adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation algorithm. Audio, Speech, and Language Processing, IEEE Transactions on, 17(1):66 83. 5