BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION



Similar documents
Component Ordering in Independent Component Analysis Based on Data Power

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Speech Signal Processing: An Overview

Advanced Signal Processing and Digital Noise Reduction

Artificial Neural Network for Speech Recognition

Establishing the Uniqueness of the Human Voice for Security Applications

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Hardware Implementation of Probabilistic State Machine for Word Recognition

Ericsson T18s Voice Dialing Simulator

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

L9: Cepstral analysis

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Emotion Detection from Speech

Developing an Isolated Word Recognition System in MATLAB

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

School Class Monitoring System Based on Audio Signal Processing

Lecture 1-10: Spectrograms

Machine Learning with MATLAB David Willingham Application Engineer

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Separation and Classification of Harmonic Sounds for Singing Voice Detection

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Recent advances in Digital Music Processing and Indexing

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Speech recognition technology for mobile phones

Tracking and Recognition in Sports Videos

Music Mood Classification

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

Comp Fundamentals of Artificial Intelligence Lecture notes, Speech recognition

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SR2000 FREQUENCY MONITOR

Carla Simões, Speech Analysis and Transcription Software

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Auto-Tuning Using Fourier Coefficients

Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Command-induced Tracking Jitter Study I D. Clark November 24, 2009

Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

The Scientific Data Mining Process

Lecture 12: An Overview of Speech Recognition

Solutions to Exam in Speech Signal Processing EN2300

Computational Foundations of Cognitive Science

Blood Vessel Classification into Arteries and Veins in Retinal Images

Metadata for Corpora PATCOR and Domotica-2


IBM Research Report. CSR: Speaker Recognition from Compressed VoIP Packet Stream

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

MFCC-Based Voice Recognition System for Home Automation Using Dynamic Programming

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Online Diarization of Telephone Conversations

Application Note. Introduction to. Page 1 / 10

Speech Recognition System for Cerebral Palsy

Speech recognition for human computer interaction

From Concept to Production in Secure Voice Communications

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Vision based Vehicle Tracking using a high angle camera

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Machine Learning for Data Science (CS4786) Lecture 1

FAST MIR IN A SPARSE TRANSFORM DOMAIN

False alarm in outdoor environments

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

NONLINEAR TIME SERIES ANALYSIS

Face detection is a process of localizing and extracting the face region from the

The Phase Modulator In NBFM Voice Communication Systems

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER

HIDDEN MARKOV MODELS FOR SPECTRAL SIMILARITY OF SONGS. Arthur Flexer, Elias Pampalk, Gerhard Widmer

LOW COST HARDWARE IMPLEMENTATION FOR DIGITAL HEARING AID USING

Segmentation and Automatic Descreening of Scanned Documents

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Accurate and robust image superresolution by neural processing of local image representations

Automatic Fall Detector based on Sliding Window Principle

Audio Content Analysis for Online Audiovisual Data Segmentation and Classification

Forecasting the U.S. Stock Market via Levenberg-Marquardt and Haken Artificial Neural Networks Using ICA&PCA Pre-Processing Techniques

Gender Identification using MFCC for Telephone Applications A Comparative Study

Trigonometric functions and sound

ScienceDirect. Brain Image Classification using Learning Machine Approach and Brain Structure Analysis

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Detection and Restoration of Vertical Non-linear Scratches in Digitized Film Sequences

Transcription:

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be This paper considers the problem of improving the automatic speech recognition of audio fragments containing background music. This problem is put in the framework of linear source separation, where the music component is subtracted from the signal, thereby aiming at a better speech recognition, not necessarily a better subjective audio quality. PROBLEM FORMULATION Consider the setup where one wants to apply ASR in the presence of background music, as is often the case with broadcast news or documentary programmes. The speech of these is of good quality, a human has no problem understanding it, but an ASR system (although very robust against moderate white noise) fails terribly when the noise is more structured, since music is disturbing the frequency spectrum in a selective way. Typical audio fragments of this type can be found at e.g. the news website of the BBC (http://news.bbc.co.uk/cbbcnews/) or the news flashes on VRT radio (see http://www.radio1.be) or NOS (see http://omroep.nl/nieuws), but also any documentary programme will add background music to most, if not all, commentary, since this gives the programme a juicier flavour. The type of music used for this purpose is such that it does not disturb human understanding of the spoken content and does not distract attention. Hence often the music consists of relatively long constant frequency tones, as can be seen on the spectrogram in Figure 1. In the example of Figure 1, an audio fragment of 31 seconds from a BBC documentary was sampled at 16 khz and the FFT spectrum was calculated on 512 samples This work was partially supported by the K.U.Leuven Research Fund and by the E.U. 5th Framework project MUSA.

Figure 1: Spectrogram of a 31 second audio fragment with mixed voice and music at a time, with a time shift of 10 ms. These are standard values for speech recognition, see e.g. [1]. The spectrogram plots these FFT values (vertical axis) against time (horizontal axis), where darker pixels correspond to higher FFT values. From this spectrogram of a combined voice + music audio fragment, the music contribution, and especially the most disturbing aspect of it for ASR, is visible as horizontal lines. A speech recogniser will typically start from a simplified version of the spectrogram, viz. the Mel-scale cepstral coefficients, which essentially downsample each column of the spectrogram in a nonlinear way to 24 values, see the upper one third of Figure 2. These 24 values per time instant are here visualised in a similar way as for the spectrogram. Since not only Mel values but also their temporal changes are important for speech recognition, Figure 2 also visualises the 48 additional feature values often used by a speech recogniser, viz. the first and second order time derivatives of the Mel cepstrum, resulting in 72-dimensional feature vectors, one per 10 ms time frame, which are used

to perform a Viterbi search in an HMM phoneme network [2]. Figure 2: Mel-scale cepstral coefficients and derivatives of the same audio fragment It is clear that the long horizontal lines in the spectrogram, stemming from the music, are still present in the HMM features, hence it is no surprise that the Viterbi search will map these to completely different phonemes, as opposed to the situation with moderate white Gaussian noise where the Mel cepstral coefficients are much less affected. A RELATED PROBLEM Removal of additive noise from a signal is a widely studied problem [3]. It is typically tackled by separating signal and noise, thereby making use of assumed statistical properties of both, like the stationarity of the noise, or the statistical independence of signal and noise, plus of course some discriminative feature which distinguishes the two, e.g. the assumption that the noise is white. The noise removal problem can be seen as a special case of the more general blind source separation problem [4], whereby the sum of two or more statistically independent (stationary) sources is observed and the task is to extract the components. This differs from the related problem of speech/music segmentation, as considered in e.g. [5, 6], where the task is to label audio fragments as either speech or music. That

problem can be tackled directly within the ASR HMM framework, by training the silence state with music fragments. Unfortunately, a similar approach for background music reduction (applied to the other HMM states) fails, probably because the thus trained phoneme models start to overlap too much. But from this speech/music segmentation problem we can learn which discriminative features are successful, and use these as guidelines to solve the source separation problem. In [5] two useful features are found to be the momentary frame entropy and the change rate of the audio. From [6] we learn that music fragments are more ordered than speech, i.e., the entropy of music is much more constant in time than the entropy of speech. INDEPENDENT COMPONENT ANALYSIS Our task is thus to subtract the music component from the signal, where we may assume that the music and speech components are statistically independent. Independent component analysis (ICA) [7] is a well-known technique to do this in an unsupervised way. The efficient FastICA algorithm [8] (for which a freely downloadable matlab version exists) was applied to the 72-dimensional feature vectors. The result is a decomposition of these feature vectors into 72 linear components which are as independent from each other as possible. The 72 independent components cannot be used directly, since on the one hand we only need two components, and on the other hand ICA does not tell us which component(s) come from the speech part of the signal and which from the music. Here the results from the speech/music segmentation problem can be used: we assume that music has a more constant frame entropy than speech. Applying ICA to several audio fragments, it was found that the best ASR results were obtained when subtracting from the signal the 10 ICA components with the lowest entropy in the 24 first order derivative components. The obtained improvement in speech recognition result is not spectacular: the word error rate drops from 45% to around 35%, which is still far to high, but at least this proves that blind source separation of speech and music can be useful for improving speech recognition. CONCLUSION Removal of background music from mixed speech/music audio fragments is a difficult problem. A first attempt to use ICA for this purpose shows that it is indeed possible to reduce the speech recognition word error rate substantially.

Figure 3: ICA signal component of the Mel cepstrum Figure 4: ICA music component of the Mel cepstrum

REFERENCES [1] P. Vanroose, G. Kalberer, P. Wambacq, L. Van Gool, From speech to 3D face animation, in Proceedings of the 23rd Symposium on Information Theory in the Benelux, Louvain-la-Neuve, pp. 255 260, 2002. [2] X.D. Huang, Y. Ariki, M.A. Jack, Hidden Markov Models for speech recognition, Edinburgh University Press, 1990. [3] K. Hermus, P. Wambacq and D. Van Compernolle, Improved noise robustness for speech recognition by adaptive SVD-based filtering, in Proc. 20th Symp. on Inform. Theory in the Benelux, Enschede, pp. 117 124, 1999. [4] T.W. Lee, M.S. Lewicki, M. Girolami and T.J. Sejnowski, Blind source separation of more sources than mixtures using overcomplete representations, IEEE Signal Processing Letters, 4(4), pp. 87 90, 1999. [5] J. Ajmera, I. McCowan, H. Bourlard, Speech/music segmentation using entropy and dynamism features in a HMM classification framework, Speech Communication 40, pp. 351 363, 2003. [6] J. Pinquier, J. Rouas, R. André-Obrecht, Robust speech / music classification in audio documents, 7th International Conference On Spoken Language Processing (ICSLP), pp. 2005 2008, 2002. [7] P. Comon, Independent component analysis, a new concept?, Signal Processing, 36(3), pp. 287 314, 1994. [8] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley & Sons, 2001.