MUSICAL INSTRUMENT FAMILY CLASSIFICATION



Similar documents
L9: Cepstral analysis

Artificial Neural Network for Speech Recognition

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Speech Signal Processing: An Overview

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Advanced Signal Processing and Digital Noise Reduction

Ericsson T18s Voice Dialing Simulator

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Establishing the Uniqueness of the Human Voice for Security Applications

Separation and Classification of Harmonic Sounds for Singing Voice Detection

Recent advances in Digital Music Processing and Indexing

Linear Threshold Units

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Quarterly Progress and Status Report. Measuring inharmonicity through pitch extraction

Developing an Isolated Word Recognition System in MATLAB

Analysis/resynthesis with the short time Fourier transform

Auto-Tuning Using Fourier Coefficients

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Machine Learning and Pattern Recognition Logistic Regression

Lab 1. The Fourier Transform

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

: Introduction to Machine Learning Dr. Rita Osadchy

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Measuring Line Edge Roughness: Fluctuations in Uncertainty

Probability and Random Variables. Generation of random variables (r.v.)

Hardware Implementation of Probabilistic State Machine for Word Recognition

School Class Monitoring System Based on Audio Signal Processing

A Learning Based Method for Super-Resolution of Low Resolution Images

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Automatic Transcription: An Enabling Technology for Music Analysis

CS Introduction to Data Mining Instructor: Abdullah Mueen

High Quality Integrated Data Reconstruction for Medical Applications

MODELING DYNAMIC PATTERNS FOR EMOTIONAL CONTENT IN MUSIC

Music Genre Classification

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Admin stuff. 4 Image Pyramids. Spatial Domain. Projects. Fourier domain 2/26/2008. Fourier as a change of basis

Principal components analysis

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Trinity College London

Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids

A Secure File Transfer based on Discrete Wavelet Transformation and Audio Watermarking Techniques

Final Year Project Progress Report. Frequency-Domain Adaptive Filtering. Myles Friel. Supervisor: Dr.Edward Jones

How To Use Neural Networks In Data Mining

The Algorithms of Speech Recognition, Programming and Simulating in MATLAB

Environmental Remote Sensing GEOG 2021

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

Basic Music Theory for Junior Cert.

SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY

MACHINE LEARNING IN HIGH ENERGY PHYSICS

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Short-time FFT, Multi-taper analysis & Filtering in SPM12

Component Ordering in Independent Component Analysis Based on Data Power

B3. Short Time Fourier Transform (STFT)

Dynamic Process Modeling. Process Dynamics and Control

A Digital Audio Watermark Embedding Algorithm

The Probit Link Function in Generalized Linear Models for Data Mining Applications

MIMO CHANNEL CAPACITY

The Scientific Data Mining Process

Statistics Graduate Courses

Similar benefits are also derived through modal testing of other space structures.

The Image Deblurring Problem

PYKC Jan Lecture 1 Slide 1

Emotion Detection from Speech

Frequently Asked Questions: Applied Music Lessons

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Matlab GUI for WFB spectral analysis

SOLVING LINEAR SYSTEMS

Linear Classification. Volker Tresp Summer 2015

Lecture 14. Point Spread Function (PSF)

Capacity Limits of MIMO Channels

Machine Learning for Data Science (CS4786) Lecture 1

Christfried Webers. Canberra February June 2015

TRADUZIONE IN INGLESE degli

Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP

Lecture 9: Introduction to Pattern Analysis

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

Lecture 3: Linear methods for classification

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

ADVANCED APPLICATIONS OF ELECTRICAL ENGINEERING

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Comp Fundamentals of Artificial Intelligence Lecture notes, Speech recognition

Hobbayne Primary School Music Policy Statement Updated October 2011

Spectrum Level and Band Level

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

Adding Sinusoids of the Same Frequency. Additive Synthesis. Spectrum. Music 270a: Modulation

STUDY OF DAM-RESERVOIR DYNAMIC INTERACTION USING VIBRATION TESTS ON A PHYSICAL MODEL

Introduction to Machine Learning Using Python. Vikram Kamath

RADIO FREQUENCY INTERFERENCE AND CAPACITY REDUCTION IN DSL

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Transcription:

MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media. mit. edu A method to classify sounds of musical instruments (single monophonic notes) is introduced. The classification is done using in parallel two sets of perceptual features extracted from the sounds. The models used are a mixture of gaussians, whose parameters where found by training over a database of target sound families. The feature extraction procedure, model training and model usage for classification are explained. An implementation and results of the method are shown and discussed. Introduction Humans are very good at classifying types of musical instruments, even under the most adverse conditions (i.e. noise, polyphonic sound, environmental perturbations). But training a computer to recognize between two different instruments of the same family is not easy task. The problem is that the concept of a family of sounds is difficult to explicitly teach to a computer. A method that uses a set of perceptually derived features to train models of families of sounds is introduced. The sound is analyzed and mapped into two perceptual feature spaces: the Spectral Contours and the Cepstral Coefficients. Approach The proposed method uses a parallel feature-space modeled with gaussian-mixtures to approximate the families of musical instruments to be classified. The models are trained using methods of estimationmaximization. Two feature spaces are used: spectral contour and cepstral coefficients. These are calculated in a frame-by-frame basis for each musical note in the training set. Normalization and dimensionality reduction are applied to make the final training set for each musical instrument type. An independent gaussian mixture model is trained for each feature space for each instrument family. These models are used in parallel to compute the probability of each unknown note of being classified as belonging to a particular instrument type in every feature-space.. Feature-space Digital audio signals are associated with high sampling rates. The amount of data that is produced per second is surprisingly high. Luckily, the characteristics of the sounds of musical instruments vary slowly in time, allowing a meaningful (and very reduced) set of data to be extracted from the original audio signal. In addition, it is important to take into account in some degree the human perceptual element when classifying musical instruments [ 4 ]. Features computed in a frame-by-frame basis of about 0 ms per frame and 30 percent overlap have shown to be good for analysis of musical sounds [ 5 ]... Spectral contour Each frame is transformed using a DFT to the discrete frequency domain. The basic spectral contour is defined as the smoothed energy spectrum. In this project, a modified spectral contour is applied. This modification is performed using a non-linear mapping of the frequency axis into the bark scale [ ]. The

bark scale is a perceptually derived scale composed by non-regular divisions in the frequency domain, each division is equivalent to one critical band. This mapping is done using [ ]. 0.76 * f 000 + 3.5 tan f 7500 z = 3tan ( ) The spectral contour is calculated for every frame, in all critical bands. In Figure is possible to see the evolution in time vs. the critical bands of different musical instruments. (note that it is plotted the log of the magnitude to have a better understanding). (a) Figure Spectral contour of (a) piano and (b) saxophone (b).. Cepstral coefficients A technique originally developed to analyze speech can be applied to extract some information about the structure of the sound [ 7 ],[ 5 ] The real cepstrum (or simply cepstrum) is computed as the inverse Discrete Fourier Transform of the logarithm of the magnitude of the energy spectrum of the frame. This is expressed as: cepstrum= IFFT( log( FFT( frame( t) ))) ( ) For each frame, a set of the first cepstral coefficients is selected. These first coefficients are related to the distribution of the broadband energy in the frequency domain. If a sound is regarded as a time-invariant linear system that is exited by a quasi-periodic source, the first coefficients of the cepstrum will be related to the structure of the system, regardless of the frequency of the excitation [ 7 ]. This is very convenient for the task at hand, because the goal is to classify sounds regardless of their pitch. Also, the underlying assumption is that each family of instruments preserves some of the characteristics (cepstral coefficients) across the pitch range. Figure shows the cepstral coefficients evolution for the same instruments in the previous figure.

3 (a) Figure Cepstrum coefficients for (a) piano and (b) saxophone (b). Feature data manipulation Because of the high number of dimensions that result typically in the feature extraction, it is necessary to realize a dimensionality reduction to be able to manage the data in a more efficient way. Normalization helps to isolate the effects of different energy levels in different frames and sound samples... Normalization Many of the feature values are related to the energy of the spectrum (or the log of the energy), and at the same time it depends on the energy of the original sound (or sound level). This can drive similar sounds with different sound level to be classified as from different families. The approach taken is to make the magnitude of each frame to be equal to unity. frame frame= ( 3 ) frame Note that this will make some frames that fade out in time to loose this time dependant characteristic... Dimensionality reduction It is convenient to reduce the number of components to manipulate in each frame, because this reduces the time of computation of the algorithms, as well as it improves the performance of the selection when removing redundant (or noisy) information from within the data sets. The method used is Principal Component Analysis (PCA) [ 3 ]. y = Wx.3 Model of a musical instrument family: gaussian mixture Each musical instrument family is modeled by two independent gaussian mixtures [ ],[ 3 ],[ 6 ], one for each feature space used. Therefore, the representation of a single instrument is given by the linear combination of these gaussian mixtures: p M contour cepstrum ( x type) = A Pjcontour( j) pcontour( x j) + B Pjcepstrum( j) pcepstrum( x j) j= M j= ( 4 ) ( 5 )

4 Where each p ( x j) contour / cepstrum is a multivariate, full covariance gaussian distribution, with dimension equal to the dimension of each feature space. The weights A and B reflect some knowledge about the confidence in the measure of each feature space. (A+B=, A,B >=0);.3. Training: Expectation maximization An iterative procedure to estimate an optimal set of parameters for the models is used, namely Expectation Maximization EM [ ],[ 3 ], The algorithm selects a random initial guess for the parameters to optimize σ and computes the expected probability of each training point with relation to the mixture. ( p ( j), ) j, u j j Then, a new set of parameters is estimated and the procedure repeated until a good solution is found, or a limit in the number of iterations is reached..3. Classification procedure The classification of a new sound sample has the following steps: - Feature extraction: Extract the frames, and the Spectral Envelope and Cepstrum coefficients. - Normalization: each frame is normalized to unity magnitude. - Dimensionality reduction: for each feature set and each type of musical instrument, a different transform matrix is used to reduce the number of dimensions. - Calculate p(x type) for all the types of instruments, all the frames of the sound, all the feature spaces. - Select the highest p(x type) for each frame. - Classify the sound as belonging to the type that appears the most in all the classified frames/feature spaces. 3 Implementation The suggested algorithm was implemented and tested. The used resources and the results are shown. 3. Program and required resources Sound Database: McGill University Master Samples (MUMS library). CD : Solo Strings: solo violin, viola and cello. 3 notes CD : Woodwind and brass: Flutes, clarinets, bassoons, trumpets, and trombones. CD 3: Piano: Grand pianos. 64 notes Each set was randomly divided in 80% training and 0% testing data. 454 notes The programs were written in Matlab 6.0 and run in a Pentium III dual proc. 750 Mhz, 56 MB ram. They took in average 800 seconds to find the parameters for a model using 0 dimensions and 7 gaussians, using all the training data set. Because of the dependence in a good selection of initial conditions for a good estimate of the parameters using EM, each model was required to be estimated an average of 3 times, to avoid some problems with numeric precision (when a matrix of covariance shrinks to be zero). 4 Discussion During the realization of this project, some interesting functional aspects of EM and gaussian mixtures were noted: When having many dimensions (more than 8), and many gaussians (more than 6), it was common that the initial conditions of the parameter estimation seemed to make the system more sensitive to them. Many of the runs where stopped because of numerical problems with the evaluation after a bad choice of initial conditions.

5 A couple of restrictions have to be enforced, to not allow a gaussian to shrink in any given dimension too much. This will make it flat in this dimension and therefore worthless. The number of dimensions needed to express both feature spaces was around 8 for the Spectral Contour, and about 5 for the Cepstrum Coefficients. The number of gaussians was about 6 and 3 respectively. 5 References [ ] Duda, R. O., Hart, P. E., Pattern Classification and Scene Analysis, John Wiley & sons, New York, 973 [ ] Garcia, R. A., Digital Watermarking of Audio Signals using a Psychoacoustic Auditory Model and Spread Spectrum Theory, University of Miami, Master thesis, 999 [ 3 ] Gershenfeld, N., The Nature of Mathematical Modeling, Cambridge University Press, New York, 999 [ 4 ] Martin, K. D., Sound Source Recognition: A Theory and Computational Model, MIT Ph.D Dissertation, 999 [ 5 ] Oppeneim, A. V., Schafer, R. W., Buck, J. R., Discrete-Time Signal Processing, Prentice hall, New Jersey, Second Edition, 999 [ 6 ] Papoulis, A., Probability, Random Variables, and Stochastic Processes, Mc Graw Hill, Third Ed., 99 [ 7 ] Rabiner, L. R., Schafer, R. W., Digital Processing of Speech Signals, Prentice Hall, New Jersey, 978