Tutorial: Pattern Recognition in Acoustic Signal Processing

Similar documents
Linear Threshold Units

Statistical Machine Learning

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Classification. Volker Tresp Summer 2015

Big Data Analytics CSCI 4030

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Christfried Webers. Canberra February June 2015

Course: Model, Learning, and Inference: Lecture 5

Lecture 8 February 4

Introduction to Online Learning Theory

Predict Influencers in the Social Network

Hardware Implementation of Probabilistic State Machine for Word Recognition

Logistic Regression (1/24/13)

CSCI567 Machine Learning (Fall 2014)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

STA 4273H: Statistical Machine Learning

Advanced Signal Processing and Digital Noise Reduction

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Lecture 3: Linear methods for classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Speech Recognition on Cell Broadband Engine UCRL-PRES

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Introduction to Learning & Decision Trees

Principles of Data Mining by Hand&Mannila&Smyth

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Probabilistic user behavior models in online stores for recommender systems

Probabilistic Latent Semantic Analysis (plsa)

Learning is a very general term denoting the way in which agents:

Java Modules for Time Series Analysis

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Lecture 2: The SVM classifier

Towards better accuracy for Spam predictions

Direct Loss Minimization for Structured Prediction

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Classification algorithm in Data mining: An Overview

HT2015: SC4 Statistical Data Mining and Machine Learning

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Neural Networks Lesson 5 - Cluster Analysis

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

: Introduction to Machine Learning Dr. Rita Osadchy

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

A Learning Based Method for Super-Resolution of Low Resolution Images

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Machine Learning in Spam Filtering

The equivalence of logistic regression and maximum entropy models

NEURAL NETWORKS A Comprehensive Foundation

Azure Machine Learning, SQL Data Mining and R

Ericsson T18s Voice Dialing Simulator

A Simple Introduction to Support Vector Machines

Question 2 Naïve Bayes (16 points)

Neural Networks and Support Vector Machines

Language Modeling. Chapter Introduction

Cell Phone based Activity Detection using Markov Logic Network

Collaborative Filtering. Radek Pelánek

Big learning: challenges and opportunities

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Methods of Data Analysis Working with probability distributions

Developing an Isolated Word Recognition System in MATLAB

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Multi-Class and Structured Classification

Probability and Random Variables. Generation of random variables (r.v.)

Myanmar Continuous Speech Recognition System Based on DTW and HMM

A Content based Spam Filtering Using Optical Back Propagation Technique

Machine Learning with MATLAB David Willingham Application Engineer

Tutorial on Markov Chain Monte Carlo

An Introduction to Machine Learning

Classification Problems

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Machine Learning Logistic Regression

Data Mining Practical Machine Learning Tools and Techniques

Music Genre Classification

Data Mining. Dr. Saed Sayad. University of Toronto

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Simple and efficient online algorithms for real world applications

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Statistical Machine Learning from Data

Classification using Logistic Regression

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Data Mining - Evaluation of Classifiers

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Emotion Detection from Speech

Statistical machine learning, high dimension and big data

Machine Learning Big Data using Map Reduce

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Differential privacy in health care analytics and medical research An interactive tutorial

Support Vector Machines with Clustering for Training with Very Large Datasets

Artificial Neural Network for Speech Recognition

Social Media Mining. Data Mining Essentials

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Transcription:

Tutorial: Pattern Recognition in Acoustic Signal Processing Mark Hasegawa-Johnson These slides: http://www.isle.uiuc.edu/slides/2009/hasegawa-johnson09asa1.pdf ASA Spring Meeting, May 21, 2009

Outline 1 Why Use Pattern Recognition? 2 Algorithm Selection 3 Tutorial: Discriminative Methods Hypothesis Space: Universal Approximators Training Criteria: Differentiable Error Metric Training Algorithm: Chain Rule Wrinkle #1: Recognition, Tracking Wrinkle #2: Small Training Corpus 4 Tutorial: Bayesian Methods Hypothesis Space: Latent Variables Training Criteria: Maximum Likelihood, MAP, MaxEnt Training and Inference Algorithms: Bayes Rule Wrinkle #1: HMM Regression, Switching Kalman Smoothers 5 Tutorial: Hybrids 6 Conclusions

Why Use Pattern Recognition? The Scientific Method Hypothesize-Measure-Test y = h(x) 1 Based on knowledge of the physical situation, form: 1 a hypothesis 2 a null hypothesis 2 Collect data: (x i,y i ), 1 i N. 3 Test the hypothesis: measure P(data null hypothesis)

Why Use Pattern Recognition? The Pattern Recognition Method Hypothesize-Measure-Learn-Test y = h(x) 1 Form an infinite set of hypotheses (called the hypothesis space ), usually a parameterized universal approximator. 2 Collect: 1 training data ((x i, y i ), 1 i N) 2 testing data ((x i, y i ), N + 1 i N + M) 3 Train the hypothesis: maximize P(hypothesis training data) 4 Test the hypothesis: measure P(testing data hypothesis)

Why Use Pattern Recognition? Example: PR in the Scientific Method

Algorithm Selection Criteria for Choosing a Pattern Recognizer 1 Structure of the Model 1 Discriminative Training: All parameters in the model can be simultaneously adjusted to minimize global error metric 2 Bayesian Training: Components must be separately trained, then combined without blowing up

Algorithm Selection Criteria for Choosing a Pattern Recognizer 1 Structure of the Model 1 Discriminative 2 Bayesian 2 Size of the Training Database 1 Empirical Risk Minimization: Training database includes 10,000 independent trials; train model to minimize training database error 2 Structural Risk Minimization: Training database smaller than 1000 trials; train model to minimize (Model Complexity) P(Error) (Training Corpus Error) + λ (Training Corpus Size)

Algorithm Selection Criteria for Choosing a Pattern Recognizer 1 Structure of the Model 1 Discriminative 2 Bayesian 2 Size of the Training Database 1 Empirical Risk Minimization 2 Structural Risk Minimization 3 Dynamic State 1 y = h(x) has no hidden state (classification, regression) 2 y = h(x) has hidden state (recognition, tracking)

Algorithm Selection Criteria for Choosing a Pattern Recognizer 1 Structure of the Model 1 Discriminative 2 Bayesian 2 Size of the Training Database 1 Empirical Risk Minimization 2 Structural Risk Minimization 3 Dynamic State 1 y = h(x) has no hidden state (classification, regression) 2 y = h(x) has hidden state (recognition, tracking) 4 Function Range 1 y = h(x) is an integer (classification, recognition) 2 y = h(x) is a real-valued vector (regression, tracking)

Tutorial: Discriminative Methods Discriminative Training Gradient Descent Methods 1 Choose a hypothesis space (a universal approximator) 2 Choose a differentiable error metric 3 Apply the Chain Rule

Tutorial: Discriminative Methods Hypothesis Space: Universal Approximators Universal Approximators Universal Approximator: Definition A parameterized function space h Θ (x), with parameter vector Θ R αk, is called a universal approximator if for any bounded h(x) with finite domain, lim min h Θ(x) h(x) = 0 K Θ Example: Sigmoidal Neural Network 1 Sigmoidal Neural Network (Θ = {c 1,...,c K,w 1,...,w K }) h Θ (x) = K k=1 c k 1 1 + e xt w k

Tutorial: Discriminative Methods Hypothesis Space: Universal Approximators Universal Approximators More Examples: Universal Approximators 1 Sigmoidal Neural Network (Θ = {c 1,...,c K,w 1,...,w K }) 2 Mixture Gaussian (Θ = {c 1,...,c K,µ 1,...,µ K,Σ 1,...,Σ K }) h Θ (x) = K k=1 c k e 1 2 (x µ k) T Σ 1 k (x µ k ) 3 Classification and Regression Tree (CART), K Nearest Neighbors (KNN), etc. (Θ = [b 1,...,b K,w 1,...,w K, R 1,..., R k ]) h Θ (x) = x T w k + b k if x R k R k is a region with piece-wise linear boundaries.

Tutorial: Discriminative Methods Training Criteria: Differentiable Error Metric Differentiable Error Metric Minkowski Norm Error Metrics E L = 1 N h Θ (x i ) y i L L N i=1 The Best Metric: The Zero Norm { 0 yi = h h Θ (x i ) y i 0 Θ (x i ) 1 otherwise Problem: if E 0 0, what do we do next?

Tutorial: Discriminative Methods Training Criteria: Differentiable Error Metric Differentiable Error Metric Minkowski Norm Error Metrics N E L = 1 N i=1 h Θ (x i ) y i L L Differentiable Error Metrics: L = 1, L = 2 1 The One Norm (Manhattan Distance, L = 1): E 1 h Θ (x i ) = sign(h Θ(x i ) y i ) 2 The Two Norm (Euclidean Distance, L = 2): E 2 h Θ (x i ) = h Θ(x i ) y i

Tutorial: Discriminative Methods Training Algorithm: Chain Rule Apply the Chain Rule The Error Back-Propagation Algorithm 1 Initialize: Choose some initial parameter set Θ (0) 2 Iterate: For t = 1,... until E(Θ) stops changing: Θ E Θ (t) = Θ (t 1) η Θ E N i=1 ( ) E ( Θ h Θ (x i )) h Θ (x i )

Tutorial: Discriminative Methods Wrinkle #1: Recognition, Tracking Wrinkle #1: Recognition, Tracking A Recursive Neural Net is a neural net with one or more hidden state variables: h Θ (x i ) = K k=1 c k 1 1 + e [xt i,h Θ (x i 1 )]w k Training is performed using Back-Propagation Through Time: ( ) E 2 h Θ (x i ) = (h hθ (x i+1 ) Θ(x i ) y i ) + (h Θ (x i+1 ) y i+1 ) +... h Θ (x i )

Tutorial: Discriminative Methods Wrinkle #1: Recognition, Tracking Recursive Neural Nets: Example Application Task Description Blue = x i (F0=RNN input) Yellow = y i (pitch accent = target RNN output) Pink = f (x i ) (RNN-estimated pitch accent probability) Example Results

Tutorial: Discriminative Methods Wrinkle #2: Small Training Corpus Wrinkle #2: Small Training Corpus Test Corpus Error Bounds based on the Central Limit Theorem Example Bounds P(Error) E(X,Y,Θ) + G(Θ) Minimum Description Length: K(Θ) describes h Θ (x) as a binary program, and G(Θ) K(Θ) 0 Support Vector Machines: ψ(θ) describes h Θ (x) as a linear classifier in an augmented feature space, and G(Θ) ψ(θ) 2 2

Tutorial: Discriminative Methods Wrinkle #2: Small Training Corpus Support Vector Machines: Example Application Consonant-vowel transitions, and similar manner-change landmarks, are good places to look for information about speech [Stevens, Interspeech 2000] Some landmarks occur relatively infrequently it s hard to learn what they sound like. SVMs do the job [Niyogi, Burges, and Ramesh, 1999; Borys and Hasegawa-Johnson, 2005; chance=50%]: SVM %ACC SVM %ACC -+Silence 92.1 +-Silence 91.6 -+Continuant 79.7 +-Continuant 81.1 -+Sonorant 86.4 +-Sonorant 91.1 -+Syllabic 88.6 +-Syllabic 78.5 -+Consonantal 78.1 +-Consonantal 73.1

Tutorial: Bayesian Methods Bayesian Methods Bayesian Classification and Recognition: y = arg maxp Θ (y x) Bayesian Regression and Tracking: y = E {y x} Advantages and Disadvantages Disadvantage: One must learn p(x,y); this usually requires more data, and is subject to more error, than learning y = h(x) directly Advantage: Bayesian inference allows modeling of latent variables, state dynamics, and extra sources of information in a principled manner

Tutorial: Bayesian Methods Hypothesis Space: Latent Variables Hypothesis Space Example: Hidden Markov Model Let X = [x 1,...,x N ] be the observations, let Y = [y 1,...,y N ] be the labels. A Hidden Markov Model posits the existence of some latent variables Q = [q 1,...,q N ] such that h Θ (X) = arg max p Θ(X,Y) Y p Θ (X,Y ) = Q N p Θ (y i y i 1 )p Θ (q i q i 1,y i )p Θ (x i q i ) i=1 p Θ (y i y i 1 ) (the language model ) is a lookup table p Θ (q i q i 1,y i ) (the pronunciation model ) is a lookup table p Θ (x i q i,y i ) (the acoustic model ) is a Gaussian, with mean vector µ q and covariance matrix Σ q

Tutorial: Bayesian Methods Training Criteria: Maximum Likelihood, MAP, MaxEnt Learn the Distributions: Maximum Likelihood Maximum Likelihood Parameter Estimation Θ = arg maxlog p Θ (X,Y ) What About Small Training Corpora? Maximum Likelihood is a form of empirical risk minimization Related forms of structural risk minimization include MAP (maximum a posteriori probability) Θ = arg max(logp(x, Y Θ) + log p(θ)) MaxEnt(maximum entropy) Θ = arg max(log p Θ (X, Y ) + H(p Θ ))

Tutorial: Bayesian Methods Training and Inference Algorithms: Bayes Rule Apply Bayes Rule Bayes Rule (a.k.a. the Definition of Conditional Probability) p Θ (X,Y ) = Q N p Θ (y i y i 1 )p Θ (q i q i 1,y i )p Θ (x i q i ) i=1 Training a Bayesian Classifier: Maximum Likelihood Θ = arg maxp Θ (X,Y ) Testing a Bayesian Classifier: Minimum Probability of Error Y = arg maxp Θ (X,Y )

Tutorial: Bayesian Methods Wrinkle #1: HMM Regression, Switching Kalman Smoothers Bayesian Regression and Tracking Hidden Markov Regression Suppose y i R D is a real-valued vector, and (x i,y i ) are jointly Gaussian: ( p(x,y q) exp 1 [ ] T [ ] 1 [ ] ) xi x q Aq B q xi x q 2 y i ȳ q C q y i ȳ q then h(x i ) = arg min E 2 is E {y i X } = q i P(q i X) B T q ( ) ȳ q + Bq T A 1 q (x x q )

Tutorial: Bayesian Methods Wrinkle #1: HMM Regression, Switching Kalman Smoothers Switching Kalman Smoother st 1 s t s t+1 y y y t 1 t t+1 x x x t 1 t t+1 Setup: exactly like HMM regression, except that (x i,y i,y i 1 ) are jointly Gaussian Result: exactly like HMM regression, except that ȳ i q,x, A i q,x, and B i q,x must be updated using interacting multiple Kalman filters: ( ) P(q i X) ȳ i q,x + Bi q,x T A 1 i q,x (x x q) E {y i X } = q i

Tutorial: Bayesian Methods Wrinkle #1: HMM Regression, Switching Kalman Smoothers Switching Kalman Smoother: Example Task Description x i is an acoustic spectrum; y i is the corresponding vector of speech articulator positions Results (unpublished): E 2 (Switching Kalman Smoother) < E 2 (HMM Regression) Difference is consistent but very small

Tutorial: Hybrids Hybrid Discriminative-Bayesian Systems Task Scenario y i is very difficult to classify without dynamic information, (e.g. speech recognition: y i =words, x i =short-time spectrum) An auxiliary variable f i can be inferred very accurately using a local classifier (e.g., f i =phonological distinctive features): ˆf i = h Θ (x i ) Training database includes X = [x 1,...,x N ], Y = [y 1,...,y N ], and F = [f 1,...,f N ] Testing database includes only X = [x N+1,...,x N+M ]

Tutorial: Hybrids Hybrid Training Methods Training Algorithm Train h Θ (x) using discriminative methods Minimize E L = 1 N f i h Θ (x i ) L L N i=1 h Θ (x i ) is a real-valued vector that approximates f i Train a probability model p Θ (F,X,Y ) using Bayesian methods Using p Θ (F, X, Y ), Bayesian inference can model dynamics of hidden states, incorporate multiple knowledge sources, etc.

Tutorial: Hybrids Example: Landmark-Based Speech Recognizer SVM-HMM Hybrid Landmark-Based Speech Recognizer SVM computes a real-valued distinctive feature that optimally discriminates between the case f i = 1 (landmark of a specified type is present) and f i = 1 (landmark absent) HMM computes p(phoneme sequence landmark sequence)

Tutorial: Hybrids Phone Recognition Accuracy vs. Mixture Size, Telephone Speech MFCC = mel frequency cepstral coefficients Landmark = detect manner-to-manner landmarks, e.g., obstruent-to-sonorant Manner = detect manner-onset landmarks, e.g., onset of sonorant region

Tutorial: Hybrids Example: RNN with Kalman Smoothing Mitra et al., in review Frequency 4000 2000 0 0.5 Synthetic utterance:ground 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 VEL TBCL TBCD TTCL TTCD 0 0.5 200 100 0 20 0 20 100 50 0 20 0 20 ANN+Kalman(AP) TrueData 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Time

Conclusions Conclusions Pattern recognition (especially discriminative training) it s easy! Choose a hypothesis space (a family of universal approximators) Gradient descent to minimize error Bayesian learning simplifies the use of structured models Hidden-state dynamics External sources of information Hybrid discriminative-bayesian methods sometimes give the best of both worlds Discriminative training = minimum error locally Bayesian inference = principled integration of disparate information sources

Conclusions Thank You! http://www.isle.uiuc.edu/slides/2009/hasegawa-johnson09asa1.pdf