# Machine Learning I Week 14: Sequence Learning Introduction

Save this PDF as:

Size: px
Start display at page:

Download "Machine Learning I Week 14: Sequence Learning Introduction"

## Transcription

1 Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009

2 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher M. Bishop Machine Learning for Sequential Data: A Review Thomas G. Dietterich, review paper Markovian Models for Sequential Data Yoshua Bengio, review paper Supervised Sequence Labelling with Recurrent Neural Networks Alex Graves, Ph.D. thesis On Intelligence Jeff Hawkins

3 What is Sequence Learning? Most machine learning algorithms are designed for independent, identically distributed (i.i.d.) data But many interesting data types are not i.i.d. In particular the successive points in sequential data are strongly correlated Sequence learning is the study of machine learning algorithms designed for sequential data. These algorithms should not assume data points to be independent be able to deal with sequential distortions make use of context information Alex Graves ML I 15./

4 What is Sequence Learning Used for? Time-Series Prediction Tasks where the history of a time series is used to predict the next point. Applications include stock market prediction, weather forecasting, object tracking, disaster prediction... Sequence Labelling Tasks where a sequence of labels is applied to a sequence of data. Applications include speech recognition, gesture recognition, protein secondary structure prediction, handwriting recognition... For now we will concentrate on sequence labelling, but most algorithms are applicable to both

5 Definition of Sequence Labelling Sequence labelling is a supervised learning task where pairs (x, t) of input sequences and target label sequences are used for training The inputs x come from the set X = (R m ) of sequences of m-dimensional real-valued vectors, for some fixed m The targets t come from the set T = L of strings over the alphabet L of labels used for the task In each pair (x, t) the target sequence is at most as long as the input sequence: t x. They are not necessarily the same length Definition (Sequence Labelling) Given a training set A and a test set B, both drawn independently from a fixed distribution D X T, the goal is to use A to train an algorithm h : X T to label B in a way that minimises some task-specific error measure

6 Comments We assume that the distribution D X T that generates the data is stationary i.e. the probability of some (x, t) D X T remains constant over time (strictly speaking this does not apply to e.g. financial and weather data, because markets and climates change over time) Therefore the sequences (but not the individual data points) are i.i.d. This means that much of the reasoning underlying standard machine learning algorithms also applies here, only at the level of sequences and not points

7 Motivating Example Online handwriting recognition is the recognition of words and letters from sequences of pen positions The inputs are the x and y coordinates of the pen, so m is 2 The label alphabet L is just the usual Latin alphabet, possibly with extra labels for punctuation marks etc. The error measure is the edit distance between the output of the classifier and the target sequence input itput utput output

8 Online Handwriting with Non-Sequential Algorithms If we assume the data-points are independent we should classify each co-ordinate separately Clearly impossible! Each point is only meaningful in the context of its surroundings Obvious solution is to classify an input window around each point This is the usual approach when standard ML algorithms (SVMs, MLPs etc) are applied to sequential data

9 Online Handwriting with Non-Sequential Algorithms If we assume the data-points are independent we should classify each co-ordinate separately Clearly impossible! Each point is only meaningful in the context of its surroundings Obvious solution is to classify an input window around each point This is the usual approach when standard ML algorithms (SVMs, MLPs etc) are applied to sequential data

10 Context and Input Windows One problem is that it is difficult to determine in advance how big the window should be Too small gives poor performance, too big is computationally unfeasible (too many parameters) Have to hand-tune for the dataset, depending on writing style, input resolution etc.

11 Context and Input Windows One problem is that it is difficult to determine in advance how big the window should be Too small gives poor performance, too big is computationally unfeasible (too many parameters) Have to hand-tune for the dataset, depending on writing style, input resolution etc.

12 Sequential Distortion and Input Windows A deeper problem is that the same patterns often appear stretched or compressed along the time axis in different sequences In handwriting this is caused by variations in writing style In speech it comes from variations in speaking rate, prosody etc. Input windows are not robust to this because they ignore the relationship between the data-points. Even a 1 pixel shift looks like a completely different image! This means poor generalisation and lots of training data needed Alex Graves ML I 15./

13 Hidden State Architectures A better solution is to use an architecture with an internal hidden state that depends on both the previous state and the current data point The chain of hidden states acts like a memory of the data Can be extended to look several states back with no change to the basic structure (but an increase in computational cost) There are many types of hidden state architecture, including Hidden Markov models, recurrent neural networks, linear dynamical systems, extended Kalman filters...

14 Advantages of Hidden State Architectures Context is passed along by the memory stored in the previous states, so the problem of fixed-size input windows are avoided Pr(s n x i ) = Pr(s n s n 1 ) Pr(s n 1 s n 2 )... Pr(s i x i ) And typically require fewer parameters than input windows Sequential distortions can be accommodated by slight changes to the hidden state sequence. Similar sequences look similar to the algorithm. The general principle is that the structure of the architecture matches the structure of the data. Put another way, hidden state architectures are biased towards sequential data.

15 Hidden Markov Models Hidden Markov models (HMMs) are a generative hidden state architecture where sequences of discrete hidden states are matched to observation sequences. Recap: generative models attempt to determine the probability of the inputs (observations) given some class or label: Pr(x C k ) Fitting a mixture of Gaussians to data is a well known example of a generative model with a hidden state Can think of HMMs as a sequential version of a mixture model

16 Hidden Markov Models In a simple mixture model the observations are conditioned on the states. This is still true for HMMs, but now the states are conditioned on the previous states as well This creates the following joint distribution over states and observations N N Pr(x, s θ) = Pr(s 1 π) Pr(s i s i 1, A) Pr(x i s i, φ) i=2 where θ = {π, A, φ} are the HMM parameters and N is the sequence length i=1

17 HMM Parameters Pr(s 1 = k π) = π k are the initial probabilities of the states Pr(s i = k s i 1 = j, A) = A jk is the matrix of transition probabilities between states. Note that some of its entries may be zero, since not all transitions are necessarily allowed Pr(x i s i, φ) are the emission probabilities of the observations given the states. The form of Pr(x i s i, φ) depends on the task, and the performance of HMMs depends critically on choosing a distribution able to accurately model the data. For cases where a single Gaussian is not flexible enough, mixtures of Gaussian and neural networks are common choices. n Pr(x i s i, φ) = akn i (x i µ i k, Σ i k) k=1

18 Training and Using HMMs Like most parametric models, HMMs are trained by adjusting the parameters to maximise the log-likelihood of the training data log Pr(A θ) = log Pr(x, s θ) s (x,t) A This can be done efficiently with the Baum-Welch algorithm Once trained, we use the HMM to label a new data sequence x by finding the state sequence s that gives the highest joint probability s = arg max p(x, s θ). s This can be done with the Viterbi algorithm Note: HMMs can be seen as a special case of probabilistic graphical models (Bishop chapter 8). In this context Baum-Welch and Viterbi are special cases of the sum-product and max-sum algorithms.

19 Evaluation Problem Given an observation sequence x and parameters θ what is the probability Pr(x θ)? First need to compute Pr(s θ). For example, with s = s 1 s 2 s 3 : Pr(s θ) = Pr(s 1, s 2, s 3 θ) = Pr(s 1 θ)pr(s 2, s 3 s 1, θ) = Pr(s 1 θ)pr(s 2 s 1, θ)pr(s 3 s 2, θ) = π 2 A 21 A 11 A 12 Then compute Pr(x s, θ): Pr(x s, θ) = Pr(x 1 x 2 x 3 s 1 s 2 s 3, θ) = Pr(x 1 s 1, θ)pr(x 2 s 2, θ)pr(x 3 s 3, θ) (1)

20 Evaluation Problem Then use sum rule for probabilities to get Pr(x θ) = s Pr(s θ)pr(x s, θ) (2) PROBLEM: number of possible state sequences = s N

21 Unfold the HMM If we unfold the state transition diagram of the above example, we obtain a lattice, or trellis, representation of the latent states. This makes it easier to understand the following derivations.

22 A smart solution for the Evaluation Problem α t (i) = Pr(x 1 x 2... x t s t = i) α 1 (i) = Pr(x 1 s 1 = i)π i α t+1 (j) = A ij Pr(x t+1 s t = j)α t (i) i Therefore the probability of a given observation sequence ending in state s i can be computed as follows Pr(x 1 x 2... x t ) = Pr(s t = i x 1 x 2... x t ) = N α t (i) i=1 α t (i) N j=1 α t(j)

23 A smart solution for the Evaluation Problem α t (i) = Pr(x 1 x 2... x t s t = i) α 1 (i) = Pr(x 1 s 1 = i)π i α t+1 (j) = A ij Pr(x t+1 s t = j)α t (i) i Therefore the probability of a given observation sequence ending in state s i can be computed as follows Pr(x 1 x 2... x t ) = Pr(s t = i x 1 x 2... x t ) = N α t (i) i=1 α t (i) N j=1 α t(j)

24 The Viterbi Algorithm Given an observation sequence x, which is the state sequence s with the highest probability? arg max s Pr(s x 1 x 2... x T ) with Bayes Pr(x 1 x 2... x T s)pr(s) = arg max s Pr(x 1 x 2... x T ) = arg max Pr(x 1 x 2... x T s)pr(s) s Again: Dynamic programming to the rescue!

25 The Viterbi Algorithm The variable δ t (i) is the maximum probability of the existence of the state path s 1 s 2... s t 1 ending in state i and producing the output Pr(x 1 x 2... x t ) δ t (i) = max Pr(s 1 s 2... s t 1, s t = i, x 1 x 2... x t ) s 1s 2...s t 1

26 The Viterbi Algorithm The variable δ t (i) is the maximum probability of the existence of the state path s 1 s 2... s t 1 ending in state i and producing the output Pr(x 1 x 2... x t ) δ t (i) = max Pr(s 1 s 2... s t 1, s t = i, x 1 x 2... x t ) s 1s 2...s t 1

27 The Viterbi Algorithm So for any δ t (j) we are looking for the most probable path of length t that has as the last two states i and j. But this is the most probable path (of length t 1) to i followed by the transition from i to j and the corresponding observation x t. Thus, the most probable path to j has i as its penultimate state, with i = arg max δ t 1 (i)a ij Pr(x t s t = j) i Hence δ t (j) = δ t 1 (i )A i jpr(x t s t = j)

28 The Viterbi Algorithm So for any δ t (j) we are looking for the most probable path of length t that has as the last two states i and j. But this is the most probable path (of length t 1) to i followed by the transition from i to j and the corresponding observation x t. Thus, the most probable path to j has i as its penultimate state, with i = arg max δ t 1 (i)a ij Pr(x t s t = j) i Hence δ t (j) = δ t 1 (i )A i jpr(x t s t = j)

29 Sequence Labelling with HMMs Could define a simple HMM with one state per label But for most data multi-state models are needed for each label, such as this one used for a phoneme in speech recognition Note that only left-to-right and self transitions are allowed. This ensures that all observation sequences generated by the label pass through similar stages For good performance, the states should correspond to independent observation segments within the label Can concatenate label models to get higher level structures, such as words

30 N-Gram Label Models Using separate label models assumes that the observation sequences generated by the labels are independent But in practice this often isn t true. e.g. in speech the pronunciation of a phoneme is influenced by those around it (co-articulation) Usual solution is to use n-gram label models (e.g. triphones), with the label generating the observations in the centre Improves performance, but also increases the number of models (L L n for L labels) and amount of training data required Can reduce the parameter explosion by tying similar states in different n-grams

31 Duration Modelling The only way an HMM can stay in the same state is by repeatedly making self-transitions This means the probability of spending T timesteps in some state k decays expontentially with T Pr(T ) = (A kk ) T (1 A kk ) exp( T ln A kk ) However this is usually an unrealistic model of state duration One solution is to remove the self-transitions and model the duration probability p(t k) explicitly When state k is entered, a value for T is first drawn from p(t k) and T successive observations are then drawn from p(x k)

32 Variant HMM structures Many variant architectures can be created by modifying the basic HMM dependency structure. Three of them are described below Input-output HMMs Turn HMMs into a discriminative model by reversing the dependency between the states and the observations and introducing extra output variables

33 Variant HMM structures Autoregressive HMMs Add adding explicit dependencies between the observations to improve long-range context modelling Factorial HMMs Add extra chains of hidden states, thereby moving from a single-valued to a distributed architecture and increasing the memory capacity of HMMs: O(log N) O(N)

### Lecture 10: Sequential Data Models

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Time-series: stock

### Hidden Markov Models. and. Sequential Data

Hidden Markov Models and Sequential Data Sequential Data Often arise through measurement of time series Snowfall measurements on successive days in Buffalo Rainfall measurements in Chirrapunji Daily values

### Hidden Markov Models Fundamentals

Hidden Markov Models Fundamentals Daniel Ramage CS229 Section Notes December, 2007 Abstract How can we apply machine learning to data that is represented as a sequence of observations over time? For instance,

### Hidden Markov Models

Hidden Markov Models Phil Blunsom pcbl@cs.mu.oz.au August 9, 00 Abstract The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide range of time series data. In the context of natural

### Hidden Markov Model. Jia Li. Department of Statistics The Pennsylvania State University. Hidden Markov Model

Hidden Markov Model Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Hidden Markov Model Hidden Markov models have close connection with mixture models. A mixture model

### Hidden Markov Models for biological systems

Hidden Markov Models for biological systems 1 1 1 1 0 2 2 2 2 N N KN N b o 1 o 2 o 3 o T SS 2005 Heermann - Universität Heidelberg Seite 1 We would like to identify stretches of sequences that are actually

### Literature: (6) Pattern Recognition and Machine Learning Christopher M. Bishop MIT Press

Literature: (6) Pattern Recognition and Machine Learning Christopher M. Bishop MIT Press -! Generative statistical models try to learn how to generate data from latent variables (for example class labels).

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

### Markov chains (cont d)

6.867 Machine learning, lecture 19 (Jaaola) 1 Lecture topics: Marov chains (cont d) Hidden Marov Models Marov chains (cont d) In the contet of spectral clustering (last lecture) we discussed a random wal

### Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models Terminology and Basic Algorithms Motivation We make predictions based on models of observed data (machine learning). A simple model is that observations are assumed to be independent

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Feature engineering. Léon Bottou COS 424 4/22/2010

Feature engineering Léon Bottou COS 424 4/22/2010 Summary Summary I. The importance of features II. Feature relevance III. Selecting features IV. Learning features Léon Bottou 2/29 COS 424 4/22/2010 I.

### Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

### Regime Switching Models: An Example for a Stock Market Index

Regime Switching Models: An Example for a Stock Market Index Erik Kole Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam April 2010 In this document, I discuss in detail

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Lecture 9: Continuous

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 9: Continuous Latent Variable Models 1 Example: continuous underlying variables What are the intrinsic latent dimensions in these two datasets?

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition

Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition Tim Morris School of Computer Science, University of Manchester 1 Introduction to speech recognition 1.1 The

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM tutorial 4. by Dr Philip Jackson

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM tutorial 4 by Dr Philip Jackson Discrete & continuous HMMs - Gaussian & other distributions Gaussian mixtures -

### An Introduction to Statistical Machine Learning - Overview -

An Introduction to Statistical Machine Learning - Overview - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

### CS Markov Chains Additional Reading 1 and Homework problems

CS 252 - Markov Chains Additional Reading 1 and Homework problems 1 Markov Chains The first model we discussed, the DFA, is deterministic it s transition function must be total and it allows for transition

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Autoregressive HMMs for speech synthesis. Interspeech 2009

William Byrne Interspeech 2009 Outline Introduction Highlights Background 1 Introduction Highlights Background 2 3 Experimental set-up Results 4 Highlights Background Highlights for speech synthesis: synthesis

### Advanced Signal Processing and Digital Noise Reduction

Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate Neural Traduction Automatique par Conjointement Apprentissage Pour Aligner et Traduire Dzmitry Bahdanau KyungHyun Cho Yoshua Bengio

### Recurrent Neural Networks

Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Revision Sign off. Name Signature Date. Josh Alvord. Alex Grosso. Jose Marquez. Sneha Popley. Phillip Stromberg. Ford Wesner

Revision Sign-off By signing the following, the team member asserts that he/she has read all changes since the last revision and has, to the best of his/her knowledge, found the information contained herein

### Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

### Andrew Blake and Pushmeet Kohli

1 Introduction to Markov Random Fields Andrew Blake and Pushmeet Kohli This book sets out to demonstrate the power of the Markov random field (MRF) in vision. It treats the MRF both as a tool for modeling

### Hidden Markov Models. Hidden Markov Models

Markov Chains Simplest possible information graph structure: just a linear chain X Y Z. Obeys Markov property: joint probability factors into conditional probabilities that only depend on the previous

### Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

### Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models A Tutorial for the Course Computational Intelligence http://www.igi.tugraz.at/lehre/ci Barbara Resch (modified by Erhard and Car line Rank) Signal Processing

### APPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL

INTERNATIONAL JOURNAL OF ELECTRONICS; MECHANICAL and MECHATRONICS ENGINEERING Vol.2 Num.4 pp.(353-360) APPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL Hanife DEMIRALP 1, Ehsan MOGHIMIHADJI 2 1 Department

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Machine learning for algo trading

Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

### Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA Machine Learning Project

Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA 308-761 Machine Learning Project Kaleigh Smith January 17, 2002 The goal of this paper is to review the theory of Hidden

### Hardware Implementation of Probabilistic State Machine for Word Recognition

IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

### Statistics Graduate Courses

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### 15 Markov Chains: Limiting Probabilities

MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the n-step transition probabilities are given

### Writer Adaptive Training and Writing Variant Model Refinement for Offline Arabic Handwriting Recognition

Writer Adaptive Training and Writing Variant Model Refinement for Offline Arabic Handwriting Recognition Philippe Dreuw, David Rybach, Christian Gollan, and Hermann Ney RWTH Aachen University Human Language

### Turkish Radiology Dictation System

Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

### Hidden Markov Models CBB 261 / CPS 261. part 1. B. Majoros

Hidden Markov Models part 1 CBB 261 / CPS 261 B. Majoros The short answer: it s an opaque, nondeterministic machine that emits variable-length sequences of discrete 1 symbols. push AGATATTATCCGATGCTACGCATCA

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### LECTURE 2: ALGORITHMS AND APPLICATIONS

LECTURE 2: ALGORITHMS AND APPLICATIONS OUTLINE State Sequence Decoding Example: dice & coins Observation Sequence Evaluation Example: spoken digit recognition HMM architectures Other applications of HMMs

### Markov Chains. Chapter Introduction and Definitions 110SOR201(2002)

page 5 SOR() Chapter Markov Chains. Introduction and Definitions Consider a sequence of consecutive times ( or trials or stages): n =,,,... Suppose that at each time a probabilistic experiment is performed,

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

### Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models

Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models Denis Tananaev, Galina Shagrova, Victor Kozhevnikov North-Caucasus Federal University,

### Appearance-Based Recognition of Words in American Sign Language

Appearance-Based Recognition of Words in American Sign Language Morteza Zahedi, Daniel Keysers, and Hermann Ney Lehrstuhl für Informatik VI Computer Science Department RWTH Aachen University D-52056 Aachen,

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Speech Synthesis by Artificial Neural Networks (AI / Speech processing / Signal processing)

Speech Synthesis by Artificial Neural Networks (AI / Speech processing / Signal processing) Christos P. Yiakoumettis Department of Informatics University of Sussex, UK (Email: c.yiakoumettis@sussex.ac.uk)

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Non-Markovian Logic-Probabilistic Modeling and Inference

Non-Markovian Logic-Probabilistic Modeling and Inference Eduardo Menezes de Morais Glauber De Bona Marcelo Finger Department of Computer Science University of São Paulo São Paulo, Brazil WL4AI 2015 What

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks

princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks Lecturer: Sanjeev Arora Scribe:Elena Nabieva 1 Basics A Markov chain is a discrete-time stochastic process

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Data a systematic approach

Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in

### Feature Extraction and Selection. More Info == Better Performance? Curse of Dimensionality. Feature Space. High-Dimensional Spaces

More Info == Better Performance? Feature Extraction and Selection APR Course, Delft, The Netherlands Marco Loog Feature Space Curse of Dimensionality A p-dimensional space, in which each dimension is a

### Bayesian Decision Theory. Prof. Richard Zanibbi

Bayesian Decision Theory Prof. Richard Zanibbi Bayesian Decision Theory The Basic Idea To minimize errors, choose the least risky class, i.e. the class for which the expected loss is smallest Assumptions

### Coding and decoding with convolutional codes. The Viterbi Algor

Coding and decoding with convolutional codes. The Viterbi Algorithm. 8 Block codes: main ideas Principles st point of view: infinite length block code nd point of view: convolutions Some examples Repetition

### SPEAKER ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS SYSTEM USING MLLR

SPEAKER ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS SYSTEM USING MLLR Masatsune Tamura y, Takashi Masuko y, Keiichi Tokuda yy, and Takao Kobayashi y ytokyo Institute of Technology, Yokohama, 6-8 JAPAN yynagoya

### The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Machine Learning Overview

Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 1. As a scientific Discipline 2. As an area of Computer Science/AI

### Hidden Markov Models

8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Clustering - example. Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering

Clustering - example Graph Mining and Graph Kernels Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering x!!!!8!!! 8 x 1 1 Clustering - example

### Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia

### Discovering Structure in Continuous Variables Using Bayesian Networks

In: "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge MA, 1996. Discovering Structure in Continuous Variables Using Bayesian Networks Reimar Hofmann and Volker Tresp Siemens AG,

### Information Retrieval Statistics of Text

Information Retrieval Statistics of Text James Allan University of Massachusetts, Amherst (NTU ST770-A) Fall 2002 All slides copyright Bruce Croft and/or James Allan Outline Zipf distribution Vocabulary

### Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

### Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

### Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 1-3. You must complete all four problems to

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### CS540 Machine learning Lecture 14 Mixtures, EM, Non-parametric models

CS540 Machine learning Lecture 14 Mixtures, EM, Non-parametric models Outline Mixture models EM for mixture models K means clustering Conditional mixtures Kernel density estimation Kernel regression GMM

### Interpreting American Sign Language with Kinect

Interpreting American Sign Language with Kinect Frank Huang and Sandy Huang December 16, 2011 1 Introduction Accurate and real-time machine translation of sign language has the potential to significantly

### 6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

### Learning Organizational Principles in Human Environments

Learning Organizational Principles in Human Environments Outline Motivation: Object Allocation Problem Organizational Principles in Kitchen Environments Datasets Learning Organizational Principles Features

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Activity recognition in ADL settings. Ben Kröse b.j.a.krose@uva.nl

Activity recognition in ADL settings Ben Kröse b.j.a.krose@uva.nl Content Why sensor monitoring for health and wellbeing? Activity monitoring from simple sensors Cameras Co-design and privacy issues Necessity

### Machine Learning and Statistics: What s the Connection?

Machine Learning and Statistics: What s the Connection? Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh, UK August 2006 Outline The roots of machine learning