Lecture 9: Introduction to Pattern Analysis



Similar documents
Lecture 3: Linear methods for classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Classification. Volker Tresp Summer 2015

Introduction to Pattern Recognition

L4: Bayesian Decision Theory

: Introduction to Machine Learning Dr. Rita Osadchy

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Classification Techniques for Remote Sensing

Machine Learning.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Classification Problems

Statistical Machine Learning

Introduction to General and Generalized Linear Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Threshold Units

Christfried Webers. Canberra February June 2015

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Some probability and statistics

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Machine Learning: Statistical Methods - I

ST 371 (IV): Discrete Random Variables

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Neural Networks Lesson 5 - Cluster Analysis

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

An Introduction to Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Probability for Estimation (review)

STA 4273H: Statistical Machine Learning

Environmental Remote Sensing GEOG 2021

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Statistical Machine Learning from Data

A Tutorial on Probability Theory

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Data Mining Algorithms Part 1. Dejan Sarka

Sections 2.11 and 5.8

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Analecta Vol. 8, No. 2 ISSN

APPLICATIONS OF BAYES THEOREM

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Monotonicity Hints. Abstract

Lecture Slides for INTRODUCTION TO. ETHEM ALPAYDIN The MIT Press, Lab Class and literature. Friday, , Harburger Schloßstr.

Linear Models for Classification

Joint Exam 1/P Sample Exam 1

Data Mining: Algorithms and Applications Matrix Math Review

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Data Mining. Nonlinear Classification

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

CSCI567 Machine Learning (Fall 2014)

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Machine Learning and Pattern Recognition Logistic Regression

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

An Introduction to Basic Statistics and Probability

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Quadratic forms Cochran s theorem, degrees of freedom, and all that

MAS108 Probability I

Statistical Models in Data Mining

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Correlation in Random Variables

1 Maximum likelihood estimation

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Neural Network Add-in

Additional sources Compilation of sources:

3F3: Signal and Pattern Processing

Point Biserial Correlation Tests

Making Sense of the Mayhem: Machine Learning and March Madness

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Covariance and Correlation

Stat 704 Data Analysis I Probability Review

Package MixGHD. June 26, 2015

Introduction to Machine Learning Using Python. Vikram Kamath

Predict Influencers in the Social Network

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Basics of Statistical Machine Learning

Understanding and Applying Kalman Filtering

Lecture 4: BK inequality 27th August and 6th September, 2007

Simple and efficient online algorithms for real world applications

Object Recognition and Template Matching

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Cell Phone based Activity Detection using Markov Logic Network

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

The Bivariate Normal Distribution

DATA MINING TECHNIQUES AND APPLICATIONS

Probability & Probability Distributions

Principle of Data Reduction

Principal Component Analysis

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Transcription:

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities

Features, patterns and classifiers g Feature n Feature is any distinctive aspect, quality or characteristic g Features may be symbolic (i.e., color) or numeric (i.e., height) n The combination of d features is represented as a d-dimensional column vector called a feature vector g The d-dimensional space defined by the feature vector is called feature space g Objects are represented as points in feature space. This representation is called a scatter plot x = x x x 2 d x x 3 x x2 Feature 2 Class 3 Class 2 Class Feature vector Feature space (3D) Scatter plot (2D) Feature 2

Features, patterns and classifiers g Pattern n Pattern is a composite of traits or features characteristic of an individual n In classification, a pattern is a pair of variables {x,ω} where g x is a collection of observations or features (feature vector) g ω is the concept behind the observation (label) g What makes a good feature vector? n The quality of a feature vector is related to its ability to discriminate examples from different classes g Examples from the same class should have similar feature values g Examples from different classes have different feature values Good features Bad features 3

Features, patterns and classifiers g More feature properties Linear separability on-linear separability Multi-modal Highly correlated features g Classifiers n The goal of a classifier is to partition feature space into class-labeled decision regions n Borders between decision regions are called decision boundaries R R R2 R3 R3 R2 R4 4

Components of a pattern rec. system g A typical pattern recognition system contains n A sensor n A preprocessing mechanism n A feature extraction mechanism (manual or automated) n A classification or description algorithm n A set of examples (training set) already classified or described Feedback / Adaptation Classification algorithm Class assignment The real world Sensor Preprocessing and enhancement Feature extraction Clustering algorithm Cluster assignment Regression algorithm Predicted variable(s) 5

An example g Consider the following scenario* n A fish processing plan wants to automate the process of sorting incoming fish according to species (salmon or sea bass) n The automation system consists of g a conveyor belt for incoming products g two conveyor belts for sorted products g a pick-and-place robotic arm g a vision system with an overhead CCD camera g a computer to analyze images and control the robot arm CCD camera Conveyor belt (salmon) Conveyor belt computer *Adapted from Duda, Hart and Stork, Pattern Classification, 2 nd Ed. Robot arm Conveyor belt (bass) 6

An example g Sensor n The camera captures an image as a new fish enters the sorting area g Preprocessing n Adjustments for average intensity levels n Segmentation to separate fish from background g Feature Extraction n Suppose we know that, on the average, sea bass is larger than salmon g Classification n Collect a set of examples from both species g Plot a distribution of lengths for both classes n Determine a decision boundary (threshold) that minimizes the classification error g We estimate the system s probability of error and obtain a discouraging result of 40% n What is next? count Salmon Decision boundary Sea bass length 7

An example g Improving the performance of our PR system n Committed to achieve a recognition rate of 95%, we try a number of features n g g Width, Area, Position of the eyes w.r.t. mouth... only to find out that these features contain no discriminatory information Finally we find a good feature: average intensity of the scales count Decision boundary Sea bass Salmon Avg. scale intensity n We combine length and average intensity of the scales to improve class separability n We compute a linear discriminant function to separate the two classes, and obtain a classification rate of 95.7% length Sea bass Salmon Decision boundary Avg. scale intensity 8

An example g Cost Versus Classification rate n Is classification rate the best objective function for this problem? g The cost of misclassifying salmon as sea bass is that the end customer will occasionally find a tasty piece of salmon when he purchases sea bass g The cost of misclassifying sea bass as salmon is a customer upset when he finds a piece of sea bass purchased at the price of salmon n We could intuitively shift the decision boundary to minimize an alternative cost function length Decision boundary length ew Decision boundary Sea bass Salmon Sea bass Salmon Avg. scale intensity Avg. scale intensity 9

An example g The issue of generalization n The recognition rate of our linear classifier (95.7%) met the design specs, but we still think we can improve the performance of the system g We then design an artificial neural network with five hidden layers, a combination of logistic and hyperbolic tangent activation functions, train it with the Levenberg-Marquardt algorithm and obtain an impressive classification rate of 99.9975% with the following decision boundary length Sea bass Salmon Avg. scale intensity n Satisfied with our classifier, we integrate the system and deploy it to the fish processing plant g A few days later the plant manager calls to complain that the system is misclassifying an average of 25% of the fish g What went wrong? 0

Review of probability theory g Probability n Probabilities are numbers assigned to events that indicate how likely it is that the event will occur when a random experiment is performed Sample space A A 2 Probability Law probability A4 A 3 A A 2 A 3 A 4 event g Conditional Probability n If A and B are two events, the probability of event A when we already know that event B has occurred P[A B] is defined by the relation P[A IB] P[A B] = for P[B] > 0 P[B] g P[A B] is read as the conditional probability of A conditioned on B, or simply the probability of A given B

Review of probability theory g Conditional probability: graphical interpretation S S A A B B B has A A B B occurred g Theorem of Total Probability n Let B, B 2,, B be mutually exclusive events, then P[A] = P[A B ]P[B] +...P[A B]P[B] = P[A Bk ]P[Bk ] k= B B 3 B - A B 2 B 2

Review of probability theory g Bayes Theorem n Given B, B 2,, B, a partition of the sample space S. Suppose that event A occurs; what is the probability of event B j? n Using the definition of conditional probability and the Theorem of total probability we obtain P[B j A] P[A IB j] P[A B j] P[B j] = = P[A] P[A B ] P[B k= k k ] n Bayes Theorem is definitely the fundamental relationship in Statistical Pattern Recognition Rev. Thomas Bayes (702-76) 3

Review of probability theory g For pattern recognition, Bayes Theorem can be expressed as P( ω x) = j k= P(x ω ) P( ω ) j P(x ω ) P( ω ) k j k = P(x ω ) P( ω ) j P(x) j n where ω j is the i th class and x is the feature vector g Each term in the Bayes Theorem has a special name, which you should be familiar with n P(ω i ) Prior probability (of class ω i ) n P(ω i x) Posterior Probability (of class ω i given the observation x) n P(x ω i ) Likelihood (conditional prob. of x given class ω i ) n P(x) A normalization constant that does not affect the decision g Two commonly used decision rules are n Maximum A Posteriori (MAP): choose the class ω i with highest P(ω i x) n Maximum Likelihood (ML): choose the class ω i with highest P(x ω i ) g ML and MAP are equivalent for non-informative priors (P(ω i )=constant) 4

Review of probability theory g Characterizing features/vectors n Complete: Probability mass/density function pdf 00 200 300 400 500 x(lb) pdf for a person s weight pmf 5/6 4/6 3/6 2/6 /6 2 3 4 5 6 x pmf for rolling a (fair) dice n Partial: Statistics g Expectation n The expectation represents the center of mass of a density g Variance n The variance represents the spread about the mean g Covariance (only for random vectors) n The tendency of each pair of features to vary together, i.e., to co-vary 5

Review of probability theory g The covariance matrix (cont.) COV[X] = = E[(X ; T ] = E[(x E[(x... )(x )(x )] )]... O... E[(x E[(x )(x )(x )] )] = 2 σ... c......... c σ 2 n The covariance terms can be expressed as 2 c = σ and c = ρ σ σ g where ρ ik is called the correlation coefficient g Graphical interpretation ii i ik ik i k X k X k X k X k X k X i X i X i X i X i C ik =-σ i σ k ρ ik =- C ik =-½σ i σ k ρ ik =-½ C ik =0 ρ ik =0 C ik =+½σ i σ k ρ ik =+½ C ik =σ i σ k ρ ik =+ 6

Review of probability theory g Meet the multivariate ormal or Gaussian density (µ,σ): f X (x) = (2 π ) n/2 /2 exp (X 2 T (X n For a single dimension, this expression reduces to the familiar expression g Gaussian distributions are very popular n The parameters (µ,σ) are sufficient to uniquely characterize the normal distribution n If the x i s are mutually uncorrelated (c ik =0), then they are also independent g f X (x) = X - exp 2πσ 2 σ The covariance matrix becomes diagonal, with the individual variances in the main diagonal n Marginal and conditional densities n Linear transformations 2 7