3F3: Signal and Pattern Processing



Similar documents
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning Logistic Regression

STA 4273H: Statistical Machine Learning

Lecture 3: Linear methods for classification

Bayes and Naïve Bayes. cs534-machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Models for Classification

Lecture 6. Artificial Neural Networks

Basics of Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Classification. Volker Tresp Summer 2015

MACHINE LEARNING IN HIGH ENERGY PHYSICS

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Predict Influencers in the Social Network

Introduction to Machine Learning Using Python. Vikram Kamath

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CSCI567 Machine Learning (Fall 2014)

Analysis Tools and Libraries for BigData

Linear Threshold Units

Statistical Machine Learning

Classification Problems

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Machine Learning and Pattern Recognition Logistic Regression

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

The Artificial Prediction Market

Data Mining and Visualization

Christfried Webers. Canberra February June 2015

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 9: Introduction to Pattern Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Machine Learning Approach to Handle Fraud Bids

Logistic Regression for Spam Filtering

Data Mining: An Overview. David Madigan

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Machine Learning What, how, why?

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Model Combination. 24 Novembre 2009

Programming Exercise 3: Multi-class Classification and Neural Networks

High-dimensional labeled data analysis with Gabriel graphs

CSI:FLORIDA. Section 4.4: Logistic Regression

Computational Assignment 4: Discriminant Analysis

SAS Software to Fit the Generalized Linear Model

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

CHAPTER 2 Estimating Probabilities

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Classification of Bad Accounts in Credit Card Industry

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Graphical Representation of Multivariate Data

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

CS229 Lecture notes. Andrew Ng

Supervised Learning (Big Data Analytics)

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Maximum Likelihood Estimation

Factorial experimental designs and generalized linear models

Introduction to Logistic Regression

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Classification using Logistic Regression

Question 2 Naïve Bayes (16 points)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

A Logistic Regression Approach to Ad Click Prediction

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap

Monotonicity Hints. Abstract

Detecting Corporate Fraud: An Application of Machine Learning

Statistical Machine Learning from Data

Logistic Regression (1/24/13)

Efficient Cluster Detection and Network Marketing in a Nautural Environment

Reject Inference in Credit Scoring. Jie-Men Mok


Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Finance 400 A. Penati - G. Pennacchi Market Micro-Structure: Notes on the Kyle Model

Data Mining Part 5. Prediction

1 Prior Probability and Posterior Probability

Notes on the Negative Binomial Distribution

Classification with Hybrid Generative/Discriminative Models

Multiple Choice Models II

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Machine Learning Final Project Spam Filtering

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Principled Hybrids of Generative and Discriminative Models

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

1 Maximum likelihood estimation

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Data mining and statistical models in marketing campaigns of BT Retail

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Neural Networks Lesson 5 - Cluster Analysis

Machine Learning for Data Science (CS4786) Lecture 1

CSC 411: Lecture 07: Multiclass Classification

Transcription:

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term

Classification We will represent data by vectors in some vector space. Let denote a data point with elements = (, 2,..., D ) The elements of, e.g. d, represent measured (observed) features of the data point; D denotes the number of measured features of each point. The data set D consists of N pairs of data points and corresponding discrete class labels: D = {( (), y () )..., ( (N), y (N) )} where y (n) {,..., C} and C is the number of classes. The goal is to classify new inputs correctly (i.e. to generalize). o o o o o o o Eamples: spam vs non-spam normal vs disease 0 vs vs 2 vs 3... vs 9

Classification: Eample Iris Dataset 3 classes, 4 numeric attributes, 50 instances A data set with 50 points and 3 classes. Each point is a random sample of measurements of flowers from one of three iris species setosa, versicolor, and virginica collected by Anderson (935). Used by Fisher (936) for linear discrimant function technique. The measurements are sepal length, sepal width, petal length, and petal width in cm. Data: 5.,3.5,.4,0.2,Iris-setosa 4.9,3.0,.4,0.2,Iris-setosa 4.7,3.2,.3,0.2,Iris-setosa... 7.0,3.2,4.7,.4,Iris-versicolor 6.4,3.2,4.5,.5,Iris-versicolor 6.9,3.,4.9,.5,Iris-versicolor... 6.3,3.3,6.0,2.5,Iris-virginica sepal width 4.5 4 3.5 3 2.5 5.8,2.7,5.,.9,Iris-virginica 7.,3.0,5.9,2.,Iris-virginica 2 4 5 6 sepal length 7 8

Linear Classification Data set D: D = {( (), y () )..., ( (N), y (N) )} Assume y (n) {0, } (i.e. two classes) and (n) R D, and let (n) = Linear Classification (Deterministic) ( (n) ). y (n) = H (β (n)) where H(z) = { if z 0 0 if z < 0 is known as the Heaviside, threshold, or step function.

Linear Classification Data set D: D = {( (), y () )..., ( (N), y (N) )} Assume y (n) {0, } (i.e. two classes) and (n) R D, and let (n) = Linear Classification (Probabilistic) ( (n) ). P (y (n) = (n), β) = σ (β (n)) Logistic Function where σ( ) is a monotonic increasing function σ : R [0, ]. For eample if σ(z) is σ(z) = +ep( z), the sigmoid or logistic function, this is called logistic regression. Compare this to... Linear Regression y (n) = β (n) + ɛ n σ(z) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 5 0 5 z

Logistic Classification Data set: D = {( (), y () )..., ( (N), y (N) )}. P (y (n) = (n), β) = σ (β (n)) P (y (n) = 0 (n), β) = P (y (n) = (n), β) = σ ( β (n)) Since σ(z) = +ep( z), note that σ(z) = σ( z). This defines a Bernoulli distribution over y at each point. (cf θ y ( θ) ( y) ) The likelihood for β is: P (y X, β) = n P (y (n) (n), β) = n σ(β (n) ) y(n) ( σ(β (n) )) ( y(n) ) Given the data, this is just some function of β which we can optimize... This is usually called Logistic Regression but it s really solving a classification problem, so I m going to correct this misnomer.

Logistic Classification Maimum Likelihood learning: L(β) = ln P (y X, β) = ln P (y (n) (n), β) n = [ y (n) ln σ(β (n) ) + ( y (n) ) ln σ( β ] (n) ) n Taking derivatives, using ln σ(z) z = σ( z), and defining z n def = β (n), we get: L(β) β = n = n [y (n) σ( z n ) (n) ( y (n) )σ(z n ) (n)] [y (n) σ( z n ) (n) + y (n) σ(z n ) (n) σ(z n ) (n)] = n (y (n) σ(z n )) (n) Intuition?

Logistic Classification The gradient: L(β) β = (y (n) σ(z n )) (n) n A learning rule (steepest gradient ascent in the likelihood): β [t+] = β [t] + η L(β[t] ) β [t] where η is a learning rate Which gives: β [t+] = β [t] + η n (y (n) σ(z n )) (n) This is a batch learning rule, since all N points are considered at once. An online learning rule, can consider one data point at a time... β [t+] = β [t] + η(y (t) σ(z t )) (t) This is useful for real-time learning and adaptation applications. One can also use any standard optimizer (e.g. conjugate gradients, Newton s method).

Logistic Classification Demo of the online logistic classification learning rule: where z t = β [t] (t). β [t+] = β [t] + η(y (t) σ(z t )) (t) 3 2 0 2 lrdemo 3 3 2 0 2 3

Maimum a Posteriori (MAP) and Bayesian Learning MAP: Assume we define a prior on β, for eample a Gaussian prior with variance λ: P (β) = ep{ 2πλ D+ 2λ β β} The instead of maimizing the likelihood we can maimize the posterior probability of β The second term penalizes the length of β. Why would we want to do this? ln P (β D) = ln P (D β) + ln P (β) + const = L(β) 2λ β β + const Bayesian Learning: We can also do Bayesian learning of β by trying to compute or approimate P (β D) rather than optimize it.

Nonlinear Logistic Classification Easy: just map the inputs into a higher dimensional space by computing some nonlinear functions of the inputs. For eample: (, 2 ) (, 2, 2, 2, 2 2) Then do logistic classification using these new inputs. 3 2 0 2 nlrdemo 3 3 2 0 2 3

Multinomial Classification How do we do multiple classes? y {,..., C} Answer: Use a softma function which generalizes the logistic function: Each class c has a vector β c P (y = c, β) = ep{β c } C c = ep{β c }

Some eercises. show how using the softma function is equivalent to doing logistic classification when C = 2. 2. implement the online logistic classification rule in Matlab/Octave. 3. implement the nonlinear etension. 4. investigate what happens when the classes are well separated, and what role the prior has in MAP learning.