Multi-Class and Structured Classification



Similar documents
Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Course: Model, Learning, and Inference: Lecture 5

Principles of Data Mining by Hand&Mannila&Smyth

A Simple Introduction to Support Vector Machines

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Machine Learning

3. The Junction Tree Algorithms

STA 4273H: Statistical Machine Learning

Big Data Analytics CSCI 4030

Lecture 6: Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

: Introduction to Machine Learning Dr. Rita Osadchy

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Supervised Learning (Big Data Analytics)

Machine Learning Final Project Spam Filtering

The Basics of Graphical Models

CSCI567 Machine Learning (Fall 2014)

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Question 2 Naïve Bayes (16 points)

Social Media Mining. Data Mining Essentials

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine learning for algo trading

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 3: Linear methods for classification

Cell Phone based Activity Detection using Markov Logic Network

A Learning Based Method for Super-Resolution of Low Resolution Images

Predict Influencers in the Social Network

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Classification by Pairwise Coupling

Data Mining Algorithms Part 1. Dejan Sarka

Invited Applications Paper

CSC 411: Lecture 07: Multiclass Classification

Classification algorithm in Data mining: An Overview

Christfried Webers. Canberra February June 2015

A fast multi-class SVM learning method for huge databases

Azure Machine Learning, SQL Data Mining and R

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

1 Maximum likelihood estimation

Evaluation of Machine Learning Techniques for Green Energy Prediction

Machine Learning for Data Science (CS4786) Lecture 1

An Introduction to Data Mining

Classification using Logistic Regression

Learning from Data: Naive Bayes

Collaborative Filtering. Radek Pelánek

Statistical Models in Data Mining

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

CSE 473: Artificial Intelligence Autumn 2010

Survey on Multiclass Classification Methods

Learning is a very general term denoting the way in which agents:

Programming Exercise 3: Multi-class Classification and Neural Networks

Linear Threshold Units

Cross-validation for detecting and preventing overfitting

A semi-supervised Spam mail detector

Machine Learning over Big Data

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

CS Data Science and Visualization Spring 2016

The Data Mining Process

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

An Introduction to Machine Learning

Machine Learning Introduction

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Machine Learning.

Data Mining. Dr. Saed Sayad. University of Toronto

Tracking Groups of Pedestrians in Video Sequences

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Bayesian networks - Time-series models - Apache Spark & Scala

Classification of Bad Accounts in Credit Card Industry

Simple and efficient online algorithms for real world applications

Environmental Remote Sensing GEOG 2021

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Big Data - Lecture 1 Optimization reminders

Better credit models benefit us all

The Scientific Data Mining Process

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Analytics on Big Data

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Linear Models for Classification

Semi-Supervised Support Vector Machines and Application to Spam Filtering

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

A Content based Spam Filtering Using Optical Back Propagation Technique

Linear Classification. Volker Tresp Summer 2015

Efficient Streaming Classification Methods

Support Vector Machines with Clustering for Training with Very Large Datasets

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Compression algorithm for Bayesian network modeling of binary systems

Server Load Prediction

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Machine Learning in Spam Filtering

Sub-class Error-Correcting Output Codes

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Transcription:

Multi-Class and Structured Classification [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] [ p y( )] http://www.cs.berkeley.edu/~jordan/courses/294-fall09

Basic Classification in ML Input Output Spam filtering!!!!$$$!!!! Binary Character recognition Multi-Class C [thanks to Ben Taskar for slide!]

Structured Classification Input Output Handwriting recognition Structured output brace 3D object recognition building tree [thanks to Ben Taskar for slide!]

Multi-Class Classification Multi-class classification : direct approaches Nearest Neighbor Generative approach & Naïve Bayes Linear classification: geometry Perceptron K-class (polychotomous) logistic regression K-class SVM Multi-class classification through binary classification One-vs-All All-vs-all Others Calibration

Multi-label label classification Is it eatable? Is it a banana? Is it a banana? Is it sweet? Is it an apple? Is it yellow? Is it a fruit? Is it a banana? Is it an orange? Is it a pineapple? Is it sweet? Is it round? uctures rent stru Differ Nested/ Hierarchical Exclusive/ Multi-class General/Structured

Nearest Neighbor, Decision Trees - From the classification lecture: NN and k-nn were already phrased in a multi-class framework For decision tree, want purity of leaves depending on the proportion of each class (want one class to be clearly dominant)

As in the binary case: Generative models 1. Learn p(y) and p(y x) 2. Use Bayes rule: 3. Classify as p(y) p(x y) p(y x)

Advantages: Generative models Fast to train: only the data from class k is needed to learn the k th model (reduction by a factor k compared with other method) Works well with little data provided the model is reasonable Drawbacks: Depends on the quality of the model Doesn t model p(y x) directly With a lot of datapoints doesn t perform as well as discriminative methods

Naïve Bayes Class Y Assumption: Given the class the features are independent X 1 X 2 X 3 X 4 X 5 Features Bag-of-words models If the features are discrete:

Linear classification Each class has a parameter vector (w k,b k ) x is assigned to class k iff Note that we can break the symmetry and choose (w 1,b 1 )=0 For simplicity set b k =0 (add a dimension and include it in w k ) So learning goal given separable data: choose w k s.t.

Three discriminative algorithms

Geometry of Linear classification Perceptron K-class logistic regression K-class SVM

Multiclass Perceptron Online: for each datapoint Predict: Update: Advantages : Extremely simple updates (no gradient to calculate) No need to have all the data in memory (some point stay classified correctly after a while) Drawbacks If the data is not separable decrease slowly

Polychotomous logistic regression distribution in exponential ilform Online: for each datapoint Batch: all descent methods Especially in large dimension, use regularization Advantages: Drawbacks: Smooth function Non sparse Get probability estimates small flip label probability (0,0,1) (.1,.1,.8)

Multi-class SVM Intuitive formulation: without regularization / for the separable case Primal problem: QP Solved in the dual formulation, also Quadratic Program Main advantage: Sparsity (but not systematic) Speed with SMO (heuristic use of sparsity) Sparse solutions Drawbacks: Need to recalculate or store x it x j Outputs not probabilities

Real world classification problems Digit recognition Automated protein classification Object recognition 10 Phoneme recognition 300-600 100 http://www.glue e.umd.edu/~zhelin/rec cog.html 50 [Waibel, Hanzawa, Hinton,Shikano, Lang 1989] The number of classes is sometimes big The multi-class l algorithm can be heavy

Combining i binary classifiers One-vs-all For each class build a classifier for that class vs the rest Often very imbalanced classifiers (use asymmetric regularization) All-vs-all For each class build a classifier for that class vs the rest A priori a large number of classifiers to build but The pairwise classification are way much faster The classifications are balanced (easier to find the best regularization) so that in many cases it is clearly faster than one-vs-all

Visualize which classes are more difficult to learn Can also be used to compare two different classifiers Cluster classes and go hierachical [Godbole, 02] Confusion Matrix Actua al classes Predicted classes Classification of 20 news groups [Godbole, 02] BLAST classification of proteins in 850 superfamilies

Calibration How to measure the confidence in a class prediction? Crucial for: 1. Comparison between different classifiers 2. Ranking the prediction for ROC/Precision-Recall curve 3. In several application domains having a measure of confidence for each individual answer is very important (e.g. tumor detection) Some methods have an implicit notion of confidence e.g. for SVM the distance to the class boundary relative to the size of the margin other like logistic regression have an explicit one.

Calibration Definition: the decision function f of a classifier is said to be calibrated or well-calibrated if Informally f is a good estimate of the probability of classifying correctly a new datapoint x which would have output value x. Intuitively if the raw output of a classifier is g you can calibrate it by estimating the probability bilit of x being well classified given that g(x)=y for all y values possible.

Calibration Example: a logistic regression or more generally Example: a logistic regression, or more generally calculating a Bayes posterior should yield a reasonably well-calibrated decision function.

Combining OVA calibrated classifiers Class 1 Class 2 Class 3 Class 4 + + ++ + + ++ + + + + + ++ + + ++ + + + +++ + + +++ + + + ++ + +++ + + + + + + + + Calibratio on + + + + + + + p 1 p 2 p 3 p 4 Renormalize p other consistent (p 1,p 2,,p 4,p other )

Other methods for calibration Simple calibration Logistic regression Intraclass density estimation + Naïve Bayes Isotonic regression More sophisticated calibrations Calibration for A vs A by Hastie and Calibration for A-vs-A by Hastie and Tibshirani

Structured classification

Local Classification b r ar e Classify using local information Ignores correlations! [thanks to Ben Taskar for slide!]

Local Classification building tree shrub ground [thanks to Ben Taskar for slide!]

Structured Classification b r ac e Use local information Exploit correlations [thanks to Ben Taskar for slide!]

Structured Classification building tree shrub ground [thanks to Ben Taskar for slide!]

Structured Classification Structured classification : direct approaches Generative approach: Markov Random Fields (Bayesian modeling with graphical models) Linear classification: Structured Perceptron Conditional Random Fields (counterpart of logistic regression) Large-margin structured classification

Structured red classification Simple example HMM: Label sequence Observation sequence b r a c e Optical Character Optical Character Recognition

Tree model 1 Label blstructure Observations

Tree model 1 Eye color inheritance: haplotype inference

Tree Model 2: Hierarchical Text Classification Root Arts Business Computers Cannes Film Festival schedule......................... X: webpage Movies Television Internet Jobs Real Estate Y: label in tree (from ODP) Software

Grid model Image segmentation Segmented = Labeled image

Structured red Model Main idea: define scoring function which decomposes as sum of features scores k on parts p: Label examples by looking for max score: Parts = nodes, edges, etc. space of feasible outputs t

Cliques and Features b r a c e b r a c e In undirected graphs: cliques = groups of completely l interconnected variables In directed graphs: cliques = variable+its parents

Exponential form Once the graph is defined the model can be written in exponential form parameter vector feature vector Comparing two labellings with the likelihood ratio

Decoding and Learning Three important operations on a general structured (e.g. graphical) model: Decoding: find the right label sequence Inference: compute probabilities of labels Learning: find model + parameters w so that decoding works HMM example: b r a c e Decoding: Viterbi algorithm Inference: forward-backward algorithm Learning: e.g. transition and emission counts (case of learning a generative model from fully labeled training data)

Decoding and Learning Decoding: algorithm on the graph (eg. max-product) Use dynamic Inference: algorithm on the graph programming to take (eg. sum-product, belief propagation, junction tree, sampling) advantage of the structure Learning: inference + optimization 1. Focus of graphical model class 2. Need 2 essential concepts: 1. cliques: variables that directly depend on one another 2. features (of the cliques): some functions of the cliques

Our favorite (discriminative) algorithms the devil is in the details...

For each datapoint (Averaged) Perceptron Predict: Update: Averaged perceptron: p

Example: multiclass setting Predict: Feature encoding: Update: Predict: Update:

CRF Z difficult to compute with complicated graphs Introduction by Hannah M.Wallach http://www.inference.phy.cam.ac.uk/hmw26/crf/ Conditioned on all the observations MEMM & CRF, Mayssam Sayyadian, Rob McCann anhai.cs.uiuc.edu/courses/498ad-fall04/local/my-slides/crf-students.pdf No Z M 3 net The margin penalty can factorize according to the problem structure Introduction by Simon Lacoste-Julien http://www.cs.berkeley.edu/~slacoste/school/cs281a/project_report.html

[thanks to Ben Taskar for slide!] Object Segmentation Results Data: [Stanford Quad by Segbot] Trained on 30,000 point scene Tested on 3,000,000 point scenes Evaluated on 180,000000 point scene Model Error Local learning 32% Local prediction Local learning 27% +smoothing Structured 7% method [Taskar+al 04, Anguelov+Taskar+al 05] Laser Range Finder Segbot M. Montemerlo S. Thrun building tree shrub ground