Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris



Similar documents
A Simple Introduction to Support Vector Machines

Statistical Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Model Combination. 24 Novembre 2009

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines Explained

Linear Classification. Volker Tresp Summer 2015

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Support Vector Machine (SVM)

Decision Trees from large Databases: SLIQ

Data Mining. Nonlinear Classification

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining Practical Machine Learning Tools and Techniques

Lecture 3: Linear methods for classification

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Support Vector Machines

Knowledge Discovery and Data Mining

An Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Linear Models for Classification

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Several Views of Support Vector Machines

An Introduction to Machine Learning

Decompose Error Rate into components, some of which can be measured on unlabeled data

Supervised Feature Selection & Unsupervised Dimensionality Reduction

CS570 Data Mining Classification: Ensemble Methods

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Knowledge Discovery and Data Mining

L25: Ensemble learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

STA 4273H: Statistical Machine Learning

Data Mining Classification: Decision Trees

The Operational Value of Social Media Information. Social Media and Customer Interaction

6 Classification and Regression Trees, 7 Bagging, and Boosting

Big Data Analytics CSCI 4030

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining Techniques Chapter 6: Decision Trees

Supervised Learning (Big Data Analytics)

Classification Techniques for Remote Sensing

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Gerry Hobbs, Department of Statistics, West Virginia University

Lecture 10: Regression Trees

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Neural Networks and Support Vector Machines

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Local features and matching. Image classification & object localization

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Using multiple models: Bagging, Boosting, Ensembles, Forests

Lecture 9: Introduction to Pattern Analysis

Machine Learning in Spam Filtering

Ensemble Approach for the Classification of Imbalanced Data

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Data Mining Methods: Applications for Institutional Research

Machine Learning and Pattern Recognition Logistic Regression

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

The Artificial Prediction Market

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Data Mining for Knowledge Management. Classification

Comparison of machine learning methods for intelligent tutoring systems

Classification of Bad Accounts in Credit Card Industry

Lecture 2: The SVM classifier

E-commerce Transaction Anomaly Classification

Tree based ensemble models regularization by convex optimization

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Data Mining - Evaluation of Classifiers

Monotonicity Hints. Abstract

Microsoft Azure Machine learning Algorithms

Model Trees for Classification of Hybrid Data Types

Monday Morning Data Mining

Why Ensembles Win Data Mining Competitions

Classification algorithm in Data mining: An Overview

Classification Problems

Chapter 6. The stacking ensemble approach

Local classification and local likelihoods

REVIEW OF ENSEMBLE CLASSIFICATION

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Advanced Ensemble Strategies for Polynomial Models

Comparison of Data Mining Techniques used for Financial Data Analysis

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Christfried Webers. Canberra February June 2015

Sub-class Error-Correcting Output Codes

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

How To Make A Credit Risk Model For A Bank Account

Environmental Remote Sensing GEOG 2021

Predict Influencers in the Social Network

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Transcription:

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1

Module #: Title of Module 2

Review Overview Linear separability Non-linear classification Linear Support Vector Machines (SVMs) Picking the best linear decision boundary Non-linear SVMs Decision Trees Ensemble methods: Bagging and Boosting 3

Objective functions E(X, Θ) -- function of the data (X) and the parameters (Θ) They measure data fit between a model and the data, e.g.: The K-means objective function is the sum of the squared Euclidean distance from each data point to the closest mean (or centroid ), parameters are the mean When fitting a statistical model, we use: E(X, Θ) = log P(X Θ) = Σ i log P(Xi Θ) 4

Examples of objective functions Clustering: Cuse Mix of Gaussians: log likelihood of data Affinity propagation: sum of similarities with exemplars Regression: Linear: Sum of squared errors (aka residuals), aka log likelihood under Gaussian noise Classification: Logistic regression: Sum of log probability of class under logistic function Fisher s: ratio of σ 2 within class vs σ 2 between class 5

Overfitting 1 mod del fit) Err ror (i.e. Generalization (test set) error Training set error # of parameters or complexity More complex models better fit the training data. But after some point, more complex models generalize worse to new data. 6

Dealing with overfitting by regularization Basic idea: add a term to the objective function that penalizes # of parameters or model complexity, Now: E(X, Θ) = datafit(x, Θ) λ complexity(θ) Hyper-parameter parameter λ controls the strength of regularization could have a natural value, or need to set this by cross-validation, Increases in model complexity need to be balanced by improved model fit. (each unit of added complexity costs λ units of data fit) 7

Examples of regularization Clustering: (N is # params, a M is # datapoints) apo Bayes IC: LL 0.5 N log M Akaike IC: LL N Regression: L1 (aka LASSO): LL λ Σ i b i L2 (aka Ridge): LL λ Σ i b i 2 Elastic net: LL λ 1 Σ i b i - λ 2 Σ i b i 2 Classification: Logistic regression: Same as linear regression 8

Decision boundary: b +b = T 1 x 1 2 x 2 x b = t x 1 f(x) = x T b discriminant function (or value) Linear classification b Can define decision boundary with a vector of feature weights, b, and a threshold, t, if x T b > t, predict + and otherwise predict -, can trade off false positives versus false negatives by changing t x 2

Linear separability Linearly separable Not linearly separable x 1 x 1 x 2 x 2 Two classes are linearly separable, if you can perfectly classify them with a linear decision boundary. Why is this important? Linear classifiers can only draw linear decision i boundaries.

Non-linear classification x 1 x 2 Non-linear classifiers have non-linear, and possibly discontinous decision boundaries 11

Non-linear decision boundaries Set x 3 = (x 1 m 1 ) 2 +(x 2 m 2 ) 2 x 3 = 1 x 1 m Decision rule: Predict if x 3 < 1 x 2 12

Non-linear classification: Idea #1 Fit a linear classifier to non-linear functions of the input features. E.g.: Transform features (quadratic): {1, x 1, x 2 } {1, x 1, x 2, x 1 x 2, x 12, x 22 } Now can fit arbitrary quadratic decision boundaries by fitting a linear classifier to transformed features. This is feasible for quadratic transforms, but what about other transforms ( power of 10 )? 13

Support Vector Machines Vladimir Vapnik 14

Picking the best decision boundary Decision boundaries x 1 What s the best decision boundary to choose when one? That is, what is the one most likely to generalize well to new datapoints? x 2

LDA & Fisher sol n (generative) Best decision boundary y( (in 2-d): parallel to the direction of greatest variance. x 1 This is the Bayes optimal classifier (if mean and covariance are correct). x 2

Unregularized logistic regression (discriminative) x 1 All solutions are equally good because each one perfectly discriminates the classes. (note: this changes if there s regularization) x 2 17

Margin Linear SVM solution Maximum margin boundary. x 1 x 2 Vladimir Vapnik 18

Margin SVM decision boundaries Here, the maximum margin boundary is specified by three points, called the support vectors. x 1 Support vectors x 2 Vladimir Vapnik 19

SVM objective function (Where targets: y i = -1 or y i = 1, x i input features, b are feature weights) Primal objective function: E(b) = -C Σ i ξ i b T b/2 where y i (b T x i ) > 1-ξ i for all i d a t a L 2 r ξ i >= 0 is degree eof mis-classification under g hinge loss f for datapoint i, C>0 is the i hyperparameter eter balancing regularization and data t fit. 20

Visualization of ξ i Misclassified point ξ i If linearly separable, all values of ξ i = 0 x 1 Vladimir Vapnik x 2 21

Another SVM objective function Dual : E(α) = Σ i α i - Σ i Σ j α i α j y i y j x it x j /2 Under the constraints that: Σ i α i y i = 0 and 0 α i C Depends only on dot products!!! Then can define weights b = Σ i α i y i x i and discriminant function: b T x = Σ i α i y i x it x Data i such that α i > 0 are support vectors 22

Dot products of transformed vector as kernel functions Let x = (x () 2 2 1, x 2 ) and 2 (x) = ( 2x 1 x 2, x 12, x 22 ) Here 2 (x) [or p (x)] is a vector valued function that returns all possible powers of two [or p] of x 1 and x 2 Then 2 (x) T 2 (z) = (x T z) 2 = K(x,z) kernel function And in general, p (x) T p (z) = (x T z) p Every kernel function K(x,z) that satisfies some simple conditions corresponds to a dot product of some transformation (x) (aka projection function ) 23

Non-linear SVM objective function Objective: E(α) = Σ i α i - Σ i Σ j α i α j y i y j K(x i,x j )/2 Under the constraints that: Σ i α i y i = 0 and 0 α i C Discriminant function: f(x) = Σ i α i y i K(x i,x) w i feature weight 24

Kernelization (FYI) Often instead of explicitly writing out the non-linear feature set, one simply calculates l a kernel function of the two input vectors, i.e., K(x i, x j ) = φ(x i ) T φ(x j ) Here s why: any kernel function K(x i, x j ) that satisfies certain conditions, e.g., K(x i, x j ) is a symmetric positive semi-definite function (which implies that the matrix K, where K ij = K(x i, x j ) is a symmetric positive semi-definite matrix), corresponds to a dot product K(x i, x j ) = φ(x i ) T φ(x j ) for some projection function φ(x). Often it s easy to think of defining a kernel function that captures some notation of similarity Non-linear SVMs use the discriminant function f(x; w) =Σ i w i K(x i,x)

Some popular kernel functions K(x i, x j ) = (x it x j + 1) P Inhomogeneous Polynomial kernel of degree P K(x i, x j ) = (x it x j ) P Homogeneous Polynomial kernel K(x i, x j ) = exp{-(x i -x j ) T (x i -x j )/ s 2 } Gaussian radial basis function kernel K(x i, x j ) = Pearson(x i,x j ) + 1 Bioinformatics kernel A Also, can make a kernel out of an interaction network!

Example: radial basis function kernels support vectors x 1 x 2 Φ(x) = {N(x; x 1, σ 2 I), N(x; x 2, σ 2 I),, N(x; x 2, σ 2 I)} Where N(x; m, Σ) is the probability density of x under a multivariate Gaussian with mean m, and covariance Σ

SVM summary A sparse linear classifier that uses as features kernel functions that measure similarity to each data point in the training. 28

Neural Networks (aka multilayer perceptrons) h 4 f(x,θ) = Σ k v k h k (x) Where hidden unit : h k (x) = σ(σ i w ik x i ) x 1 Often: σ(x) = 1 / (1 + e -x ) h 1 v and W are parameters x 2 h 2 h 3 Notes: -Fitting the hidden units has become a lot easier using deep belief networks -Could also fit means of RBF for hidden units

K-Nearest Neighbour K = 15 Classify x in the same way that a majority of the K nearest neighbours to x are labeled. Can also used # of nearest neighbors that are positive as a discriminant value. K = 1 Smaller values of K lead to more complex classifiers, you can set K using (nested) cross-validation. Images from Elements of Stat Learning, Hastie, Tibshirani, Efron (available online)

x 1 Decision trees For each leaf: 1) Choose a dimension to split along, and choose best split using a split criterion (see below) 2) Split along dimension, creating two new leaves, return to (1) until leaves are pure. To reduce overfitting, it is advisable to prune the tree by removing leaves with small numbers of examples x 2 Split criterion: C4.5 (Quinlan 1993): Expected decrease in (binary) entropy CART (Breiman et al 1984): Expected decrease in Gini impurity (1-sum of squared class probabilities)

Decision trees x 2 > a? yes no x 1 x 2 a

Decision trees x 2 > a? yes no x 1 x 2 > b? yes no etc b x 2 a

Decision trees x 2 > a? yes no x 1 x 2 > b? yes no etc b x 2 a

Decision tree summary Decision trees learn a recursive splits of the data along individual features that partition the input space into homogeneous groups of data points with the same labels. l 35

Ensemble classification Combining multiple classifiers together by training them separately and then averaging their predictions is a good way to avoid overfitting. Bagging (Breiman 1996): Resample training set, train separate classifiers on each sample, have the classifiers vote for the classification Boosting (e.g. Adaboost, Freund and Schapire 1997): Iteratively reweight training sets based on errors of a weighted average of classifiers: Train classifier ( weak learner ) to minimize weighted error on training set Weight new classifier according to prediction error, reweight training i set according to prediction error Repeat Minimizes exponential loss on training set over a convex set of functions

(x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) Bootstrap samples Bagging Train separate classifiers (x 1, y 1 ), (x 1, y 1 ), (x 3, y 3 ),, (x N-1, y N-1 ) f 1 (x) (x 1, y 1 ), (x 2, y 2 ), (x 4, y 4 ),, (x N, y N ) f 2 (x) etc (M samples in total) bagged(x) = Σ j f j (x)/m 37

Boosting (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) (x 2, y 2 ), (x 2, y 2 ),, (x N-1 1, y N-1 ) Assess performance f 1 (x) Train classifier Resample hard cases, Set w t f t (x) Train classifier Assess performance (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) (x 2, y 2 ), (x 2, y 2 ),, (x N-1, y N-1 ) Legend (x i, y i ) -- correct (x i, y i ) -- incorrect boost(x) = Σ j w j f j (x) 38

Ensemble learning summary Bagging: classifiers trained separately (in parallel) on different bootstrapped samples of the training set, and make equally weighted votes Boosting: classifiers trained sequentially, weighted by performance, on samples of the training set that focus on hard examples. Final classification is based on weighted votes. 39

Random Forests (Breiman 2001) Construct M bootstrapped samples of the training set (of size N) For each sample, build DT using CART (no pruning), but split optimally on randomly chosen features random feature choice reduces correlation among trees, this is a good thing. Since bootstrap resamples training set with replacement, leaves out, on average, (1-1/N) N x 100% of the examples (~100/e% = 36.7%), can use these out-of-bag samples to estimate performance Bag predictions (i.e. average them) Can assess importance of features by evaluating performance of trees containing those features

Advantages of Decision Trees Relatively easy to combine continuous and discrete (ordinal or categorical) features within the same classifier, this is harder to do for other methods Random Forests can be trained in parallel (unlike boosted decision trees), and relatively quickly. Random Forests tend to do quite well in empirical tests (but see No Free Lunch theorem)

Networks as kernels Can use matrix representations of graphs to generate kernels. One popular graph-based kernel is the diffusion kernel: K =(λi L) -1 Where L = D W, D is a diagonal matrix with the row sums, and W is the matrix representation of the graph. GeneMANIA label propagation: f = (λi L) -1 y