# Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Save this PDF as:

Size: px
Start display at page:

Download "Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris"

## Transcription

1 Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1

2 Module #: Title of Module 2

3 Review Overview Linear separability Non-linear classification Linear Support Vector Machines (SVMs) Picking the best linear decision boundary Non-linear SVMs Decision Trees Ensemble methods: Bagging and Boosting 3

4 Objective functions E(X, Θ) -- function of the data (X) and the parameters (Θ) They measure data fit between a model and the data, e.g.: The K-means objective function is the sum of the squared Euclidean distance from each data point to the closest mean (or centroid ), parameters are the mean When fitting a statistical model, we use: E(X, Θ) = log P(X Θ) = Σ i log P(Xi Θ) 4

5 Examples of objective functions Clustering: Cuse Mix of Gaussians: log likelihood of data Affinity propagation: sum of similarities with exemplars Regression: Linear: Sum of squared errors (aka residuals), aka log likelihood under Gaussian noise Classification: Logistic regression: Sum of log probability of class under logistic function Fisher s: ratio of σ 2 within class vs σ 2 between class 5

6 Overfitting 1 mod del fit) Err ror (i.e. Generalization (test set) error Training set error # of parameters or complexity More complex models better fit the training data. But after some point, more complex models generalize worse to new data. 6

7 Dealing with overfitting by regularization Basic idea: add a term to the objective function that penalizes # of parameters or model complexity, Now: E(X, Θ) = datafit(x, Θ) λ complexity(θ) Hyper-parameter parameter λ controls the strength of regularization could have a natural value, or need to set this by cross-validation, Increases in model complexity need to be balanced by improved model fit. (each unit of added complexity costs λ units of data fit) 7

8 Examples of regularization Clustering: (N is # params, a M is # datapoints) apo Bayes IC: LL 0.5 N log M Akaike IC: LL N Regression: L1 (aka LASSO): LL λ Σ i b i L2 (aka Ridge): LL λ Σ i b i 2 Elastic net: LL λ 1 Σ i b i - λ 2 Σ i b i 2 Classification: Logistic regression: Same as linear regression 8

9 Decision boundary: b +b = T 1 x 1 2 x 2 x b = t x 1 f(x) = x T b discriminant function (or value) Linear classification b Can define decision boundary with a vector of feature weights, b, and a threshold, t, if x T b > t, predict + and otherwise predict -, can trade off false positives versus false negatives by changing t x 2

10 Linear separability Linearly separable Not linearly separable x 1 x 1 x 2 x 2 Two classes are linearly separable, if you can perfectly classify them with a linear decision boundary. Why is this important? Linear classifiers can only draw linear decision i boundaries.

11 Non-linear classification x 1 x 2 Non-linear classifiers have non-linear, and possibly discontinous decision boundaries 11

12 Non-linear decision boundaries Set x 3 = (x 1 m 1 ) 2 +(x 2 m 2 ) 2 x 3 = 1 x 1 m Decision rule: Predict if x 3 < 1 x 2 12

13 Non-linear classification: Idea #1 Fit a linear classifier to non-linear functions of the input features. E.g.: Transform features (quadratic): {1, x 1, x 2 } {1, x 1, x 2, x 1 x 2, x 12, x 22 } Now can fit arbitrary quadratic decision boundaries by fitting a linear classifier to transformed features. This is feasible for quadratic transforms, but what about other transforms ( power of 10 )? 13

14 Support Vector Machines Vladimir Vapnik 14

15 Picking the best decision boundary Decision boundaries x 1 What s the best decision boundary to choose when one? That is, what is the one most likely to generalize well to new datapoints? x 2

16 LDA & Fisher sol n (generative) Best decision boundary y( (in 2-d): parallel to the direction of greatest variance. x 1 This is the Bayes optimal classifier (if mean and covariance are correct). x 2

17 Unregularized logistic regression (discriminative) x 1 All solutions are equally good because each one perfectly discriminates the classes. (note: this changes if there s regularization) x 2 17

18 Margin Linear SVM solution Maximum margin boundary. x 1 x 2 Vladimir Vapnik 18

19 Margin SVM decision boundaries Here, the maximum margin boundary is specified by three points, called the support vectors. x 1 Support vectors x 2 Vladimir Vapnik 19

20 SVM objective function (Where targets: y i = -1 or y i = 1, x i input features, b are feature weights) Primal objective function: E(b) = -C Σ i ξ i b T b/2 where y i (b T x i ) > 1-ξ i for all i d a t a L 2 r ξ i >= 0 is degree eof mis-classification under g hinge loss f for datapoint i, C>0 is the i hyperparameter eter balancing regularization and data t fit. 20

21 Visualization of ξ i Misclassified point ξ i If linearly separable, all values of ξ i = 0 x 1 Vladimir Vapnik x 2 21

22 Another SVM objective function Dual : E(α) = Σ i α i - Σ i Σ j α i α j y i y j x it x j /2 Under the constraints that: Σ i α i y i = 0 and 0 α i C Depends only on dot products!!! Then can define weights b = Σ i α i y i x i and discriminant function: b T x = Σ i α i y i x it x Data i such that α i > 0 are support vectors 22

23 Dot products of transformed vector as kernel functions Let x = (x () 2 2 1, x 2 ) and 2 (x) = ( 2x 1 x 2, x 12, x 22 ) Here 2 (x) [or p (x)] is a vector valued function that returns all possible powers of two [or p] of x 1 and x 2 Then 2 (x) T 2 (z) = (x T z) 2 = K(x,z) kernel function And in general, p (x) T p (z) = (x T z) p Every kernel function K(x,z) that satisfies some simple conditions corresponds to a dot product of some transformation (x) (aka projection function ) 23

24 Non-linear SVM objective function Objective: E(α) = Σ i α i - Σ i Σ j α i α j y i y j K(x i,x j )/2 Under the constraints that: Σ i α i y i = 0 and 0 α i C Discriminant function: f(x) = Σ i α i y i K(x i,x) w i feature weight 24

25 Kernelization (FYI) Often instead of explicitly writing out the non-linear feature set, one simply calculates l a kernel function of the two input vectors, i.e., K(x i, x j ) = φ(x i ) T φ(x j ) Here s why: any kernel function K(x i, x j ) that satisfies certain conditions, e.g., K(x i, x j ) is a symmetric positive semi-definite function (which implies that the matrix K, where K ij = K(x i, x j ) is a symmetric positive semi-definite matrix), corresponds to a dot product K(x i, x j ) = φ(x i ) T φ(x j ) for some projection function φ(x). Often it s easy to think of defining a kernel function that captures some notation of similarity Non-linear SVMs use the discriminant function f(x; w) =Σ i w i K(x i,x)

26 Some popular kernel functions K(x i, x j ) = (x it x j + 1) P Inhomogeneous Polynomial kernel of degree P K(x i, x j ) = (x it x j ) P Homogeneous Polynomial kernel K(x i, x j ) = exp{-(x i -x j ) T (x i -x j )/ s 2 } Gaussian radial basis function kernel K(x i, x j ) = Pearson(x i,x j ) + 1 Bioinformatics kernel A Also, can make a kernel out of an interaction network!

27 Example: radial basis function kernels support vectors x 1 x 2 Φ(x) = {N(x; x 1, σ 2 I), N(x; x 2, σ 2 I),, N(x; x 2, σ 2 I)} Where N(x; m, Σ) is the probability density of x under a multivariate Gaussian with mean m, and covariance Σ

28 SVM summary A sparse linear classifier that uses as features kernel functions that measure similarity to each data point in the training. 28

29 Neural Networks (aka multilayer perceptrons) h 4 f(x,θ) = Σ k v k h k (x) Where hidden unit : h k (x) = σ(σ i w ik x i ) x 1 Often: σ(x) = 1 / (1 + e -x ) h 1 v and W are parameters x 2 h 2 h 3 Notes: -Fitting the hidden units has become a lot easier using deep belief networks -Could also fit means of RBF for hidden units

30 K-Nearest Neighbour K = 15 Classify x in the same way that a majority of the K nearest neighbours to x are labeled. Can also used # of nearest neighbors that are positive as a discriminant value. K = 1 Smaller values of K lead to more complex classifiers, you can set K using (nested) cross-validation. Images from Elements of Stat Learning, Hastie, Tibshirani, Efron (available online)

31 x 1 Decision trees For each leaf: 1) Choose a dimension to split along, and choose best split using a split criterion (see below) 2) Split along dimension, creating two new leaves, return to (1) until leaves are pure. To reduce overfitting, it is advisable to prune the tree by removing leaves with small numbers of examples x 2 Split criterion: C4.5 (Quinlan 1993): Expected decrease in (binary) entropy CART (Breiman et al 1984): Expected decrease in Gini impurity (1-sum of squared class probabilities)

32 Decision trees x 2 > a? yes no x 1 x 2 a

33 Decision trees x 2 > a? yes no x 1 x 2 > b? yes no etc b x 2 a

34 Decision trees x 2 > a? yes no x 1 x 2 > b? yes no etc b x 2 a

35 Decision tree summary Decision trees learn a recursive splits of the data along individual features that partition the input space into homogeneous groups of data points with the same labels. l 35

36 Ensemble classification Combining multiple classifiers together by training them separately and then averaging their predictions is a good way to avoid overfitting. Bagging (Breiman 1996): Resample training set, train separate classifiers on each sample, have the classifiers vote for the classification Boosting (e.g. Adaboost, Freund and Schapire 1997): Iteratively reweight training sets based on errors of a weighted average of classifiers: Train classifier ( weak learner ) to minimize weighted error on training set Weight new classifier according to prediction error, reweight training i set according to prediction error Repeat Minimizes exponential loss on training set over a convex set of functions

37 (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) Bootstrap samples Bagging Train separate classifiers (x 1, y 1 ), (x 1, y 1 ), (x 3, y 3 ),, (x N-1, y N-1 ) f 1 (x) (x 1, y 1 ), (x 2, y 2 ), (x 4, y 4 ),, (x N, y N ) f 2 (x) etc (M samples in total) bagged(x) = Σ j f j (x)/m 37

38 Boosting (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) (x 2, y 2 ), (x 2, y 2 ),, (x N-1 1, y N-1 ) Assess performance f 1 (x) Train classifier Resample hard cases, Set w t f t (x) Train classifier Assess performance (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) (x 2, y 2 ), (x 2, y 2 ),, (x N-1, y N-1 ) Legend (x i, y i ) -- correct (x i, y i ) -- incorrect boost(x) = Σ j w j f j (x) 38

39 Ensemble learning summary Bagging: classifiers trained separately (in parallel) on different bootstrapped samples of the training set, and make equally weighted votes Boosting: classifiers trained sequentially, weighted by performance, on samples of the training set that focus on hard examples. Final classification is based on weighted votes. 39

40 Random Forests (Breiman 2001) Construct M bootstrapped samples of the training set (of size N) For each sample, build DT using CART (no pruning), but split optimally on randomly chosen features random feature choice reduces correlation among trees, this is a good thing. Since bootstrap resamples training set with replacement, leaves out, on average, (1-1/N) N x 100% of the examples (~100/e% = 36.7%), can use these out-of-bag samples to estimate performance Bag predictions (i.e. average them) Can assess importance of features by evaluating performance of trees containing those features

41 Advantages of Decision Trees Relatively easy to combine continuous and discrete (ordinal or categorical) features within the same classifier, this is harder to do for other methods Random Forests can be trained in parallel (unlike boosted decision trees), and relatively quickly. Random Forests tend to do quite well in empirical tests (but see No Free Lunch theorem)

42 Networks as kernels Can use matrix representations of graphs to generate kernels. One popular graph-based kernel is the diffusion kernel: K =(λi L) -1 Where L = D W, D is a diagonal matrix with the row sums, and W is the matrix representation of the graph. GeneMANIA label propagation: f = (λi L) -1 y

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### Support Vector Machines

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Trees and Random Forests

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### Decompose Error Rate into components, some of which can be measured on unlabeled data

Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### 1. Bag of visual words model: recognizing object categories

1. Bag of visual words model: recognizing object categories 1 1 1 Problem: Image Classification Given: positive training images containing an object class, and negative training images that don t Classify:

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

### L25: Ensemble learning

L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### 6 Classification and Regression Trees, 7 Bagging, and Boosting

hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Introduction to Machine Learning. Rob Schapire Princeton University. schapire

Introduction to Machine Learning Rob Schapire Princeton University www.cs.princeton.edu/ schapire Machine Learning studies how to automatically learn to make accurate predictions based on past observations

### Support Vector Machines

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Data Mining Classification: Alternative Techniques. Instance-Based Classifiers. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining Classification: Alternative Techniques Instance-Based Classifiers Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar Set of Stored Cases Atr1... AtrN Class A B

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### PCA, Clustering and Classification. By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker

PCA, Clustering and Classification By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker Motivation: Multidimensional data Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Classification Techniques for Remote Sensing

Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Neural Networks and Support Vector Machines

INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Neural Networks & Boosting

Neural Networks & Boosting Bob Stine Dept of Statistics, School University of Pennsylvania Questions How is logistic regression different from OLS? Logistic mean function for probabilities Larger weight

### Introduction to Machine Learning NPFL 054

Introduction to Machine Learning NPFL 054 http://ufal.mff.cuni.cz/course/npfl054 Barbora Hladká hladka@ufal.mff.cuni.cz Martin Holub holub@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and

### Local features and matching. Image classification & object localization

Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

### A Survey of Kernel Clustering Methods

A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Breaking SVM Complexity with Cross-Training

Breaking SVM Complexity with Cross-Training Gökhan H. Bakır Max Planck Institute for Biological Cybernetics, Tübingen, Germany gb@tuebingen.mpg.de Léon Bottou NEC Labs America Princeton NJ, USA leon@bottou.org

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Introduction to Neural Networks : Revision Lectures

Introduction to Neural Networks : Revision Lectures John A. Bullinaria, 2004 1. Module Aims and Learning Outcomes 2. Biological and Artificial Neural Networks 3. Training Methods for Multi Layer Perceptrons

### Notes on Support Vector Machines

Notes on Support Vector Machines Fernando Mira da Silva Fernando.Silva@inesc.pt Neural Network Group I N E S C November 1998 Abstract This report describes an empirical study of Support Vector Machines

### Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher