# Cross-validation for detecting and preventing overfitting

 To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video
Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit Copright Andrew W. Moore Slide 1

2 A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 2

3 Linear Regression Copright Andrew W. Moore Slide 3

4 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : 1 =(3).. 1 =7.. Originall discussed in the previous Andrew Lecture: Neural Nets Copright Andrew W. Moore Slide 4

5 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : Z= 1 =(3).. 1 =7.. = : z 1 =(1,3).. 1 =7.. z k =(1, k ) 7 3 : Copright Andrew W. Moore Slide 5

6 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : Z= 1 =(3).. 1 =7.. = : 7 3 : β=(z T Z) -1 (Z T ) z 1 =(1,3).. 1 =7.. est = β 0 + β 1 z k =(1, k ) Copright Andrew W. Moore Slide 6

7 Quadratic Regression Copright Andrew W. Moore Slide 7

8 Quadratic Regression X 3 1 : Z= Y 7 3 : : z=(1,, 2,) 9 1 X= = 3 1 : =(3,2).. 1 =7.. = 7 3 : : Much more about this in the future Andrew Lecture: Favorite Regression Algorithms β=(z T Z) -1 (Z T ) est = β 0 + β 1 + β 2 2 Copright Andrew W. Moore Slide 8

9 Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 9

10 Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 10

11 What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 11

12 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set Copright Andrew W. Moore Slide 12

13 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set (Linear regression eample) Copright Andrew W. Moore Slide 13

14 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Linear regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 14

15 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Quadratic regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 15

16 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Join the dots eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 16

17 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 17

18 Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 18

19 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record Copright Andrew W. Moore Slide 19

20 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright Andrew W. Moore Slide 20

21 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 21

22 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) Copright Andrew W. Moore Slide 22

23 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 23

24 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright Andrew W. Moore Slide 24

25 LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright Andrew W. Moore Slide 25

26 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 26

27 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 27

28 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) Copright Andrew W. Moore Slide 28

29 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright Andrew W. Moore Slide 29

30 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 30

31 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright Andrew W. Moore Slide 31

32 k-fold Cross Validation Linear Regression MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 32

33 k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 33

34 k-fold Cross Validation Joint-the-dots MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 34

35 Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 35

36 Which kind of Cross Validation? Downside Upside Test-set Leaveone-out 10-fold 3-fold R-fold Variance: unreliable Cheap estimate of future performance But note: One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 10% of the data. making Onl wastes these cheap 10%. Onl 10 times more epensive 10 times more epensive than testset instead of R times. Wastier than 10-fold. Epensivier than testset Identical to Leave-one-out Slightl better than testset Copright Andrew W. Moore Slide 36

37 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i TRAINERR 10-FOLD-CV-ERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copright Andrew W. Moore Slide 37

38 CV-based Model Selection Eample: Choosing number of hidden units in a onehidden-laer neural net. Step 1: Compute 10-fold CV error for si different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units 2 hidden units 3 hidden units 4 hidden units 5 hidden units Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 38

39 CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 K=4 K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 39

40 Algorithm K=1 K=2 K=3 K=4 K=5 K=6 CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different NN (and model all other nonparametric classes: Wh did we use 10-fold-CV for neural nets and LOOCV for k- nearest neighbor? TRAINERR And wh stop LOOCV-ERR at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the epense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k- methods) LOOCV happens to be as cheap as regular predictions. No good reason, ecept it looked like things were getting worse as K was increasing Choice Sadl, no. And in fact, the relationship can be ver bump. Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 K=1024 Idea Two: Hillclimbing from an initial guess at K Copright Andrew W. Moore Slide 40

41 CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Copright Andrew W. Moore Slide 41

42 CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Copright Andrew W. Moore Slide 42

43 CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the K-NN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 43

44 CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Also: The scale factors of a nonparametric distance metric Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the K-NN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 44

45 CV-based Algorithm Choice Eample: Choosing which regression algorithm to use Step 1: Compute 10-fold-CV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg n Quad reg n LWR, KW=0.1 LWR, KW=0.5 Step 2: Whichever algorithm gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 45

46 Alternatives to CV-based model selection Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Onl directl applicable to choosing classifiers Described in a future Lecture Copright Andrew W. Moore Slide 46

47 Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage: ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC if ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright Andrew W. Moore Slide 47

48 Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen subset of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright Andrew W. Moore Slide 48

49 Cross-Validation for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Copright Andrew W. Moore Slide 49

50 Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 50

51 Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-testset-error Works prett well in vs. practice, Iteration apparentl minimized set of weights is somewhat like Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 51

52 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright Andrew W. Moore Slide 52

53 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright Andrew W. Moore Slide 53

54 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. What s LOOCV of 1-NN? What s LOOCV of 3-NN? What s LOOCV of 22-NN? Copright Andrew W. Moore Slide 54

55 Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. But there s a more sensitive alternative: Compute log P(all test outputs all test inputs, our model) Copright Andrew W. Moore Slide 55

56 Cross-Validation for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright Andrew W. Moore Slide 56

57 Cross-Validation for densit estimation Compute the sum of log-likelihoods of test points Eample uses: Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (testset densit will almost alwas look better with fewer features) Copright Andrew W. Moore Slide 57

58 Feature Selection Suppose ou have a learning algorithm LA and a set of input attributes { X 1, X 2.. X m } You epect that LA will onl find some subset of the attributes useful. Question: How can we use cross-validation to find a useful subset? Four ideas: Another fun area in which Forward selection Andrew has spent a lot of his Backward elimination wild outh Hill Climbing Stochastic search (Simulated Annealing or GAs) Copright Andrew W. Moore Slide 58

59 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright Andrew W. Moore Slide 59

60 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. What can be done about it? Copright Andrew W. Moore Slide 60

61 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! What can be done about it? Copright Andrew W. Moore Slide 61

62 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional testset before doing an model selection. Check the best model performs well even on the additional testset. Or: Randomization Testing Copright Andrew W. Moore Slide 62

63 What ou should know Wh ou can t use training-set-error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training set error to choose the learning algorithm Test-set cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright Andrew W. Moore Slide 63

### K-nearest-neighbor: an introduction to machine learning

K-nearest-neighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification:

### Lecture 13: Validation

Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

### Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business

Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Introduction 2. Measuring Accuracy 3. Out-of Sample Predictions 4.

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### L13: cross-validation

Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

### Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

### Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

### 5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

### We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### Mathematical goals. Starting points. Materials required. Time needed

Level A7 of challenge: C A7 Interpreting functions, graphs and tables tables Mathematical goals Starting points Materials required Time needed To enable learners to understand: the relationship between

### Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

A. THE STANDARD PARABOLA Graphing Quadratic Functions The graph of a quadratic function is called a parabola. The most basic graph is of the function =, as shown in Figure, and it is to this graph which

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

### INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4. Example 1

Chapter 1 INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4 This opening section introduces the students to man of the big ideas of Algebra 2, as well as different was of thinking and various problem solving strategies.

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### Advice for applying Machine Learning

Advice for applying Machine Learning Andrew Ng Stanford University Today s Lecture Advice on how getting learning algorithms to different applications. Most of today s material is not very mathematical.

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Supporting Online Material for

www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### 4.1 Piecewise-Defined Functions

Section 4.1 Piecewise-Defined Functions 335 4.1 Piecewise-Defined Functions In preparation for the definition of the absolute value function, it is etremel important to have a good grasp of the concept

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Evaluating Machine Learning Algorithms. Machine DIT

Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify

### Decompose Error Rate into components, some of which can be measured on unlabeled data

Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### Model Validation Techniques

Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

### Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Discovering Structure in Continuous Variables Using Bayesian Networks

In: "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge MA, 1996. Discovering Structure in Continuous Variables Using Bayesian Networks Reimar Hofmann and Volker Tresp Siemens AG,

### Cross-Validation. Synonyms Rotation estimation

Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

### Find the Relationship: An Exercise in Graphing Analysis

Find the Relationship: An Eercise in Graphing Analsis Computer 5 In several laborator investigations ou do this ear, a primar purpose will be to find the mathematical relationship between two variables.

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,

### 8. Bilateral symmetry

. Bilateral smmetr Our purpose here is to investigate the notion of bilateral smmetr both geometricall and algebraicall. Actuall there's another absolutel huge idea that makes an appearance here and that's

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Unsupervised Learning: Clustering with DBSCAN Mat Kallada

Unsupervised Learning: Clustering with DBSCAN Mat Kallada STAT 2450 - Introduction to Data Mining Supervised Data Mining: Predicting a column called the label The domain of data mining focused on prediction:

### Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

### 6. The given function is only drawn for x > 0. Complete the function for x < 0 with the following conditions:

Precalculus Worksheet 1. Da 1 1. The relation described b the set of points {(-, 5 ),( 0, 5 ),(,8 ),(, 9) } is NOT a function. Eplain wh. For questions - 4, use the graph at the right.. Eplain wh the graph

### Neural Networks and Support Vector Machines

INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Math 1400/1650 (Cherry): Quadratic Functions

Math 100/1650 (Cherr): Quadratic Functions A quadratic function is a function Q() of the form Q() = a + b + c with a 0. For eample, Q() = 3 7 + 5 is a quadratic function, and here a = 3, b = 7 and c =

### Review of Computer Engineering Research WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM. Geeta R.B.* Shobha R.B.

Review of Computer Engineering Research journal homepage: http://www.pakinsight.com/?ic=journal&journal=76 WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM Geeta R.B.* Department

### Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Polynomial and Rational Functions

Chapter 5 Polnomial and Rational Functions Section 5.1 Polnomial Functions Section summaries The general form of a polnomial function is f() = a n n + a n 1 n 1 + +a 1 + a 0. The degree of f() is the largest

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit