Crossvalidation for detecting and preventing overfitting


 Christian Little
 2 years ago
 Views:
Transcription
1 Crossvalidation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit Copright Andrew W. Moore Slide 1
2 A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 2
3 Linear Regression Copright Andrew W. Moore Slide 3
4 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : 1 =(3).. 1 =7.. Originall discussed in the previous Andrew Lecture: Neural Nets Copright Andrew W. Moore Slide 4
5 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : Z= 1 =(3).. 1 =7.. = : z 1 =(1,3).. 1 =7.. z k =(1, k ) 7 3 : Copright Andrew W. Moore Slide 5
6 Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = : : Z= 1 =(3).. 1 =7.. = : 7 3 : β=(z T Z) 1 (Z T ) z 1 =(1,3).. 1 =7.. est = β 0 + β 1 z k =(1, k ) Copright Andrew W. Moore Slide 6
7 Quadratic Regression Copright Andrew W. Moore Slide 7
8 Quadratic Regression X 3 1 : Z= Y 7 3 : : z=(1,, 2,) 9 1 X= = 3 1 : =(3,2).. 1 =7.. = 7 3 : : Much more about this in the future Andrew Lecture: Favorite Regression Algorithms β=(z T Z) 1 (Z T ) est = β 0 + β 1 + β 2 2 Copright Andrew W. Moore Slide 8
9 Jointhedots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 9
10 Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 10
11 What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 11
12 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set Copright Andrew W. Moore Slide 12
13 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set (Linear regression eample) Copright Andrew W. Moore Slide 13
14 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Linear regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 14
15 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Quadratic regression eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 15
16 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Join the dots eample) Mean Squared Error = Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 16
17 Good news: Ver ver simple The test set method Can then simpl choose the method with the best testset score Bad news: What s the downside? Copright Andrew W. Moore Slide 17
18 Good news: Ver ver simple The test set method Can then simpl choose the method with the best testset score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our testset might just be luck or unluck We sa the testset estimator of performance has high variance Copright Andrew W. Moore Slide 18
19 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record Copright Andrew W. Moore Slide 19
20 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright Andrew W. Moore Slide 20
21 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints Copright Andrew W. Moore Slide 21
22 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints 4. Note our error ( k, k ) Copright Andrew W. Moore Slide 22
23 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 23
24 LOOCV (Leaveoneout Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright Andrew W. Moore Slide 24
25 LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright Andrew W. Moore Slide 25
26 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 26
27 Which kind of Cross Validation? Testset Leaveoneout Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 27
28 kfold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) Copright Andrew W. Moore Slide 28
29 kfold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. Copright Andrew W. Moore Slide 29
30 kfold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the testset sum of errors on the green points. Copright Andrew W. Moore Slide 30
31 kfold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the testset sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the testset sum of errors on the blue points. Copright Andrew W. Moore Slide 31
32 kfold Cross Validation Linear Regression MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the testset sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the testset sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 32
33 kfold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the testset sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the testset sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 33
34 kfold Cross Validation Jointthedots MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the testset sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the testset sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the testset sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 34
35 Which kind of Cross Validation? Testset Leaveoneout 10fold 3fold Rfold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10fold. Epensivier than test set Identical to Leaveoneout Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 35
36 Which kind of Cross Validation? Downside Upside Testset Leaveoneout 10fold 3fold Rfold Variance: unreliable Cheap estimate of future performance But note: One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 10% of the data. making Onl wastes these cheap 10%. Onl 10 times more epensive 10 times more epensive than testset instead of R times. Wastier than 10fold. Epensivier than testset Identical to Leaveoneout Slightl better than testset Copright Andrew W. Moore Slide 36
37 CVbased Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i TRAINERR 10FOLDCVERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copright Andrew W. Moore Slide 37
38 CVbased Model Selection Eample: Choosing number of hidden units in a onehiddenlaer neural net. Step 1: Compute 10fold CV error for si different model classes: Algorithm TRAINERR 10FOLDCVERR Choice 0 hidden units 1 hidden units 2 hidden units 3 hidden units 4 hidden units 5 hidden units Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 38
39 CVbased Model Selection Eample: Choosing k for a knearestneighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm TRAINERR 10foldCVERR Choice K=1 K=2 K=3 K=4 K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 39
40 Algorithm K=1 K=2 K=3 K=4 K=5 K=6 CVbased Model Selection Eample: Choosing k for a knearestneighbor regression. Step 1: Compute LOOCV error for si different NN (and model all other nonparametric classes: Wh did we use 10foldCV for neural nets and LOOCV for k nearest neighbor? TRAINERR And wh stop LOOCVERR at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the epense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k methods) LOOCV happens to be as cheap as regular predictions. No good reason, ecept it looked like things were getting worse as K was increasing Choice Sadl, no. And in fact, the relationship can be ver bump. Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 K=1024 Idea Two: Hillclimbing from an initial guess at K Copright Andrew W. Moore Slide 40
41 CVbased Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Copright Andrew W. Moore Slide 41
42 CVbased Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a realvalued parameter. What should we do? Copright Andrew W. Moore Slide 42
43 CVbased Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a realvalued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the KNN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 43
44 CVbased Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a realvalued parameter. What should we do? Also: The scale factors of a nonparametric distance metric Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the KNN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 44
45 CVbased Algorithm Choice Eample: Choosing which regression algorithm to use Step 1: Compute 10foldCV error for si different model classes: Algorithm TRAINERR 10foldCVERR Choice 1NN 10NN Linear Reg n Quad reg n LWR, KW=0.1 LWR, KW=0.5 Step 2: Whichever algorithm gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 45
46 Alternatives to CVbased model selection Model selection methods: 1. Crossvalidation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. VCdimension (VapnikChervonenkis Dimension) Onl directl applicable to choosing classifiers Described in a future Lecture Copright Andrew W. Moore Slide 46
47 Which model selection method is best? 1. (CV) Crossvalidation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VCdimension AIC, BIC and SRMVC advantage: ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leaveoneout CV should be the same Asmptoticall BIC and carefull chosen kfold should be same You want BIC if ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternativesincluding proper Baesian approaches. It s an emotional issue. Copright Andrew W. Moore Slide 47
48 Other Crossvalidation issues Can do leave all pairs out or leaveallntuplesout if feeling resourceful. Some folks do kfolds in which each fold is an independentlchosen subset of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. kfold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright Andrew W. Moore Slide 48
49 CrossValidation for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Copright Andrew W. Moore Slide 49
50 Supervising Gradient Descent This is a weird but common use of Testset validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSEtestseterror vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 50
51 Supervising Gradient Descent This is a weird but common use of Testset validation Suppose ou have a neural net with too Relies on an intuition that a notfull man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSEtestseterror Works prett well in vs. practice, Iteration apparentl minimized set of weights is somewhat like Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 51
52 Crossvalidation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright Andrew W. Moore Slide 52
53 Crossvalidation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright Andrew W. Moore Slide 53
54 Crossvalidation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. What s LOOCV of 1NN? What s LOOCV of 3NN? What s LOOCV of 22NN? Copright Andrew W. Moore Slide 54
55 Crossvalidation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. But there s a more sensitive alternative: Compute log P(all test outputs all test inputs, our model) Copright Andrew W. Moore Slide 55
56 CrossValidation for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright Andrew W. Moore Slide 56
57 CrossValidation for densit estimation Compute the sum of loglikelihoods of test points Eample uses: Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (testset densit will almost alwas look better with fewer features) Copright Andrew W. Moore Slide 57
58 Feature Selection Suppose ou have a learning algorithm LA and a set of input attributes { X 1, X 2.. X m } You epect that LA will onl find some subset of the attributes useful. Question: How can we use crossvalidation to find a useful subset? Four ideas: Another fun area in which Forward selection Andrew has spent a lot of his Backward elimination wild outh Hill Climbing Stochastic search (Simulated Annealing or GAs) Copright Andrew W. Moore Slide 58
59 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright Andrew W. Moore Slide 59
60 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. What can be done about it? Copright Andrew W. Moore Slide 60
61 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! What can be done about it? Copright Andrew W. Moore Slide 61
62 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional testset before doing an model selection. Check the best model performs well even on the additional testset. Or: Randomization Testing Copright Andrew W. Moore Slide 62
63 What ou should know Wh ou can t use trainingseterror to estimate the qualit of our learning algorithm on our data. Wh ou can t use training set error to choose the learning algorithm Testset crossvalidation Leaveoneout crossvalidation kfold crossvalidation Feature selection methods CV for classification, regression & densities Copright Andrew W. Moore Slide 63
Knearestneighbor: an introduction to machine learning
Knearestneighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification:
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationIntroduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business
Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Introduction 2. Measuring Accuracy 3. Outof Sample Predictions 4.
More informationIntroduction to Machine Learning
Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationFinal Exam, Spring 2007
10701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: Email address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationLocal classification and local likelihoods
Local classification and local likelihoods November 18 knearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
More informationL13: crossvalidation
Resampling methods Cross validation Bootstrap L13: crossvalidation Bias and variance estimation with the Bootstrap Threeway data partitioning CSCE 666 Pattern Analysis Ricardo GutierrezOsuna CSE@TAMU
More informationChapter 7. Feature Selection. 7.1 Introduction
Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature
More informationClassification and Regression Trees
Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationWe discuss 2 resampling methods in this chapter  crossvalidation  the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Crossvalidation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
More informationIntroduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones
Introduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationC19 Machine Learning
C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks
More informationImproving Generalization
Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. CrossValidation 4. Weight Restriction
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationMathematical goals. Starting points. Materials required. Time needed
Level A7 of challenge: C A7 Interpreting functions, graphs and tables tables Mathematical goals Starting points Materials required Time needed To enable learners to understand: the relationship between
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationGraphing Quadratic Functions
A. THE STANDARD PARABOLA Graphing Quadratic Functions The graph of a quadratic function is called a parabola. The most basic graph is of the function =, as shown in Figure, and it is to this graph which
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationINVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4. Example 1
Chapter 1 INVESTIGATIONS AND FUNCTIONS 1.1.1 1.1.4 This opening section introduces the students to man of the big ideas of Algebra 2, as well as different was of thinking and various problem solving strategies.
More informationNeural Networks. CAP5610 Machine Learning Instructor: GuoJun Qi
Neural Networks CAP5610 Machine Learning Instructor: GuoJun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationLearning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.
Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose
More informationData Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationIntroduction to Statistical Machine Learning
CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,
More informationAdvice for applying Machine Learning
Advice for applying Machine Learning Andrew Ng Stanford University Today s Lecture Advice on how getting learning algorithms to different applications. Most of today s material is not very mathematical.
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More information4.1 PiecewiseDefined Functions
Section 4.1 PiecewiseDefined Functions 335 4.1 PiecewiseDefined Functions In preparation for the definition of the absolute value function, it is etremel important to have a good grasp of the concept
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationEvaluating Machine Learning Algorithms. Machine DIT
Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
BiasVariance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data BiasVariance Decomposition for Regression BiasVariance Decomposition for Classification BiasVariance
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationLecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.
Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 20092010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network?  Perceptron learners  Multilayer networks What is a Support
More informationModel Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
More informationMachine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machinelearning Logistics Lectures M 9:3011:30 am Room 4419 Personnel
More informationINTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbookchapterslides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationDiscovering Structure in Continuous Variables Using Bayesian Networks
In: "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge MA, 1996. Discovering Structure in Continuous Variables Using Bayesian Networks Reimar Hofmann and Volker Tresp Siemens AG,
More informationCrossValidation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationFind the Relationship: An Exercise in Graphing Analysis
Find the Relationship: An Eercise in Graphing Analsis Computer 5 In several laborator investigations ou do this ear, a primar purpose will be to find the mathematical relationship between two variables.
More informationNew Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationKMeans Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
KMeans Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring  Overview Random Forest  Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationDrug Store Sales Prediction
Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract  In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,
More informationIntroduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
More information8. Bilateral symmetry
. Bilateral smmetr Our purpose here is to investigate the notion of bilateral smmetr both geometricall and algebraicall. Actuall there's another absolutel huge idea that makes an appearance here and that's
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.
Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationUnsupervised Learning: Clustering with DBSCAN Mat Kallada
Unsupervised Learning: Clustering with DBSCAN Mat Kallada STAT 2450  Introduction to Data Mining Supervised Data Mining: Predicting a column called the label The domain of data mining focused on prediction:
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More informationSection 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
More information6. The given function is only drawn for x > 0. Complete the function for x < 0 with the following conditions:
Precalculus Worksheet 1. Da 1 1. The relation described b the set of points {(, 5 ),( 0, 5 ),(,8 ),(, 9) } is NOT a function. Eplain wh. For questions  4, use the graph at the right.. Eplain wh the graph
More informationNeural Networks and Support Vector Machines
INF5390  Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF539013 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
More informationSemiSupervised Support Vector Machines and Application to Spam Filtering
SemiSupervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationMath 1400/1650 (Cherry): Quadratic Functions
Math 100/1650 (Cherr): Quadratic Functions A quadratic function is a function Q() of the form Q() = a + b + c with a 0. For eample, Q() = 3 7 + 5 is a quadratic function, and here a = 3, b = 7 and c =
More informationReview of Computer Engineering Research WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM. Geeta R.B.* Shobha R.B.
Review of Computer Engineering Research journal homepage: http://www.pakinsight.com/?ic=journal&journal=76 WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM Geeta R.B.* Department
More informationSummary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen
Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationPolynomial and Rational Functions
Chapter 5 Polnomial and Rational Functions Section 5.1 Polnomial Functions Section summaries The general form of a polnomial function is f() = a n n + a n 1 n 1 + +a 1 + a 0. The degree of f() is the largest
More informationResponse variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.
Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables  logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationInformation Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
te to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them
More information