Lecture 9: Evaluating Predictive Models

Similar documents
Evaluating Predictive Models

Nonparametric statistics and model selection

Homework Assignment 7

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Pre-Algebra Lecture 6

Local classification and local likelihoods

Lecture 10: Regression Trees

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Cross Validation. Dr. Thomas Jensen Expedia.com

Statistical Machine Learning

Using games to support. Win-Win Math Games. by Marilyn Burns

Curve fitting How to. Least squares and linear regression. by W. Garrett Mitchener

Data Mining Practical Machine Learning Tools and Techniques

Vieta s Formulas and the Identity Theorem

1.6 The Order of Operations

Math 4310 Handout - Quotient Vector Spaces

The Method of Partial Fractions Math 121 Calculus II Spring 2015

Zeros of a Polynomial Function

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Support Vector Machine (SVM)

CALCULATIONS & STATISTICS

Notes on Factoring. MA 206 Kurt Bryan

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

1 Shapes of Cubic Functions

Activity 1: Using base ten blocks to model operations on decimals

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Simple Regression Theory II 2010 Samuel L. Baker

2.5 Zeros of a Polynomial Functions

Polynomial Invariants

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Section 6: Model Selection, Logistic Regression and more...

Instructor (Andrew Ng)

Decision Making under Uncertainty

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

ANALYSIS OF TREND CHAPTER 5

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Lecture 7: Continuous Random Variables

Linear Programming Notes VII Sensitivity Analysis

Monotonicity Hints. Abstract

3 Some Integer Functions

Question 2 Naïve Bayes (16 points)

Math 132. Population Growth: the World

Pigeonhole Principle Solutions

Applications to Data Smoothing and Image Processing I

CURVE FITTING LEAST SQUARES APPROXIMATION

Exponents and Radicals

Introduction to General and Generalized Linear Models

Point and Interval Estimates

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Drugs store sales forecast using Machine Learning

Binary Adders: Half Adders and Full Adders

Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Cross-validation for detecting and preventing overfitting

The Taxman Game. Robert K. Moniot September 5, 2003

Integers are positive and negative whole numbers, that is they are; {... 3, 2, 1,0,1,2,3...}. The dots mean they continue in that pattern.

Model Validation Techniques

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

Supporting Online Material for

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:

1 Lecture: Integration of rational functions by decomposition

OA4-13 Rounding on a Number Line Pages 80 81

Solving Quadratic Equations

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Data Mining Techniques Chapter 6: Decision Trees

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

0.8 Rational Expressions and Equations

Lecture 3: Linear methods for classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

A Short Tour of the Predictive Modeling Process

Time Series and Forecasting

Solving Rational Equations

NUMBER SYSTEMS. William Stallings

Big Data Analytics CSCI 4030

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

2.4 Real Zeros of Polynomial Functions

Detecting Corporate Fraud: An Application of Machine Learning

Session 7 Bivariate Data and Analysis

Overview. Data Mining. Predicting Stock Market Returns. Predicting Health Risk. Wharton Department of Statistics. Wharton

Decimal Notations for Fractions Number and Operations Fractions /4.NF

Session 6 Number Theory

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Beating the MLB Moneyline

SIMPLIFYING ALGEBRAIC FRACTIONS

Session 9 Case 3: Utilizing Available Software Statistical Analysis

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Deriving Demand Functions - Examples 1

5.5. Solving linear systems by the elimination method

Objective To introduce the concept of square roots and the use of the square-root key on a calculator. Assessment Management

Lecture 5 Rational functions and partial fraction expansion

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Support Vector Machines Explained

Teachers should read through the following activity ideas and make their own risk assessment for them before proceeding with them in the classroom.

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

Equations, Lenses and Fractions

Relative and Absolute Change Percentages

Introduction to Logistic Regression

Transcription:

Lecture 9: Evaluating Predictive Models 36-350: Data Mining October 9, 2006 So far in the course, we have largely been concerned with descriptive models, which try to summarize our data in compact and comprehensible ways, with information retrieval, etc. Now, in the second half, we are going to focus on predictive models, which attempt to foresee what will happen with new data. We have already seen an example of this in classification problems: algorithms like the nearest-neighbor method try to guess which classes new data belong to, based on old data. You are also familiar with linear regression, which can be used both to describe trends in existing data, and to predict the value of the dependent variable on new data. In both cases, we can gauge how well the predictive model does by looking at its accuracy, or equivalently at its errors. For classification, the usual measure of error is the fraction of cases mis-classified, called the mis-classification rate or just the error rate. For linear regression, the usual measure of error is the sum of squared errors, or equivalently 1 R 2, and the corresponding measure of accuracy is R 2. In the method of maximum likelihood, the accuracy is just the likelihood value, and the error is conventionally the negative log-likelihood. What we would like, ideally, is a predictive model which has zero error on future data. We basically never achieve this, partly because our models are imperfect and partly because the world is just noisy and stochastic, and even the ideal model will not have zero error all the time. Instead, we would like to minimize the expected error, or risk, on future data. If we didn t care about future data, this would be easy. We have various possible models, each with different parameter settings, conventionally written θ. We also have a collection of data x 1, x 2,... x n x. For each possible model, then, we can compute the error on the data, L(x, θ), called the in-sample loss or the empirical risk. The simplest strategy is then to pick the model, the value of θ, which minimizes the in-sample loss. This means picking the classifier with the lowest in-sample error rate, or the regression which minimizes the sum of squared errors, or the likelihood-maximizing parameter value what you ve usually done in statistics courses so far. There is however a potential problem here, because L(x, θ) is not what we really want to minimize. That is E [L(X, θ)], the expected loss on new data drawn from the same distribution. This is also called the risk, as I said, or the outof-sample loss, or the generalization error (because it involves generalizing from the old data to new data). The in-sample loss equals the risk plus sampling 1

noise: L(x, θ) = E [L(X, θ)] + ɛ(θ) Here ɛ(θ) is a random term which has mean zero, and represents the effects of having only a finite data sample, rather than the complete probability distribution. (I write it ɛ(θ) as a reminder that different models are going to be effected differently by the same sampling fluctuations.) The problem, then, is that the model which minimizes the in-sample loss could be one with good generalization performance (E [L(X, θ)] is small), or it could be one which got very lucky (ɛ(θ) was large and negative). To see how this can matter, consider choosing among different models with different degrees of complexity also called model selection. In particular, consider Figure 1. Here, I have fitted ten different models ten different polynomials to the same data set. The higher-order polynomials give, visually at least, a significantly better fit. This is confirmed by looking at the residual sum of squares. degree RSS 1 17.19 2 16.62 3 16.02 4 14.53 5 13.54 6 13.01 7 12.25 8 10.80 9 9.77 10 8.65 Since there are only twenty data points, if I continued this out to polynomials of degree twenty, I could get the residual sum of squares down to zero, apparently perfect prediction. The problem comes when we take these models and try to generalize to new data, such as an extra 200 data points drawn from exactly the same distribution (Figure 2). degree Sum of Squared Errors 1 453.68 2 455.28 3 475.53 4 476.36 5 484.18 6 486.97 7 488.77 8 518.39 9 530.25 10 545.37 2

Polynomial Fits to Original Data Dependent Variable 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Independent Variable Figure 1: Fits of polynomials from degree 1 to 10 to twenty data points (X is standard Gaussian, Y = 0.5X+ another standard Gaussian. 3

3 2 1 0 1 2 3 2 0 2 Polynomial Fits to New Data Independent Variable Dependent Variable Figure 2: Predictions of the polynomial models on 200 new data points. What s going on here is that the more complicated models the higher-order polynomials, with more terms and parameters were not actually fitting the generalizable features of the data. Instead, they were fitting the sampling noise, the accidents which don t repeat. That is, the more complicated models overfit the data. In terms of our earlier notation, ɛ is bigger for the more flexible models. The model which does the best on new data is the linear model, because, as it happens the true model here is linear. If the true model had been, say, quadratic, then we should have seen the second-degree polynomials do best, and the linear models would have under-fit the data, because the (real, generalizable) curvature would have been missed. After bad data, over-fitting may well be the single biggest problem confronting data mining. People have accordingly evolved many ways to combat it. 4

Big Data The simplest approach to dealing with over-fitting is to hope that it will go away. As we get more and more data, the law of large numbers (and other limit theorems) tell us that it should become more and more representative of the true data-generating distribution, assuming there is one. Thus, ɛ, the difference between the empirical risk and the generalization risk, should grow smaller and smaller. If you are dealing with a lot of data, you can hope that ɛ is very small, and that minimizing the in-sample loss will give you a model which generalizes almost as well as possible. Unfortunately, if your model is very flexible, lots of data can be exponentially large. Penalization The next bright idea is to say that if the problem is over-flexible models, we should penalize flexibility. That is, instead of minimizing L(x, θ), minimize L(x, θ)+λg(θ), where g(θ) is some kind of indication of the complexity or flexibility of θ, say the number of parameters, and λ is our trade-off factor. Standard linear regression implements a scheme like this in the form of adjusted R 2. It can work very well, but the trick is in choosing the penalty term; the number of parameters is often used but often not appropriate. Cross-Validation The most reliable trick, in many ways, is related to what I did above. Divide your data at random into two parts. Call the first part the training set, and use it to fit your models. Then evaluate their performance on the other part, the testing set. Because you divided the data up randomly, the performance on the test data should be an unbiased estimate of the generalization performance. (But, unbiased doesn t necessarily mean close.) There are many wrinkles here, some of which we ll see, like multi-fold cross-validation repeat the division q times, and chose the model which comes out best on average over the q different test sets. 5