Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler

Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error Finding good parameters ϴ (direct minimization problem) Non-Linear regression problem

Example of Regression Vehicle price estimation problem Features x: Fuel type, The number of doors, Engine size Targets y: Price of vehicle Training Data: (fit the model) Fuel type The number of doors Engine size (61 326) Price 1 4 70 11000 1 2 120 17000 2 2 310 41000 Testing Data: (evaluate the model) Fuel type The number of doors Engine size (61 326) Price 2 4 150 21000

Example of Regression Vehicle price estimation problem Fuel type #of doors Engine size Price 1 4 70 11000 1 2 120 17000 2 2 310 41000 Fuel type The number of doors Engine size (61 326) Price 2 4 150 21000 ϴ = [-300, -200, 700, 130] Ex #1 = -300 + -200 *1 + 4*700 + 70*130 = 11400 Ex #2 = -300 + -200 *1 + 2*700 + 120*130 = 16500 Ex #3 = -300 + -200 *2 + 2*700 + 310*130 = 41000 Mean Absolute Training Error = 1/3 *(400 + 500 + 0) = 300 Test = -300 + -200 *2 + 4*700 + 150*130 = 21600 Mean Absolute Testing Error = 1/1 *(600) = 600

Supervised learning Notation Features x (input variables) Targets y (output variables) Predictions ŷ Parameters θ Training data (examples) Features Program ( Learner ) Characterized by some parameters θ Procedure (using θ) that outputs a prediction Error = Distance between y and ŷ Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)

Overview Regression Problem Definition and parameters. Prediction using ϴ as parameters Measure the error Finding good parameters ϴ (direct minimization problem) Non-Linear regression problem

Linear regression Target y 40 20 New instance with X 1 =8 Predicted value =17 Y = 5 + 1.5*X 1 Predictor : Evaluate line: ӯ = ϴ 0 + ϴ 1 * X 1 return ӯ ӯ = Predicted target value (Black line) 0 0 10 20 Feature X 1 Define form of function f(x) explicitly Find a good f(x) within that family

More dimensions? 26 26 y 24 22 y 24 22 20 20 30 20 20 x 10 1 10 0 0 x 2 30 40 30 20 20 x 10 1 10 0 0 x 2 30 40

Notation Ӯ is a plane in n+1 dimension space Define feature x 0 = 1 (constant) Then n = the number of features in dataset

Supervised learning Notation Features x (input variables) Targets y (output variables) Predictions ŷ Parameters θ Training data (examples) Features Program ( Learner ) Characterized by some parameters θ Procedure (using θ) that outputs a prediction Error = Distance between y and ŷ Learning algorithm Change θ Improve performance Feedback / Target values Evaluation of the model (measure error)

Measuring error Red points = Real target values Black line = ӯ (predicted value) ӯ = ϴ 0 + ϴ 1 * X Blue lines = Error (Difference between real value y and predicted value ӯ) Observation Error or residual Prediction 0 0 20

Mean Squared Error How can we quantify the error? m=number of instance of data Y= Real target value in dataset, ӯ = Predicted target value by ϴ*X Training Error: m= the number of training instances, Testing Error: Using a partition of Training error to check predicted values. m= the number of testing instances,

MSE cost function Rewrite using matrix form X = input variables in dataset y= output variable in dataset m=number of instance of data n = the number of features, (Matlab) >> e = y th*x ; J = e*e /m;

Visualizing the error function J is error function. The plane is the value of J, not the plane fitted to output values. J(θ) θ 1 Dimensions are ϴ0 and ϴ1 instead of X1 and X2 Output is J instead of y as target value 40 30 θ 1 20 10 0-10 Representation of J in 2D space. Inner red circles has less value of J Outer red circles has higher value of J -20-30 -40-1 -0.5 0 0.5 1 1.5 2 2.5 3 θ 0

Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Program ( Learner ) Learning algorithm Change θ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters θ Procedure (using θ) that outputs a prediction Evaluation of the model (measure error)

Finding good parameters Want to find parameters which minimize our error Think of a cost surface : error residual for that θ

MSE Minimum (m <= n+1) Consider a simple problem One feature, two data points Two unknowns and two equations: m=number of instance of data n = the number of features, n +1=1+1 = 2 m=2 Can solve this system directly: Theta gives a line or plane that exactly fit to all target values.

SSE Minimum (m > n+1) Most of the time, m > n There may be no linear function that hits all the data exactly Minimum of a function has gradient equal to zero (gradient is a horizontal line.) Reordering, we have n +1=1+1 = 2 m=3 Just need to know how to compute parameters.

Effects of Mean Square Error choice outlier data: An outlier is an observation that lies an abnormal distance from other value. 18 16 14 12 10 8 16 2 cost for this one datum Heavy penalty for large errors Distract line from other points. 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20

Absolute error 18 MSE, original data 16 14 Abs error, original data Abs error, outlier data 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20

Error functions for regression (Mean Square Error) (Mean Absolute Error) Something else entirely (???) Arbitrary Error functions can t be solved in closed form So as alternative way, use gradient descent

Nonlinear functions Single feature x, predict target y: Add features: Linear regression in new features Sometimes useful to think of feature transform Convert a non-linear function to linear function and then solve it.

Higher-order polynomials Y = ϴ 0 Are more features better? Nested hypotheses 2 nd order more general than 1 st, 3 rd order than 2 nd, Fits the observed data better Y = ϴ 0 + ϴ 1 * X Y = ϴ 0 + ϴ 1 * X + ϴ 2 *X 2 + ϴ 3 *X 3 18 nd order 1 st order 3 rd order

Test data After training the model Go out and get more data from the world New observations (x,y) How well does our model perform? Training data New, test data

Training versus test error Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 0 th to 2 st order Error decreases Underfitting Higher order Error increases Overfitting 30 25 Under fitting 20 15 10 5 Mean squared error Training data New, test data 0 0 1 2 3 4 5 6 7 8 9 10 Polynomial order Overfitting

Summary Regression Problem Definition Vehicle Price estimation Prediction using ϴ: Measure the error: difference between y and ŷ e.g. Absolute error, MSE direct minimization problem Two cases m<=n+1 and m > n+1 Non-Linear regression problem Finding best n th order polynomial function for each problem (not overfitting and not under fitting)