Regression analysis. MTAT Data Mining Anna Leontjeva

Transcription

1 Regression analysis MTAT Data Mining 2016 Anna Leontjeva

2 Previous lecture Supervised vs. Unsupervised Learning?

3 Previous lecture Supervised vs. Unsupervised Learning? Iris setosa Iris versicolor Iris virginica

4 Previous lecture Supervised vs. Unsupervised Learning? R packages and their dependencies

5 Previous lecture Supervised vs. Unsupervised Learning? The goal of the supervised approach is to learn function that maps input x to output y, given a labeled set of pairs D = {(x i,y i )} N i=1 The goal of the unsupervised approach is to learn interesting patterns given only an input D = {x i } N i=1

6 Previous lecture Classification or regression?

7 Previous lecture Classification or regression? D = {(x i,y i )} N i=1 Classification: y i 2 {1,..C} Regression: y i 2 R

8 Agenda KNN Linear regression Logistic regression Overfitting Regularization

9 Parametric and non-parametric methods by Kerby Rosanes parametric model has a fixed number of parameters (model-based) in non-parametric model number of parameters grow with the amount of training data (instance-based)

10 Parametric and non-parametric methods by Kerby Rosanes parametric model has a fixed number of parameters (model-based) Regression in non-parametric model number of parameters grow with the amount of training data (instance-based) K-nearest neighbors

11 Parametric and non-parametric methods by Kerby Rosanes + - parametric: faster to use stronger assumptions about data distributions non-parametric: more flexible computationally challenging

12 Non-parametric: K-nearest neighbors (KNN)? To classify new point x: - look at K points in the training set that are closest to x! - count members of each class in this set - assign a class to x with majority voting (or return a fraction for a class)

13 K-nearest neighbors (KNN)? Define distance!! e.g Euclidian To classify new point x: - look at K points in the training set that are closest to x! - count members of each class in this set - assign a class to x with majority voting (or return a fraction for a class)

14 KNN simple concept easy to implement difficult to implement efficiently not interpretable (instance-based) asymptotically optimal suffers from curse of dimensionality

15 KNN simple concept easy to implement difficult to implement efficiently not interpretable (instance-based) asymptotically optimal suffers from curse of dimensionality

16 The curse of dimensionality increase of the dimensionality (number of features) leads to sparsity of data points* definitions of density and distance between points are less meaningful algorithms may perform poorly in high-dimensional data *

17 KNN Linear regression Logistic regression Overfitting Regularization

18 Parametric model: Linear regression

19 Simple linear regression y Task: given a list of observations x ŷ = ax + b D = {(x i,y i )} N i=1 find a line that approximates the correspondence in the data

20 Simple linear regression y = + x + output (dependent variable, response) input (independent variable, feature, explanatory variable, etc)

21 Simple linear regression y = + x + intercept (bias) mean of y when x=0 noise (error term, residual) shows what we are not able to predict with x coefficient (slope, or weight w) shows how increases output if input increases by one unit

22 Simple linear regression y = + x + {

23 Simple linear regression We search for a function ŷ = f(x) such that minimizes mean squared error (MSE) : 1 N NX i=1 (y i ŷ i ) 2 1 = N NX (y i x i ) 2 i=1 which means to find derivatives wrt and solve the system of equations: @ =0

24 Simple linear @ =0, ( =ȳ x = P N i=1 (x i x)(y i ȳ) P N i=1 (x i x) 2 where NX NX x = 1 N x i ȳ = 1 N y i i=1 i=1

25 Simple linear @ =0, Closed-form solution* ( =ȳ x = P N i=1 (x i x)(y i ȳ) P N i=1 (x i x) 2 where NX NX x = 1 N x i ȳ = 1 N y i i=1 i=1 *it solves a given problem in terms of functions and mathematical operations from a generally-accepted set of operations

26 Simple linear regression: example Built-in R dataset:a collection of observations of the Old Faithful geyser in the USA Yellowstone National Park > data(faithful) > head(faithful) eruptions waiting > dim(faithful) [1] the length of the waiting period until the next one (in mins) the duration of the geyser eruptions (in mins) > model <- lm(data=faithful, eruptions ~ waiting) What model do we define here? What is input and output?

27 Simple linear regression: example > summary(model)! Call: lm(formula = eruptions ~ waiting, data = faithful)! Residuals: Min 1Q Median 3Q Max ! Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** waiting <2e-16 *** --- R 2 =1 Signif. codes: 0 *** ** 0.01 * ! Residual standard error: on 270 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 P (yi f i (x)) 2 P (yi ȳ) 2 The fitted model is: eruptions = x waiting

28 Simple linear regression: example in R The fitted model is: eruptions = x waiting What is the eruption time if waiting was 70?

29 Simple linear regression: example in R The fitted model is: eruptions = x waiting What is the eruption time if waiting was 70? > * [1] > coef(model)[[1]] + coef(model)[[2]]*70 Calculating predictions for new set: > test_set = data.frame(waiting=c(70,80,100)) > predict(model, newdata=test_set)

30 Machine learning secret sauce Train Data Remember Test

31 Simple linear regression: example in R train_idx <- sample(nrow(faithful), 172) train <- faithful[train_idx,] test <- faithful[-train_idx,] model <- lm(data=train, eruptions ~ waiting) test$predictions <- predict(newdata=test, model) MSE <- (1/nrow(test))*sum((test$eruptions - test$predictions)^2) > MSE [1] ggplot(train, aes(x=waiting, y=eruptions)) + geom_point() + geom_smooth(method='lm') + theme_bw() ggplot(test, aes(x=eruptions, y=predictions)) + geom_point(color='red') + theme_bw() + geom_abline(intercept = 0, slope = 1)

32 Multivariate linear regression all the same, but instead of one feature, x is a k-dimensional vector x i =(x i1,x i2,..,x ik ) the model is the linear combination of all features: ŷ = x x k x k via the matrix representation: ŷ = X 0 1 ŷ 1 B. A = yˆ n x x 1k B A 1 x n1... x np 0 0 k 1 C A

33 Multivariate linear regression Recall from a simple regression a system of @ =0, ( =ȳ x = P N i=1 (x i x)(y i ȳ) P N i=1 (x i x) 2 For multivariate regression MSE is defined: MSE = 1 N (y ŷ)t = 2y T X +2X T X

34 Multivariate = 2y T X +2X T X =0) =(X T X) 1 X T y complexity of matrix inverse is high: O(n ) in practice iterative methods are used (e.g. gradient descent)

35 Assumptions - the relationship between x and y is linear y - y distributed normally at each value of x x - no heteroscedasticity (variance is systematically changing) - independence and normality of errors - lack of multicollinearity (non-correlated features)

36 Multivariate linear regression Linear model requires parameters to be linear, not features! This is linear model y = x x x 2 This is linear model y = x x x x 2 2 x 0 = (x) x z, p (x),log(x)... This is not linear model y = x x 2

37 Multivariate linear regression > head(prestige) education income women prestige census type GOV.ADMINISTRATORS prof GENERAL.MANAGERS prof ACCOUNTANTS prof PURCHASING.OFFICERS prof CHEMISTS prof PHYSICISTS prof

38 Multivariate linear regression > head(prestige) education income women prestige census type GOV.ADMINISTRATORS prof GENERAL.MANAGERS prof ACCOUNTANTS prof PURCHASING.OFFICERS prof CHEMISTS prof PHYSICISTS prof > model_multivariate <- lm(data=prestige, prestige ~ education + log(income, base=10) + women) > summary(model_multivariate)! Call: lm(formula = prestige ~ education + log(income, base = 10) +! women, data = Prestige) Residuals: Min 1Q Median 3Q Max ! Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-11 *** education < 2e-16 *** log(income, base = 10) e-10 *** women Signif. codes: 0 *** ** 0.01 * ! Residual standard error: on 98 degrees of freedom Multiple R-squared: , Adjusted R-squared: 0.83 F-statistic: on 3 and 98 DF, p-value: < 2.2e-16 Interpret

40 Logistic regression is not regression!

41 Logistic regression is not regression!* it is classification y i 2 {1,..C} * it is called so due to its similarity to linear regression

42 Binary logistic regression means that y is binary: (0,1) ŷ = x x k x k

43 Binary logistic regression logit( ) =ln( y is binary: (0,1) models the log odds of probability of "success" as a linear function of input features, where: = P (y =1 x) 1 = P (y =0 x) 1 )= x x k x k

44 Binary logistic regression = P (y =1 x) Denote M := x x k x k ln( 1 )=M ) 1 =expm =exp M (1 ) =exp M exp M = expm 1 + exp M = exp M sigmoid function

45 Binary logistic regression sigmoid means S-shaped also known as squashing function since it maps the line to [0,1], which is necessary if the output needs to be interpreted as probability

46 Binary logistic regression 1 0 If we threshold the output at 0.5, we create a decision rule of the form ŷ =1, p(y =1 x) > 0.5

47 Binary logistic regression: example in R logit <- glm(data=train, as.factor(danger)~waiting, family='binomial') summary(logit) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) *** waiting *** --- Signif. codes: 0 *** ** 0.01 * Recall that: logit( ) =ln( 1 )= x x k x k Coefficients are difficult to interpret directly: the log odds We can take exponent of coefficients: exp(coef(logit)) (Intercept) waiting "0.000" "1.893"

48 Binary logistic regression: example in R (Intercept) waiting "0.000" "1.893" Now, coefficients express odds: P (y =1 x) P (y =0 x) Write down the model Interpret If it is close to 1, it is not interesting, as one unit increase of x does not change odds of success.

49 Binary logistic regression: example in R logit <- glm(data=train, as.factor(danger)~waiting, family='binomial') test$predictions_probability <- predict(newdata=test, logit, type = 'response') test$predictions_binary <- ifelse(test$predictions_probability<=0.5,0,1) table(real=test$danger, predictions=test$predictions_binary) predictions real How is it called?

50 Binary logistic regression: ROC curve > roc_obj <- roc(response=test$danger, predictor=test$predictions_probability) > roc_obj$auc Area under the curve:

51 Intuition behind ROC: step I: sort your data according to the score: step II: according to the sorting write down the true class: step III: go up for 1 and right for 0

52 model2 makes Intuition behind ROC: mistakes earlier random guess model1: Area under the curve: both our model, and the second one predict correctly model2: Area under the curve: when they are confident in their scores

54 Under- and Overfitting

55 How to detect overfitting! Slides by Digvijay Singh

56 How to detect overfitting

57 How to detect overfitting Model does not generalize

58 Prediction error Bias-variance tradeoff Model complexity

59 Prediction error Bias-variance tradeoff Model complexity We wants to choose a model that both accurately captures patterns in training data, but also generalizes well to unseen data.

60 Prediction error Bias-variance tradeoff Model complexity We wants to choose a model that both accurately captures patterns in training data, but also generalizes well to unseen data. Unfortunately, it is a tradeoff between two.

61 Bias-variance tradeoff Bias - error from erroneous assumptions in the learning algorithm (underfitting) Variance - error from sensitivity to small fluctuations (overfitting)

62 Bias variance tradeoff Variance Bias Dimensionality reduction! Feature selection! Larger training set! Adding features! Tuning of hyperparameters

63 Bias variance tradeoff Variance Bias Dimensionality reduction! Feature selection! Larger training set! Adding features! Tuning of hyperparameters

64 Tuning of hyperparameters Method Variance Bias Linear and logistic regression! K-nearest neighbors! Decision trees!! regularization increase of k pruning

66 ( Regularization (for regression) Recall that in regression we minimized MSE: 1 N NX (y i ŷ i ) 2 i=1 It is loss function L for regression Different methods have different loss functions that describe how to penalize errors

67 ( Regularization (for regression) Recall that in regression we minimized MSE: 1 N NX (y i ŷ i ) 2 i=1 It is loss function L for regression Regularization R imposes a penalty on the size of coefficients: L = MSE + R where R can be: 1 or 2 2

68 Regularization (for regression) L = MSE + R where R can be: 1 or 2 2 Lasso Ridge (l 1 norm) (l 2 norm) Lasso results in many coefficients being zero, thus performing feature selection Ridge regression tends to keep all coefficients, but decrease them to small numbers

69 Regularization (for regression) L = MSE + R px j=1 j where R can be: 1 or 2 2 Lasso Ridge (l 1 norm) (l 2 norm) px j=1 2 j Lasso results in many coefficients being zero, thus performing feature selection Ridge regression tends to keep all coefficients, but decrease them to small numbers

70 KNN Summary? Linear regression y x Logistic regression Overfitting Regularization

71 Recommended literature