Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 Stats 202: Data Mining and Analysis Lester Mackey September 23, 2015 (Slide credits: Sergi Bacallad) 1 / 24
Annuncements Hmewrk 1 is nline Submit slutins t Gradescpe by 1:30pm (PT) next Weds. Remember ur curse plicy: Late hmewrk is nt accepted, but yu can submit partial slutins, and we will drp the lwest hmewrk scre We ve created a pinned pst n Piazza t help yu find Kaggle cmpetitin teammates I d like t get int the habit f repeating yur questins (please remind me if I frget!) 2 / 24
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between bservatins r variables in a data matrix: Datapints, examples, r bservatins Variables r features Quantitative variables: weight, height, number f children,... Qualitative (categrical) variables: cllege majr, prfessin, gender,... 3 / 24
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between bservatins r variables in a data matrix. Standard unsupervised learning gals include: Finding lwer-dimensinal representatins f the data fr imprved visualizatin, interpretatin, r cmpressin. Dimensinality reductin: PCA, ICA, ismap, lcally linear embeddings, etc. Finding meaningful grupings f the data. Clustering. Unsupervised learning is als knwn in Statistics as explratry data analysis. 4 / 24
Supervised vs. unsupervised learning In supervised learning, there are input variables and utput variables: Datapints, examples, r bservatins Input variables Output variable If quantitative, qualitative / categrical, is awe regressin say this is a we say this prblem classificatin prblem Variables, features, r predictrs 5 / 24
Supervised vs. unsupervised learning In supervised learning, there are input variables and utput variables. If X is the vectr f inputs fr a particular sample, a quantitative utput variable Y is typically mdeled as Y = f(x) + ε }{{} Randm errr f fixed and unknwn, captures the systematic infrmatin that X prvides abut Y ε represents the unpredictable nise in the prblem Supervised learning methds seek t learn the relatinship between X and Y (e.g., by estimating the functin f) using a set f training examples. 6 / 24
Supervised vs. unsupervised learning Y = f(x) + ε }{{} Randm errr Why learn the relatinship between X and Y? Predictin: Useful when the input variable is readily available, but the utput variable is nt. Example: Predict stck prices next mnth using data frm last year. Inference: A mdel fr f can help us understand the structure f the data which variables influence the utput, and which dn t? What is the relatinship between each variable and the utput, e.g., linear, nn-linear? Examples: Infer which genes are assciated with the incidence f heart disease. 7 / 24
Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: Parametric methds assume that f has a fixed functinal frm with a fixed number f parameters, fr example, a linear frm: f(x) = X 1 β 1 + + X p β p with parameters β 1,..., β p. Estimating f then reduces t estimating r fitting the parameters. Nn-parametric methds d nt assume fixed functinal frm with a fixed number f parameters but ften still limit hw wiggly r rugh the functin f can be. 8 / 24
Parametric vs. nnparametric predictin Incme Incme Years f Educatin Years f Educatin Senirity Senirity Figures 2.4 and 2.5 Parametric methds are fundamentally limited in their fit quality. Nn-parametric methds keep imprving as we add mre data t fit. Parametric methds are ften simpler t interpret and are less prne t verfitting t nise in the data. 9 / 24
Evaluating supervised learning methds Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predictin functin estimate: ˆf. Questin: Hw d we knw if ˆf is a gd estimate? Intuitive answer: ˆf is gd if it predicts well if (x 0, y 0 ) is a new datapint (nt used in training), then ˆf(x 0 ) and y 0 shuld be clse Ppular measure f clseness is squared errr (y 0 ˆf(x 0 )) 2 (but ther measures f predictin errr are als used) Given many test datapints {(x i, y i ); i = 1,..., m} (nt used in training), cmmn and reasnable t use average test predictin errr, e.g., test mean squared errr (MSE) 1 m m (y i ˆf(x i)) 2 i=1 as a measure f the predictin quality f ˆf 10 / 24
Evaluating supervised learning methds Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predictin functin estimate: ˆf. Questin: What if yu dn t have extra test data lying arund? Tempting slutin: Use training errr, e.g., training MSE 1 n n (y i ˆf(x i )) 2. i=1 as yur measure f predictin quality instead. Prblem: Small training errr des nt imply small test errr! 11 / 24
Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Three estimates ˆf are shwn: 1. Linear regressin. 2. Splines (very smth). 3. Splines (quite rugh). 12 / 24
Figure 2.10 Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The functin f is nw almst linear. 13 / 24
Figure 2.11 Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility When the nise ε has small variance, the third methd des well. 14 / 24
The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training utputs (y 1,..., y n ). Then, the expected mean squared errr (MSE) at x 0 decmpses int three interpretable terms: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). Irreducible errr E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The variance f ˆf(x 0 ): E[ ˆf(x 0 ) E( ˆf(x 0 ))] 2 15 / 24
16 / 24
Take-aways frm bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The expected MSE is never smaller than the irreducible errr Bth small bias and small variance are needed fr small expected MSE Easy t find zer variance prcedures with high bias (predict a cnstant value) and very lw bias prcedures with high variance (chse ˆf passing thrugh every training pint) In practice, best expected MSE achieved by incurring sme bias t decrease variance and vice-versa: this is the bias-variance trade-ff Rule f thumb Mre flexible methds Higher variance and Lwer bias. Example: Linear fit may nt change substantially with training data but high bias if true f very nn-linear 17 / 24
Squiggly f, high nise Linear f, high nise Squiggly f, lw nise 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure 2.12 18 / 24
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f car attributes, then the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. We will adpt sme additinal ntatin: P (X, Y ) : jint distributin f (X, Y ), P (Y X) : cnditinal distributin f X given Y, ŷ i : predictin f y i given x i. 19 / 24
Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: 1(y 0 ŷ 0 ). As with squared errr, we can cmpute average test predictin errr (called test errr rate under 0-1 lss) using previusly unseen test data {(x i, y i ); i = 1,..., m} : 1 m m 1(y i ŷ i). i=1 Similarly, we can cmpute the (ften ptimistic) training errr rate 1 n 1(y i ŷ i ). n i=1 20 / 24
Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. The Bayes classifier assigns: ŷ i = argmax j P (Y = j X = x i ) It can be shwn that this is the best classifier under the 0-1 lss. 21 / 24
K-nearest neighbrs T assign a clr t the input, we lk at its K = 3 nearest neighbrs. We predict the clr f the majrity f the neighbrs. Figure 2.14 22 / 24
K-nearest neighbrs als has a decisin bundary X 1 X2 KNN: K=10 Figure 2.15 23 / 24
Higher K smther decisin bundary KNN: K=1 KNN: K=100 Figure 2.16 24 / 24