Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Similar documents

Student Academic Learning Services Page 1 of 7. Statistics: The Null and Alternate Hypotheses. A Student Academic Learning Services Guide

Data Analytics for Campaigns Assignment 1: Jan 6 th, 2015 Due: Jan 13 th, 2015

Applied Spatial Statistics: Lecture 6 Multivariate Normal

Chapter 3: Cluster Analysis

Times Table Activities: Multiplication

Statistical Analysis (1-way ANOVA)

learndirect Test Information Guide The National Test in Adult Numeracy

Some Statistical Procedures and Functions with Excel

Chapter 6: Continuous Probability Distributions GBS221, Class March 25, 2013 Notes Compiled by Nicolas C. Rouse, Instructor, Phoenix College

An Introduction to Statistical Learning

Fixed vs. Variable Interest Rates

Saving Energy Lesson Plan

Industrial and Systems Engineering Master of Science Program Data Analytics and Optimization

A Walk on the Human Performance Side Part I

Annuities and Senior Citizens

Computer Science Undergraduate Scholarship

An Introduction to Statistical Learning

PROMOTING THE USE OF VIDEO CONFERENCING. How to get the absolute best from Video Conferencing by encouraging and increasing usage.

New Chip Card Technology Released Across the U.S.

The Importance of Market Research

PBS TeacherLine Course Syllabus

Baltimore Conference Call with Director of Student Services

Watlington and Chalgrove GP Practice - Patient Satisfaction Survey 2011

Ready to upgrade the Turbo on your Diesel?

CSE 231 Fall 2015 Computer Project #4

Army DCIPS Employee Self-Report of Accomplishments Overview Revised July 2012

Corporations Q&A. Shareholders Edward R. Alexander, Jr.

Calling from a Cell Phone

Using Sentry-go Enterprise/ASPX for Sentry-go Quick & Plus! monitors

Disk Redundancy (RAID)

Backward Design Lesson Planning. How do I determine and write lesson objectives? (identifying desired results)

CHECKING ACCOUNTS AND ATM TRANSACTIONS

Michigan Transfer Agreement (MTA) Frequently Asked Questions for College Personnel

Retirement Planning Options Annuities

NAVIPLAN PREMIUM LEARNING GUIDE. Analyze, compare, and present insurance scenarios

Phi Kappa Sigma International Fraternity Insurance Billing Methodology

Atom Insight Business Solution Bundles

What Does Specialty Own Occupation Really Mean?

CREATING A COLLEGE-GOING CULTURE

HarePoint HelpDesk for SharePoint. For SharePoint Server 2010, SharePoint Foundation User Guide

Spatial basis risk and feasibility of index based crop insurance in Canada: A spatial panel data econometric approach

How to Write Program Objectives/Outcomes

Access EEC s Web Applications... 2 View Messages from EEC... 3 Sign In as a Returning User... 3

esupport Quick Start Guide

Information Guide Booklet. Home Loans

By offering the Study Abroad Scholarship, we hope to make your study abroad experience much more affordable!

Group Term Life Insurance: Table I Straddle Testing and Imputed Income for Dependent Life Insurance

Optimal Payments Extension. Supporting Documentation for the Extension Package v1.1

SCHOLARSHIP APPLICATION

The Ohio Board of Regents Credit When It s Due process identifies students who

Loan Repayment Planning Worksheet

SWARTHMORE GIFT PLANNING

Chalkable Classroom For Students

GED MATH STUDY GUIDE. Last revision July 15, 2011

FAQs regarding our system upgrade

Lesson Study Project in Mathematics, Fall University of Wisconsin Marathon County. Report

Considerations for Success in Workflow Automation. Automating Workflows with KwikTag by ImageTag

How To Get A Credit By Examination

DIGITAL MARKETING STRATEGY CHECKLIST

Title IV Refund Policy (R2T4)

FACT SHEET BORROWING THROUGH SUPER. Prepared by Brett Griffiths, Director Superannuation Consulting e bgriffiths@vincents.com.au

Special Tax Notice Regarding 403(b) (TSA) Distributions

UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES

Corporate Standards for data quality and the collation of data for external presentation

NHPCO Guidelines for Using CAHPS Hospice Survey Results

ARE YOU INTERESTED IN THE PRIOR LEARNING ASSESSMENT (PLA) PROGRAM?

To discuss Chapter 13 bankruptcy questions with our bankruptcy attorney, please call us or fill out a Free Evaluation form on our website.

CONTENTS UNDERSTANDING PPACA. Implications of PPACA Relative to Student Athletes. Institution Level Discussion/Decisions.

electronic newsletter templates and the communication portal for promoting various products and services.

COPYWRITING FOR WEBSITES

BRILL s Editorial Manager (EM) Manual for Authors Table of Contents

WHITE PAPER. Vendor Managed Inventory (VMI) is Not Just for A Items

PEARL LINGUISTICS YOUR NEW LANGUAGE SERVICE PROVIDER FREQUENTLY ASKED QUESTIONS

Magenta HR in partnership with breath ehr

System Business Continuity Classification

Tipsheet: Sending Out Mass s in ApplyYourself

Firewall/Proxy Server Settings to Access Hosted Environment. For Access Control Method (also known as access lists and usually used on routers)

Conversations of Performance Management

Licensing the Core Client Access License (CAL) Suite and Enterprise CAL Suite

COMMONLY ASKED INTERVIEW QUESTIONS & STRATEGIES TO ANSWER THEM

Integrate Marketing Automation, Lead Management and CRM

Transcription:

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 Stats 202: Data Mining and Analysis Lester Mackey September 23, 2015 (Slide credits: Sergi Bacallad) 1 / 24

Annuncements Hmewrk 1 is nline Submit slutins t Gradescpe by 1:30pm (PT) next Weds. Remember ur curse plicy: Late hmewrk is nt accepted, but yu can submit partial slutins, and we will drp the lwest hmewrk scre We ve created a pinned pst n Piazza t help yu find Kaggle cmpetitin teammates I d like t get int the habit f repeating yur questins (please remind me if I frget!) 2 / 24

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between bservatins r variables in a data matrix: Datapints, examples, r bservatins Variables r features Quantitative variables: weight, height, number f children,... Qualitative (categrical) variables: cllege majr, prfessin, gender,... 3 / 24

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between bservatins r variables in a data matrix. Standard unsupervised learning gals include: Finding lwer-dimensinal representatins f the data fr imprved visualizatin, interpretatin, r cmpressin. Dimensinality reductin: PCA, ICA, ismap, lcally linear embeddings, etc. Finding meaningful grupings f the data. Clustering. Unsupervised learning is als knwn in Statistics as explratry data analysis. 4 / 24

Supervised vs. unsupervised learning In supervised learning, there are input variables and utput variables: Datapints, examples, r bservatins Input variables Output variable If quantitative, qualitative / categrical, is awe regressin say this is a we say this prblem classificatin prblem Variables, features, r predictrs 5 / 24

Supervised vs. unsupervised learning In supervised learning, there are input variables and utput variables. If X is the vectr f inputs fr a particular sample, a quantitative utput variable Y is typically mdeled as Y = f(x) + ε }{{} Randm errr f fixed and unknwn, captures the systematic infrmatin that X prvides abut Y ε represents the unpredictable nise in the prblem Supervised learning methds seek t learn the relatinship between X and Y (e.g., by estimating the functin f) using a set f training examples. 6 / 24

Supervised vs. unsupervised learning Y = f(x) + ε }{{} Randm errr Why learn the relatinship between X and Y? Predictin: Useful when the input variable is readily available, but the utput variable is nt. Example: Predict stck prices next mnth using data frm last year. Inference: A mdel fr f can help us understand the structure f the data which variables influence the utput, and which dn t? What is the relatinship between each variable and the utput, e.g., linear, nn-linear? Examples: Infer which genes are assciated with the incidence f heart disease. 7 / 24

Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: Parametric methds assume that f has a fixed functinal frm with a fixed number f parameters, fr example, a linear frm: f(x) = X 1 β 1 + + X p β p with parameters β 1,..., β p. Estimating f then reduces t estimating r fitting the parameters. Nn-parametric methds d nt assume fixed functinal frm with a fixed number f parameters but ften still limit hw wiggly r rugh the functin f can be. 8 / 24

Parametric vs. nnparametric predictin Incme Incme Years f Educatin Years f Educatin Senirity Senirity Figures 2.4 and 2.5 Parametric methds are fundamentally limited in their fit quality. Nn-parametric methds keep imprving as we add mre data t fit. Parametric methds are ften simpler t interpret and are less prne t verfitting t nise in the data. 9 / 24

Evaluating supervised learning methds Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predictin functin estimate: ˆf. Questin: Hw d we knw if ˆf is a gd estimate? Intuitive answer: ˆf is gd if it predicts well if (x 0, y 0 ) is a new datapint (nt used in training), then ˆf(x 0 ) and y 0 shuld be clse Ppular measure f clseness is squared errr (y 0 ˆf(x 0 )) 2 (but ther measures f predictin errr are als used) Given many test datapints {(x i, y i ); i = 1,..., m} (nt used in training), cmmn and reasnable t use average test predictin errr, e.g., test mean squared errr (MSE) 1 m m (y i ˆf(x i)) 2 i=1 as a measure f the predictin quality f ˆf 10 / 24

Evaluating supervised learning methds Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predictin functin estimate: ˆf. Questin: What if yu dn t have extra test data lying arund? Tempting slutin: Use training errr, e.g., training MSE 1 n n (y i ˆf(x i )) 2. i=1 as yur measure f predictin quality instead. Prblem: Small training errr des nt imply small test errr! 11 / 24

Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Three estimates ˆf are shwn: 1. Linear regressin. 2. Splines (very smth). 3. Splines (quite rugh). 12 / 24

Figure 2.10 Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The functin f is nw almst linear. 13 / 24

Figure 2.11 Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility When the nise ε has small variance, the third methd des well. 14 / 24

The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training utputs (y 1,..., y n ). Then, the expected mean squared errr (MSE) at x 0 decmpses int three interpretable terms: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). Irreducible errr E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The variance f ˆf(x 0 ): E[ ˆf(x 0 ) E( ˆf(x 0 ))] 2 15 / 24

16 / 24

Take-aways frm bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The expected MSE is never smaller than the irreducible errr Bth small bias and small variance are needed fr small expected MSE Easy t find zer variance prcedures with high bias (predict a cnstant value) and very lw bias prcedures with high variance (chse ˆf passing thrugh every training pint) In practice, best expected MSE achieved by incurring sme bias t decrease variance and vice-versa: this is the bias-variance trade-ff Rule f thumb Mre flexible methds Higher variance and Lwer bias. Example: Linear fit may nt change substantially with training data but high bias if true f very nn-linear 17 / 24

Squiggly f, high nise Linear f, high nise Squiggly f, lw nise 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure 2.12 18 / 24

Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f car attributes, then the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. We will adpt sme additinal ntatin: P (X, Y ) : jint distributin f (X, Y ), P (Y X) : cnditinal distributin f X given Y, ŷ i : predictin f y i given x i. 19 / 24

Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: 1(y 0 ŷ 0 ). As with squared errr, we can cmpute average test predictin errr (called test errr rate under 0-1 lss) using previusly unseen test data {(x i, y i ); i = 1,..., m} : 1 m m 1(y i ŷ i). i=1 Similarly, we can cmpute the (ften ptimistic) training errr rate 1 n 1(y i ŷ i ). n i=1 20 / 24

Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. The Bayes classifier assigns: ŷ i = argmax j P (Y = j X = x i ) It can be shwn that this is the best classifier under the 0-1 lss. 21 / 24

K-nearest neighbrs T assign a clr t the input, we lk at its K = 3 nearest neighbrs. We predict the clr f the majrity f the neighbrs. Figure 2.14 22 / 24

K-nearest neighbrs als has a decisin bundary X 1 X2 KNN: K=10 Figure 2.15 23 / 24

Higher K smther decisin bundary KNN: K=1 KNN: K=100 Figure 2.16 24 / 24