 Brittany West
 2 years ago
1 Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22
2 What is machine learning? The ability of a machine to improve its performance based on previous results: learn from experience Observe the world (data) Change our behavior accordingly Typical examples Predicting outcomes Explaining observations Finding interesting or unusual data Why? Automate very large or very fast predictions These days: webscale data problems
3 Classification Discriminating between two (or more) types of data Example: Spam filtering Bad Cures fast and effective!  Canadian *** Pharmacy # Internet Inline Drugstore Viagra Our price $.5 Cialis Our price $.99 Good Interested in your research on graphical models  Dear Prof. Ihler, I have read some of your papers on probabilistic graphical models. Because I
4 Classification Example: face detection
5 Regression Based on past history, predict future outcomes Wall Street Netflix
6 Data Mining & Understanding Massive volumes of data available Webpages, Google books, Too large to handcurate or organize How does Google decide the most relevant documents? How can we look for text documents about law, medicine, etc? What makes a document similar? Gets even harder for images, video,
7 How does machine learning work? Metaprogramming Predict apply rules to examples Score get feedback on performance Learn change predictor to do better Program / Predictor Learning algorithm Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a prediction Compare / score performance
8 How does machine learning work? Metaprogramming Predict apply rules to examples Score get feedback on performance Learn change predictor to do better Program / Predictor Learning algorithm Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a prediction Compare / score performance
9 Nearest neighbor classifier X 2!? X!
10 Nearest neighbor classifier Predictor : Given new features: Find nearest example Return its value X 2!? X!
11 Nearest neighbor classifier All points where we decide Decision Boundary X 2!? All points where we decide X!
12 Nearest neighbor classifier Predictor : Evaluate line: If r >, return else return X 2!? X!
13 Contrast: linear classifier Predictor : Evaluate line: If r >, return else return X 2! Linear decision boundary (r = ) X!
14 More Data Points? Linear decision boundary always linear X 2! Complexity of a classifier Parametric: describe form explicitly in terms of some parameters X! Nonparametric: number of parameters required increases with the amount of data
15 Questions to consider How would we select a good linear classifier? (How to measure error?) How are these two methods related? How do we pick between them? X 2! X!
16 Regression; Scatter plots 4 Target y 2 2 Feature x Suggests a relationship between x and y Prediction: new x, what is y?
17 Predicting new examples 4 Target y y (m+) =? 2 x (m+) 2 Feature x Regression: given the observed data, estimate y (m+) given new x (m+)
18 Nearest neighbor regression 4 Target y y (m+) =? 2 x (m+) 2 Feature x Find training datum x (i) closest to x (m+) Predict y (i)
19 Nearest neighbor regression Target y 4 Predictor : Given new features: Find nearest example Return its value 2 2 Feature x Defines a function f(x) implicitly Form is piecewise constant
20 Linear regression 4 Predictor : Evaluate line: Target y return r 2 2 Feature x Define form of function f(x) explicitly Find a good f(x) within that family
21 Regression vs. Classification Regression Classification y y flatten x x Features x Realvalued target t Predict continuous function ŷ(x) Features x Discrete class c (usually / or +/ ) Predict discrete function ŷ(x) x
22 How does machine learning work? Metaprogramming Predict apply rules to examples Score get feedback on performance Learn change predictor to do better Program / Predictor Learning algorithm Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a prediction Compare / score performance
23 Measuring error X 2! X!
24 Measuring error X 2! X!
25 Measuring error Misclassification rate: fraction of training data whose prediction is wrong (Might also care about type of mistake ) X 2! X!
26 Measuring error What makes a good predictor?
27 Measuring error What makes a good predictor?
28 Measuring error What makes a good predictor?
29 Measuring error Observation Error or residual Prediction 2
30 Sum of squared error How can we quantify the error? Could choose something else, of course Computationally convenient (more later) Measures the variance of the residuals Corresponds to Gaussian model of noise
31 How does machine learning work? Metaprogramming Predict apply rules to examples Score get feedback on performance Learn change predictor to do better Program / Predictor Learning algorithm Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a prediction Compare / score performance
32 Visualizing the cost function
33 Finding good parameters Want to find parameters which minimize our error Think of a cost surface : error residual for that µ
34 Gradient descent? How to change µ to improve J(µ)? Choose a direction in which J(µ) is decreasing
35 Gradient descent How to change µ to improve J(µ)? Choose a direction in which J(µ) is decreasing Gradient Positive => increasing Negative => decreasing
36 Gradient descent in more dimensions Gradient vector Indicates direction of steepest ascent (negative = steepest descent)
37 Gradient descent Initialization Step size Can change as a function of iteration Gradient direction Stopping condition Initialize µ Do { µ Ã µ  r µ J(µ) } while ( rj > ² )
38 Gradient for the SSE SSE r J =?
39 Gradient descent Initialization Step size Can change as a function of iteration Gradient direction Stopping condition Initialize µ Do { µ Ã µ  r µ J(µ) } while ( rj > ² ) { Error magnitude & direction for datum j { Sensitivity to each µ i
40 Derivative of SSE { Error magnitude & direction for datum j { Sensitivity to each µ i Rewrite using matrix form (Matlab) >> e = y th*x ; DJ = e*x; th=th al*dj;
41 Gradient descent on cost function
42 Comments on gradient descent Very general algorithm we ll see it many times Local minima Sensitive to starting point
43 Comments on gradient descent Very general algorithm Can use in almost any problem! Local minima Sensitive to starting point Step size Too large? Too small? Automatic ways to choose? May want step size to decrease with iteration Common choices: Fixed Linear: C/(iteration) More advance methods (e.g., Newton s method)
44 How does machine learning work? Metaprogramming Predict apply rules to examples Score get feedback on performance Learn change predictor to do better Program / Predictor Learning algorithm Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a prediction Compare / score performance
45 Adding features Linear classifier can t learn some functions D example: Not linearly separable Add quadratic features Linearly separable in new features
46 Higherorder polynomials Are more features better? Nested hypotheses 2 nd order more general than st, 3 rd order than 2 nd, Fits the observed data better
47 Test data After training the model Go out and get more data from the world New observations (x,y) How well does our model perform?
48 How Overfitting affects Prediction Predictive Error Error on Test Data Error on Training Data Ideal Range for Model Complexity Model Complexity Underfitting Overfitting
More information