Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Transcription

1 Machine Learning 1

2 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2

3 Outline Inductive learning Decision tree learning Measuring learning performance Statistical learning Naive Bayes Learning Classification Evaluation 3

4 Inductive Learning Training Set, Data of N examples of input-output pairs (x 1,y 1 )...(x N,y N ) such that y i is generated by unknown function y = f(x) Learning: discover a hypothesis function h that approximates the true function f Test Set is used to measure accuracy of hypothesis h Hypothesis h generalizes well if it correctly predicts the value of y in novel examples Hypothesis space, Hypothesis being realizable 4

5 Kinds of Learning Three types of feedback determine main kinds of (machine) learning: Supervised learning: requires collection of sample input-output pairs problem instance, correct answer, so that it learns a function that maps from input to output. In other words, it requires teacher Unsupervised learning: learn patterns from the input without specific feedback: e.g., clustering. Requires no teacher Reinforcement learning: occasional rewards occur to reinforce or inhibit certain sequences of actions. Is harder, but requires no teacher Semi-Supervised learning: Too few labeled examples plus not necessarily very accurate 5

6 Inductive learning (a.k.a. Science) Simplest form: learn a function from examples (tabula rasa, blank slate in Latin) f is the target function An example is an input-output pair x, f(x), e.g., Problem: find a(n) hypothesis h such that h f given a training set of examples O O X X X, +1 6

7 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x 7

12 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Ockham s razor (William of Ockham (c )): maximize a combination of consistency and simplicity. 12

13 Learning Decision Trees A decision tree represents a function that takes as input a vector of attribute values and returns a decision a single output value. A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F We will now outline a supervised learning method for constructing decision trees given labeled data input-output pairs. 13

14 Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes Target Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X 1 T F F T Some $$$ F T French 0 10 T X 2 T F F T Full $ F F Thai F X 3 F T F F Some $ F F Burger 0 10 T X 4 T F T T Full $ F F Thai T X 5 T F T F Full $$$ F T French >60 F X 6 F T F T Some $$ T T Italian 0 10 T X 7 F T F F None $ T F Burger 0 10 F X 8 F F F T Some $$ T T Thai 0 10 T X 9 F T T F Full $ T F Burger >60 F X 10 T T T T Full $$$ F T Italian F X 11 F F F F None $ F F Thai 0 10 F X 12 T T T T Full $ F F Burger T Classification of examples is positive (T) or negative (F) 14

15 Decision trees One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? > F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T 15

16 Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees 16

17 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? 17

18 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions 18

19 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows 19

20 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n 20

21 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees 21

22 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? 22

23 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set may get worse predictions 23

24 Decision tree learning Aim: find a small tree consistent with the training examples Idea 1: (recursively) choose most significant attribute as root of (sub)tree to branch on next Idea 2: a good attribute splits the examples into subsets that are (ideally) all positive or all negative 24

25 Example contd. Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data 25

26 How do we know that h f? Performance measurement 1) Use theorems of computational/statistical learning theory 2) Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size % correct on test set Training set size 26

27 Performance measurement contd. Learning curve depends on realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1 realizable redundant nonrealizable # of examples 27

28 Performance measurement contd. II Still, How do we know that h f? Hume s Problem of Induction: Wikipedia: The problem of induction is the philosophical question of whether inductive reasoning leads to knowledge understood in the classic philosophical sense, since it focuses on the lack of justification for either: 1. Generalizing about the properties of a class of objects based on some number of observations of particular instances of that class (for example, the inference that all swans we have seen are white, and therefore all swans are white, before the discovery of black swans) or 2. Presupposing that a sequence of events in the future will occur as it always has in the past (for example, that the laws of physics will hold as they have always been observed to hold). Hume called this the principle uniformity of nature. 28

29 Classes of Learning Problems Classification: The output y of a true function that we learnis a finite set of values, e.g., wait or leave in a restaurant; sunny, cloudy, or rainy. Regression: The output y ofatrue function thatwe learnisanumber, e.g., tomorrow s temperature. Sometimes the function f is stochastic strictly speaking, it is not a function of x, so what we learn is a conditional probability distribution P(Y x). 29

30 Statistical learning Training Set, Data evidence instantiations of all or some of the random variables describing the domain Hypotheses are probabilistic theories of how the domain works 30

31 Learning a Probability Model Training Set, Data of N examples of input-output pairs (x 1,y 1 )...(x N,y N ) such that y i is generated by unknown function y = f(x) Inductive Learning: discover a hypothesis function h that approximates the true function f, e.g, Decision Trees Statistical Learning: Given a fixed structure of a probability model of the domain, discover its parameters from Data: parameter learning As a result given parameters of a problem instance, learned probability model can be used to answer queries about problem instances Classification: Observed parameters of a given instance and learned probability model of a domain provides probabilistic information on the likelihood of a particular classification 31

32 Classification Problems Classification is the task of predicting labels (class variables) for inputs Commercially and Scientifically Important Examples: Spam Filtering Optical Character Recognition (OCR) Medical Diagnoses Part of Speech Tagging Semantic Role Labeling/Information Extraction Automatic essay grading Fraud detection 32

33 Probabilistic Models A naive Bayes model: P(Cause,Effect 1,...,Effect n ) = P(Cause)Π i P(Effect i Cause) (1) Cavity Cause Toothache Catch Effect 1 Effect n where Cause is taken to be the class variable, which is to be predicted. The attribute-parameter variables are the leaves Effects. Model is naive : assumes parameter variables to be independent Model Training: using Training Set to uncover the conditional probability distribution of parameters P(Effect i Cause j ) Once the model is trained, given values of parameters of a problem instance, we can use (1) to classify an instance. 33

34 Independence as Abstraction Model is naive : assumes parameter variables to be independent May lead to overconfidence Indeed, all CAPS in Spam is not independent of $$ symbols Yet, it is often a fine abstraction, and a computationally tractable one 34

35 Optical Character Recognition Example: Training a Model Given a labeled collection M of digits in digital form nxn grid Features: Pixel i,j = on or off, Adj A naive Bayes model: P(Digit,Pixel 1,1,...,Pixel n,n,adj) = P(Digit)Π i,j P(Pixel i,j Digit)P(Adj) Model Training Process: For M P(0) = count(m,0) M,..., P(9) = count(m,9) M P(pixel 1,1 = on 0) = count(m,0,on,1,1) count(m,0),... P(pixel 1,1 = off 0) = 1 P(pixel 1,1 = on 0),... 35

36 Example: Classification in OCR Given parameters-attributes-features of an unseen instance and trained model we can compute P(0,pixel 1,1 = on,...,pixel n,n = off,adj = true) = x 0... P(9,pixel 1,1 = on,...,pixel n,n = off,adj) = x 9 and then pick the most likely class, i.e., class that corresponds to the maximum value among x 0,...,x 9. 36

37 Evaluation Split Labeled Data into Three Categories (80/10/10; 60/20/20): 1. Training set 2. Held-out set 3.Test set Decide on Features (Parameters, Attributes): attribute-value pairs that characterize each instance Experimentation-Evaluation Cycle: 1. Learn parameters, (e.g., model probabilities) on training set 2.Tune set of features on held-out set 3. Compute accuracy on test set: accuracy fraction of instances predicted correctly 37

38 Feature Engineering Feature Engineering is crucial! Features translate into hypotheses space Too few features: cannot fit the data Too many features: overfitting 38

39 Generalization and Overfitting Relative frequency parameters will overfit the training data Since training set did not contain 3 with pixel i,j on during training does not mean it does not exist (but note how we will assign probability 0 to such event!) Unlikely that every occurrence if minute is 100% spam Unlikely that every occurrence if seriously is 100% ham Similarly, what happens to the words that never occur in training set? Unseen events should not be assigned 0 probability To generalize better: smoothing is essential 39

40 Intuitions Behind Smoothing Estimation: Smoothing We have some prior expectation about parameters Given little evidence, we should prefer prior Given a lot of evidence the data should rule Maximum likelihood estimate P ML (x) = count(x) total samples does not account for above intuitions Consider three coin flips: Head, Head, Tail; what is P ML (x) 40

41 Laplace s estimate P LAP (x) = Estimation: Laplace Smoothing count(x)+1 total samples + X Pretend that every outcome appeared once more than it did Note how it elegantly deals with earlier unseen events Laplace s estimate extended with strength factor: P LAP,k (x) = count(x)+k total samples + k X Considerthreecoinflips: Head,Head,Tail;whatareP ML (x),p LAP (x),p LAP,k (x)? There are many ways to introduce smoothing as well as methods to account for unknown events 41

42 Summary Learning needed for unknown environments, lazy designers Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis that is approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Learning Models, Naive Bayses Nets Classification Problem by Means of Naive Bayses Nets Smoothing Evaluation Concepts 42