Machine Learning: Statistical Methods - I

Machine Learning: Statistical Methods - I Introduction - 15.04.2009 Stefan Roth, 15.04.2009 Department of Computer Science GRIS

Machine Learning I Lecturer: Prof. Stefan Roth, Ph.D. <sroth AT cs.tu-...> About me... Teaching Assistant: Qi Gao <qgao AT gris.tu-...> Announcements: Course web page: http://www.gris.informatik.tu-darmstadt.de/teaching/courses/ss09/ml/index.en.htm Mailing list: see subscription information on web page. Forum: http://d120.de/forum/viewforum.php?f=292 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 2

Course language will be English. This applies to lectures, exercises, announcements, etc. Why? Essentially all machine learning publications and books are written in English. Knowing the original terms is crucial. If strongly preferred, you may contact the course staff in German. English is encouraged though, because we may use your (anonymized) question to clarify points to the entire class. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 3

Organization Lecture: 2 hours a week Wednesdays, 9:50-11:30, S3/05 room 073 We will cover the foundational aspects of each topic. Exercise: 2 hours a week Wednesdays, 11:40-13:20, S3/05 room 073 We will cover some practical aspects, and discuss the homework assignments. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 4

Exam & Bonus System Most likely there will be an oral exam. Likely during the first week after classes end. Can be taken in English or German. Details depend on how many students end up taking the class. There will be a bonus of up to a full grade for those who do well in the homework assignments. Details TBA. Exercises: In order to get credit for 2+2 SWS, you need to actively participate in the exercises / turn in the homework assignments. If you do not hand in homework assignments regularly, you can only get credit for the lecture part (2 SWS). Stefan Roth, 15.04.2009 Department of Computer Science GRIS 5

Style Lectures: I would like the lectures to be at least partly interactive. Maybe more interactive than you are used to. This is supposed to be helpful for you and me. You are encouraged to ask questions! Exercises: Mostly interactive. You are encouraged to ask detailed questions! Your participation counts: Bonus for final exam. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 6

Homework assignments Mix of written and programming assignments. We will have around 4-5 assignments. Programming assignments in MATLAB, standard environment for scientific computing. - Goal: Work with some real data to get a first hand knowledge of how the techniques work that we will learn. - Introduction during first exercise (next week). Also pen and paper exercises. The last assignment may be a larger project-like one: Stay tuned... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 7

Readings Course book: Christopher Bishop: Pattern Recognition and Machine Learning. Springer, 2006. ISBN 0-287-31073-8 (very good book, but not an easy read). Should be on reserve in the library. Other useful books: Duda, Hart & Stork: Pattern Classification, Wiley-Interscience, 2000, 2nd edition. ISBN 0-471-05669-3 (new version of a classic). David J. C. MacKay: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http:// www.inference.phy.cam.ac.uk/mackay/itila/book.html). Gelman et al.: Bayesian Data Analysis. CRC Press, 2nd ed., 2004, ISBN 1-584-88388-X Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer, 2001. ISBN 0-387-95284-5 (the statistical perspective). Tom Mitchell: Machine Learning, McGraw-Hill, 1997, ISBN 0-07-042807-7 (classic, but getting outdated). Stefan Roth, 15.04.2009 Department of Computer Science GRIS 8

Readings Additional readings: At times I will post papers and tutorials. Will be available or linked from the course web page. I will often assign weekly readings: Please read them and come to class prepared! The Bishop book is a good investment, because it is also a very useful reference. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 9

How does it fit into your course plan? Diplom: Anwendungsbezogene Informatik Possibly Praktische or Theoretische Informatik if you can find someone who will count ML I toward this. - Note that I will not be able to offer an exam in Theoretische Informatik. B.Sc. / M.Sc.: Human Computer Systems (see Modulhandbuch) Not Data Knowledge Engineering If you are strongly interested in machine learning you should: - Take ML: Statistical Methods for HCS credit and - Take ML: Symbolische Methoden for DKE credit Stefan Roth, 15.04.2009 Department of Computer Science GRIS 10

How does it fit into your course plan? Related classes: Human Computer Systems (WS, Schiele/Fellner): prerequisite Machine Learning: Statistical Methods II (WS, Schiele) Computer Vision I (SS, Schiele/Schindler) - II (WS, Roth) Maschinelles Lernen - Symbolische Verfahren (WS, Fürnkranz) Bildverarbeitung (SS, Sakas) Theses and projects: Topics in machine learning with applications in computer vision. Please talk to me if you are interested. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 11

Machine Learning What is ML? What is its goal? Develop a machine / an algorithm that learns to perform a task from past experience. Why? What for? Fundamental component of every intelligent and/or autonomous system Discovering rules and patterns in data Automatic adaptation of systems Attempting to understand human / biological learning Stefan Roth, 15.04.2009 Department of Computer Science GRIS 12

Machine Learning in Action Stefan Roth, 15.04.2009 Department of Computer Science GRIS 13

Machine Learning: Examples Example 1: Recognition of handwritten digits These digits are given to us as small digital images. We have to build a machine to decide which digit it is. Obvious challenge: There are many different ways in which people handwrite. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 14

Machine Learning: Examples Example 2: Classification of fish salmon sea bass count 22 20 18 16 12 10 8 6 4 2 0 FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed. Next the features are extracted and finally the classification is emitted, here either salmon or sea bass. Although the information flow is often chosen to be from the source to the classifier, some systems employ information flow in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows). Yet others combine two or more stages into a unified step, such as simultaneous segmentation and feature extraction. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 15 5 10 l* 15 20 25 length FIGURE 1.2. Histograms for the length feature for the two categories. No single threshold value of the length will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value marked l will lead to the smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, and c 2001 by John Wiley & Sons, Inc. David G. Stork, Pattern Classification. Copyright "

Machine Learning: Examples More examples: Email filtering: Speech recognition: Vehicle control: Stefan Roth, 15.04.2009 Department of Computer Science GRIS 16

Impact & Successes Recognition of speech, letters, faces,... Autonomous vehicle navigation Games Backgammon world-champion Chess: Deep-Blue vs. Kasparov Google Finding new astronomical structures Fraud detection (credit card applications)... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 17

Machine Learning Develop a machine / an algorithm that learns to perform a task from past experience. Put more abstractly: Our task is to learn a mapping from input to output. Put differently, we want to predict the output from the input. f : I O y = f(x; θ) Input: Output: Parameter(s): x I y O θ Θ images, text, other measurements,... (that is/are being learned ) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 18

Classification vs. Regression Classification: Learn a mapping into a discrete space, e.g.: O = {0, 1} O = {0, 1, 2, 3,...} O = {verb, noun, nounphrase,...} Examples: Spam / not spam, sea bass vs. salmon, parsing a sentence, recognizing digits, etc. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 19

Classification vs. Regression Regression: Learn a mapping into a continuous space, e.g.: O = R O = R 3 Examples: Curve fitting, financial analysis,... 100 1 90 80 0 70 60 1 50 0 1 40 1 2 3 4 5 6 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 20

General Paradigm Training: Training data learn model θ Learned parameters Testing: predict 0, 1, 2, 8, 4, 6, 6, 7, 8, 9 Test data different from training data Predicted output Stefan Roth, 15.04.2009 Department of Computer Science GRIS 21

What data do we have for training? Data with labels (input / output pairs): supervised learning E.g. image with digit label Sensory data for car with intended steering control. - 0-5 Data without labels: unsupervised learning E.g. automatic clustering (grouping) of sounds Clustering of text according to topics Data with and without labels: semi-supervised learning No examples: learning-by-doing Reinforcement learning Stefan Roth, 15.04.2009 Department of Computer Science GRIS 22

Some Key Challenges We need generalization! We cannot simply memorize the training set. What if we see an input that we haven t seen before? Different shape of the digit image (unknown writer) Dirt on the picture, etc. We need to learn what is important for carrying out our task. This is one of the most crucial points that we will return to many times. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 23

Generalization How do we achieve generalization? width 22 21 20 salmon sea bass 19 18? 17 16 15 14 2 4 6 8 10 lightness FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 24

Generalization How do we achieve generalization? width 22 21 20 19 18 17 16 15 salmon sea bass Occam s Razor 14 2 4 6 8 10 lightness FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier, thereby giving the highest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. We should not make the model overly complex! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 25

Some Key Challenges Input x: Features: Choosing the right features is very important. Coding and use of domain knowledge. May allow for invariance (e.g. volume and pitch of voice). Curse of Dimensionality: If the features are too high-dimensional, we will run into trouble - more later. Dimensionality reduction Stefan Roth, 15.04.2009 Department of Computer Science GRIS 26

Some Key Challenges How do we measure performance? 99% correct classification in speech recognition: What does that really mean? We understand the meaning of the sentence? We understand every word? For all speakers? Need more concrete numbers: % of correctly classified letters average distance driven (until accident...) % of games won % correctly recognized words, sentences, etc. Training vs. testing performance! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 27

Some Key Challenges We also need to define the right error metric: Whis is better? Euclidean distance (L2 norm) might be useless. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 28

Some Key Challenges Which is the right model? The learned parameters can mean a lot of different things. - w: may characterize the family of functions or the model space - w: may index the hypothesis space - w: vector, adjacency matrix, graph,... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 29

Some Key Challenges Even if we have solved the other problems, computation is usually quite hard: Learning often involves some kind of optimization Find (search) best model parameters Often we have to deal with thousands of training examples Given a model, compute the prediction efficiently Stefan Roth, 15.04.2009 Department of Computer Science GRIS 30

Why is machine learning interesting (for you)? Machine learning is a challenging problem that is far from being solved. Our learning systems are primitive compared to us humans. Think about what and how quickly a child can learn! It combines insights and tools from many fields and disciplines: Traditional artificial intelligence (logic, semantic networks,...) Statistics Complexity theory Artificial neural networks Psychology Adaptive control,... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 31

Why is machine learning interesting (for you)? Allows you to apply theoretical skills that you may otherwise only use rarely. Has lots of applications: Computer vision Computer linguistics Search (think Google) Digital assistants Computer systems... It is a growing field: Many major companies are hiring people with machine learning knowledge. Anecdote: At a recent workshop on computer graphics, about 2/3 of the groups said they would benefit from more machine learning knowledge. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 32

Preliminary Syllabus Subject to change! Fundamentals (~ 3 weeks) Bayes decision theory, maximum likelihood, Bayesian inference Performance evaluation Probability density estimation Mixture models, expectation maximization Linear Methods (~ 3-4 weeks) Linear regression PCA, robust PCA Fisher linear discriminant generalized linear models Stefan Roth, 15.04.2009 Department of Computer Science GRIS 33

Preliminary Syllabus Large-Margin Methods (~ 3-4 weeks) Statistical learning theory Support vector machines Kernel methods Miscellaneous (~ 3 weeks) Model averaging (bagging & boosting) Graphical models (basic introduction) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 34

Credits Large parts of the lecture material have been developed by Prof. Bernt Schiele for the previous iterations of this course. Many figures that I will use are directly taken out of the books by Chris Bishop and Duda, Hart & Stork. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 35

Brief Review of Basic Probability What you should already know: F = a F = o Random picking: Red box: 60% of the time B = r B = b Blue box: 40% of the time Pick fruit from a box with equal probability p(b = r) = 0.6 p(b = b) = 0.4 p(f = a B = r) =p(f = o B = b) = 0.25 p(f = o B = r) =p(f = a B = b) = 0.75 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 36

Brief Review of Basic Probability We usually do not mention the random variable (RV) explicitly (for brevity). Instead of p(x = x) we write: p(x) if we want to denote the probability distribution for a particular random variable X. p(x) if we want to denote the value of the probability of the random variable being x. It should be obvious from the context when we mean the random variable itself and a value that the random variable can take. Some people use upper case P (X = x) probability distributions. I usually don t. for (discrete) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 37

Brief Review of Basic Probability Joint probability: The probability distribution of random variables X and Y taking on a configuration jointly. For example: Conditional probability: p(x, Y ) p(b = b, F = o) p(x Y ) The probability distribution of random variable X that random variable Y takes on a specific value For example: p(b = b F = o) given the fact Stefan Roth, 15.04.2009 Department of Computer Science GRIS 38

Basic Rules I Probabilities are always non-negative: p(x) 0 Probabilities sum to 1: p(x) = 1 0 p(x) 1 x Sum rule or marginalization: p(x) = y p(x, y) p(y) = x p(x, y) p(x) and p(y) are called marginal distributions of the joint distribution p(x, y) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 39

Basic Rules II Product rule: p(x, y) =p(x y)p(y) =p(y x)p(x) From this we directly follow... Bayes rule or Bayes theorem: We will widely use these rules. p(y x) = p(x y)p(y) p(x) Rev. Thomas Bayes 1701-1761 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 40

Continuous RVs What if we have continuous random variables, X = x R say? Any single value has zero probability. We can only assign a probability for a random variable being in a range of values: Pr(x 0 < X < x 1 )=Pr(x 0 X x 1 ) Instead we use the probability density p(x) Pr(x 0 X x 1 )= x1 x 0 p(x) dx Cumulative distribution function: P (z) = z p(x) dx and P (x) =p(x) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 41

Continuous RVs p(x) P (x) Probability density function = pdf δx x Cumulative distribution function = cdf We can work with a density (pdf) as if it was a probability distribution: For simplicity we usually use the same notation for both. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 42

Basic rules for pdfs What are the rules? Non-negativity: Summing to 1: p(x) 0 p(x) dx =1 But: Marginalization: p(x) = in general p(x, y) dy p(y) = p(x, y) dx p(x) 1 Product rule: p(x, y) =p(x y)p(y) =p(y x)p(x) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 43

Expectations The average value of a function f(x) distribution p(x) is the expectation: under a probability E[f] =E[f(x)] = x f(x)p(x) or E[f] = f(x)p(x) dx For joint distributions we sometimes write: E x [f(x, y)] Conditional expectation: E x y [f] =E x [f y] = x f(x)p(x y) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 44

Variance and Covariance Variance of a single RV: var[x] =E [ (x E[x]) 2] = E[x 2 ] E[x] 2 Covariance of two RVs: cov(x, y) =E x,y [(x E[x])(y E[y])] = E x,y [xy] E[x]E[y] Random vectors: All we have said so far not only applies to scalar random variables, but also to random vectors. In particular, we have the covariance matrix: cov(x, y) = E x,y [ (x E[x])(y E[y]) T ] = E x,y [xy T ] E[x]E[y] T Stefan Roth, 15.04.2009 Department of Computer Science GRIS 45

Preview Review of some basics about probability Bayesian decision theory Loss functions Disclaimer: It will get quite a bit more mathematical than this :) Don t get scared away, but be aware that this will not be a walk in the park. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 46

Readings for next week Introduction to ML: Bishop 1.0, 1.1 Review of the basics of probability: Bishop 1.2 (you can skip 1.2.5 and 1.2.6 for now) Decision theory: Bishop 1.5 For the curious: Probability: You could also look at MacKay 2 Brush up on information theory: Bishop 1.6 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 47