Machine Learning: Statistical Methods - I

Similar documents
CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

: Introduction to Machine Learning Dr. Rita Osadchy

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Lecture 9: Introduction to Pattern Analysis

Machine Learning.

Introduction to Pattern Recognition

Machine Learning Introduction

Learning is a very general term denoting the way in which agents:

Data, Measurements, Features

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Supervised and unsupervised learning - 1

Statistical Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Introduction to Learning & Decision Trees

MA2823: Foundations of Machine Learning

ADVANCED MACHINE LEARNING. Introduction

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Course 395: Machine Learning

Data Mining: Algorithms and Applications Matrix Math Review

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Covariance and Correlation

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Predict the Popularity of YouTube Videos Using Early View Data

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Introduction to Machine Learning Using Python. Vikram Kamath

Linear Threshold Units

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining - Evaluation of Classifiers

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Question 2 Naïve Bayes (16 points)

MD - Data Mining

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Predict Influencers in the Social Network

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

A Simple Introduction to Support Vector Machines

Maschinelles Lernen mit MATLAB

3 An Illustrative Example

Machine Learning and Pattern Recognition Logistic Regression

Creating a NL Texas Hold em Bot

Machine Learning and Statistics: What s the Connection?

6.2.8 Neural networks for data mining

Classification Techniques for Remote Sensing

1 What is Machine Learning?

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Faculty of Science School of Mathematics and Statistics

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

More Local Structure Information for Make-Model Recognition

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Kickoff: Anomaly Detection Challenges

Least Squares Estimation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

The Basics of Graphical Models

Machine Learning Capacity and Performance Analysis and R

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

OUTLIER ANALYSIS. Data Mining 1

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Social Media Mining. Data Mining Essentials

Classifying Manipulation Primitives from Visual Data

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Azure Machine Learning, SQL Data Mining and R

Writing a Project Report: Style Matters

Machine learning for algo trading

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

8. Machine Learning Applied Artificial Intelligence

Joint Exam 1/P Sample Exam 1

MACHINE LEARNING BASICS WITH R

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Component Ordering in Independent Component Analysis Based on Data Power

Lecture 6: Discrete & Continuous Probability and Random Variables

Hypothesis Testing for Beginners

CSE 517A MACHINE LEARNING INTRODUCTION

Learning outcomes. Knowledge and understanding. Competence and skills

Analysis Tools and Libraries for BigData

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

HT2015: SC4 Statistical Data Mining and Machine Learning

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Lecture 3: Linear methods for classification

How To Perform An Ensemble Analysis

Machine Learning in Spam Filtering

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

TIETS34 Seminar: Data Mining on Biometric identification

Machine Learning with MATLAB David Willingham Application Engineer

Lecture Slides for INTRODUCTION TO. ETHEM ALPAYDIN The MIT Press, Lab Class and literature. Friday, , Harburger Schloßstr.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Supervised Learning (Big Data Analytics)

How To Cluster

Big Data Analytics CSCI 4030

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Transcription:

Machine Learning: Statistical Methods - I Introduction - 15.04.2009 Stefan Roth, 15.04.2009 Department of Computer Science GRIS

Machine Learning I Lecturer: Prof. Stefan Roth, Ph.D. <sroth AT cs.tu-...> About me... Teaching Assistant: Qi Gao <qgao AT gris.tu-...> Announcements: Course web page: http://www.gris.informatik.tu-darmstadt.de/teaching/courses/ss09/ml/index.en.htm Mailing list: see subscription information on web page. Forum: http://d120.de/forum/viewforum.php?f=292 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 2

Course language will be English. This applies to lectures, exercises, announcements, etc. Why? Essentially all machine learning publications and books are written in English. Knowing the original terms is crucial. If strongly preferred, you may contact the course staff in German. English is encouraged though, because we may use your (anonymized) question to clarify points to the entire class. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 3

Organization Lecture: 2 hours a week Wednesdays, 9:50-11:30, S3/05 room 073 We will cover the foundational aspects of each topic. Exercise: 2 hours a week Wednesdays, 11:40-13:20, S3/05 room 073 We will cover some practical aspects, and discuss the homework assignments. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 4

Exam & Bonus System Most likely there will be an oral exam. Likely during the first week after classes end. Can be taken in English or German. Details depend on how many students end up taking the class. There will be a bonus of up to a full grade for those who do well in the homework assignments. Details TBA. Exercises: In order to get credit for 2+2 SWS, you need to actively participate in the exercises / turn in the homework assignments. If you do not hand in homework assignments regularly, you can only get credit for the lecture part (2 SWS). Stefan Roth, 15.04.2009 Department of Computer Science GRIS 5

Style Lectures: I would like the lectures to be at least partly interactive. Maybe more interactive than you are used to. This is supposed to be helpful for you and me. You are encouraged to ask questions! Exercises: Mostly interactive. You are encouraged to ask detailed questions! Your participation counts: Bonus for final exam. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 6

Homework assignments Mix of written and programming assignments. We will have around 4-5 assignments. Programming assignments in MATLAB, standard environment for scientific computing. - Goal: Work with some real data to get a first hand knowledge of how the techniques work that we will learn. - Introduction during first exercise (next week). Also pen and paper exercises. The last assignment may be a larger project-like one: Stay tuned... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 7

Readings Course book: Christopher Bishop: Pattern Recognition and Machine Learning. Springer, 2006. ISBN 0-287-31073-8 (very good book, but not an easy read). Should be on reserve in the library. Other useful books: Duda, Hart & Stork: Pattern Classification, Wiley-Interscience, 2000, 2nd edition. ISBN 0-471-05669-3 (new version of a classic). David J. C. MacKay: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http:// www.inference.phy.cam.ac.uk/mackay/itila/book.html). Gelman et al.: Bayesian Data Analysis. CRC Press, 2nd ed., 2004, ISBN 1-584-88388-X Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer, 2001. ISBN 0-387-95284-5 (the statistical perspective). Tom Mitchell: Machine Learning, McGraw-Hill, 1997, ISBN 0-07-042807-7 (classic, but getting outdated). Stefan Roth, 15.04.2009 Department of Computer Science GRIS 8

Readings Additional readings: At times I will post papers and tutorials. Will be available or linked from the course web page. I will often assign weekly readings: Please read them and come to class prepared! The Bishop book is a good investment, because it is also a very useful reference. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 9

How does it fit into your course plan? Diplom: Anwendungsbezogene Informatik Possibly Praktische or Theoretische Informatik if you can find someone who will count ML I toward this. - Note that I will not be able to offer an exam in Theoretische Informatik. B.Sc. / M.Sc.: Human Computer Systems (see Modulhandbuch) Not Data Knowledge Engineering If you are strongly interested in machine learning you should: - Take ML: Statistical Methods for HCS credit and - Take ML: Symbolische Methoden for DKE credit Stefan Roth, 15.04.2009 Department of Computer Science GRIS 10

How does it fit into your course plan? Related classes: Human Computer Systems (WS, Schiele/Fellner): prerequisite Machine Learning: Statistical Methods II (WS, Schiele) Computer Vision I (SS, Schiele/Schindler) - II (WS, Roth) Maschinelles Lernen - Symbolische Verfahren (WS, Fürnkranz) Bildverarbeitung (SS, Sakas) Theses and projects: Topics in machine learning with applications in computer vision. Please talk to me if you are interested. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 11

Machine Learning What is ML? What is its goal? Develop a machine / an algorithm that learns to perform a task from past experience. Why? What for? Fundamental component of every intelligent and/or autonomous system Discovering rules and patterns in data Automatic adaptation of systems Attempting to understand human / biological learning Stefan Roth, 15.04.2009 Department of Computer Science GRIS 12

Machine Learning in Action Stefan Roth, 15.04.2009 Department of Computer Science GRIS 13

Machine Learning: Examples Example 1: Recognition of handwritten digits These digits are given to us as small digital images. We have to build a machine to decide which digit it is. Obvious challenge: There are many different ways in which people handwrite. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 14

Machine Learning: Examples Example 2: Classification of fish salmon sea bass count 22 20 18 16 12 10 8 6 4 2 0 FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed. Next the features are extracted and finally the classification is emitted, here either salmon or sea bass. Although the information flow is often chosen to be from the source to the classifier, some systems employ information flow in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows). Yet others combine two or more stages into a unified step, such as simultaneous segmentation and feature extraction. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 15 5 10 l* 15 20 25 length FIGURE 1.2. Histograms for the length feature for the two categories. No single threshold value of the length will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value marked l will lead to the smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, and c 2001 by John Wiley & Sons, Inc. David G. Stork, Pattern Classification. Copyright "

Machine Learning: Examples More examples: Email filtering: Speech recognition: Vehicle control: Stefan Roth, 15.04.2009 Department of Computer Science GRIS 16

Impact & Successes Recognition of speech, letters, faces,... Autonomous vehicle navigation Games Backgammon world-champion Chess: Deep-Blue vs. Kasparov Google Finding new astronomical structures Fraud detection (credit card applications)... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 17

Machine Learning Develop a machine / an algorithm that learns to perform a task from past experience. Put more abstractly: Our task is to learn a mapping from input to output. Put differently, we want to predict the output from the input. f : I O y = f(x; θ) Input: Output: Parameter(s): x I y O θ Θ images, text, other measurements,... (that is/are being learned ) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 18

Classification vs. Regression Classification: Learn a mapping into a discrete space, e.g.: O = {0, 1} O = {0, 1, 2, 3,...} O = {verb, noun, nounphrase,...} Examples: Spam / not spam, sea bass vs. salmon, parsing a sentence, recognizing digits, etc. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 19

Classification vs. Regression Regression: Learn a mapping into a continuous space, e.g.: O = R O = R 3 Examples: Curve fitting, financial analysis,... 100 1 90 80 0 70 60 1 50 0 1 40 1 2 3 4 5 6 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 20

General Paradigm Training: Training data learn model θ Learned parameters Testing: predict 0, 1, 2, 8, 4, 6, 6, 7, 8, 9 Test data different from training data Predicted output Stefan Roth, 15.04.2009 Department of Computer Science GRIS 21

What data do we have for training? Data with labels (input / output pairs): supervised learning E.g. image with digit label Sensory data for car with intended steering control. - 0-5 Data without labels: unsupervised learning E.g. automatic clustering (grouping) of sounds Clustering of text according to topics Data with and without labels: semi-supervised learning No examples: learning-by-doing Reinforcement learning Stefan Roth, 15.04.2009 Department of Computer Science GRIS 22

Some Key Challenges We need generalization! We cannot simply memorize the training set. What if we see an input that we haven t seen before? Different shape of the digit image (unknown writer) Dirt on the picture, etc. We need to learn what is important for carrying out our task. This is one of the most crucial points that we will return to many times. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 23

Generalization How do we achieve generalization? width 22 21 20 salmon sea bass 19 18? 17 16 15 14 2 4 6 8 10 lightness FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 24

Generalization How do we achieve generalization? width 22 21 20 19 18 17 16 15 salmon sea bass Occam s Razor 14 2 4 6 8 10 lightness FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier, thereby giving the highest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. We should not make the model overly complex! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 25

Some Key Challenges Input x: Features: Choosing the right features is very important. Coding and use of domain knowledge. May allow for invariance (e.g. volume and pitch of voice). Curse of Dimensionality: If the features are too high-dimensional, we will run into trouble - more later. Dimensionality reduction Stefan Roth, 15.04.2009 Department of Computer Science GRIS 26

Some Key Challenges How do we measure performance? 99% correct classification in speech recognition: What does that really mean? We understand the meaning of the sentence? We understand every word? For all speakers? Need more concrete numbers: % of correctly classified letters average distance driven (until accident...) % of games won % correctly recognized words, sentences, etc. Training vs. testing performance! Stefan Roth, 15.04.2009 Department of Computer Science GRIS 27

Some Key Challenges We also need to define the right error metric: Whis is better? Euclidean distance (L2 norm) might be useless. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 28

Some Key Challenges Which is the right model? The learned parameters can mean a lot of different things. - w: may characterize the family of functions or the model space - w: may index the hypothesis space - w: vector, adjacency matrix, graph,... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 29

Some Key Challenges Even if we have solved the other problems, computation is usually quite hard: Learning often involves some kind of optimization Find (search) best model parameters Often we have to deal with thousands of training examples Given a model, compute the prediction efficiently Stefan Roth, 15.04.2009 Department of Computer Science GRIS 30

Why is machine learning interesting (for you)? Machine learning is a challenging problem that is far from being solved. Our learning systems are primitive compared to us humans. Think about what and how quickly a child can learn! It combines insights and tools from many fields and disciplines: Traditional artificial intelligence (logic, semantic networks,...) Statistics Complexity theory Artificial neural networks Psychology Adaptive control,... Stefan Roth, 15.04.2009 Department of Computer Science GRIS 31

Why is machine learning interesting (for you)? Allows you to apply theoretical skills that you may otherwise only use rarely. Has lots of applications: Computer vision Computer linguistics Search (think Google) Digital assistants Computer systems... It is a growing field: Many major companies are hiring people with machine learning knowledge. Anecdote: At a recent workshop on computer graphics, about 2/3 of the groups said they would benefit from more machine learning knowledge. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 32

Preliminary Syllabus Subject to change! Fundamentals (~ 3 weeks) Bayes decision theory, maximum likelihood, Bayesian inference Performance evaluation Probability density estimation Mixture models, expectation maximization Linear Methods (~ 3-4 weeks) Linear regression PCA, robust PCA Fisher linear discriminant generalized linear models Stefan Roth, 15.04.2009 Department of Computer Science GRIS 33

Preliminary Syllabus Large-Margin Methods (~ 3-4 weeks) Statistical learning theory Support vector machines Kernel methods Miscellaneous (~ 3 weeks) Model averaging (bagging & boosting) Graphical models (basic introduction) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 34

Credits Large parts of the lecture material have been developed by Prof. Bernt Schiele for the previous iterations of this course. Many figures that I will use are directly taken out of the books by Chris Bishop and Duda, Hart & Stork. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 35

Brief Review of Basic Probability What you should already know: F = a F = o Random picking: Red box: 60% of the time B = r B = b Blue box: 40% of the time Pick fruit from a box with equal probability p(b = r) = 0.6 p(b = b) = 0.4 p(f = a B = r) =p(f = o B = b) = 0.25 p(f = o B = r) =p(f = a B = b) = 0.75 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 36

Brief Review of Basic Probability We usually do not mention the random variable (RV) explicitly (for brevity). Instead of p(x = x) we write: p(x) if we want to denote the probability distribution for a particular random variable X. p(x) if we want to denote the value of the probability of the random variable being x. It should be obvious from the context when we mean the random variable itself and a value that the random variable can take. Some people use upper case P (X = x) probability distributions. I usually don t. for (discrete) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 37

Brief Review of Basic Probability Joint probability: The probability distribution of random variables X and Y taking on a configuration jointly. For example: Conditional probability: p(x, Y ) p(b = b, F = o) p(x Y ) The probability distribution of random variable X that random variable Y takes on a specific value For example: p(b = b F = o) given the fact Stefan Roth, 15.04.2009 Department of Computer Science GRIS 38

Basic Rules I Probabilities are always non-negative: p(x) 0 Probabilities sum to 1: p(x) = 1 0 p(x) 1 x Sum rule or marginalization: p(x) = y p(x, y) p(y) = x p(x, y) p(x) and p(y) are called marginal distributions of the joint distribution p(x, y) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 39

Basic Rules II Product rule: p(x, y) =p(x y)p(y) =p(y x)p(x) From this we directly follow... Bayes rule or Bayes theorem: We will widely use these rules. p(y x) = p(x y)p(y) p(x) Rev. Thomas Bayes 1701-1761 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 40

Continuous RVs What if we have continuous random variables, X = x R say? Any single value has zero probability. We can only assign a probability for a random variable being in a range of values: Pr(x 0 < X < x 1 )=Pr(x 0 X x 1 ) Instead we use the probability density p(x) Pr(x 0 X x 1 )= x1 x 0 p(x) dx Cumulative distribution function: P (z) = z p(x) dx and P (x) =p(x) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 41

Continuous RVs p(x) P (x) Probability density function = pdf δx x Cumulative distribution function = cdf We can work with a density (pdf) as if it was a probability distribution: For simplicity we usually use the same notation for both. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 42

Basic rules for pdfs What are the rules? Non-negativity: Summing to 1: p(x) 0 p(x) dx =1 But: Marginalization: p(x) = in general p(x, y) dy p(y) = p(x, y) dx p(x) 1 Product rule: p(x, y) =p(x y)p(y) =p(y x)p(x) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 43

Expectations The average value of a function f(x) distribution p(x) is the expectation: under a probability E[f] =E[f(x)] = x f(x)p(x) or E[f] = f(x)p(x) dx For joint distributions we sometimes write: E x [f(x, y)] Conditional expectation: E x y [f] =E x [f y] = x f(x)p(x y) Stefan Roth, 15.04.2009 Department of Computer Science GRIS 44

Variance and Covariance Variance of a single RV: var[x] =E [ (x E[x]) 2] = E[x 2 ] E[x] 2 Covariance of two RVs: cov(x, y) =E x,y [(x E[x])(y E[y])] = E x,y [xy] E[x]E[y] Random vectors: All we have said so far not only applies to scalar random variables, but also to random vectors. In particular, we have the covariance matrix: cov(x, y) = E x,y [ (x E[x])(y E[y]) T ] = E x,y [xy T ] E[x]E[y] T Stefan Roth, 15.04.2009 Department of Computer Science GRIS 45

Preview Review of some basics about probability Bayesian decision theory Loss functions Disclaimer: It will get quite a bit more mathematical than this :) Don t get scared away, but be aware that this will not be a walk in the park. Stefan Roth, 15.04.2009 Department of Computer Science GRIS 46

Readings for next week Introduction to ML: Bishop 1.0, 1.1 Review of the basics of probability: Bishop 1.2 (you can skip 1.2.5 and 1.2.6 for now) Decision theory: Bishop 1.5 For the curious: Probability: You could also look at MacKay 2 Brush up on information theory: Bishop 1.6 Stefan Roth, 15.04.2009 Department of Computer Science GRIS 47