PMR 2728 / 5228 Probability Theory in AI and Robotics. Machine Learning. Fabio G. Cozman - Office MS08 -

Size: px

Start display at page:

Download "PMR 2728 / 5228 Probability Theory in AI and Robotics. Machine Learning. Fabio G. Cozman - Office MS08 -"

Marilynn Logan
7 years ago
Views:

1 PMR 2728 / 5228 Probability Theory in AI and Robotics Machine Learning Fabio G. Cozman - Office MS08 - fgcozman@usp.br November 12, 2012

2 Machine learning Quite general term: learning from explanations, from examples, from data... Several topics: Logical models (e.g., inductive logic programming). Rule-based learning. Memory-based learning. Statistical learning (estimation, classification).

3 Part I 1 Statistical learning and classification: basics. 2 Optimal classifiers. 3 Supervised/unsupervised learning. 4 Bias, variance, overfitting.

4 Statistical learning That is, to build a statistical model from data. Most often geared towards classification, less often towards scientific understanding. Classification is also known as 1 Pattern recognition. 2 Discrimination. 3 Data mining (this is a very general term!).

5 Classification Consider variables X = {X 1,..., X n } that are observed; they are called features/attributes. Classifier is a function g(x 1,..., X n ) from attributes to labels. Labels are values of a class variable Y. Thus: Ŷ = g(x 1,..., X n ) and the goal is to have Ŷ equal to Y. So, learning here is: produce a function g using data this is referred to as training the classifier.

6 Error rates To evaluate classifiers, usual metric is the probability of error: e g = P(Y g(x)). So, we use the joint distribution p(x, Y ). The empirical error rate is ê g = 1 N N I Y g(x) (X, Y ). i=1

7 Data A collection of data is a database. Usually a database contains tables: each row contains observed values for variables X 1,..., X n. If a variable is never observed, it is hidden; often referred to as a latent variable. We have missing data whenever a variable is not observed. Missing data may be missing at random (MAR). Missing data may be missing due to some systematic reason. Databases may be processed at once (batch mode) or sequentially as observations are gathered.

8 Supervised/unsupervised classification Consider a database containing records (X, Y ). When every record contains a label (no missing label), we have supervised learning. Then: When every label Y is missing, we have unsupervised learning. General case: some labels missing: semi-supervised learning. Note: In all cases, records may also contain missing data. Unsupervised learning is also referred to as clustering.

9 Training and testing data Usually a classifier is learned using a portion of the available data: the training data. The remainder of the data are used to test the classifier (for instance, by estimating the probability of error): the testing data.

10 Overfitting and cross-validation Testing data is important to detect overfitting: That is, classifier is excellent for the training data but fails for other data. (Example: polynomial of degree 1000 interpolates 100 given points exactly, but does it poorly for other points.) If no testing data are available, at least cross-validation must be used. Idea: separate a fraction of the data for testing, then repeat over the whole database. Five-fold or ten-fold cross-validation are very common. Leave-one-out validation is also common but it demands more computation.

11 Optimal classification The optimal classifier is: g = arg min e g = arg min P(Y g(x)). g g If Y were binary, and we had p(x, Y ), what is g? arg min P(Y g(x)) = arg min P(Y g(x) X) g g(x) = arg min I {g(x)=0} (X)P(Y = 1 X) + I {g(x)=1}(x)p(y = 0 X). g(x) Thus: g (X) = { 0 if P(Y = 1 X) < 1/2, 1 if P(Y = 1 X) 1/2.

12 Plug-in classifiers In practice we usually do not have p(x, Y ). Common strategy is to estimate p(x, Y ) and use it in the optimal classifier scheme.

13 Warning If p(x, Y ) is not given, there is no universally optimal method to generate classifiers. For any method, there is always a distribution p(x, Y ) such that the method is worse than some other method. Thus, it makes sense to look for simple and intuitive methods that work often.

14 Example: Nearest neighbor Simple idea: g(x) is equal to the majority of labels in a k neighborhood of X, given some distance. If k = k(n) such that k(n)/n 0, then lim n E[e gn ] = e g (very nice). Moreover, for 1-nearest neighbor, lim n E[e gn ] 2e g (amazing). Digression: if a method generates classifiers g n for training data of size n, and E[e gn ] e g, the method is consistent. Thus nearest neighbor classifiers are (in a special way) consistent.

15 In short, the basic schemes: There are two basic schemes when this distribution is not available but data is collected: The distribution is estimated and the classifier is built using the estimates. The classifier is directly built from data, possibly using estimates to evaluate the process. Maybe some estimate of the error rate is minimized (maybe the empirical error rate). Maybe some approximating function is selected.

16 Bias and variance The optimal classifier uses p = P(Y = 1 X). In practice we often use the estimate ˆp. Whatever the estimator for ˆp, it has an expected value and a variance it is a random variable. Consider the quadratic expected error in estimating ˆp: E [ (ˆp p) 2] = E [ˆp 2] 2pE[ˆp] + p 2 = E [ˆp 2] 2pE[ˆp] + p 2 + E[ˆp] 2 E[ˆp] 2 = (E[ˆp] p) 2 + E [(ˆp E[ˆp]) 2] = bias 2 + variance.

17 The bias/variance tradeoff Sometimes a simple estimator has large bias but small variance, and works well. Central quantity is the bias on P(Y = 1 X). Digression: classifiers that represent only P(Y = 1 X) are called diagnostic classifiers; classifiers that represent p(x, Y ) are called generative classifiers.

18 In practice We never know exactly which classifier to use. Simple ones work well: nearest neighbor, Naive Bayes, decision trees... Many classifiers can be understood as Bayesian networks, but not always the most complex ones win. Other classifiers: neural networks, support vector machines... Important to test, so as to select a reasonable one!

19 Part II 1 Text classification (and related issues). 2 Image classification.

20 Application: Text classification Topic of enormous economic importance (example: spam detection, document retrieval). Main problem: given a piece of text, classify it. Often, classify into a set of given labels. Clustering: also define the labels. Bag of words: just count the words (radicals) in document; each count is a feature. More sophisticated: hierarchical model, with concepts that group words.

21 Text classification with hyperlinks (Getoor, Segal, Taskar, Koller 2001)

22 Text classification: results

23 Some other applications Classification of handwritten digits. Segmentation of images, object recognition in images. Detection of obstacles in robotics. And many applications in commercial data mining: client classification, marketing impact, etc.

24 Application: Expression detection

25 The Mona Lisa is happy (!)

26 Part III 1 Mixture of Gaussians. 2 Support vector machines. 3 Classification trees. 4 Regression, logistic regression. 5 Neural networks.

27 Mixture of Gaussians Assume: For each class y, features are Gaussian. Result is: p(x, Y ) = Y N(X; µ(y), Σ(y))P(y). where µ is mean and Σ is the covariance matrix. It is necessary to estimate parameters (means, variances, and probabilities P(Y = y). Usually done by maximum likelihood. Supervised learning: just counting. Unsupervised learning: numerical optimization (usually EM).

28 Mixtures of Gaussians: separation For two labels, classifies selects label 1 if r 2 1 < r log(p(1) /P(0)) + log( Σ 0 / Σ 1 ), where r i = (x µ i ) T Σ 1 i (x µ i ). When variances are equal, classes are separated by hyperplanes (!). Result is then basically identical to Fisher linear discriminant analysis. General case leads to quadratic boundaries.

29 Regression Instead of trying to estimate the distribution, one might instead estimate (or approximate) the function r(x) = P(Y = 1 X). The function r(x) is often called the regression function. Common strategy: find parameters α of r α by least squares min α N (y i r α (x i )) 2. i=1

30 Linear and logistic regression: Linear regression: n r α (X) = α 0 + α j X j. j=1 Logistic regression: r α (X) = ( exp α 0 + ) n j=1 α jx j ( 1 + exp α 0 + ). n j=1 α jx j In these cases the parameters are obtained by numerical optimization.

31 Logistic regression and Gaussian mixtures The expressions for logistic regression can be made very similar to classifiers for Gaussian mixtures. However, logistic regression is not affected by the marginal P(X). Logistic regression tends to be better at classification than Gaussian mixtures.

32 Neural networks Neural networks are nonlinear regressors where r α is α 0 + k n α k g(α 0 + X j ), j=1 where g is a smooth function. Usually g(x) = ex 1 + e x. The structure can be much more complicated than this. Parameters must be found by optimization.

33 Support vector machines Consider just two classes (binary classification). Take classifier: g(x) = sgn α 0 + n α j X j. j=1 The assumption here is that separation surface is linear. To select the separation hyperplane, minimize the margin (the minimum distance from the hyperplane to a point in a class). This is a quadratic problem, with efficient solution (!). The resulting hyperplane is an SVM. It is possible to generalize using kernels, etc etc.

34 Classification trees (XLMiner)

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong