Introduction to Machine Learning

Transcription

1 Introduction to Machine Learning Felix Brockherde 12 Kristof Schütt 1 1 Technische Universität Berlin 2 Max Planck Institute of Microstructure Physics IPAM Tutorial 2013 Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

2 What is Machine Learning? Infered Structure Data with Pattern Algorithm ML Model ML is about learning structure from data Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

3 Examples Drug discovery Face recognition BCI Recommender systems Search engines DNA splice site detection Speech recognition Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

4 This Talk Part 1: Learning Theory and Supervised ML Basic Ideas of Learning Theory Support Vector Machines Kernels Kernel Ridge Regression Part 2: Unsupervised ML and Application PCA Model Selection Feature Representation Not covered Probabilistic Models Neural Networks Online Learning Reinforcement Learning Semi-supervised Learning etc. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

5 Supervised Learning Classification Regression y i { 1, +1} y i R Given: Points X = (x 1,..., x N ) with x i R d and Labels Y = (y 1,..., y n ) generated by some joint probability distribution. Learn underlying unknown mapping f (x) = y Important: Performance on unseen data Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

6 Basic Ideas in Learning Theory Risk minimization (RM) Learn a model function f from examples (x 1, y 1 ),..., (x N, y N ) R d R or {+1, 1}, generated from P(x, y) such that the expected number of errors on test data (drawn fom P(x, y)), 1 R[f ] = 2 f (x) y 2 dp(x, y), is minimal. Problem: Distribution P(x, y) is unknown Empirical Risk Minimization (ERM) Replace the average over P(x, y) by average of training samples (i.e. minimize the training error): R emp [f ] = 1 N N i=1 1 2 f (x i) y i 2 Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

7 Law of large numbers: R emp [f ] R[f ] as N. Question: Does min f R emp [f ] give us min f R[f ] for sufficiently large N? No: uniform convergence needed Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

8 Law of large numbers: R emp [f ] R[f ] as N. Question: Does min f R emp [f ] give us min f R[f ] for sufficiently large N? No: uniform convergence needed Error bound for classification With probablity of at least 1 η: D(log 2N D R[f ] R emp [f ] + + 1) log( η 4 ) N where D is the VC dimension (Vapnik and Chervonenkis (1971)). Introduce structure on set of possible functions and use Structural Risk Minimization (SRM). Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

9 The linear function class has VC-dimension D = 3 min f R emp [f ] + Complexity[f ] Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

10 Support Vector Machines (SVM) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

14 Support Vector Machines (SVM) 2 w {x w x + b = +1} Normalize w so that min xi w x i + b = 1. {x w x + b = 1} b w {x w x + b = 0} w x 1 + b = +1 w x 2 + b = 1 w (x 1 x 2 ) = 2 w w (x 1 x 2 ) = 2 w Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

15 VC Dimension of Hyperplane Classifiers Theorem (Cortes and Vapnik (1995)) Hyperplanes in canonical form have VC Dimension D min{r 2 w 2 + 1, N + 1} where R the radius of the smallest sphere containing the data. SRM Bound: R[f ] R emp [f ] + D(log 2N D + 1) log( η 4 ) N maximal margin = minimum w 2 good generalization, i.e. low risk: min w,b w 2 subject to y i (w x i + b) 1 for i = 1... N Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

16 Slack variables Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

17 Slack variables Introduce slack variables ξ i : ξ i min w,b,ξ i subject to w 2 + C N i=1 ξ i y i (w x i + b) 1 ξ i ξ i 0 Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

18 Non-linear hyperplanes Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

19 Non-linear hyperplanes Map into a higher dimensional feature space: Φ : R 2 R 3 (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2 ) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

20 Dual SVM Primal min w,b,ξ i w 2 + C N i=1 ξ i subject to y i (w Φ(x i ) + b) 1 ξ i and ξ i 0 for i = 1... N Dual max α subject to N α i 1 N α i α j y i y j (Φ(x i ) Φ(x j )) 2 i=1 i,j=1 N α i y i = 0 and C α i 0 for i = 1... N i=1 Data points x i only appear in scalar products (Φ(x i ) Φ(x j )). Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

21 The Kernel Trick Replace scalar products with kernel function (Müller et al. (2001)): k(x, y) = Φ(x) Φ(y) Compute kernel matrix K ij = k(x i, x j ), i.e. never use Φ directly Underlying mapping Φ can be unknown Kernels can be adopted to specific task, e.g. using prior knowledge (kernels for graphs, trees, strings,... ) Common kernels Gaussian Kernel: k(x, y) = ) exp ( x y 2 2σ 2 Linear Kernel: k(x, y) = x y Polynomial Kernel: k(x, y) = (x y + c) d Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

22 The Support Vectors in SVM max α subject to N α i 1 N α i α j y i y j (Φ(x i ) Φ(x j )) 2 i=1 i,j=1 N α i y i = 0 and C α i 0 for i = 1... N i=1 KKT conditions y i [wφ(x i )) + b] > 1 = a i = 0 x i irrelevant y i [wφ(x i )) + b] = 1 = on/in margin x i Support Vector Old model f (x) = w Φ(x i ) + b becomes via w = N i=1 α iy i Φ(x i ): N f (x) = α i y i k(x i, x) + b f (x) = α i y i k(x i, x) + b i=1 x i SV Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

23 Kernel Ridge Regression (KRR) Ridge Regression min w N y i w x i 2 + λ w 2 i=1 Setting derivative to zero gives w = ( λi + ) 1 N N x i x i y i x i i=1 i=1 Linear Model: f (x) = w x Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

24 Kernelizing Ridge Regression Setting X = (x 1,..., x N ) R d N and Y = (y 1,..., y n ) R N : w = (λi + XX ) 1 XY Apply Woodbury Matrix identity: w = X (X X + λi ) 1 Y Introduce α: α = (K + λi ) 1 Y and w = N Φ(x i )α i i=1 Kernel Model: f (x) = w Φ(x) = N i=1 α ik(x i, x) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

25 Unsupervised Learning MIXTURE MODELS AND EM Learn structure from unlabeled data Fit an assumed model / distribution to the data Examples clustering blind source separation outlier detection dimensionality reduction (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points denote the data set in a two-dimensional Euclidean space. The initial choices for centres µ 1 and µ 2 are shown by the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the red cluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying the points according to which side of the perpendicular bisector of the two cluster centres, shown by the magenta line, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the points assigned to the corresponding cluster. (d) (i) show successive E and M steps through to final convergence of the algorithm. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

26 Principal Component Analysis (PCA) Given centered data matrix X = (x 1,..., x N ) R NxD best linear approximation { w 1 = arg min X X ww 2} w =1 direction of largest variance { w 1 = arg max X w 2 } w =1 matrix reduction for further components Pearson (1901) X k+1 = X k X k ww Pearson, K On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2: Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

27 Principal Component Analysis (PCA) Given centered data matrix X R NxD, decompose correlated data matrix into uncorrelated, orthogonal PCs diagonalize covariance matrix Σ = 1 N X X Σw k = σ 2 k w k order principal components w k by variance σ 2 k project data to first n principal components Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

28 Principal Component Analysis (PCA) Given centered data matrix X R NxD, decompose correlated data matrix into uncorrelated, orthogonal PCs diagonalize covariance matrix Σ = 1 N X X Σw k = σ 2 k w k order principal components w k by variance σ 2 k project data to first n principal components What about nonlinear correlations? Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

29 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) Σ f w k = σ 2 k w k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

30 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) X f X f w k = Nσ 2 k w k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

31 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) X f X f w k = Nσk 2 w k w k = X f α k X f X f X f α k = Nσk 2 X f α k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

32 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) X f X f w k = Nσk 2 w k w k = X f α k X f X f X f α k = Nσk 2 X f α k X f X f X f X f X f α k = Nσ 2 k X f X f α k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

33 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) X f X f w k = Nσk 2 w k w k = X f α k X f X f X f α k = Nσk 2 X f α k X f K 2 α k = Nσ 2 k Kα k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

34 Kernel Principal Component Analysis (kpca) Transformation to feature space X X f : Σ f = 1 N X f X f, K = X f X f, K ij = k(x i, x j ) X f X f w k = Nσk 2 w k w k = X f α k X f X f X f α k = Nσk 2 X f α k X f K 2 α k = Nσ 2 k Kα k K 1 Kα k = Nσ 2 k α k Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

35 Kernel Principal Component Analysis (kpca) Projection: x f w k = x f X f α k N = α k,i k(x, x i ) i=1 Schölkopf et al. (1997) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

36 Model Selection Find the model that best fits the data distribution We can only estimate this distribution Consider noise ratio / distribution data correlation Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

37 Hyperparameters train test 0.5 adjust model complexity regularization, kernel parameters, etc. have to be tuned using examples not used for training standard solution: exhaustive search over parameter grid f(x) x f (x) = sin(x) ( x xi 2 ) α i exp f (x) = i σ 2 α = (K + τi ) 1 y Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

38 Grid Search f(x) x 1.5 σ RMSE f(x) f(x) τ x x Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

39 k-fold cross-validation split data model selection training test 4x inner loop evaluation training test 5x outer loop Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

40 k-fold cross-validation split data model selection training test 4x inner loop evaluation training test 5x outer loop Don t even think about looking at the test set! Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

41 From objects to vectors How to represent complex objects for kernel methods? explicit map to vector space: φ : M R n use standard kernel (e.g., linear, polynomial, gaussian) k : R n R n R on mapped features direct use of kernel function: k : M M R Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

42 Feature Representation Given a physical object (molecule, crystal, etc.) and a property of interest, what is a good ML representation? no loss of valuable information support generalization remove invariances decompose problem incorporation of domain knowledge depends on data set, target function and learning method Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

43 Feature Representation - Molecules Coulomb matrix: 0.5Zi 2.4 C ij = Z i Z j r i r j if i = j if i j (a) (b) (c) (d) (e) (Rupp et al., 2012; Montavon et al., 2012) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

44 Feature Representation - Molecules PCA of Coulomb matrices with atom permutations Montavon et al. (2013) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

45 Results - Molecules Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

46 Feature Representation - Crystals element pair r 1 r n α α g αα (r 1 ) g αα (r n ) α β g αβ (r 1 ) g αβ (r n ) β α g βα (r 1 ) g βα (r n ) β β g ββ (r 1 ) g ββ (r n ) Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

47 Results - Crystals Learning curve of DOS fermi predictions K.T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K.-R. Müller, E.K.U. Gross, How to represent crystal structures for machine learning: towards fast prediction of electronic properties, arxiv, 2013 Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

48 Machine Learning has been successfully applied to various research fields. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

49 Machine Learning has been successfully applied to various research fields.... is based on statistical learning theory. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

50 Machine Learning has been successfully applied to various research fields.... is based on statistical learning theory.... provides fast and accurate predictions on previously unseen data. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

51 Machine Learning has been successfully applied to various research fields.... is based on statistical learning theory.... provides fast and accurate predictions on previously unseen data.... is able to model non-linear relationships of high-dimensional data. Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

52 Machine Learning has been successfully applied to various research fields.... is based on statistical learning theory.... provides fast and accurate predictions on previously unseen data.... is able to model non-linear relationships of high-dimensional data. Feature representation is key! Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35

53 Literature I Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3): Montavon, G., Hansen, K., Fazli, S., Rupp, M., Biegler, F., Ziehe, A., Tkatchenko, A., Lilienfeld, A. V., and Müller, K.-R. (2012). Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural Information Processing Systems, pages Montavon, G., Rupp, M., Gobre, V., Vazquez-Mayagoitia, A., Hansen, K., Tkatchenko, A., Müller, K.-R., and von Lilienfeld, O. A. (2013). Machine learning of molecular electronic properties in chemical compound space. arxiv preprint arxiv: Müller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. Neural Networks, IEEE Transactions on, 12(2): Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11): Rupp, M., Tkatchenko, A., Müller, K.-R., and von Lilienfeld, O. A. (2012). Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108(5): Schölkopf, B., Smola, A., and Müller, K.-R. (1997). Kernel principal component analysis. In Artificial Neural Networks ICANN 97, pages Springer. Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2): Felix Brockherde, Kristof Schütt Introduction to Machine Learning IPAM Tutorial / 35