Prototype based methods: Mathematical foundations, interpretability, and data visualization

Transcription

1 Prototype based methods: Mathematical foundations, interpretability, and data visualization Barbara Hammer, Xibin Zhu CITEC Centre of Excellence Bielefeld University ijcnn14_tutorial.html

2

3 Why LVQ? [Machine Learning that Matters, Kiri L. Wagstaff, ICML 2012]... of 152 non-cross-conference papers published at ICML 2011:! there is a need for machine learning techniques which facilitate a direct interpretation of the results

4 Why LVQ?! LVQ is a prime example of a Machine Learning model which is intuitive and interpretable! but classical LVQ is a mere heuristic! This Tutorial: modern LVQ variants and their mathematics

5 Prototypes! prototypes are points in the data space:! which decompose the space into receptive fields:! induce a classification ~w i 2 R n R( ~w i )={~x k ~w i ~xk 2 applek~w j ~xk 2 8j 6= i}

6 Prototypes! prototypes offer a sparse encoding! prototypes represent data! manual inspection possible

7 Prototypes WSOM 2005, Paris 7

8 Prototypes WSOM 2005, Paris 8

9 Prototype learning! supervised: classes are known a priori: training set: P = {(~x i,y i ) i =1,...,p} R n {1,...,C}! LVQ, GLVQ, RSLVQ,...! unsupervised: clusters are not known priorly! NG, GTM, AP,...!... usually solid mathematical foundation available

10 LVQ Learning vector quantization [Kohonen, 1988] init positions of ~w j, labels are c( ~w j ) repeat: pick data point (~x i,y i ) randomly determine winner ~w I if y i = c( ~w I ): ~w I (~x i ~w I ) otherwise: ~w I (~x i ~w I )

11 LVQ LVQ 2.1 [Kohonen, 1990] init positions of ~w j, labels are c( ~w j ) repeat: pick data point (~x i,y i ) randomly determine closest prototype with y i = c( ~w + ): ~w + determine closest prototype with y i 6= c( ~w ): ~w if prototypes fall into a window around decision boundary: ~w + (~x i ~w + ) ~w (~x i ~w )

12

13 Cognitive Interaction Technology Center of Excellence Online detection of faults sensors

14 [T.Bojer et al., 2003] Cognitive Interaction Technology Center of Excellence Online detection of faults Setting: high dim. features few training data online training LVQ: close to 100% accuracy prototypes can be stored can be inspected

15 Clinical proteomics unhappy because possibly ill.. take serum put into mass spectrometer observe a characteristic spectrum which tells us more about the peptides in the serum

16 [F.-M.Schleif et al., 2009] Cognitive Interaction Technology Center of Excellence Clinical proteomics prostate cancer [National Cancer Institute, Prostate Cancer Dataset, l]:! 318 examples, SELDI-TOF from blood serum, 130 dim after preprocessing (normalization, peak detection)! 2 classes (healthy versus cancer in different states) LVQ GRLVQ SVM 62.5% 93.7% 92.7%

17 Steroid metabolomics unhappy because possibly ill.. extract steroid markers (32 selected steorid metabolites) by means of GC/MS take serum ACC / ACA

18 [W.Arlt, M.Biehl et al, 2011] Cognitive Interaction Technology Center of Excellence Steroid metabolomics

19 [S.Kirstein, H.Wersing, H.-M.Gross, E.Koerner, 2012] Cognitive Interaction Technology Center of Excellence Object recognition

20 Take home message! LVQ offers an intuitive classifier with high potential for industrial applications! interpretability of the technique is a big plus

21 LVQ code! lvq PAK ( only basic versions! included in popular software such as WEKA: only basic versions! SOM toolbox ( also GLVQ, matrix learning! mloss: also GLVQ, matrix learning! see also material at tutorial web site in particular for advanced versions as covered in the following:

22

23 LVQ! LVQ 1 does not have a valid cost function: X f LV Q (d +,d ) where d ± =(~x i ~w ± ) 2 squared distance to closest correct / wrong prototype and i f LV Q (a, b) = a b if a apple b else

24 LVQ2.1! LVQ2.1 has a valid cost function: X f LV Q2.1 (d +,d ) where d ± =(~x i ~w ± ) 2 squared distance to closest correct / wrong prototype and f LV Q2.1 (a, b) = window (a b) But this is unbounded! i

25 LVQ2.1! behavior without window in simple model situations: generalization error of LVQ depending on its initialization in simple model setting: result can be far from optimum [Biehl,Ghosh,Hammer,2007] (p + > p - )! so tricky choice of window necessary... (p - )

26 More reasonable cost function for LVQ! based on margin maximization: GLVQ [Sato/Yamada 1996, Hammer/Villmann 2002, Crammer et al 2002, Schneider et al. 2009]! based on probabilistic modeling: RSLVQ [Seo/Obermayer 2003]

27 Colt for LVQ in a nutshell! function class F given by possible LVQ-networks! training data (x i,y i )! machine learner! LVQ-function f in F! often: f(x i ) = y i for training points (i.e. small empirical error)! desired: P(f(x) = y) should be large (i.e. small real error)

28 Colt for LVQ in a nutshell safe classification insecure classification! (hypothesis) margin of x i : m(x i ) = d - - d + where d + / d - is the squared distance to closest correct / wrong prototype! mathematics! error is bounded by: E/m + O( p 2 (B 3 ln 1/δ) 1/2 ) / (ρm 1/2 )) good bounds for few training errors and large margin + does not include dimensionality where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence m = number of examples, B = support, p = number of prototypes

29 Colt for LVQ in a nutshell safe classification insecure classification! (hypothesis) margin of x i : m(x i ) = d - - d + where d + / d - is the squared distance to closest correct / wrong prototype! mathematics! error is bounded by: good bounds for few training errors and large margin data with E/m (too) + O( p 2 (B 3 term ln 1/δ) / margin 1/2 ) / (ρm 1/2 )) small margin where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence m = number of examples, B = support, p = number of prototypes + does not include dimensionality

30 Margin maximization! mathematical objective: maximize margin maximize margin

31 Margin maximization! mathematical objective: unbounded min P i d (~x i) d + (~x i )

32 Margin maximization! mathematical objective: minimize Σ i (d + (x i ) d - (x i )) / (d + (x i ) + d - (x i )) min X i d (~x i ) d + (~x i ) d (~x i )+d + (~x i )

33 [Sato/Yamada 1996] Cognitive Interaction Technology Center of Excellence Generalized LVQ (GLVQ) derivatives GLVQ

34 Generalized LVQ (GLVQ) derivatives GLVQ

35 Generalized LVQ (GLVQ) derivatives scaling LVQ2.1 GLVQ

36 Probabilsitic modeling! Mixture of Gaussians with labels

37 Robust soft LVQ (RSLVQ)

38 RSLVQ Cognitive Interaction Technology Center of Excellence

39 RSLVQ Cognitive Interaction Technology Center of Excellence

40 Prototype locations Cognitive Interaction Technology Center of Excellence

41 Take home! LVQ can be substantiated by large margin generalization bounds (independent of dimensionality)! LVQ can be based on cost functions:! probabilistic modeling! excellent results! bandwidth is very crititcal parameter (crisp limit does not perform well)! prototypes not always representative! margin maximization! very good results! parameters not critical! prototypes are representative for data! enables stable training, principled mathematical modelling

42

43 Why metric learning? Example: acceptance of papers at some conference L - layout, T - technical quality, I - interesting subject, F - famous author, S appropriate subject, Q - overall quality, P - author registers for conference, E - appropriate length, B - likes beer, P - looks pretty, G - gives good talks, K - knows programm committee, M - member of programm committee, C - special session, R - has red hairs

44 Why metric learning?! data are usually represented by feature vectors! feature vectors are compared using Euclidean distance! but this might tell you nothing useful smell head belly human (42,42,42,0,...) (41,43,44,1,...) (-41,43,44,1,...)

45 Why metric learning?

46 Metric parameterization

47 Metric learning: G relevance LVQ! mathematical objective: minimize Σ i (d λ + (x i ) d λ- (x i )) / (d λ+ (x i ) + d λ- (x i )) where d λ (x,y) = Σ l λ l (x l -y l ) 2 normalize the relevance terms relevance learning

48 GRLVQ! mathematical objective: min Σ i (d λ + (x i ) d λ- (x i )) / (d λ+ (x i ) + d λ- (x i )) derivatives intuitive, fast, well founded, flexible, suited for large dimensions

49 GRLVQ! mathematical objective: min Σ i (d λ + (x i ) d λ- (x i )) / (d λ+ (x i ) + d λ- (x i )) derivatives scaling LVQ2.1 relevance update intuitive, fast, well founded, flexible, suited for large dimensions

50 GRLVQ 2D data embedded in 10 D with noise/noisy copies

51 Generalized Matrix LVQ (GMLVQ) Substitute metric by general quadratic form:

52 LGMLVQ Cognitive Interaction Technology Center of Excellence

53 UCI benchmarks... Cognitive Interaction Technology Center of Excellence

54 [W.Arlt, M.Biehl et al, 2011] Cognitive Interaction Technology Center of Excellence Interpretability: Steroid metabolomics

55

56 GMLVQ yields (local) matrices, i.e. (local) scaling and rotations of the space GRLVQ: global scaling GMLVQ: global scaling and rotation LGMLVQ: local scaling and rotation

57 GMLVQ! GMLVQ with positiv semidefinite matrices: * = quadratic complexity w.r.t data dimensionality

58 Low rank GMLVQ! GMLVQ with positiv semidefinite low rank matrices matrices: * = linear complexity w.r.t data dimensionality equivalent to full version (if data are intrincically low dimensional)

59 Low rank GMLVQ Cognitive Interaction Technology Center of Excellence

60 [Bunte et al. 2012] Cognitive Interaction Technology Center of Excellence LiRamLVQ glob al local global local * = induces global projection: glob al f: x " * x

61 Discriminative visualization Example: USPS digits

62

63 Stationary solutions of GMLVQ! assume fixed receptive fields, what is the optimum metric?! update of matrix has the form (prefactor indicates sign): (x centered in prototype) plus normalization! similar to van Mises iteration! converges to first eigenvector of! in particular convergence to low rank matrix!

64 Stationary solution contributes with + contributes with -

65

66 Interpretation of matrix terms high medium low alcohol content infra-red spectral data: 124 wine spamples 256 wavelengths 30 training data 94 test spectra

67 Interpretation of matrix terms! often: diagonal terms are interpreted as relevance! problem: for high dimensional data holds for all matrices with differences in the null space of C = XX t

68 Interpretation of matrix terms! dividing out null space yields the profile! direct interpretation of relevance profile misleading for high dim data, get rid of null space first!

69 Interpretation of matrix terms GMLVQ over-fitting effect best performance 7 dimensions remaining null-space correction P=30 dimensions

70 Take home! metric adaptation:! increases accuracy! does not deteriorating its generalization ability! low rank matrix:! allows efficient training! data visualization! no restriction as compared to optimum metric! intrepretation:! by looking at feature weighting,! for high dimensionali data, normalization is necessary

71 Schneider, Biehl, Hammer...matrix learning is cool! Neural Computation 2009

72

73 Dissimilarity or similarity data! feature extraction " vectorial data size softness color curvature... " (20,7,...)! pairwise (dis)similarity measurement " (dis)similarity matrix

74 (Dis)similarity data! (dis)similarity measures, e.g.: 1.Alignment 2.Normalized Compression Distance 3.Graph structure kernels 4. GTTACAGGT GGTACACGT GTGACAAGT

75 LVQ for dis-/similarities! kernel GLVQ (Suganthan et al.)! differentiable kernel GLVQ (Villmann et al.)! relational GLVQ/SRLVQ (Xibin et al.)! kernel SRLVQ (Hofmann et al.)!...

76 Relational GLVQ Cognitive Interaction Technology Center of Excellence Assumption: Prototypes are expressed as linear combinations w i = α j ij x j where Fact: for every symmetric bilinear form and linear representation as above we find 2 x j w i = (D α i ) j 1 α T 2 i D α i Method: Substitute all terms x j w i in original methods and use

77 Relational GLVQ assume prototypes have the form then GLVQ costs become "... ugly formulas

78 Benchmark data Cognitive Interaction Technology Center of Excellence

79 Similarities/dissimilarities euclid general k~x i ~x j k 2 d ij = d(x i,x j ) h~x i, ~x j i s ij = s(x i,x j ) assumption: symmetric: d ij = d ji s ij = s ji zero diagonal: d ii =0 normalization of s is possible: s ii =1

80 Similarities/dissimilarities euclid general k~x i ~x j k 2 d ij = d(x i,x j ) h~x i, ~x j i s ij = s(x i,x j ) d ij = s ii 2s ij + s jj

81 Similarities/dissimilarities euclid general k~x i ~x j k 2 d ij = d(x i,x j ) h~x i, ~x j i s ij = s(x i,x j ) s ij = 1 2 d ij 1 n P l d il 1 n Pl d lj + 1 P n 2 l,l d 0 ll 0

82 Pseudo-euclidean embedding pseudo-euclid general k~x i ~x j k 2 pq = k~x 1 i ~x 1 j k2 k~x 2 i ~x 2 j k2 d ij = d(x i,x j ) h~x i, ~x j i pq = h~x 1 i, ~x1 j i h~x2 i, ~x2 j i s ij = s(x i,x j ) signature (p, q, n p q) euclideanity can be obtained by clip / flip

83 Pseudo-Euclidean Space For every symmetric D a vector space embedding in pseudo-euclidean space exists; symmetric bilinear form induces dissimilarities -1 P2=(-6.1,1) P4=(-0.1,0) P3=(0.1,0) P1=(6.1,1) +1 P6=(-4,-1) P5=(4,-1)

84 LVQ for dis-/similarities classification based on k~x i ~w j k 2 = k~x i k 2 2h~x i, ~w j i + k ~w j k 2 training optimizes f k~x i ~w j k 2 i,j

85 LVQ for dis-/similarities classification based on k~x i ~w j k 2 = k~x i k 2 2h~x i, ~w j i + k ~w j k 2 training optimizes f k~x i ~w j k 2 i,j prototypes as linear combinations ~w j = P ji ~x i possible assumptions: P j ji = 1, ji 0

86 LVQ for dis-/similarities classification based on k~x i ~w j k 2 = k~x i k 2 2h~x i, ~w j i + k ~w j k 2 training optimizes f k~x i ~w j k 2 i,j kernel aproach k~x i ~w j k 2 = s ii 2 X l jl s il + X l,l 0 jl jl 0s ll 0

87 LVQ for dis-/similarities classification based on k~x i ~w j k 2 = k~x i k 2 2h~x i, ~w j i + k ~w j k 2 training optimizes f k~x i ~w j k 2 i,j relational aproach k~x i ~w j k 2 = X l jl d il 1 2 X l,l 0 jl jl 0d ll 0 for normalized jl

88 LVQ for dis-/similarities 00 optimize: f X ii jl d il X l 1 X jl jl 0d ll 0A l,l 0 i,j 1 A 1 jl s il + X jl jl 0s ll 0A l,l 0 1 A i,j gradient descent with respect to followed by normalization jl! relational GLVQ / SRLVQ

89 LVQ for dis-/similarities gradient descent with respect j f hence: k~x i ~w j k 2 i,j! kernel GLVQ / SRLVQ ~w j = X l = 2f 0 (~x i ~w j ) Pl jl~x l 2f 0 (~x i P l jl~x l ) jl ~x l this can be decomposed into contributions of the coe... only for euclidean form! cients

90 LVQ for dis-/similarities GLVQ similarities gradient w.r.t. coefficients RSLVQ dissimilarities gradient w.r.t. prototypes only in the euclidean case: kernel variants resemble gradient w.r.t w large margin generalization bounds interpretation as likelihood ratio

91 Results Cognitive Interaction Technology Center of Excellence

92 Computational effort Size of Matrix (Double Precision) n Size MB 10, MB 20, GB 50, GB 200, GB

93 Computational effort? k~x i ~w j k 2 = s ii 2 X l jl s il + X l,l 0 jl jl 0s ll 0 = e t ise i 2 e i S j + t js j sample m landmarks only S m,n approximate S S m,n S 1 m,ms n,m S m,m S n,m [Nyström approximation, Williams/Seeger]

94 Experiments Cognitive Interaction Technology Center of Excellence

95

96 Take home! there exist cool methods which enable the application of LVQ for similarities / dissimilarities! quadratic complexity! Nystroem approximation for low rank data reduces to linear complexity! metric adaptation possible in a similar way as for GMLVQ: adapt w.r.t similarity/dissimilarity parameters (has been done for alignment distance " ESANN 14)

97

98 Confidence measures! Certainty of a classification? x?!

99 Conformal prediction! framework to accompany pointwise classification of online methods by provable guarantees: classifier trained on N (exchangeable) data conformity measure yields possible labels such that for a new point it holds: [Shafer & Vovk,2008]

100 Conformal prediction! pick conformity measure, e.g.! induces two terms: Credibility: how sure that a prediction is correct Confidence: how sure that ALL OTHER labels are incorrect.. any measure is valid,but some measures are more useful... higher credibility higher confidence lower credibility lower confidence

101 Conformal prediction algorithm [Shafer,Vovk]

102 Simplified conformal prediction given training data and new point 1. train the model on training data 2. compute nonconformity of training set 3. for every non conformity of 4. compare values is 5. output label with best r-value credibility: largest r-value confidence: 1- second largest r-value

103 Qualitative result Cognitive Interaction Technology Center of Excellence

104 Growing conformal semi-supervised LVQ given labeled data and unlabeled data init model with minimum number of prototypes train model on Loop: predict confidence/credibility on predict labels on based on secures part and consider secure part add the part of with high confidence/credibility identify regions with poor confidence/credibility for generate new protoype

105 Growing conformal LVQ

106 Semi-supervised growing conformal LVQ

107 Example evaluations

108 Take home! conformal prediction enables to accompany classification results by confidence values! can be realised efficiently for LVQ based on distance measures! allows incremental versions (also for relational setting, semi-supervised training)

109

110 Literature! T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1997.! T. Kohonen. Learning vector quantization. In: M.A. Arbib, editor, The Handbook of Brain Theory and Neural Networks., pages MIT Press, Cambridge, MA, 1995.! M. Biehl, B. Hammer, P. Schneider, T. Villmann, Metric Learning for Prototype-based, in: Innovations in Neural Information Paradigms and Applications, M. Bianchini, M. Maggini, F. Scarselli, L.C. Jain (eds.), Springer Studies in Computational Intelligence, Vol 247 (2009), ! M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, T. Villmann, Stationarity of Matrix Relevance Learning Vector Quantization, Machine Learning Reports 01/2009, Univ. Leipzig (2009)! M. Biehl, A. Ghosh, and B. Hammer, Dynamics and generalization ability of LVQ algorithms, J. Machine Learning Research 8 (Feb): , 2007! W. Arlt, M. Biehl, A.E. Taylor, S. Hahner, R. Libe, B.A. Hughes, P. Schneider, D.J. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C.H.L. Shackleton, X. Bertagna, M. Fassnacht, P.M. Stewart Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors J. of Clinical Endocrinology & Metabolism 96: (2011).! Frank-Michael Schleif, Thomas Villmann, Markus Kostrzewa, Barbara Hammer, Alexander Gammerman: Cancer informatics by prototype networks in mass spectrometry. Artificial Intelligence in Medicine 45(2-3): (2009)! S. Kirstein, H. Wersing, H.-M. Gross, and E. Körner. A Life-Long Learning Vector Quantization Approach for Interactive Learning of Multiple Categories. Neural Networks 28: (2012).! Sambu Seo, Klaus Obermayer: Soft Learning Vector Quantization. Neural Computation 15(7): (2003)! Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, Xibin Zhu: Learning vector quantization for (dis-)similarities. Neurocomputing (IJON) 131:43-51 (2014)! Marc Strickert, Barbara Hammer, Thomas Villmann, Michael Biehl: Regularization and improved interpretation of linear data mappings and adaptive distance measures. CIDM 2013:10-17! Sato, Yamada, Generalized Learning Vector Quantization, NIPS 96

111 Literature! B. Mokbel, B. Paassen, and B. Hammer. Adaptive distance measures for sequential data. In Michel Verleysen, editor, ESANN, pages , 2014.! Daniela Hofmann, Frank-Michael Schleif, Benjamin Paa.en, and Barbara Hammer. Learning interpretable kernelized prototype-based models. Neurocomputing, accepted, 2013.! Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Semi-supervised vector quantization for proximity data. In ESANN, pages 89 94, 2013.! Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Sparse conformal prediction for dissimilarity data. Annals of Mathematics and Artificial Intelligence (AMAI), 2014.! Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, and Xibin Zhu. Learning vector quantization for (dis-)similarities. Neurocomputing, 131:43 51, 2014.! Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Patch processing for relational learning vector quantization. In Jun Wang, Gary G. Yen, and Marios M. Polycarpou, editors, Advances in Neural Networks - ISNN th International Symposium on Neural Networks, Shenyang, China, July 11-14, Proceedings, Part I, volume 7367, pages Springer, 2012.! Andrej Gisbrecht, Bassam Mokbel, Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Linear time relational prototype based learning. Int. J. Neural Syst., 22(5), 2012.! Kerstin Bunte, Petra Schneider, Barbara Hammer, Frank-Michael Schleif, Thomas Villmann, and Michael Biehl. Limited rank matrix learning, discriminative dimension reduction and visualization. Neural Networks, 26: , 2012.! P. Schneider, K. Bunte, H. Stiekema, B. Hammer, T. Villmann, and M. Biehl. Regularization in matrix relevance learning. IEEE Transactions on Neural Networks, 21: , 2010.! M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, and T. Villmann. Stationarity of matrix relevance learning vector quantization machine learning reports. Technical Report 01/2009, University of Leipzig, 2009.! Petra Schneider, Michael Biehl, Barbara Hammer: Adaptive Relevance Matrices in Learning Vector Quantization. Neural Computation 21(12): (2009)! Koby Crammer, Ran Gilad-Bachrach, Amir Navot, Naftali Tishby: Margin Analysis of the LVQ Algorithm. NIPS 2002: ! Shafer, Vovk, JMLR 51, A Tutorial on Conformal Prediction,2008.