Intelligent Data Analysis & Data Mining, a.k.a. DM2 2007/2008. Alfredo Vellido. Probability & Data Mining 3. Bayesian Neural Networks & Latent models

Transcription

1 Intelligent Data Analysis & Data Mining, a.k.a. DM2 2007/2008. Alfredo Vellido Probability & Data Mining 3. Bayesian Neural Networks & Latent models

2 Contents of the course (hopefully) 1. Introduction to DM and its methodologies 2. Visual DM: Exploratory DM through visualization 3. Pattern recognition 1 4. Pattern recognition 2 5. Feature extraction 6. Feature selection 7. Error estimation 8. Linear classifiers, kernels and SVMs 9. Probability in Data Mining 10. Latency, generativity, manifolds and all that SML 11. Applications of GTM: from medicine to ecology 12. DM Case studies

3 Bayes & supervised neural networks

4 Recap: Bayesian Neural Networks Probability theory should lay at the foundation any learning algorithm, otherwise risking that the reasoning performed in it be inconsistent in some cases (heuristics) Embedding probability theory into machine learning techniques requires modeling assumptions to be made explicit; it also automatically satisfies the likelihood principle and provides a natural framework to handle uncertainty. (what type of uncertainty?) Probability theory is ideally suited as a theoretical foundation for pattern recognition

5 Step back: plain Neural Networks From a simple perceptron to the MLP MLP: Two- and three-layered networks as universal aproximators. = = = = = =

6 Step back: plain Neural Networks

7 Bayesian Neural Networks (3) The Bayesian formalism framework works at different levels of the development of a supervised neural network: = ( ) ( ) ( αβ) = ( αβ) ( αβ) ( ω) = ( ω) ( ω)

8 Bayesian Neural Networks (7) The Bayesian formalism framework works at different levels of the development of a supervised neural network: ω = ( ω) ( ω) We start from an a priori distribution of the weights ω α = where ω = weight decay regularizer = = this corresponds to the use of a

9 Bayesian Neural Networks (8) The Bayesian formalism framework works at different levels of the development of a supervised neural network: ω = ( ω) ( ω) starting from an expression for the evidence or likelihood ( ω) = ( β ) where = { } = and assuming that the targets have been generated by a smooth function with added Gaussian noise (beta!)

10 Bayesian Neural Networks (9) The Bayesian formalism framework works at different levels of the development of a supervised neural network: ω = ( ω) ( ω) then, the a posteriori probability is proportional to: ( ω ) ( + ) β α Notice (!!!) that the maximization is almost equivalent to the ML solution for large data sets

11 Bayesian Neural Networks (10) The Bayesian formalism framework works at different levels of the development of a supervised neural network: ω = ( ω) ( ω) In what sense do we say that the maximization is almost equivalent to the ML solution for large data sets? The most probable (likely) solution for the weights, or w MP, corresponds to the minimum of or minimum error β ( ω ) ( + ) ( β + α ) = = ( ( ) ) α + = = β α and maximum posterior probability ( ω ) ( + ) β α

12 Bayesian Neural Networks (11) The Bayesian formalism framework works at different levels of the development of a supervised neural network: Distribution of the network outputs REMEMBER: σ ω = Error bars in regression = β + ( ω) ( ω) = ( ) ( ) ( ) ( ) = ( ) σ Two terms: one from intrinsic noise, another from width of post dist of weights

13 The Bayesian formalism framework works at different levels of the development of a supervised neural network: Varianceand regularization hyperparameters...can be calculated differentiating with respect to hyperparameters obtaining iterative formulae for their estimation Bayesian Neural Networks (12) β α β α = β α = = α β β α β α β α = = + = α β

14 Remember Over- and under-fitting A STEP ASIDE: OVERFITTING When fitting a model to noisy data (ALWAYS), we make the assumption that the data have been generated from some TRUE model by making predictions at given values of the inputs, then adding some amount of noise to each point, where the noise is drawn from a normal distribution with an unknown variance. Our task is to discover both this model and the width of the noise distribution.. In doing so, we aim for a compromise between bias, where our model does not follow the right trend in the data (and so does not match well with the underlying truth), and variance,, where our model fits the data points too closely, fitting the noise rather than trying to capture the true distribution. These two extremes are known as underfitting and overfitting. IMPORTANT! : the number of parameters in a model; the higher, the more accurately the model can fit the data. If the number of parameters in our model is larger than that the true one,, then we risk overfitting,, and if our model contains fewer parameters than the truth, we could underfit.

15 Over- and under-fitting (2) A STEP ASIDE: OVERFITTING The illustration shows how increasing the number of parameters in i the model can result in overfitting. The 9 data points are generated from a cubic polynomial which contains 4 parameters (the true model) and adding noise. We can see that by selecting candidate models containing more parameters than the truth, we can reduce, and even en eliminate, any mismatch between the data points and our model. This T occurs when the number of parameters is the same as the number of o data points (an 8th order polynomial has 9 parameters).

16 Bayesian Neural Networks (13) What else does this Bayesian formalism offer us? Variable selection: Automatic Relevance Determination (ARD): regularization terms are asociated to each network input. The a priori distribution of the weights is defined as: = α Automatic Regularization: The regularization coefficients associated to irrelevant inputs are inferred high values that make the corresponding weights tend to 0. This can be interpreted as soft input pruning. inspection of the final values { c } indicates the relative relevance of each variable.

17 Bayesian Neural Networks (14) Bayesian formalism in a nutshell Initialization of hiperparametersand. Initialization of weights with values obtained from Network training using standard optimization techniques to minimize Insert, every few iterations of the algorithm, the reevaluation ofand, which entails the evaluation of the Hessian of the error function and the extraction of its eigenvalues. [optionally, repeat the three previous steps to limit the negative effect of local minima of the error function] Repeat the previous steps for a sample of different possible models or, alternatively, use a committee of networks or select models with high evidence to define a mixture of experts.

18 Bayesians C.M. Bishop, Neural Networks for Pattern Recognition (Ch.10). Oxford University Press. C.M. Bishop, Pattern Recognition and Machine Learning.. Springer Verlag. D.J. MacKay, Information Theory, Inference & Learning Algorithms.. Cambridge Univ. Press. E.T. Jaynes,, 2003, Probability Theory: The Logic of Science. Cambridge University Press Tom Loredo,, Bayesian Inference, a practical primer (tutorial)

19 Let s s hold the Bayesian horses for a while

20 Latency, Projection and Generativity (or towards unsupervised Bayesian Neural Networks)

21 Latent models

22 What is a latent model? First of all what is a latent variable? According to the WIKIPEDIA Latent variables, as opposed to observable variables, are those variables that cannot be directly observed but are rather inferred from other variables that can be observed and directly measured. Examples of latent variables include quality of life, business confidence, morale, happiness, conservatism. Latent variables are also called as hypothetical variables or hypothetical constructs. The use of latent variables is common in social sciences and to an extent in the economics domain. The exact definition of latent variables varies in different domains.

23 What is a latent model? (2) First of all what is a latent variable? Still in WIKIPEDIA One advantage of using latent variables is that it reduces the dimensionality of data. A large number of observable variables can be aggregated to represent an underlying concept, making it easier for human beings to understand and assimilate information. Haven t we heard all this before?

24 !"#$%# %&'#( How to deal with multivariate and usually high-dimensional data? It depends: Low dimensionality (1-3D) Medium dimensionality (4-10D) High dimensionality (>10D)

25 low-medium dimensionality <10D Spatial Coordinates 3D requieres interactivity Further pre-cognitive elements allow us to add dimensions color, movement, shape, Fancy solutions glyphs: Chernoff faces, stickfigures, whiskers...

26

27 &' high dimensional data How do we visualize high dimensional data without losing much on the way? Some alternatives are relatively simple, others are not Eliminate redundant or uninformative dimensions (variables) at least the problem is alleviated FEATURE SELECTION: Lluis Divide & Conquer: create multiple simultaneous lowdimensional visualizations. Then latent and projection methods!

28 What is a latent model? (3) Latent Variable Models and Factor Analysis by David Bartholomew, Martin Knott (2nd ed., 1999) LVMs provide an important tool for the analysis of multivariate data [ ] a conceptual framework within which many disparate methods can be unified and a base from which new methods can be developed A statistical model specifies the joint distribution of a set of random variables P(X) = P(x 1,x 2,,x D ) and it becomes a LVM when some of the variables are unobservable.

29 What is a latent model? (4) Latent Variable Models and Factor Analysis by David Bartholomew, Martin Knott (2nd ed., 1999) the interesting question concerns why LVs should be introduced into a model One reason is to reduce dimensionality if the information contained in the interrelationships can be conveyed in a much smaller set, our ability to see the structure in the data will be much improved. Another reason is that latent quantities figure prominently in many fields to which statistical methods are applied social sciences business and marketing What is Business confidence?? And what is General intelligence? and Perception of risk? as if they were measurable quantities but for which no measuring instruments exist

30 What is a latent model? (Example) 9 th GVU s WWW users survey, produced by the Graphics, Visualization & Usability Center (Kehoe and Pitkow, 1998). Self-selection of respondents. Experience in Internet over the population average, but demographic variables well balanced. Marketing-wise, the over-representation of the experienced could be beneficial.

31 What is a latent model? (Example) FACTOR DESCRIPTION ATTRIBUTES 1 Shopping experience: Control and convenience Compatibility SELECTED FACTORS 2 Environmental control: Consumer risk perception Trust and security 3 Affordability --!" 4 Shopping experience: Effort 5 Shopping experience: Customer service Ease of use Responsiveness and empathy #$%& ' %$*+,-.,/ $%,'$.+%($%0 % %( $%12* ''13 '(,$% %0'$%* %,+.($%,'$. '21,+%1 (2',/ 4 21,$* '1 '0( 1$%10 % 11+% *+,#/ %5$'*+,$%'(#% $'+-.,/ 77 8 #$%& ' %55$', +1 $521 9 '$2(, '(,$% :+',/)%5$'*+,$%'(#% 11 ORIGINAL FACTORS ; 21,$* '1 '0( $%12* ''13 112'+%( +%'.+-.,/ '5$'*+%( '13 < $%12* ''13)*+& '13.,1* = #$%& ' %( 21,$* '1 '0( 55$', 1$%10 % 11+% *+,#/

32 What is a latent model? (5) Latent Variable Models and Factor Analysis by David Bartholomew, Martin Knott (2nd ed., 1999) we prefer to speak of LVs since this accurately conveys the idea of something underlying the observed A LV can be real in the sense that it could, in principle, be measured. For example, Personal Wealth Business Confidence is not something which exists in the sense that Personal Wealth does Much of the philosophical debate which takes place on LVMs centres on REIFICATION REALISTS vs. INSTRUMENTALISTS

33 What is a latent model? (6) OBSERVED VARIABLES CONTINUOUS CATEGORICAL LATENT VARIABLES CONTINUOUS CATEGORICAL Factor Analysis Latent Profile Analysis Latent Trait Analysis Latent Class Analysis LATENT TRAIT ANALYSIS: Tino, Kaban, Sun (2004) A Generative Probabilistic Approach to Visualizing Sets of Symbolic Sequences, Proceedings of the ACM SIGKDD