# Discovering Structure in Continuous Variables Using Bayesian Networks

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 In: "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge MA, Discovering Structure in Continuous Variables Using Bayesian Networks Reimar Hofmann and Volker Tresp Siemens AG, Central Research Otto-Hahn-Ring München, Germany Abstract We study Bayesian networks for continuous variables using nonlinear conditional density estimators. We demonstrate that useful structures can be extracted from a data set in a self-organized way and we present sampling techniques for belief update based on Markov blanket conditional density models. 1 Introduction One of the strongest types of information that can be learned about an unknown process is the discovery of dependencies and even more important of independencies. A superior example is medical epidemiology where the goal is to find the causes of a disease and exclude factors which are irrelevant. Whereas complete independence between two variables in a domain might be rare in reality (which would mean that the joint probability density of variables A and B can be factored: p(a, B) = p(a)p(b)), conditional independence is more common and is often a result from true or apparent causality: consider the case that A is the cause of B and B is the cause of C, then p(c A, B) =p(c B) and A and C are independent under the condition that B is known. Precisely this notion of cause and effect and the resulting independence between variables is represented explicitly in Bayesian networks. Pearl (1988) has convincingly argued that causal thinking leads to clear knowledge representation in form of conditional probabilities and to efficient local belief propagating rules. hofmannr

2 Bayesian networks form a complete probabilistic model in the sense that they represent the joint probability distribution of all variables involved. Two of the powerful features of Bayesian networks are that any variable can be predicted from any subset of known other variables and that Bayesian networks make explicit statements about the certainty of the estimate of the state of a variable. Both aspects are particularly important for medical or fault diagnosis systems. More recently, learning of structure and of parameters in Bayesian networks has been addressed allowing for the discovery of structure between variables (Buntine, 1994, Heckerman, 1995). Most of the research on Bayesian networks has focused on systems with discrete variables, linear Gaussian models or combinations of both. Except for linear models, continuous variables pose a problem for Bayesian networks. In Pearl s words (Pearl, 1988): representing each [continuous] quantity by an estimated magnitude and a range of uncertainty, we quickly produce a computational mess. [Continuous variables] actually impose a computational tyranny of their own. In this paper we present approaches to applying the concept of Bayesian networks towards arbitrary nonlinear relations between continuous variables. Because they are fast learners we use Parzen windows based conditional density estimators for modeling local dependencies. We demonstrate how a parsimonious Bayesian network can be extracted out of a data set using unsupervised self-organized learning. For belief update we use local Markov blanket conditional density models which in combination with Gibbs sampling allow relatively efficient sampling from the conditional density of an unknown variable. 2 Bayesian Networks This brief introduction of Bayesian networks follows closely Heckerman, Considering a joint probability density 1 p(x) over a set of variables {x 1,...,x N } we can decompose using the chain rule of probability p(x) = N p(x i x 1,...,x i 1 ). (1) i=1 For each variable x i, let the parents of x i denoted by P i {x 1,...,x i 1 } be a set of variables 2 that renders x i and {x 1,...,x i 1 } independent, that is p(x i x 1,...,x i 1 )=p(x i P i ). (2) Note, that P i does not need to include all elements of {x 1,...,x i 1 } which indicates conditional independence between those variables not included in P i and x i given that the variables in P i are known. The dependencies between the variables are often depicted as directed acyclic 3 graphs (DAGs) with directed arcs from the members of P i (the parents) to x i (the child). Bayesian networks are a natural description of dependencies between variables if they depict causal relationships between variables. Bayesian networks are commonly used as a representation of the 1 For simplicity of notation we will only treat the continuous case. Handling mixtures of continuous and discrete variables does not impose any additional difficulties. 2 Usually the smallest set will be used. Note that in P i is defined with respect to a given ordering of the variables. 3 i.e. not containing any directed loops.

3 knowledge of domain experts. Experts both define the structure of the Bayesian network and the local conditional probabilities. Recently there has been great emphasis on learning structure and parameters in Bayesian networks (Heckerman, 1995). Most of previous work concentrated on models with only discrete variables or on linear models of continuous variables where the probability distribution of all continuous given all discrete variables is a multidimensional Gaussian. In this paper we use these ideas in context with continuous variables and nonlinear dependencies. 3 Learning Structure and Parameters in Nonlinear Continuous Bayesian Networks Many of the structures developed in the neural network community can be used to model the conditional density distribution of continuous variables p(x i P i ). Under the usual signal-plus independent Gaussian noise model a feedforward neural network NN(.) is a conditional density model such that p(x i P i )=G(x i ; NN(P i ),σ 2 ), where G(x; c, σ 2 ) is our notation for a normal density centered at c and with variance σ 2. More complex conditional densities can, for example, be modeled by mixtures of experts or by Parzen windows based density estimators which we used in our experiments (Section 5). We will use p M (x i P i ) for a generic conditional probability model. The joint probability model is then p M (x) = N p M (x i P i ). (3) i=1 following Equations 1 and 2. Learning Bayesian networks is usually decomposed into the problems of learning structure (that is the arcs in the network) and of learning the conditional density models P M (x i P i ) given the structure 4. First assume the structure of the network is given. If the data set only contains complete data, we can train conditional density models P M (x i P i ) independently of each other since the log-likelihood of the model decomposes conveniently into the individual likelihoods of the models for the conditional probabilities. Next, consider two competing network structures. We are basically faced with the well-known bias-variance dilemma: if we choose a network with too many arcs, we introduce large parameter variance and if we remove too many arcs we introduce bias. Here, the problem is even more complex since we also have the freedom to reverse arcs. In our experiments we evaluate different network structures based on the model likelihood using leave-one-out cross-validation which defines our scoring function for different network structures. More explicitly, the score for network structure S is Score = log(p(s)) + L cv, where p(s) is a prior over the network structures and L cv = D k=1 log(pm (x k S, X {x k })) is the leave-one-out cross-validation loglikelihood (later referred to as cv-log-likelihood). X = {x k } D k=1 is the set of training samples, and p M (x k S, X {x k }) is the probability density of sample x k given the structure S and all other samples. Each of the terms p M (x k S, X {x k }) can be computed from local densities using Equation 3. 4 Differing from Heckerman we do not follow a fully Bayesian approach in which priors are defined on parameters and structure; a fully Bayesian approach is elegant if the occurring integrals can be solved in closed form which is not the case for general nonlinear models or if data are incomplete.

4 Even for small networks it is computationally impossible to calculate the score for all possible network structures and the search for the global optimal network structure is NP-hard. In the Section 5 we describe a heuristic search which is closely related to search strategies commonly used in discrete Bayesian networks (Heckerman, 1995). 4 Prior Models In a Bayesian framework it is useful to provide means for exploiting prior knowledge, typically introducing a bias for simple structures. Biasing models towards simple structures is also useful if the model selection criteria is based on cross-validation, as in our case, because of the variance in this score. In the experiments we added a penalty per arc to the log-likelihood i.e. log p(s) αn A where N A is the number of arcs and the parameter α determines the weight of the penalty. Given more specific knowledge in form of a structure defined by a domain expert we can alternatively penalize the deviation in the arc structure (Heckerman, 1995). Furthermore, prior knowledge can be introduced in form of a set of artificial training data. These can be treated identical to real data and loosely correspond to the concept of a conjugate prior. 5 Experiment In the experiment we used Parzen windows based conditional density estimators to model the conditional densities p M (x i P i ) from Equation 2, i.e. p M (x i P i )= D k=1 G((x i, P i ); (x k i, Pk i ),σ2 i ) D k=1 G(P i; P k i,σ2 i ), (4) where {x j } D j=1 is the training set. The Gaussians in the nominator are centered at (x k i, Pk i ) which is the location of the k-th sample in the joint input/output (or parent/child) space and the Gaussians in the denominator are centered at (Pi k) which is the location of the k-th sample in the input (or parent) space. For each conditional model, σ i was optimized using leave-one-out cross validation 5. The unsupervised structure optimization procedure starts with a complete Bayesian model corresponding to Equation 1, i.e. a model where there is an arc between any pair of variables 6. Next, we tentatively try all possible arc direction changes, arc removals and arc additions which do not produce directed loops and evaluate the change in score. After evaluating all legal single modifications, we accept the change which improves the score the most. The procedure stops if every arc change decreases the score. This greedy strategy can get stuck in local minima which could in principle be avoided if changes which result in worse performance are also accepted with a nonzero probability 7 (such as in annealing strategies, Heckerman, 1995). Calculating the new score at each step requires only local computation. 5 Note that if we maintained a global σ for all density estimators, we would maintain likelihood equivalence which means that each network displaying the same independence model gets the same score on any test set. 6 The order of nodes determining the direction of initial arcs is random. 7 In our experiments we treated very small changes in score as if they were exactly zero thus allowing small decreases in score.

5 log likelihood log likelihood Number of Iterations Number of inputs Figure 1: Left: evolution of the cv-log-likelihood (dashed) and of the log-likelihood on the test set (continuous) during structure optimization. The curves are averages over 20 runs with different partitions of training and test sets and the likelihoods are normalized with respect to the number of cv- or test-samples, respectively. The penalty per arc was α =0.1. The dotted line shows the Parzen joint density model commonly used in statistics, i.e. assuming no independencies and using the same width for all Gaussians in all conditional density models. Right: log-likelihood of the local conditional Parzen model for variable 3 (p M (x 3 P 3 )) on the test set (continuous) and the corresponding cv-log-likelihood (dashed) as a function of the number of parents (inputs) crime rate 2 percent land zoned for lots 3 percent nonretail business 4 located on Charles river? 5 nitrogen oxide concentration 6 average number of rooms 7 percent built before weighted distance to employment center 9 access to radial highways 10 tax rate 11 pupil/teacher ratio 12 percent black 13 percent lower-status population 14 median value of homes Figure 2: Final structure of a run on the full data set. The removal or addition of an arc corresponds to a simple removal or addition of the corresponding dimension in the Gaussians of the local density model. However, after each such operation the widths of the Gaussians σ i in the affected local models have to be optimized. An arc reversal is simply the execution of an arc removal followed by an arc addition. In our experiment, we used the Boston housing data set, which contains 506 samples. Each sample consists of the housing price and 14 variables which supposedly influence the housing price in a Boston neighborhood (Figure 2). Figure 1 (left) shows an experiment where one third of the samples was reserved as a test set to monitor the process. Since the algorithm never sees the test data the increase in likelihood of the model on the test data is an unbiased estimator for how much the model has improved by the extraction of structure from the data. The large increase in the log-likelihood can be understood by studying Figure 1 (right). Here we picked a single variable (node 3) and formed a density model to predict this variable from the remaining 13 variables. Then we removed input variables in the order

6 of their significance. After the removal of a variable, σ 3 is optimized. Note that the cv-log-likelihood increases until only three input variables are left due to the fact that irrelevant variables or variables which are well represented by the remaining input variables are removed. The log-likelihood of the fully connected initial model is therefore low (Figure 1 left). We did a second set of 15 runs with no test set. The scores of the final structures had a standard deviation of only 0.4. However, comparing the final structures in terms of undirected arcs 8 the difference was 18% on average. The structure from one of these runs is depicted in Figure 2 (right). In comparison to the initial complete structure with 91 arcs, only 18 arcs are left and 8 arcs have changed direction. One of the advantages of Bayesian networks is that they can be easily interpreted. The goal of the original Boston housing data experiment was to examine whether the nitrogen oxide concentration (5) influences the housing price (14). Under the structure extracted by the algorithm, 5 and 14 are dependent given all other variables because they have a common child, 13. However, if all variables except 13 are known then they are independent. Another interesting question is what the relevant quantities are for predicting the housing price, i.e. which variables have to be known to render the housing price independent from all other variables. These are the parents, children, and children s parents of variable 14, that is variables 8, 10, 11, 6, 13 and 5. It is well known that in Bayesian networks, different constellations of directions of arcs may induce the same independencies, i.e. that the direction of arcs is not uniquely determined. It can therefore not be expected that the arcs actually reflect the direction of causality. 6 Missing Data and Markov Blanket Conditional Density Model Bayesian networks are typically used in applications where variables might be missing. Given partial information (i. e. the states of a subset of the variables) the goal is to update the beliefs (i. e. the probabilities) of all unknown variables. Whereas there are powerful local update rules for networks of discrete variables without (undirected) loops, the belief update in networks with loops is in general NP-hard. A generally applicable update rule for the unknown variables in networks of discrete or continuous variables is Gibbs sampling. Gibbs sampling can be roughly described as follows: for all variables whose state is known, fix their states to the known values. For all unknown variables choose some initial states. Then pick a variable x i which is not known and update its value following the probability distribution p(x i {x 1,...,x N }\{x i }) p(x i P i ) p(x j P j ). (5) x i P j Do this repeatedly for all unknown variables. Discard the first samples. Then, the samples which are generated are drawn from the probability distribution of the unknown variables given the known variables. Using these samples it is easy to calculate the expected value of any of the unknown variables, estimate variances, 8 Since the direction of arcs is not unique we used the difference in undirected arcs to compare two structures. We used the number of arcs present in one and only one of the structures normalized with respect to the number of arcs in a fully connected network.

7 covariances and other statistical measures such as the mutual information between variables. Gibbs sampling requires sampling from the univariate probability distribution in Equation 5 which is not straightforward in our model since the conditional density does not have a convenient form. Therefore, sampling techniques such as importance sampling have to be used. In our case they typically produce many rejected samples and are therefore inefficient. An alternative is sampling based on Markov blanket conditional density models. The Markov blanket of x i, M i is the smallest set of variables such that p(x i {x 1,...,x N }\x i )=p(x i M i ) (given a Bayesian network, the Markov blanket of a variable consists of its parents, its children and its children s parents.). The idea is to form a conditional density model p M (x i M i ) p(x i M i ) for each variable in the network instead of computing it according to Equation 5. Sampling from this model is simple using conditional Parzen models: the conditional density is a mixture of Gaussians from which we can sample without rejection 9. Markov blanket conditional density models are also interesting if we are only interested in always predicting one particular variable, as in most neural network applications. Assuming that a signal-plus-noise model is a reasonably good model for the conditional density, we can train an ordinary neural network to predict the variable of interest. In addition, we train a model for each input variable predicting it from the remaining variables. In addition to having obtained a model for the complete data case, we can now also handle missing inputs and do backward inference using Gibbs sampling. 7 Conclusions We demonstrated that Bayesian models of local conditional density estimators form promising nonlinear dependency models for continuous variables. The conditional density models can be trained locally if training data are complete. In this paper we focused on the self-organized extraction of structure. Bayesian networks can also serve as a framework for a modular construction of large systems out of smaller conditional density models. The Bayesian framework provides consistent update rules for the probabilities i.e. communication between modules. Finally, consider input pruning or variable selection in neural networks. Note, that our pruning strategy in Figure 1 can be considered a form of variable selection by not only removing variables which are statistically independent of the output variable but also removing variables which are represented well by the remaining variables. This way we obtain more compact models. If input values are missing then the indirect influence of the pruned variables on the output will be recovered by the sampling mechanism. References Buntine, W. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research 2: Heckerman, D. (1995). A tutorial on learning Bayesian networks. Microsoft Research, TR. MSR-TR-95-06, Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann. 9 There are, however, several open issues concerning consistency between the conditional models.

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

### Learning 3D Object Recognition Models from 2D Images

From: AAAI Technical Report FS-93-04. Compilation copyright 1993, AAAI (www.aaai.org). All rights reserved. Learning 3D Object Recognition Models from 2D Images Arthur R. Pope David G. Lowe Department

### Graphical Models. CS 343: Artificial Intelligence Bayesian Networks. Raymond J. Mooney. Conditional Probability Tables.

Graphical Models CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin 1 If no assumption of independence is made, then an exponential number of parameters must

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### Introduction to Discrete Probability Theory and Bayesian Networks

Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft September 15, 2011 This document remains the property of Inatas. Reproduction in whole or in part without the written

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### A Method of Computing Generalized Bayesian Probability Values For Expert Systems

A Method of Computing Generalized Bayesian Probability Values For Expert Systems Peter Cheese man SRI International, 333 Ravenswood Road Menlo Park, California 94025 Abstract This paper presents a new

### In Proceedings of the Eleventh Conference on Biocybernetics and Biomedical Engineering, pages 842-846, Warsaw, Poland, December 2-4, 1999

In Proceedings of the Eleventh Conference on Biocybernetics and Biomedical Engineering, pages 842-846, Warsaw, Poland, December 2-4, 1999 A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

### Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0

Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall

### Chapter 14 Managing Operational Risks with Bayesian Networks

Chapter 14 Managing Operational Risks with Bayesian Networks Carol Alexander This chapter introduces Bayesian belief and decision networks as quantitative management tools for operational risks. Bayesian

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

### THE UNIVERSITY OF AUCKLAND

COMPSCI 369 THE UNIVERSITY OF AUCKLAND FIRST SEMESTER, 2011 MID-SEMESTER TEST Campus: City COMPUTER SCIENCE Computational Science (Time allowed: 50 minutes) NOTE: Attempt all questions Use of calculators

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Feature Extraction and Selection. More Info == Better Performance? Curse of Dimensionality. Feature Space. High-Dimensional Spaces

More Info == Better Performance? Feature Extraction and Selection APR Course, Delft, The Netherlands Marco Loog Feature Space Curse of Dimensionality A p-dimensional space, in which each dimension is a

### Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning SS 2005 Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes

### GeNIeRate: An Interactive Generator of Diagnostic Bayesian Network Models

GeNIeRate: An Interactive Generator of Diagnostic Bayesian Network Models Pieter C. Kraaijeveld Man Machine Interaction Group Delft University of Technology Mekelweg 4, 2628 CD Delft, the Netherlands p.c.kraaijeveld@ewi.tudelft.nl

### Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 1-3. You must complete all four problems to

### Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win \$1 if head and lose \$1 if tail. After 10 plays, you lost \$2 in net (4 heads

### Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Tutorial on variational approximation methods. Tommi S. Jaakkola MIT AI Lab

Tutorial on variational approximation methods Tommi S. Jaakkola MIT AI Lab tommi@ai.mit.edu Tutorial topics A bit of history Examples of variational methods A brief intro to graphical models Variational

### Tensor Factorization for Multi-Relational Learning

Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany nickel@dbs.ifi.lmu.de 2 Siemens AG, Corporate

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu

### A Probabilistic Causal Model for Diagnosis of Liver Disorders

Intelligent Information Systems VII Proceedings of the Workshop held in Malbork, Poland, June 15-19, 1998 A Probabilistic Causal Model for Diagnosis of Liver Disorders Agnieszka Oniśko 1, Marek J. Druzdzel

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases Gerardo Canfora Bice Cavallo University of Sannio, Benevento, Italy, {gerardo.canfora,bice.cavallo}@unisannio.it ABSTRACT In

### Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

### Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

### Towards running complex models on big data

Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

### Graphical Modeling for Genomic Data

Graphical Modeling for Genomic Data Carel F.W. Peeters cf.peeters@vumc.nl Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

### A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka Onisko, M.S., 1,2 Marek J. Druzdzel, Ph.D., 1 and Hanna Wasyluk, M.D.,Ph.D.

Research Report CBMI-99-27, Center for Biomedical Informatics, University of Pittsburgh, September 1999 A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka Onisko, M.S., 1,2 Marek J. Druzdzel,

### Decision-Theoretic Troubleshooting: A Framework for Repair and Experiment

Decision-Theoretic Troubleshooting: A Framework for Repair and Experiment John S. Breese David Heckerman Abstract We develop and extend existing decisiontheoretic methods for troubleshooting a nonfunctioning

### degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

### Discovering Markov Blankets: Finding Independencies Among Variables

Discovering Markov Blankets: Finding Independencies Among Variables Motivation: Algorithm: Applications: Toward Optimal Feature Selection. Koller and Sahami. Proc. 13 th ICML, 1996. Algorithms for Large

10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

### CHAPTER 6 NEURAL NETWORK BASED SURFACE ROUGHNESS ESTIMATION

CHAPTER 6 NEURAL NETWORK BASED SURFACE ROUGHNESS ESTIMATION 6.1. KNOWLEDGE REPRESENTATION The function of any representation scheme is to capture the essential features of a problem domain and make that

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Decision tree algorithm short Weka tutorial

Decision tree algorithm short Weka tutorial Croce Danilo, Roberto Basili Machine leanring for Web Mining a.a. 2009-2010 Machine Learning: brief summary Example You need to write a program that: given a

### CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

### Modeling Contextual Factors of Click Rates

Modeling Contextual Factors of Click Rates Hila Becker Columbia University 500 W. 120th Street New York, NY 10027 Christopher Meek Microsoft Research 1 Microsoft Way Redmond, WA 98052 David Maxwell Chickering

### Bayesian Learning. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Learning CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

### Logic, Probability and Learning

Logic, Probability and Learning Luc De Raedt luc.deraedt@cs.kuleuven.be Overview Logic Learning Probabilistic Learning Probabilistic Logic Learning Closely following : Russell and Norvig, AI: a modern

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

### Computing with Finite and Infinite Networks

Computing with Finite and Infinite Networks Ole Winther Theoretical Physics, Lund University Sölvegatan 14 A, S-223 62 Lund, Sweden winther@nimis.thep.lu.se Abstract Using statistical mechanics results,

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### Machine Learning

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 8, 2011 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### The Big 50 Revision Guidelines for S1

The Big 50 Revision Guidelines for S1 If you can understand all of these you ll do very well 1. Know what is meant by a statistical model and the Modelling cycle of continuous refinement 2. Understand

### Feature Engineering in Machine Learning. Institute of Formal and Applied Linguistics, Charles University in Prague

Feature Engineering in Machine Learning Zdeněk Žabokrtský Institute of Formal and Applied Linguistics, Charles University in Prague Used resources http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf

### Process Modelling from Insurance Event Log

Process Modelling from Insurance Event Log P.V. Kumaraguru Research scholar, Dr.M.G.R Educational and Research Institute University Chennai- 600 095 India Dr. S.P. Rajagopalan Professor Emeritus, Dr. M.G.R

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### Putting MAIDs in Context

Putting MAIs in Context Mark Crowley epartment of Computer Science University of British Columbia Vancouver, BC V6T 1Z4 crowley@cs.ubc.ca Abstract Multi-Agent Influence iagrams (MAIs) are a compact modelling

### Bayesian Networks Tutorial I Basics of Probability Theory and Bayesian Networks

Bayesian Networks 2015 2016 Tutorial I Basics of Probability Theory and Bayesian Networks Peter Lucas LIACS, Leiden University Elementary probability theory We adopt the following notations with respect

### Posterior probability!

Posterior probability! P(x θ): old name direct probability It gives the probability of contingent events (i.e. observed data) for a given hypothesis (i.e. a model with known parameters θ) L(θ)=P(x θ):

### THE MULTINOMIAL DISTRIBUTION. Throwing Dice and the Multinomial Distribution

THE MULTINOMIAL DISTRIBUTION Discrete distribution -- The Outcomes Are Discrete. A generalization of the binomial distribution from only 2 outcomes to k outcomes. Typical Multinomial Outcomes: red A area1

### Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 23, 2015 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

### 3. The Junction Tree Algorithms

A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )

### Dependency Networks for Collaborative Filtering and Data Visualization

264 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 Dependency Networks for Collaborative Filtering and Data Visualization David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite,

### Supervised and unsupervised learning - 1

Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

### Big Data, Machine Learning, Causal Models

Big Data, Machine Learning, Causal Models Sargur N. Srihari University at Buffalo, The State University of New York USA Int. Conf. on Signal and Image Processing, Bangalore January 2014 1 Plan of Discussion

### Data Mining Techniques Chapter 7: Artificial Neural Networks

Data Mining Techniques Chapter 7: Artificial Neural Networks Artificial Neural Networks.................................................. 2 Neural network example...................................................

### Data Matching Optimal and Greedy

Chapter 13 Data Matching Optimal and Greedy Introduction This procedure is used to create treatment-control matches based on propensity scores and/or observed covariate variables. Both optimal and greedy

### bayesian inference: principles and practice

bayesian inference: and http://jakehofman.com july 9, 2009 bayesian inference: and motivation background bayes theorem bayesian probability bayesian inference would like models that: provide predictive

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### An Introduction to Bayesian Statistics

An Introduction to Bayesian Statistics Robert Weiss Department of Biostatistics UCLA School of Public Health robweiss@ucla.edu April 2011 Robert Weiss (UCLA) An Introduction to Bayesian Statistics UCLA

### Intellectual Property / Copyright Material

What is Data Mining? Author: BALWANT RAI Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 02/27/04 Email: erg@evaltech.com Abstract: In this paper we will be going

### Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

### Linear concepts and hidden variables: An empirical study

Linear concepts and hidden variables An empirical study Adam J. Grove NEC Research Institute 4 Independence Way Princeton NJ 854 grove@research.nj.nec.com Dan Roth Department of Computer Science University