# Modelling Relational Data using Bayesian Clustered Tensor Factorization

Save this PDF as:

Size: px
Start display at page:

Download "Modelling Relational Data using Bayesian Clustered Tensor Factorization"

## Transcription

1 Modelling Relational Data using Bayesian Clustered Tensor Factorization Ilya Sutskever University of Toronto Ruslan Salakhutdinov MIT Joshua B. Tenenbaum MIT Abstract We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us understand a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data. 1 Introduction Learning with relational data, or sets of propositions of the form (object, relation, object), has been important in a number of areas of AI and statistical data analysis. AI researchers have proposed that by storing enough everyday relational facts and generalizing appropriately to unobserved propositions, we might capture the essence of human common sense. For instance, given propositions such as (cup, used-for, drinking), (cup, can-contain, juice), (cup, can-contain, water), (cup, can-contain, coffee), (glass, can-contain, juice), (glass, can-contain, water), (glass, can-contain, wine), and so on, we might also infer the propositions (glass, used-for, drinking), (glass, can-contain, coffee), and (cup, can-contain, wine). Modelling relational data is also important for more immediate applications, including problems arising in social networks [2], bioinformatics [16], and collaborative filtering [18]. We approach these problems using probabilistic models that define a joint distribution over the truth values of all conceivable relations. Such a model defines a joint distribution over the binary variables T(a, r, b) {0, 1}, where a and b are objects, r is a relation, and the variable T(a, r, b) determines whether the relation (a, r, b) is true. Given a set of true relations S = {(a, r, b)}, the model predicts that a new relation (a, r, b) is true with probability P(T(a, r, b) = 1 S). In addition to making predictions on new relations, we also want to understand the data that is, to find a small set of interpretable laws that explains a large fraction of the observations. By introducing hidden variables over simple hypotheses, the posterior distribution over the hidden variables will concentrate on the laws the data is likely to obey, while the nature of the laws depends on the model. For example, the Infinite Relational Model (IRM) [8] represents simple laws consisting of partitions of objects and partitions of relations. To decide whether the relation (a, r, b) is valid, the IRM simply checks that the clusters to which a, r, and b belong are compatible. The main advantage of the IRM is its ability to extract meaningful partitions of objects and relations from the observational data, 1

2 which greatly facilitates exploratory data analysis. More elaborate proposals consider models over more powerful laws (e.g., first order formulas with noise models or multiple clusterings), which are currently less practical due to the computational difficulty of their inference problems [7, 6, 9]. Models based on matrix or tensor factorization [18, 19, 3] have the potential of making better predictions than interpretable models of similar complexity, as we demonstrate in our experimental results section. Factorization models learn a distributed representation for each object and each relation, and make predictions by taking appropriate inner products. Their strength lies in the relative ease of their continuous (rather than discrete) optimization, and in their excellent predictive performance. However, it is often hard to understand and analyze the learned latent structure. The tension between interpretability and predictive power is unfortunate: it is clearly better to have a model that has both strong predictive power and interpretability. We address this problem by introducing the Bayesian Clustered Tensor Factorization (BCTF) model, which combines good interpretability with excellent predictive power. Specifically, similarly to the IRM, the BCTF model learns a partition of the objects and a partition of the relations, so that the truth-value of a relation (a, r, b) depends primarily on the compatibility of the clusters to which a, r, and b belong. At the same time, every entity has a distributed representation: each object a is assigned the two vectors a L,a R (one for a being a left argument in a relation and one for it being a right argument), and a relation r is assigned the matrix R. Given the distributed representations, the truth of a relation (a, r, b) is determined by the value of a L Rb R, while the object partition encourages the objects within a cluster to have similar distributed representations (and similarly for relations). The experiments show that the BCTF model achieves better predictive performance than a number of related probabilistic relational models, including the IRM, on several datasets. The model is scalable, and we apply it on the Movielens [15] and the Conceptnet [10] datasets. We also examine the structure found in BCTF s clusters and learned vectors. Finally, our results provide an example where the performance of a Bayesian model substantially outperforms a corresponding MAP estimate for large sparse datasets with minimal manual hyperparameter selection. 2 The Bayesian Clustered Tensor Factorization (BCTF) We begin with a simple tensor factorization model. Suppose that we have a fixed finite set of objects O and a fixed finite set of relations R. For each object a O the model maintains two vectors a L,a R R d (the left and the right arguments of the relation), and for each relation r R it maintains a matrix R R d d, where d is the dimensionality of the model. Given a setting of these parameters (collectively denoted by θ), the model independently chooses the truth-value of each relation (a, r, b) from the distribution P(T(a, r, b) = 1 θ) = 1/(1 + exp( a L Rb R)). In particular, given a set of known relations S, we can learn the parameters by maximizing a penalized log likelihood log P(S θ) Reg(θ). The necessity of having a pair of parameters a L,a R, instead of a single distributed representation a, will become clear later. Next, we define a prior over the vectors {a L }, {a R }, and {R}. Specifically, the model defines a prior distribution over partitions of objects and partitions of relations using the Chinese Restaurant Process. Once the partitions are chosen, each cluster C samples its own prior mean and prior diagonal covariance, which are then used to independently sample vectors {a L,a R : a C} that belong to cluster C (and similarly for the relations, where we treat R as a d 2 -dimensional vector). As a result, objects within a cluster have similar distributed representations. When the clusters are sufficiently tight, the value of a L Rb R is mainly determined by the clusters to which a, r, and b belong. At the same time, the distributed representations help generalization, because they can represent graded similarities between clusters and fine differences between objects in the same cluster. Thus, given a set of relations, we expect the model to find both meaningful clusters of objects and relations, as well as predictive distributed representations. More formally, assume that O = {a 1,...,a N } and R = {r 1,..., r M }. The model is defined as follows: P(obs, θ, c, α, α DP ) = P(obs θ, σ 2 )P(θ c, α)p(c α DP )P(α DP, α, σ 2 ) (1) where the observed data obs is a set of triples and their truth values {(a, r, b), t}; the variable c = {c obj, c rel } contains the cluster assignments (partitions) of the objects and the relations; the variable θ = {a L,a R,R} consists of the distributed representations of the objects and the relations, and 2

3 Figure 1: A schematic diagram of the model, where the arcs represent the object clusters and the vectors within each cluster are similar. The model predicts T(a, r, b) with a L Rb R. {σ 2, α, α DP } are the model hyperparameters. Two of the above terms are given by P(obs θ) = N(t a LRb R, σ 2 ) (2) {(a,r,b),t} obs P(c α DP ) = CRP(c obj α DP )CRP(c rel α DP ) (3) where N(t µ, σ 2 ) denotes the Gaussian distribution with mean µ and variance σ 2, and CRP(c α) denotes the probability of the partition induced by c under the Chinese Restaurant Process with concentration parameter α. The Gaussian likelihood in Eq. 2 is far from ideal for modelling binary data, but, similarly to [19, 18], we use it instead of the logistic function because it makes the model conjugate and Gibbs sampling easier. Defining P(θ c, α) takes a little more work. Given the partitions, the sets of parameters {a L }, {a R }, and {R} become independent, so P(θ c, α) = P({a L } c obj, α obj )P({a R } c obj, α obj )P({R} c rel, α rel ) (4) The distribution over the relation-vectors is given by P({R} c rel, α rel ) = c rel k=1 µ,σ i:c rel,i =k N(R i µ, Σ)dP(µ, Σ α rel ) (5) where c rel is the number of clusters in the partition c rel. This is precisely a Dirichlet process mixture model [13]. We further place a Gaussian-Inverse-Gamma prior over (µ, Σ): P(µ, Σ α rel ) = P(µ Σ)P(Σ α rel ) = N(µ 0, Σ) d IG(σ 2 d α rel, 1) (6) exp ( d µ 2 d /2 + 1 σ 2 d ) ( σ αrel 1 d ) where Σ is a diagonal matrix whose entries are σ 2 d, the variable d ranges over the dimensions of R i (so 1 d d 2 ), and IG(x α, β) denotes the inverse-gamma distribution with shape parameter α and scale parameter β. This prior makes many useful expectations analytically computable. The terms P({a L } c obj, α obj ) and P({a R } c obj, α obj ) are defined analogously to Eq. 5. Finally, we place an improper P(x) x 1 scale-uniform prior over each hyperparameter independently. Inference We now briefly describe the MCMC algorithm used for inference. Before starting the Markov chain, we find a MAP estimate of the model parameters using the method of conjugate gradient (but we do not optimize over the partitions). The MAP estimate is then used to initialize the Markov chain. Each step of the Markov chain consists of a number of internal steps. First, given the parameters θ, the chain updates c = (c rel, c obj ) using a collapsed Gibbs sampling sweep and a step of the split-and-merge algorithm (where the launch state was obtained with two sweeps of Gibbs sampling starting from a uniformly random cluster assignment) [5]. Next, it samples from the posterior mean d (7) 3

4 and covariance of each cluster, which is the distribution proportional to the term being integrated in Eq. 5. Next, the Markov chain samples the parameters {a L } given {a R }, {R}, and the cluster posterior means and covariances. This step is tractable since the conditional distribution over the object vectors {a L } is Gaussian and factorizes into the product of conditional distributions over the individual object vectors. This conditional independence is important, since it tends to make the Markov chain mix faster, and is a direct consequence of each object a having two vectors, a L and a R. If each object a was only associated with a single vector a (and not a L,a R ), the conditional distribution over {a} would not factorize, which in turn would require the use of a slower sequential Gibbs sampler. In the current setting, we can further speed up the inference by sampling from conditional distributions in parallel. The speedup could be substantial, particularly when the number of objects is large. The disadvantage of using two vectors for each object is that the model cannot as easily capture the position-independent properties of the object, especially in the sparse regime. Sampling {a L } from the Gaussian takes time proportional to d 3 N, where N is the number of objects. While we do the same for {a R }, we run a standard hybrid Monte Carlo to update the matrices {R} using 10 leapfrog steps of size 10 5 [12]. Each matrix, which we treat as a vector, has d 2 dimensions, so direct sampling from the Gaussian distribution scales as d 6 M, which is slow even for small values of d (e.g. 20). Finally, we make a small symmetric multiplicative change to each hyperparameter and accept or reject its new value according to the Metropolis-Hastings rule. 3 Evaluation In this section, we show that the BCTF model has excellent predictive power and that it finds interpretable clusters by applying it to five datasets and comparing its performance to the IRM [8] and the Multiple Relational Clustering (MRC) model [9]. We also compare BCTF to its simpler counterpart: a Bayesian Tensor Factorization (BTF) model, where all the objects and the relations belong to a single cluster. The Bayesian Tensor Factorization model is a generalization of the Bayesian probabilistic matrix factorization [17], and is closely related to many other existing tensor-factorization methods [3, 14, 1]. In what follows, we will describe the datasets, report the predictive performance of our and of the competing algorithms, and examine the structure discovered by BCTF. 3.1 Description of the Datasets We use three of the four datasets used by [8] and [9], namely, the Animals, the UML, and the Kinship dataset, as well the Movielens [15] and the Conceptnet datasets [10]. 1. The animals dataset consists of 50 animals and 85 binary attributes. The dataset is a fully observed matrix so there is only one relation. 2. The kinship dataset consists of kinship relationships among the members of the Alyawarra tribe [4]. The dataset contains 104 people and 26 relations. This dataset is dense and has = observations, most of which are The UML dataset [11] consists of a 135 medical terms and 49 relations. The dataset is also fully observed and has = (mostly 0) observations. 4. The Movielens [15] dataset consists of observed integer ratings of 6041 movies on a scale from 1 to 5, which are rated by 3953 users. The dataset is 95.8% sparse. 5. The Conceptnet dataset [10] is a collection of common-sense assertions collected from the web. It consists of about common-sense assertions such as (hockey, is-a, sport). There are 19 relations and objects. To make our experiments faster, we used only the 7000 most frequent objects, which resulted in true facts. For the negative data, we sampled twice as many random object-relation-object triples and used them as the false facts. As a result, there were binary observations in this dataset. The dataset is 99.9% sparse. 3.2 Experimental Protocol To facilitate comparison with [9], we conducted our experiments the following way. First, we normalized each dataset so the mean of its observations was 0. Next, we created 10 random train/test 4

5 animals kinship UML movielens conceptnet algorithm RMSE AUC RMSE AUC RMSE AUC RMSE AUC RMSE AUC MAP MAP BTF BCTF BTF BCTF IRM [8] MRC [9] Table 1: A quantitative evaluation of the algorithms using 20 and 40 dimensional vectors. We report the performance of the following algorithms: the MAP-based Tensor Factorization, the Bayesian Tensor Factorization (BTF) with MCMC (where all objects belong to a single cluster), the full Bayesian Clustered Tensor Factorization (BCTF), the IRM [8] and the MRC [9]. O1 O2 O3 O4 O5 O6 O7 O8 O9 killer whale, blue whale, humpback whale, seal, walrus, dolphin O1 antelope, dalmatian, horse, giraffe, zebra, deer O2 mole, hamster, rabbit, mouse hippopotamus, elephant, rhinoceros O3 spider monkey, gorilla, chimpanzee moose, ox, sheep, buffalo, pig, cow beaver, squirrel, otter Persian cat, skunk, chihuahua, collie grizzly bear, polar bear F1 F2 F3 F1 F2 F1 F2 F3 F4 F5 F6 F7 F8 F9 flippers, strainteeth, swims, fish, arctic, coastal, ocean, water hooves, vegetation, grazer, plains, fields paws, claws, solitary bulbous, slow, inactive jungle, tree big, strong, group walks, quadrapedal, ground small, weak, nocturnal, hibernate, nestspot tail, newworld, oldworld, timid O1 O2 O3 Figure 2: Results on the Animals dataset. Left: The discovered clusters. Middle: The biclustering of the features. Right: The covariance of the distributed representations of the animals (bottom) and their attributes (top). splits, where 10% of the data was used for testing. For the Conceptnet and the Movielens datasets, we used only two train/test splits and at most 30 clusters, which made our experiments faster. We report test root mean squared error (RMSE) and the area under the precision recall curve (AUC) [9]. For the IRM 1 we make predictions as follows. The IRM partitions the data into blocks; we compute the smoothed mean of the observed entries of each block and use it to predict the test entries in the same block. 3.3 Results We first applied BCTF to the Animals, Kinship, and the UML datasets using 20 and 40-dimensional vectors. Table 1 shows that BCTF substantially outperforms IRM and MRC in terms of both RMSE and AUC. In fact, for the Kinship and the UML datasets, the simple tensor factorization model trained by MAP performs as well as BTF and BCTF. This happens because for these datasets the number of observations is much larger than the number of parameters, so there is little uncertainty about the true parameter values. However, the Animals dataset is considerably smaller, so BTF performs better, and BCTF performs even better than the BTF model. We then applied BCTF to the Movielens and the Conceptnet datasets. We found that the MAP estimates suffered from significant overfitting, and that the fully Bayesian models performed much better. This is important because both datasets are sparse, which makes overfitting difficult to combat. For the extremely sparse Conceptnet dataset, the BCTF model further improved upon simpler 1 The code is available at ckemp/code/irm.html 5

6 a) b) d) f) c) e) g) Figure 3: Results on the Kinship dataset. Left: The covariance of the distributed representations {al } learned for each person. Right: The biclustering of a subset of the relations Amino Acid, Peptide, or Protein, Biomedical or Dental Material, Carbohydrate,... Amphibian, Animal, Archaeon, Bird, Fish, Human,... Antibiotic, Biologically Active Substance, Enzyme, Hazardous or Poisonous Substance, Hormone,... Biologic Function, Cell Function, Genetic Function, Mental Process,... Classification, Drug Delivery Device, Intellectual Product, Manufactured Object,... Body Part, Organ, Cell, Cell Component,... Alga, Bacterium, Fungus, Plant, Rickettsia or Chlamydia, Virus Age Group, Family Group, Group, Patient or Disabled Group,... Cell / Molecular Dysfunction, Disease or Syndrome, Model of Disease, Mental Dysfunction,... Daily or Recreational Activity, Educational Activity, Governmental Activity,... Environmental Effect of Humans, Human-caused Phenomenon or Process,... Acquired Abnormality, Anatomical Abnormality, Congenital Abnormality, Injury or Poisoning Health Care Related Organization, Organization, Professional Society,... Affects interacts with causes Figure 4: Results on the medical UML dataset. Left: The covariance of the distributed representations {al } learned for each object. Right: The inferred clusters, along with the biclustering of a subset of the relations. BTF model. We do not report results for the IRM, because the existing off-the-shelf implementation could not handle these large datasets. We now examine the latent structure discovered by the BCTF model by inspecting a sample produced by the Markov chain. Figure 2 shows some of the clusters learned by the model on the Animals dataset. It also shows the biclustering, as well as the covariance of the distributed representations of the animals and their attributes, sorted by their clusters. By inspecting the covariance, we can determine the clusters that are tight and the affinities between the clusters. Indeed, the cluster structure is reflected in the block-diagonal structure of the covariance matrix. For example, the covariance of the attributes (see Fig. 2, top-right panel) shows that cluster F1, containing {flippers, stainteeth,swims} is similar to cluster F4, containing {bulbous, slow, inactive}, but is very dissimilar to F2, containing {hooves, vegetation, grazer}. Figure 3 displays the learned representation for the Kinship dataset. The kinship dataset has 104 people with complex relationships between them: each person belongs to one of four sections, which strongly constrains the other relations. For example, a person in section 1 has a father in section 3 and a mother in section 4 (see [8, 4] for more details). After learning, each cluster was almost completely localized in gender, section, and age. For clarity of presentation, we sort the clusters first by their section, then by their gender, and finally by their age, as done in [8]. Figure 3 (panels (b-g)) displays some of the relations according to this clustering, and panel (a) shows the covariance between the vectors {al } learned for each person. The four sections are clearly visible in the covariance structure of the distributed representations. Figure 4 shows the inferred clusters for the medical UML dataset. For example, the model discovers that {Amino Acid, Peptide, Protein} Affects {Biologic Function, Cell Function, Genetic Function}, 6

8 model significantly outperformed its MAP counterpart, and in particular, it noticeably outperformed BTF on the Conceptnet dataset. A surprising aspect of the Bayesian model is the ease with which it worked after automatic hyperparameter selection was implemented. Furthermore, the model performs well even when the initial MAP estimate is very poor, as was the case for the 40-dimensional models on the Conceptnet dataset. This is particularly important for large sparse datasets, since finding a good MAP estimate requires careful cross-validation to select the regularization hyperparameters. Careful hyperparameter selection can be very labour-expensive because it requires careful training of a large number of models. Acknowledgments The authors acknowledge the financial support from NSERC, Shell, NTT Communication Sciences Laboratory, AFOSR FA , and AFOSR MURI. References [1] Edoardo Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. In NIPS, pages MIT Press, [2] P.J. Carrington, J. Scott, and S. Wasserman. Models and methods in social network analysis. Cambridge University Press, [3] W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 5, [4] W. Denham. The Detection of Patterns in Alyawarra Nonverbal Behavior. PhD thesis, Department of Anthropology, University of Washington, [5] S. Jain and R.M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1): , [6] Y. Katz, N.D. Goodman, K. Kersting, C. Kemp, and J.B. Tenenbaum. Modeling Semantic Cognition as Logical Dimensionality Reduction. In Proceedings of Thirtieth Annual Meeting of the Cognitive Science Society, [7] C. Kemp, N.D. Goodman, and J.B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of Thirtieth Annual Meeting of the Cognitive Science Society, [8] C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 381. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, [9] S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages ACM New York, NY, USA, [10] H. Liu and P. Singh. ConceptNeta practical commonsense reasoning tool-kit. BT Technology Journal, 22(4): , [11] A.T. McCray. An upper-level ontology for the biomedical domain. Comparative and Functional Genomics, 4(1):80 84, [12] R.M. Neal. Probabilistic inference using Markov chain Monte Carlo methods, [13] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of computational and graphical statistics, pages , [14] Ian Porteous, Evgeniy Bart, and Max Welling. Multi-HDP: A non parametric bayesian model for tensor factorization. In Dieter Fox and Carla P. Gomes, editors, AAAI, pages AAAI Press, [15] J. Riedl, J. Konstan, S. Lam, and J. Herlocker. Movielens collaborative filtering data set, [16] J.F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G.F. Berriz, F.D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, et al. Towards a proteome-scale map of the human protein protein interaction network. Nature, 437(7062): , [17] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th international conference on Machine learning, pages ACM New York, NY, USA, [18] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. Advances in neural information processing systems, 20, [19] R. Speer, C. Havasi, and H. Lieberman. AnalogySpace: Reducing the dimensionality of common sense knowledge. In Proceedings of AAAI,

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Statistics Graduate Courses

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Bayesian Factorization Machines

Bayesian Factorization Machines Christoph Freudenthaler, Lars Schmidt-Thieme Information Systems & Machine Learning Lab University of Hildesheim 31141 Hildesheim {freudenthaler, schmidt-thieme}@ismll.de

### Probabilistic Matrix Factorization

Probabilistic Matrix Factorization Ruslan Salakhutdinov and Andriy Mnih Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada {rsalakhu,amnih}@cs.toronto.edu Abstract

### Cognitive Abilities Test Practice Activities. Te ach e r G u i d e. Form 7. Verbal Tests. Level5/6. Cog

Cognitive Abilities Test Practice Activities Te ach e r G u i d e Form 7 Verbal Tests Level5/6 Cog Test 1: Picture Analogies, Levels 5/6 7 Part 1: Overview of Picture Analogies An analogy draws parallels

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### Tensor Factorization for Multi-Relational Learning

Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany nickel@dbs.ifi.lmu.de 2 Siemens AG, Corporate

### Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

### AMERICAN MUSEUM OF NATURAL HISTORY SCAVENGER HUNT

AMERICAN MUSEUM OF NATURAL HISTORY SCAVENGER HUNT Begin on the 4 th floor. Take the stairs since they are faster than the elevators. Look but do not touch while in the museum. Keep your voices low but

### Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

### My Animal Report. (Write the full name of your animal on this line.)

Name: 1 My Animal Report (Write the full name of your animal on this line.) Draw a large picture of your animal, or paste a picture of your animal in the space above. If you are drawing the animal, be

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

### MUNCH! CRUNCH! SLURP! SMACK!

MUNCH! CRUNCH! SLURP! SMACK! A Mini Unit on What and How Animals Eat By Nancy VandenBerge Firstgradewow.blogspot.com Graphics by thistle girl, scrappin doodles, dj inkers and melonheadz This little unit

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### English for Spanish Speakers. Second Edition. Caroline Nixon & Michael Tomlinson

English for Spanish Speakers Second Edition s 1 Caroline Nixon & Michael Tomlinson 1 Red and yellow and pink and green, Orange and purple and blue. I can sing a rainbow, Sing a rainbow, Sing a rainbow

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Preliminary English Test

Preliminary English Test Placement Test Time allowed: 2 hours QUESTION PAPER DO NOT write on this paper Instructions: Please answer all questions DO NOT USE a dictionary Write all answers on the separate

### Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant

### Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

### Learning is a very general term denoting the way in which agents:

What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

### Movies and Leadership

Movies and Leadership Many movies contain characters that display leadership qualities or situations that can help you explore various leadership topics. Make sure that if you are using films to study

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Instructor: Erik Sudderth Graduate TAs: Dae Il Kim & Ben Swanson Head Undergraduate TA: William Allen Undergraduate TAs: Soravit

### English for Spanish Speakers. Second Edition. Caroline Nixon & Michael Tomlinson

English for Spanish Speakers Second Edition s 2 Caroline Nixon & Michael Tomlinson 1 2 There are pencils in the classroom, yes there are. There s a cupboard on the pencils, yes there is. There s a ruler

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### DOG Pets cat - dog - horse - hamster - rabbit - fish

CAT Pets cat - dog - horse - hamster - rabbit - fish DOG Pets cat - dog - horse - hamster - rabbit - fish HORSE Pets cat - dog - horse - hamster - rabbit - fish HAMSTER Pets cat - dog - horse - hamster

### Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Macmillan Children's Readers

Macmillan Children's Readers 1 Level 1 9781405057172 Colin's Colours 700 2 Level 1 9780230469198 Clothes We Wear NEW 700 3 Level 1 9780230010062 Eddie's Exercise 700 4 Level 1 9780230010048 Fantastic Freddy

### In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Learning outcomes. Knowledge and understanding. Competence and skills

Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

### My favourite animal is the cheetah. It lives in Africa in the savannah, It eats and gazel es. It is big and yel ow with black spots.

The crocodile is big and green. It has got a long tail, a long body and a big mouth with big teeth. It has got four short legs. The crocodile eats fish and other animals. He lives in Africa and America,

### The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

### 1. Squares are faster than circles, hexagons are slower than triangles, and hexagons are faster than squares. Which of these shapes is the slowest?

1. WARM UP PROBLEMS Try the following problems to see how many you are able to solve. If you have trouble to solve some, read the following section to learn some skills. Then come back to working on these

### Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation

Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Levent Bolelli 1, Şeyda Ertekin 2, and C. Lee Giles 3 1 Google Inc., 76 9 th Ave., 4 th floor, New York, NY 10011, USA 2

### Fry Instant Word List

First 100 Instant Words the had out than of by many first and words then water a but them been to not these called in what so who is all some oil you were her sit that we would now it when make find he

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### 6. Which of the following is not a basic need off all animals a. food b. *friends c. water d. protection from predators. NAME SOL 4.

NAME SOL 4.5 REVIEW - Revised Habitats, Niches and Adaptations POPULATION A group of the same species living in the same place at the same time. COMMUNITY-- All of the populations that live in the same

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### 1 What is Machine Learning?

COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #1 Scribe: Rob Schapire February 4, 2008 1 What is Machine Learning? Machine learning studies computer algorithms for learning to do

### Learning Systems of Concepts with an Infinite Relational Model

Learning Systems of Concepts with an Infinite Relational Model Charles Kemp and Joshua B. Tenenbaum Department of Brain and Cognitive Science Massachusetts Institute of Technology {ckemp,jbt}@mit.edu Thomas

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### Fry Instant Words High Frequency Words

Fry Instant Words High Frequency Words The Fry list of 600 words are the most frequently used words for reading and writing. The words are listed in rank order. First Hundred Group 1 Group 2 Group 3 Group

### Multiply and Divide with Scientific Notation Mississippi Standard: Multiply and divide numbers written in scientific notation.

Multiply and Divide with Scientific Notation Mississippi Standard: Multiply and divide numbers written in scientific notation. You can use scientific notation to simplify computations with very large and/or

### Grade Level and Subject Area All About Animals Elementary Math 1 st grade

Animals Developed as part of Complementary Learning: Arts-integrated Math and Science Curricula generously funded by the Martha Holden Jennings Foundation Unit Overview Animals Lesson/Day 1: Basic Needs

### Supported by. A seven part series exploring the fantastic world of science.

Supported by A seven part series exploring the fantastic world of science. Find out about the different types of teeth in your mouth. Milk Teeth As a child you have 20 milk teeth. Your first tooth appears

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### Learning Organizational Principles in Human Environments

Learning Organizational Principles in Human Environments Outline Motivation: Object Allocation Problem Organizational Principles in Kitchen Environments Datasets Learning Organizational Principles Features

### 1. Hallo, hallo. 2. Good morning. Good morning, good morning. How are you? Fine, thank you.

1. Hallo, hallo Hallo, hallo it's english time, hallo, hallo it's english time. clap your hands, clap your hands it's english time, clap your hands, clap your hands it's english time. 2. Good morning Good

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### The Chinese Restaurant Process

COS 597C: Bayesian nonparametrics Lecturer: David Blei Lecture # 1 Scribes: Peter Frazier, Indraneel Mukherjee September 21, 2007 In this first lecture, we begin by introducing the Chinese Restaurant Process.

### Principled Hybrids of Generative and Discriminative Models

Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com

### Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 1-3. You must complete all four problems to

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Towards robust symbolic reasoning about information from the web

Towards robust symbolic reasoning about information from the web Steven Schockaert School of Computer Science & Informatics, Cardiff University, 5 The Parade, CF24 3AA, Cardiff, United Kingdom s.schockaert@cs.cardiff.ac.uk

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### 10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Explore and Discover... Food chains and keys

Explore and Discover... Identify predator and prey and the food eaten by different animals and link this information into a food chain. Use keys to identify animals. Galleries visited (please see accompanying

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Statement of Purpose/Focus Conventions

4 2 Revised & Edited: Let me tell you how polar bears survive in the arctic. First, the skin beneath their fur soaks up the heat from the sun, so it can stay warm. Second, their claws help them walk and

### Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution

Görür D, Rasmussen CE. Dirichlet process Gaussian mixture models: Choice of the base distribution. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 5(4): 615 66 July 010/DOI 10.1007/s11390-010-1051-1 Dirichlet

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

### Activity recognition in ADL settings. Ben Kröse b.j.a.krose@uva.nl

Activity recognition in ADL settings Ben Kröse b.j.a.krose@uva.nl Content Why sensor monitoring for health and wellbeing? Activity monitoring from simple sensors Cameras Co-design and privacy issues Necessity

### Supporting Online Material for

www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### 13.3 Inference Using Full Joint Distribution

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent

### Activity 1 Exploring Animal Diets and Sizes

Activity 1 Exploring Animal Diets and Sizes Objective & Overview: Using measurement and books, students will gain a better understanding of animal size, diversity, and diet through the fun study of wildlife.

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

### Learn Words About a New Subject

Learn Words About a New Subject Directions The panels below are a storyboard for a video. Look at the pictures and the dialogue. Think about how the boldface words are connected to the big idea of how

### Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Proceedings of the IEEE ITSC 2006 2006 IEEE Intelligent Transportation Systems Conference Toronto, Canada, September 17-20, 2006 WA4.1 Deterministic Sampling-based Switching Kalman Filtering for Vehicle

### Lecture 5: Introduction to Knowledge Representation

Lecture 5: Introduction to Knowledge Representation Dr. Roman V Belavkin BIS4410 Contents 1 Knowledge Engineering Knowledge Engineering Definition 1 (Knowledge Engineering). The process of designing knowledgebased

### Elementary Statistics

Elementary Statistics Chapter 1 Dr. Ghamsary Page 1 Elementary Statistics M. Ghamsary, Ph.D. Chap 01 1 Elementary Statistics Chapter 1 Dr. Ghamsary Page 2 Statistics: Statistics is the science of collecting,

### Supplement to Regional climate model assessment. using statistical upscaling and downscaling techniques

Supplement to Regional climate model assessment using statistical upscaling and downscaling techniques Veronica J. Berrocal 1,2, Peter F. Craigmile 3, Peter Guttorp 4,5 1 Department of Biostatistics, University

### Little Present Open Education Project OKFN, India Little Present 1

1 Open Knowledge Foundation Network, India : Open Education Project Help spreading the light of education. Use and share our books. It is FREE. Educate a child. Educate the economically challenged. Share

### Lecture 5: Human Concept Learning

Lecture 5: Human Concept Learning Cognitive Systems II - Machine Learning WS 2005/2006 Part I: Basic Approaches of Concept Learning Rules, Prototypes, Examples Lecture 5: Human Concept Learning p. 108

### DIAGNOSTIC EVALUATION

Servicio de Inspección Educativa Hezkuntzako Ikuskapen Zerbitzua 2 0 1 1 / 1 2 DIAGNOSTIC EVALUATION 4th YEAR of PRIMARY EDUCATION ENGLISH LITERACY Name / surname(s):... School:... Group:... City / Town:.

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address