The Information Bottleneck EM Algorithm
|
|
|
- Mildred Clarke
- 10 years ago
- Views:
Transcription
1 200 ELIDAN & FRIEDMAN UAI2003 The Information Bottleneck EM Algorithm Gal Elidan and Nir Friedman School of Computer Science & Engineering, Hebrew University Abstract Learning with hidden variables is a central challenge in probabilistic graphical models that has important implications for many real-life problems. The classical approach is using the Expectation Maximization (EM) algorithm. This algorithm, however, can get trapped in local maxima. In this paper we explore a new approach that is based on the Information Bottleneck principle. In this approach, we view the learning problem as a tradeoff between two information theoretic objectives. The first is to make the hidden variables uninformative about the identity of specific instances. The second is to make the hidden variables informative about the observed attributes. By exploring different tradeoffs between these two objectives, we can gradually converge on a high-scoring solution. As we show, the resulting, Information Bottleneck Expectation Maximization (IB-EM) algorithm, manages to find solutions that are superior to standard EM methods. 1 Introduction In recent years there has been a great deal of research on learning graphical models in general, and Bayesian networks in particular, from data [14, 9]. A central challenge in learning graphical models is learning with hidden (or latent) variables. Hidden variables typically serve as a summarizing mechanism that "captures" information from some of the observed variables and "passes" this information to some other part the network. As such, hidden variables can simplify the network structure and consequently lead to better generalization. The classical approach to learning with hidden variables is based on the Expectation Maximization (EM) [3, 12] algorithm. This algorithm performs a greedy search of the likelihood surface and is proven to converge to a local stationary point (usually a local maximum). Unfortunately, in hard real-life learning problems, there are many local maxima that can trap EM in a poor solution. Attempts to address this problem use a variety of strategies (e.g., [4, 8, 11, 15]). In this paper, we introduce a new approach to learning Bayesian networks with hidden variables. In this approach, we view the learning problem as a tradeoff between two information theoretic objectives. The first objective is to compress information about the training data. The second objective is to make the hidden variables informative about the observed attributes to ensure they preserve the relevant information. By exploring different tradeoffs between these two objectives, we gradually converge on a high-scoring solution. Our approach builds on the Information Bottleneck framework of Tishby et al [ 17] and its multivariate extension [6]. This framework provides methods for constructing new variables T that are stochastic functions of one set of variables Y and at the same time provide information on another set of variables X. The intuition is that the new variables T capture the relevant aspects of Y that are informative about X. We show how to pose the learning problem within the multivariate Information Bottleneck framework and derive a target Lagrangian for the hidden variables. We then show that this Lagrangian is an extension of the Lagrangian formulation of EM of Neal and Hinton [13], with an additional regularization term. By controlling the strength of this regularization term with a scale parameter, we can explore a range of target functions. On the one end of the spectrum there is a trivial target where compression of the data is total and all relevant information is lost. On the other end is the target function of EM. The continuity of target functions allows us to learn using a procedure that is motivated by the deterministic annealing approach [15]. We start with the optimum of the trivial target function and slowly change the scale parameter while tracking the solution at each step on the way. We present an alternative view of the optimization problem in the joint space of the model parameters and the scaling factor. This provides an appealing method for scanning the range of solutions as in homotopy continuation [ 19]. Our procedure automatically increases the effective cardinality of the hidden variables as it progresses. Thus, by stopping the procedure at an earlier stage, we can reach a solution that has better generalization properties. We extend our Information Bottleneck EM (IB-EM) framework for multiple hidden variables and any Bayesian network structure. We further show that the relation of the Information Bottleneck and EM holds for this case and for variational EM [10]. Finally, we show that the IB-EM algorithm is effective in learning models for hard real-life problems, and is often superior to EM based methods.
2 UAI2003 ELIDAN & FRIEDMAN Bayesian Networks Consider a finite set X = {X1,..., Xn } of random variables. A Bayesian network (BN) is an annotated directed acyclic graph that encodes a joint probability distribution over X. The nodes of the graphs correspond to the random variables. Each node is annotated with a conditional probability distribution (CPO) of the random variable given its parents Pa; in the graph G. The joint distribution can then be written as P(X1,..., Xn) = TI =l P(X;IPa;). The graph G represents independence properties that are assumed to hold in the underlying distribution: Each Xi is independent of its non-descendants given its parents Pa;. Suppose we have a network structure Gover X. Given a training set D = {x[l],...,x[m]} that consists of instances of X C X, we want to learn parameters for the network. In the Maximum Likelihood setting we want to maximize the log-likelihood function log P(D I G, e) = Lm log P(x[m] I G, e). This function can be equivalently written as Ep[log P(X I g, e)] where Pis the empirical distribution (frequencies) in D. In practice, we also add a prior P( e) on parameters, and then try to maximize E p [log P(X I e)] + log P( e) which results in MAP estimation (the term compensate for rescaling of the likelihood in the expectation). These priors can be thought of as adding imaginary instances that are distributed according to a certain distribution (e.g., uniform) to the training data [9]. Consequently, from this point on we view priors as modifying the empirical distribution with additional instances, and then apply the maximum likelihood principle. 3 Multivariate Information Bottleneck The Information Bottleneck method [ 17] is a general information theoretic clustering framework. Given a joint distribution Q(X, Y) of two variables, it attempts to finds a bottleneck variable T defined by a stochastic function Q(T I Y). This variable compresses the information in Y while preserving the information that is relevant to X. For example, the variable T might be used to define a soft clustering on words appearing in documents while preserving the information relevant to the topic of these documents. The multivariate extension of this framework [6] allows the definition of several cluster variables as well as the desired interactions between them and the observed variables. The desirable interactions are represented via two Bayesian networks, one called G;n, representing the required compression, and the other called Gout representing the independencies that we are striving for between the bottleneck variables and the target variables. In Figure 1, Gin specifies that we want to minimize the information between Y and T (that compresses Y) and Gout specifies that we want T to make Y and the variables X ;s independent of each other. These two objectives are conflicting, and the tradeoff between them "squeezes" the information Y contains about the X; 's. (a) G1" = Q Figure 1: Definition of Gin and Gout for the Multivariate Information Bottleneck framework Formally, the framework of Friedman et a! [6], attempts to minimize the following objective function.c[p, Q] = IQ(Y; T) + 1D(Q(Y, T, X)IIP(Y, T,X)) (1) where D is the Kulback Leibler divergence [2], P and Q are joint probabilities that can be represented by the networks of G;n and Gout. respectively. Minimization is over possible parameterizations of Q(T 1 Y) (the marginal Q(Y, X) is given and fixed) and over possible parameterizations of P(Y, T, X) that can be represented by Gout. In other words, we want to compress Y in such a way that the distribution defined by Gin is as close as possible to desired distribution of Gout. The scale parameter 1 balances these two factors. When 1 is zero we are only interested in compressing the variable Y and we resort to the trivial solution of a single cluster (or an equivalent parameterization). When 1 is high we concentrate on choosing Q(T I Y) that is close to a distribution satisfying the independencies encoded by Gout. 4 Learning a Hidden Variable The main focus of the Information Bottleneck is on the learned distribution Q(T I Y). This distribution can be thought of as a soft clustering of the original data. Our emphasis in this work is somewhat different. Given a dataset D = { x[l],..., x[m]}, we are interested in learning a better generative model describing the distribution of the observed attributes X. That is, we want to give high probability to new data instances from the same source. In the learned network, the hidden variables will serve to "summarize" some part of the data while retaining the relevant information on some/all of the observed variables X. We start by extending the multivariate information bottleneck framework for the task of generalization where, in addition to the task clustering, we are also interested in learning the generative model of P. We first consider the case of a single hidden variable T, and extend the framework to several hidden variables in Section The Information Bottleneck EM Lagrangian Intuitively, the task of generalization requires that we "forget" the specific training examples at hand but preserve the general form of the distribution over the observed variables. That is, we want to compress the identity of specific instances. On the other hand, since the observed variables are
3 202 ELIDAN & FRIEDMAN UA12003 deterministically known when given the identity of a specific instance, we expect to loose information about the observed variables while performing this compression. This defines a tradeoff between the compression of the identity of specific instances and the preservation of the information relevant to the observed variables. We now formalize this idea. We define a new variable Y that denotes the instance identity. That is, Y takes values in { 1,..., M} and Y[m] = m. We define Q(Y, X) to be the empirical distribution of the attributes X in the data, augmented with the distribution of the new variable Y. For each instance y, x[y] is the value X take in the specific instance. We now apply the Information Bottleneck with the graph Gin of Figure I. The choice of the graph Gout depends on the network we want to learn. We take it to be the target Bayesian network with the additional variable Y and T as its parent. For simplicity, we consider the simple clustering model of Gout where T is the parent of X 1,..., Xn. In practice, and as will be seen in section 6, any choice of Gout can be used. We can now apply the Information Bottleneck framework to this pair of graphs. This will attempt to define conditional probability Q(T I Y) so that Q(T, Y, X) = Q(T I Y)Q(Y, X) can be approximated by a distribution that factorizes according to Gout. This construction will aim to find T that captures the relevant information the instance identity has about the observed attributes. We start by determining the objective function for the particular choice of Gin and Gout we are dealing with. Proposition 4.1: Minimizing the information bottleneck objective function Eq. (I) is equivalent to minimizing the Lagrangian LEM = IQ(T; Y)-r (EQ[log P(X, T)]- EQ[log Q(T)]) as a function ofq(t I Y) and P(X, T). Proof: Using the chain rule and the structure of P, we can write P(Y, X, T) = P(Y I T)P(X, T). Similarly, since X is independent oft given Y, we can write Q(Y, X, T) = Q(Y I T)Q(T)Q(X I Y). Thus, D(Q(Y, X, T)IIP(Y,X, T)) [' Q(Y I T)Q(T)Q(X I Y)] E Q og P(Y I T)P(X, T) D(Q(Y I T)IIP(Y IT))+ Eq[logQ(X I Y)] + Eq[logQ(T)]- Eq[logP(X,T)] By setting P(Y I T) = Q(Y I T), the first term reaches zero, its minimal value. The second term is a constant (since we cannot change Q(X I Y). And thus, we need to minimize the last two terms and the result follows immediately. I An immediate question is how this target function relates to standard maximum likelihood learning. To explore the connection, we use a formulation of EM introduced by Neal and Hinton [13]. Although EM is usually thought of in terms of changing the parameters of the target function P, Neal and Hinton show how to view it as a dual optimization of P and an auxiliary distribution Q. Using the same notation, we can write the functional defined by Neal and Hinton as F[Q,P] = Eq[logP(X, T)] + Hq(T I Y) where Hq(T I Y) = EQ[-iogQ(T I Y)], and Q(X) is fixed to be the observed empirical distribution. Theorem 4.2: [13]lf (Q*, P*) is a stationary point off, then P* is a stationary point of the log-likelihood function EQ[logP(X)]. Moreover, Neal and Hinton show that an EM iteration corresponds to maximizing the choice Q(T I Y) while holding P fixed, and then maximizing P while holding Q(T I Y) fixed. The form off [Q, P] is quite similar to the IB-EM Lagrangian, and indeed we can relate the two. Proposition 4.3: LEM = (1- r)lq(t; Y) -,F [Q, P] Proof: Using the identity Hq(T I Y) = -Eq[log Q(T)] Iq(T; Y), we can write F[Q, P] = Eq[iogP(X, T)] -Eq[logQ(T)]- Iq(T; Y) and the result follows immediately. I As a consequence, minimizing the IB-EM Lagrangian is equivalent to maximizing the EM functional combined with an information theoretic regularization term. When r = 1 the Lagrangian and the EM functional are equivalent and finding a local minima of EM is equivalent to finding a local maxima of the likelihood function. 4.2 The 18-EM Algorithm For a specific value of r, the Information Bottleneck EM (IB-EM) algorithm can then be described as iterations similar to the EM iterations of Neal and Hinton [13]. E-step: Minimize LEM by optimizing Q(T I Y) while holding P fixed. M-step: Minimize LEM by optimizing P while holding Q fixed. The M-Step is essentially the standard maximum likelihood optimization of Bayesian networks. To see that, note that the only term that involves Pis Eq [log P(X, T)]. This term has the form of a log-likelihood function with the "empirical" distribution Q. Since the distribution is over all the variables, we can use sufficient statistics of P for efficient estimates. Thus, the M step consists of computing expected sufficient statistics given Q, and then using a "plug-in" formula for choosing the parameters of P.
4 UAI2003 ELIDAN & FRIEDMAN 203 The E-step is a bit more involved. We need to optimize Q(T I Y). To do this we use the following two results that follow from results of Friedman et a/ [6]. Proposition 4.4: Q(T[Y) is a stationary point of LEM with respect to a fixed choice of P if and only if for all values t andy oft andy, respectively, Q(t[y) _1 _ I--r e = Q(t) xp -r EP (t,y) Z(y) where EP (t, y) = log P(x[y], t) and Z( y) is a normalizing constant. (2) Proposition 4.5: A stationary point of EM is achieved by iteratively applying the self-consistent equations of Proposition 4.4. In most cases the stationary convergence point reached by applying these self-consistent equations will be a local maxima. Combining this result, with the result of Neal and Hinton that show that optimization of P increases F ( P, Q), we conclude that both the E-step and the M-step decrease EM until we reach stationary point. 4.3 Bypassing Local Maxima using Continuation As discussed above, the parameter 'Y balances the desire to compress the data and the desire to fit parameters to Gout. When 'Y is close to 0, our only objective is compressing the data. and the effective dimensionality of T will be 1 leading to a trivial solution. At larger values of 'Y we pay more and more attention to the distribution of Gout. and we can expect additional states of T to be utilized. Ultimately, we can expect each sample to be assigned to a different cluster (if the dimensionality of T allows it). Proposition 4.3 tells us that at the limit of 'Y = 1 our solution will actually converge to one of the standard EM possible solutions. Naively, we could allow a high cardinality for the hidden variable, set 'Y to a "high" value and find the bottleneck solution at that point. There are several drawbacks to this approach. First, we will typically converge to a sub-optimal solution for a given cardinality and "f, all the more so for 'Y = 1 where there are many such maxima. Second, we often do not know the correct cardinality that should be assigned to the hidden variable. If we use a cardinality for T that is too large, learning will be less robust and might become intractable. If T has too low a dimensionality, we will not fully utilize the potential of the hidden variable. We would like to somehow identify the beneficial number of clusters without having to simply try many options. To cope with this task, we adopt the deterministic annealing strategy [15]. In this strategy, we start with 'Y 0 where = a single cluster solution (high entropy) is optimal and compression is total. We then progress toward higher values of 'Y This gradually introduces additional structure into the learned model. There are several ways of executing this general strategy. The common approach is simply to increase 'Y in fixed steps, and after each increment apply the iterative algorithm to re-attain a (local) minima with the new value of 'Y On the problems we examine in Section 6, this naive approach did not prove successful. Instead, we use a more refined approach that utilizes continuation methods for executing this strategy. This approach provides means for automatically tuning the appropriate change in the size of 'Y, and also ensures we keep track of the solution from one iteration to the next. To perform continuation, we view the optimization problem in the joint space of the parameters and 'Y. In this space we want to follow a smooth path from the trivial solution at 'Y 0 to a solution at = 'Y = 1. Furthermore we would like to require that the fix-point equations hold at all points along the path. Continuation theory [19] guarantees that, excluding degenerate cases, such a path, free of discontinuities, indeed exists. We start by characterizing such paths. Note that once we fix the parameter Q(T [ Y), the parameters of the optimal choice of P are determined as a function of Q. Thus, we take Q(T [ Y) and 'Y as the only free parameters in our problem. As we have shown in Proposition 4.4, when the gradient of the Lagrangian is zero, Eq. (2) holds for each value oft and y. Thus, we want to consider paths where all of these equations hold. We define Gt,y(Q,"f) = -logq(t[y) + (1- 'Y)Q(t)+ 'YEP (t, y) -log Z(y) Clearly, Gt,y(Q, 'Y) = 0 exactly when Eq. (2) holds fort and y. Our goal is then to follow an equi-potential path where all Gt,y( Q, 'Y) functions are zero starting from some small value of 'Y upto the desired EM solution at 'Y = 1. Suppose we are at a point (Qo, 'Yo) where all the Gt,y functions are equal to 0. We want to move in a direction = ( dq, d"f) so that ( Qo + dq, 'Yo+ d"f) also satisfies the fix-point equations. To do so, we want to find a direction.so that (3) Vt,y, "VQ,-y Gt,y(Qo,"fo) = 0 (4) Computing the derivatives of Gt,y(Q,"f) with respect to each of the parameters results in a derivative matrix H ( Q, 'Y). Rows of the matrix correspond to each of the L = [T[ x [Y[ functions of Eq. (3), and columns correspond to the L parameters of Q as well 'Y. The entries correspond to the partial derivative of the function associated with the row with respect to the parameter associated with the column. To find a direction that satisfies Eq. ( 4) we need satisfy the matrix equation H(Q,"f) = 0 (5) In other words, we are trying to find a vector in the nullspace of H(Q,"f). Note that His an L x (L + 1) rna-
5 204 ELIDAN & FRIEDMAN UAI2003 trix and the direction vector is of length L + 1. Thus, the null-space is of dimension L + 1- Rank(H(Q,I)). Numerically, excluding measure zero cases [19], we expect Rank(H(Q, 1)) to be full, i.e., L. Thus there is a unique (upto to scaling) solution to Eq. (5). Finding this direction, however, can be costly. Notice that H( Q, 1) is of size L(L+ 1). This number is quadratic in the training set size, and so just computing the matrix is prohibitively expensive, even for small datasets. Instead, we resort to approximating H ( Q, 1) by a matrix that contains only the diagonal entries a o'(i )-r) and the last column ac,.'& Q,-r). In our case these are also the most significant values of the matrix since in the off-diagonal derivatives many terms are set to zero. Once we make the approximation, we can solve Eq. (5) in time linear in L. Note that once we find a vector t. that satisfies Eq. (5), we still need to decide on the size of the step we want to take in that direction. There are various standard approaches, such as normalizing the direction vector to a predetermined size. However, in our problem, we have a natural measure of progress. Recall that I(T; Y) increases when T captures information about the samples. In all runs I(T; Y) starts at 0, and is upper-bounded by the log of the cardinality oft. Moreover, the "interesting" steps in the learning process occur when I(T; Y) grows. These are exactly the points where the balance between the two terms in the Lagrangian changes and the second term grows sufficiently to allow the first term to increase I(T; Y). With this intuition at hand, we want to normalize the step size by the expected change in I(T; Y). If we are at a region where changes in the parameters are not influencing I(T; Y), then we can make a big step. On the other hand, if the change has a strong influence on I(T; Y), then we want to carefully track the solution. Formally, we compute \7 Q,-yl(T; Y) and rescale the direction vector so that (V' Q,-ylq (T; Y) )'. D. = where E is a predetermined step size. We also bound the minimal and maximal change in 1 so that we do not get trapped in too many steps or alternatively overlook the regions of change. Finally, although the continuation method takes us in the correct direction, the approximation as well as inherent numerical instability can lead us to a suboptimal path. To cope with this situation, we adopt a commonly used heuristic used in deterministic annealing. At each value of 1. we slightly perturb the current solution and re-solve the selfconsistent equations. If this perturbation leads to a higher Lagrangian, we take it as our current solution. To summarize, our procedure works as follows: we start with 1 = 0 for which only the trivial solution exists. At each stage we compute the direction of 1 and Q( TIY) that will leave the fix-point equations intact. We then take a small step in this direction and apply IB-EM iterations to E attain the fix-point equilibrium at the new value of I We repeat these iterations until we reach 1 = Regularization and Generalization In the discussion so far we were concerned with reaching the value of 1 = 1 with a good solution. As the experimental results below show, by using continuation methods our algorithm manages to reach solutions that are superior to running standard EM. However, in many domains, maximizing the likelihood can lead to overfitting and poor generalization. Thus it is common in machine learning to use regularization to counterweight the tendency to overfit the training examples. We can use priors to reduce this effect. However, they do not guarantee the best generalization performance. As discussed above, we want to learn hidden variables with a large cardinality and use regularization to determine their effective cardinality The information bottleneck formalization provides a form of regularization that arise naturally from the definition of the learning problem. When we learn with 1 < 1, the compression term counteracts the tendency to overfit the data. Thus, we might get better generalization with parameters we estimate for intermediate values of I Indeed, when we examine models learned in different continuation iterations, we see that later iterations, when 1 is closer to 1, can degrade the generalization performance. This observation suggests that, for learning purposes, we would prefer to learn a model at some 1* that is usually smaller than 1. The technical challenge is how to select 1*. A relatively straightforward but somewhat costly approach is to use a cross-validation (CV) test. In this approach we perform k runs of our algorithm, each one learning from ( k - 1) I k' th of the training data. In each run we evaluate intermediate models on the remaining 1 I k 'th data. We can then estimate the generalization at different values of 1 by averaging the log-likelihood of the held-out data at this value of 1 in each of the k folds. We use this evaluation to estimate 1*, and then perform the continuation process on the whole training data up to the critical 1 *. It is important to stress that this approach utilizes only the training data in order to predict the value of 1 * at which generalization will be best. 5 Multiple Hidden Variables The framework we described in the previous section can easily accommodate learning networks with multiple hidden variables simply by treating T as a vector of hidden variables. In this case, the distribution Q(T I Y) describes the joint distribution of the hidden variables for each value of Y, and P(T, X) describes their joint distribution with the attributes X in the desired network. Unfortunately, if the number of variables T is large, the representation of Q(T I Y) grows exponentially, and this approach becomes
6 UAI2003 ELl DAN & FRIEDMAN 205 Figure 2: Definition of networks for the Multivariate Information Bottleneck framework with multiple hidden variables. (a) shows G;n with the mean field assumption. (b) shows a possible hierarchy for Cant infeasible. One strategy to alleviate this problem is to force Q(T I Y) to have a factorized form. This reduces the cost of representing Q and also the cost of performing inference. As an example, we can require that Q(T I Y) is factored as a product f]; Q(Ti I Y). This assumption is similar to the mean.field variational approximation (e.g., [10]). In the Multivariate Information Bottleneck framework, different factorizations of Q(T I Y) correspond to different choices of networks G m. For example, the mean field factorization is achieved when Gin is such that the only parent of each Ti is Y, as in Figure 2. In general, we can consider other choices where we introduce edges between the different T;'s. For any such choice of Gin. we get exactly the same Lagrangian as in the case of a single hidden variable. The main difference is that since Q has a factorized form, we can decompose IQ(T; Y). For example, if we use the mean field factorization, we get Iq(T; Y) = L.:;i IQ(Ti; Y). Similarly, we can decompose EQ[log P(X, T)] into a sum of terms, one for each family in P. These two factorization can lead to tractable computation of the first two terms of the Lagrangian as written in Proposition 4.1. Unfortunately, the last term EQ [log Q(T)] cannot be evaluated efficiently. Thus, we approximate this term as L.:;i EQ[log Q(Ti)]. For the mean field factorization, the resulting Lagrangian (with this approximation) has the form.cim = L:;JQ(Ti; Y)-!(EQ[logP(X,T)]- L:;i EQ[iogQ(Ti)]) As in the case of a single hidden variable, we can characterize fix-point equations that hold in stationary points of the Lagrangian. Proposition 5.1: Assuming a mean field approximation for Q(T I Y), a (local) maximum of.cim is achieved by iteratively solving the following self-consistent equations for every hidden variable i independently. 1 Q(tiiY) -.-Q(ti)l-'Y e x p'" EP( t ;, y ) Z(z, y) where EP (ti, y) = EQ ( Tit,,y) [log P(x[y], T)] and Z(i, y) is a normalizing constant. The proof follows the lines of Theorem 7.1 of [6]. The only difference in the computation of the fix-point equations is in the derivative of EQ[log P(x[y], T)] with respect to Q(ti I y) resulting in the different fonn for EP (t i, y), that can still be computed efficiently. It is easy to see that when a single hidden variable is considered, the two forms coincide. A more interesting consequence of this discussion is that when 1 = 1, maximizing.cim is equivalent to performing mean.field EM [10]. Thus, by using the modified Lagrangian we generalize this variational learning principle, and as we show below manage to reach better solutions. We can easily extend the same idea to describe a correspondence between different choices of Gin and the "matching" structural approximation when applied to standard EM. For lack of space, we do not go into details here, although they are fairly straightforward. To summarize, the IB-EM algorithm of section 4.2 can be easily generalized to handle multiple hidden variables by simply altering the form of EP ( ti, y) in the fix-point equations. All other details, such as the continuation method, remain unchanged. 6 Experimental Validation To evaluate the IB-EM method, we examine its generalization performance on several types of models on three real-life datasets. In each architecture, we consider networks with hidden variables of different cardinality, where for simplicity we use the same cardinality for all hidden variables in the same network. We now briefly describe the datasets and the model architectures we use. The Stock dataset records up/same/down daily changes of 20 major US technology stocks over a period of several years [1]. The training set includes 1213 samples and the test set includes 303 instances. We trained a Naive Bayes hidden variable model where the hidden variable is a parent of all the observations. The Digits dataset contains 400 instances sampled from the USPS (US Postal Service) dataset of handwritten digits (see This data includes 40 images for each digit. An image is represented by 256 variables, each denotes the gray level of one pixel in a 16 x 16 matrix. We discretized pixel values into 10 equal bins. We used 320 images as a training set and 80 as a test set. On this data we tried several network architectures. The first is a Naive Bayes model with a single hidden variable. In addition, we examined more complex hierarchical models. In these models we introduce a hidden parents to each quadrant of the image recursively. The 3-level hierarchy has a hidden parent to each 8x8 quadrant, and then another hidden variable that is the parent of these four hidden variables. The 4-level hierarchy starts with 4x4 blocks, and has an additional intermediate level of hidden variables (total of 21 hidden variables). The Yeast dataset contains expression measurement the baker's yeast genes in 173 experiments [7]. These experi-
7 206 ELIDAN & FRIEDMAN UAI2003 Table I: Comparison of the IB-EM algorithm, 50 runs of EM with random starting points and 50 runs of mean field EM from the same random starting points. Shown are train and test log-likelihood per instance for the best and 80th percentile of the random runs. Also shown is the percentile of the runs that are worse than the IB-EM results. Datasets shown include a Naive Bayes model for the Stock dataset, and the Digit dataset; a 3 and 4 level hierarchical model for the Digit dataset (DigH3 and DigH4); and an hierarchical model for the Yeast dataset. For each model we show several cardinalities for the hidden variables, shown in the first column. Train Log-Likelihood Model IB-EM Restarts EM Mean Field EM Stock %< 100% 80% %< 100% 80% C=3 -t % t9.90 C= % C= % Digit C= % C=lO % DigH3 C= % % C=3 C=4 DigH4 C=2 C=3 C=4 Yeast C=2 C=3 C= % % % % % % % % % % % % % % % % Test Log-Loss IB-EM Restarts EM Mean Field EM %< 100% 80% %< 100% 80% % % % % % % % % % % % % % % % % % % % % % % % '. I IB-EM, ' an field EM -, -7., - =- w =- ro ro== oo Precentage of random runs Figure 4: Comparison of test performance of the IB-EM algorithm to the cumulative performance of 50 random EM and mean field EM runs on the 3-level hierarchy model with binary variables for the Digit domain. ments measure the yeast response to changes in its environmental conditions. For each experiment the expression of 6152 genes were measured. We discretized the expression of genes into ranges down/same/up by using a threshold of +1- standard deviation from the gene's mean expression across all experiments. In this data, we treat each gene as an instance that is described by its behavior in the different experiments. We randomly partitioned the data into 4922 training instances (genes) and 1230 test instances. The model we learned for this data has an hierarchical structure with 19 hidden variables in a 3-level hierarchy that was determined by the nature of the different experiments. The middle level has a hidden parent to all similar conditions (e.g., different types of heat shock), and the top level contains a single variable that is parent of all variables in the middle level. As a first sanity check, for each model (and each cardinality of hidden variables) we performed 50 runs of EM with random starting points. These resulted in parameters that have a wide range of likelihoods both on the training set and the test set. These results (which we elaborate on below), indicate that these learning problems are challenging in the sense that EM runs are trapped in different local maxtma. Next, we considered the application of IB-EM on these problems. We performed a single IB-EM run on each problem and compared it to the 50 random EM runs, and also to 50 random mean field EM runs. For example, Figure 4 compares the test set performance (log-likelihood per instance) of these runs on the Digit dataset with a 3-level hierarchy of binary hidden variables. The solid lines shows the performance of the IB-EM solution at 1 = 1. The two dotted lines show the accumulative performance of the random runs. As we can see, the IB-EM model is superior to all the mean field EM runs, and to 94% of the EM runs. It is important to note the time required by these runs, all on a Pentium IV 2.4 Ghz machine. A single mean field EM run requires approximately 1 minute, an exact EM random run requires roughly 25 minutes, and the single IB-EM run took 27 minutes. Thus, IB-EM takes about the same time as mean field EM runs but reaches solutions that none of these runs achieve. The IB-EM run time is roughly similar to the exact EM in terms of time, but we would need about 20 such runs to get to a comparable solution. These proportions hold for other datasets, although for harder problems the standard EM runs are somewhat
8 UAI2003 ELIDAN & FRIEDMAN :: n- - u _3 0.4 : E-::---j O L L_L.,-- --., , -- ; Gamma (a) DigH3, C::2 Gamma (b) Digit, Naive Bayes, C::25 Figure 3: Continuation performance for two runs. (top panel) Log-likelihood per instance vs. 'Y The dotted lines show the best and 80th percentile of 50 random EM runs. (middle panel) Log-likelihood of unseen test data. (bottom panel) Predicted test performance using cross validation on training data. Vertical line, denotes CV estimate of 'Y*. Circles mark values of 'Y for which the Lagrangian was evaluated during the continuation process. more expensive (4-5 times) than the IB-EM runs. Table I summarizes the results of these comparisons on the different learning problems. It shows both the training set and test performance of the different methods. We show the best and the 80% run for each group of 50 random starting points, as well as the relative percentile of the IB-EM solution among these runs. As we can see, IB-EM runs outperforms random restarts for the Naive Bayes models in most cases. For the harder problems that involve a hierarchy of hidden variables, the situation is more complex. When we compare IB-EM to the mean field EM, we see that it is consistently better, often by a non-trivial margin. When we compare IB-EM to exact EM we see that in the simpler 3-level Digit hierarchal models IB-EM is better than most of the EM runs. However, in the more complex hierarchal models the performance ofib-em is worse than most EM runs. Our hypothesis is that this drop in performance is due to the inherent limitations of the mean field approximation in these models. This approximation loses much of the information about interactions between the hidden variables. We stress, however, that we selected models where we can perform exact inference over the hidden variables so that we can compare to exact EM. In many applications exact inference is infeasible, and approximations are needed. Clearly the mean field approximation is quite crude. Yet, as we discussed above, our framework allows to use more refined variational approximations, and we expect that these will improve the performance for both variational EM and IB-EM. We also compared the IB-EM method to the perturbation method of Elidan et al [4]. Briefly, their method alters the landscape of the likelihood by perturbing the relative weight of the samples and progressively diminishing this perturbation as a factor of the temperature parameter. In the Stock dataset, the perturbation method initialized with a starting temperature of 4 and cooling factor of 0.95, had performance similar to that of IB-EM. However, running time of the perturbation method was an order of magnitude larger. For other examples we considered above, running the perturbation method with the same parameters proved to be prohibitively expensive. When run with more efficient parameter settings, the perturbation method's performance was inferior to that of IB-EM. These results are consistent with those of Eli dan et al [ 4] who showed some improvement for the case of parameter learning but mainly focused on structure learning, with and without hidden variables. We now tum to examine the continuation process more closely. Figure 3(a) illustrates the progression of IB-EM (on DigH3, C=2). The top panel shows training loglikelihood per instance of parameters in intermediate points in the process. This panel also shows the values of 1 evaluated during the continuation process (circles). As we can see, the continuation procedure focuses on the region where there are significant changes in the likelihood. The middle panel shows the likelihood on test data. As we can see, although the training set likelihood increases as we increase 1, the test set performance deteriorates at some stage, due to overfitting the data. This example suggests that we can improve performance by using the model at 1 * instead of at 1 = 1. In section 4.4 we suggested using a CV test that is based only on the training data to estimate the critical value 1 *. The lower panel shows the CV estimate of the test set likelihood using only the training data. Although these estimates are
9 208 ELIDAN & FRIEDMAN UAI2003 biased, we see that they allow the learning algorithm to pinpoint the value of 1 *. To demonstrate the effectiveness of this approach more clearly we examine a situation where we have an excessive number of parameters. Namely, the Naive Bayes model on the Digit dataset with C = 25. In this scenario, we expect the learned model to overfit the training data. And thus, we expect that 1 * is lower then 1. Indeed, as we can see in Figure 3(b) at early stages in the learning the model clearly overfits the data. However, using CV estimate of of 1 *, the procedure learns a model with test set performance of which is much better than all other Naive Bayes models learned on this dataset. 7 Discussion and Future Work In this work we set out to learn models with hidden variables. We described a method for reaching a high-scoring solution by starting with a simple solution and following a trajectory to a high-scoring one. The contribution of this work is threefold. First, we made a formal connection between the Information Bottleneck principle [17, 6] and maximum likelihood learning for graphical models. The Information Bottleneck and its extensions are originally viewed as methods to understand the structure of a distribution. We showed that in some sense the Information Bottleneck and maximum likelihood estimation are two sides of the same coin. The Information Bottleneck focuses on the distribution of variables in each instance, while maximum likelihood focuses on the projection of this distribution on the estimated model. This understanding extends to general Bayesian networks the recent results ofslonim and Weiss [16] that relate the original Information Bottleneck and maximum likelihood estimation in univariate mixture distributions. Second, the introduction of the IB-EM principle, allowed us to use an approach that starts with a solution at 1 = 0 and progresses toward a solution in the more complex landscape of 1 = 1. This general scheme is common in deterministic annealing approaches [15, 18]. These approaches "flatten" the landscape by raising the likelihood to the power of I The main technical difference of our approach is the introduction of a regularization term that is derived from the structure of the approximation of the probability of the latent variables in each instance. Third, we applied continuation methods for traversing the path from the trivial solution at 1 = 0 to a solution at 1 = 1. Unlike standard approaches in deterministic annealing and Information Bottleneck, our procedure can automatically detect important regions where the solution changes drastically and ensure that they are tracked closely. In preliminary experiment results (not shown) the continuation method was clearly superior to standard annealing strategies. As we show in our experimental results, applying IB-EM to hard learning problems leads to solutions that are often superior to the equivalent version of EM. Moreover, by using an early stopping rule, we can find solutions at intermediate values of 1 that are better at generalization. The methods presented here can be extended in several directions. First, by relaxing the mean field variational approximation, we can explore a better tradeoff between tractability and quality of learned model. Second, we have used an approximate solution for performing continuation. Better approximations might lead to more accurate results (with fewer steps). Third, we showed that by using cross validation we can detect 1 * values where generalization is better. Deriving a principled method for understanding this generalization has both theoretical implications and can lead to faster and more accurate learning. Finally, the same principle can be generalized to problems of structure learning, by replacing the parameter optimization step with a structure learning procedure. This results in a natural extension of the structural EM [5] framework for learning the structure as well as parameters of the Bayesian network of interest. Acknowledgements We thank R. Bachrach, Y. Barash, G. Chechik, D. Koller, M. Ninio, T. Tishby, andy. Weiss for discussions and comments on earlier drafts of this paper. This work was supported, in part, by a grant from the Israeli Ministry of Science. G. Elidan was also supported by the Horowitz fellowship. N. Friedman was also supported by an Alon Fellowship and by the Harry & Abe Sherman Senior Lectureship in Computer Science.. References [1] X. Boyen, N. Friedman, and D. Koller. Learning the structure of complex dynamic systems. In UA/, [2] T. M. Cover and J. A. Thomas. Elements of Information Theory, [3) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., B 39: l-39, [4] G. Elidan, M. Ninio, N. Friedman, and D. Schuunnans. Data perturbation for escaping local maxima in learning. In AAAI, [5] N. Friedman. The Bayesian structural EM algorithm. In UAI, [6] N. Friedman, 0. Mosenzon, N. Slonim, and N. Tishby. Multivariate infonntion bottleneck. In UAI, [7] A. P. Gasch et al. Genomic expression program in the response of yeast cells to environmental changes. Mol. Bio. Cell, 11: ,2000. [8] F. Glover and M. Laguna. Tabu search. In Modern Heuristic Techniques for Combinatorial Problems, [9] D. Heckerman. A tutorial on learning with Bayesian networks. In Learning in Graphical Models, [10] M. 1 Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An introduction to variational approximations methods for graphical models. In Learning in Graphical Models, [11] S. Kirpatrick, C.D. Gelatt, Jr., and M.P. Vecchi. Optimization by simulated annealing. Science, 220: , May [12] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19: , [13] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justifies incremental and other variants. In Learning in Graphical Models, [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems, [15] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. IEEE, 86: , [16] N. Slonim andy. Weiss. Maximum likelihood and the information bottleneck. In NIPS, [17] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proc. 37th Allerton Conf on Communication and Computation, [18] N. Ueda and R. Nakano. Deterministic annealing EM algorithm. In Neural Networks, [19] L. T. Watson. Theory of globally convergent probability-one homotopies for non-linear programming. Tech. report TR-00-04, CS, Virginia Tech, 2000.
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Compression algorithm for Bayesian network modeling of binary systems
Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
The Expectation Maximization Algorithm A short tutorial
The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Probabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
Globally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford
Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Approximating the Partition Function by Deleting and then Correcting for Model Edges
Approximating the Partition Function by Deleting and then Correcting for Model Edges Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles Los Angeles, CA 995
Dirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
On the Interaction and Competition among Internet Service Providers
On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Stock Option Pricing Using Bayes Filters
Stock Option Pricing Using Bayes Filters Lin Liao [email protected] Abstract When using Black-Scholes formula to price options, the key is the estimation of the stochastic return variance. In this
Stock Trading by Modelling Price Trend with Dynamic Bayesian Networks
Stock Trading by Modelling Price Trend with Dynamic Bayesian Networks Jangmin O 1,JaeWonLee 2, Sung-Bae Park 1, and Byoung-Tak Zhang 1 1 School of Computer Science and Engineering, Seoul National University
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Load Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: [email protected] Abstract Load
Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)
Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision
Using Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
2.2 Creaseness operator
2.2. Creaseness operator 31 2.2 Creaseness operator Antonio López, a member of our group, has studied for his PhD dissertation the differential operators described in this section [72]. He has compared
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
Introduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
Model-Based Cluster Analysis for Web Users Sessions
Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece [email protected]
Gaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany [email protected] WWW home page: http://www.tuebingen.mpg.de/ carl
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
Statistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce
Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 [email protected]
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are
Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009
Exponential Random Graph Models for Social Network Analysis Danny Wyatt 590AI March 6, 2009 Traditional Social Network Analysis Covered by Eytan Traditional SNA uses descriptive statistics Path lengths
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Covariance and Correlation
Covariance and Correlation ( c Robert J. Serfling Not for reproduction or distribution) We have seen how to summarize a data-based relative frequency distribution by measures of location and spread, such
Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
Practical Guide to the Simplex Method of Linear Programming
Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
Optimizing Prediction with Hierarchical Models: Bayesian Clustering
1 Technical Report 06/93, (August 30, 1993). Presidencia de la Generalidad. Caballeros 9, 46001 - Valencia, Spain. Tel. (34)(6) 386.6138, Fax (34)(6) 386.3626, e-mail: [email protected] Optimizing Prediction
Ira J. Haimowitz Henry Schwarz
From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Clustering and Prediction for Credit Line Optimization Ira J. Haimowitz Henry Schwarz General
2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering
2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering Compulsory Courses IENG540 Optimization Models and Algorithms In the course important deterministic optimization
Bayesian Networks. Read R&N Ch. 14.1-14.2. Next lecture: Read R&N 18.1-18.4
Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know Basic concepts and vocabulary of Bayesian networks. Nodes represent random variables. Directed arcs
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
Lecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
פרויקט מסכם לתואר בוגר במדעים )B.Sc( במתמטיקה שימושית
המחלקה למתמטיקה Department of Mathematics פרויקט מסכם לתואר בוגר במדעים )B.Sc( במתמטיקה שימושית הימורים אופטימליים ע"י שימוש בקריטריון קלי אלון תושיה Optimal betting using the Kelly Criterion Alon Tushia
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
Tracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
Comparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
A hidden Markov model for criminal behaviour classification
RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University
Learning Vector Quantization: generalization ability and dynamics of competing prototypes
Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science
