Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling

Size: px
Start display at page:

Download "Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling"

Transcription

1 Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Ioanna Manolopoulou, Cliburn Chan and Mike West December 29, 2009 Abstract One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of observations from a low probability region of interest, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. We implement our algorithm on a flow cytometry dataset, providing higher resolution inferences for rare cell subtypes. Postdoctoral Fellow, Department of Statistical Science, ( im30@stat.duke.edu) Professor, Department of Biostatistics and Bioinformatics ( cliburn.chan@duke.edu) Professor, Department of Statistical Science, Duke University, Durham NC ( mw@stat.duke.edu) 1

2 1 Introduction Following technological advances, in many biological fields a vast amount of data is available; take, for example, flow cytometry, where tens of thousands to millions of individual cells, each with multiple different fluorescent-tagged antibody labels, are assayed in a single blood or other fluid sample (see Chan et al., 2008). Although Markov chain Monte Carlo is a very powerful tool for drawing inferences, it requires calculating the likelihood of the full data at each iteration. This is a serious drawback in the case of big datasets, often deeming it computationally prohibitive. Several approaches have been developed in order to address this problem. In most cases, very large datasets are addressed by drawing inferences on computationally manageable subsamples which are drawn randomly from the full data. Ridgeway and Madigan (2002) proposed a two-step algorithm of drawing subsamples in a Sequential Monte Carlo sampler without a mutation step, which was then improved by Balakrishnan and Madigan (2006) by introducing a rejuvenation step based on a kernel smoothing approximation similar to Liu and West (2000). In this paper we are interested in drawing inferences about low probability regions in sample space when large amounts of data from a mixture model are available, yielding few observations in the region of interest. Computational methods in mixture models have been studied extensively and provide a very flexible tool for modelling complex distributions; see, for example MacEachern (1998), MacEachern et al. (1999) and Muller et al. (1996). The motivating application arises in flow cytometry, where a vast number of observations (corresponding to cells) is available, with several markers for each cell; (see Chan et al., 2008). The data are assumed to follow a Gaussian mixture model, with individual components or groups of components representing cell types. Specific interest lies in characterizing a given cell subtype, which may often be significantly rare. For example, polyfunctional lymphocyte subsets that are of interest in predicting vaccine efficacy (Seder et al., 2008) may have frequencies of 0.01% or less of the total peripheral blood cell population. As a result, random subsamples typically contain very few observations of the rare subtype. The key idea is to use an initial random subsample in order to construct a weight function directed around the region of interest, which is subsequently used to draw a targeted subsample. Using nonparametric Bayesian mixture models, we implement a two-step Markov chain Monte Carlo approach of first using the random subsample to draw inferences, and then combining it with the targeted 2

3 subsample. We extend the method to a Sequential Monte Carlo algorithm whereby the targeted subsample is augmented sequentially as more information is available, until no more informative data points appear to be present in the full data. The idea of selective sampling through a weight function has been used in the context of discovery models; see West (1994) and West (1996). We assume that the data follows mixture distribution J f(x) = π j f j (x). j=1 Owing to the flow cytometry application, we assume data which follow a Gaussian Mixture Model as implemented by Chan et al. (2008) (see Appendix A), where cell subtypes correspond to groups of Gaussian components; our algorithms, however, may easily be adapted for non-gaussian mixtures. The region which we aim at drawing inferences about is determined by the scientific question at hand, and need not be a low-probability region. In this paper we focus on drawing inferences about the parameters φ K = (µ K, Σ K ) of a low-probability component K of a Gaussian mixture with Dirichlet process mixing weights characterized by θ = (µ, Σ, π, z, V, α), specified eg. as the component centered closest to a specific point. 2 Markov chain Monte Carlo approach The objective is to identify and analyze subsamples of the data which contain information about the specific subset of the parameters of interest. The key idea is to obtain a rough estimate about the low-probability component K based on a random subsample, which is subsequently used to draw weighted subsamples of the data that are more likely to be relevant to our analysis, providing us with higher resolution about the structure of the distribution in the region of interest. The direct approach is to follow a two-step procedure of Markov chain Monte Carlo samplers. We use an initial, randomly drawn subsample from the data in order to obtain an estimate of the parameters, and use this estimate to draw a more informative subsample. The two subsamples are then combined in a joint Markov chain Monte Carlo sampler to provide us with more accurate estimates of φ K. Although interest specifically lies in estimating the parameters of component K given by µ K, Σ K, inference on the full set of µ, Σ is required in order to carry out the analysis. 3

4 We denote the two subsamples (random and targeted) X R and X T of size n R and n T respectively. The first is drawn randomly from the data, whereas the second is drawn according to weights w i 1 i N. We aim to choose the weights so that the targeted subsample contains mostly observations from component K, thus we may choose w i = w(x i ) = N(m, τs), where m, S are estimates of µ K, Σ K from the initial analysis (based on the random subsample), and τ is a tuning scalar. The constant τ will determine how far from the initial estimates the targeted subsample will be. In other words, we choose the second subsample of the data according to how well it fits with the estimated distribution of component K, possibly allowing for a wider distribution through the constant τ. The likelihood of the data (X R, X T ) in component k then takes the following form. For observations in the random subsample: f(x R i z i = k, µ, Σ) = N(x R i µ k, Σ k ), i = 1,..., n R and f(x R i µ, Σ) = K π k N(x R i µ k, Σ k ), i = 1,..., n R k=1 where and For observations in the targeted subsample: f(x T i X R, z i = k, µ, Σ) w(x T i )N(x T i µ K, Σ K )) Σ k = (Σ 1 k N(x T i µ k, Σ k ), i = 1,..., n T, + (τs) 1 ) 1 µ k = Σ k (Σ 1 k µ k + (τs) 1 m) f(x T i X R, µ, Σ) = K π k (θ)n(x T i µ k, Σ k ), i = 1,..., n T, k=1 4

5 where π k (θ) = π k N (µ k m, (τs + Σ k )) K k=1 π kn (µ k m, (τs + Σ k )). Note here that we are using only the un-normalized weights w(x i ) N(x i m, τs) even though we are drawing without replacement, assuming that N i=1 N(x i m, τs) remains unchanged after drawing each of the targeted data points, in other words that the unnormalized weights sum to. This means that we assume a very large number of data points within the region of non-negligible support of the weight function w(x). The first Markov chain Monte Carlo sampler is a standard blocked Gibbs sampler (see Ishwaran and James, 2002) with target distribution p(µ, Σ, π, z, V, α X R ). In order to carry out the second Markov chain Monte Carlo sampler based on the random and targeted subsamples combined, the posterior distributions of the parameters of z, π, µ, Σ, α have to be re-calculated so that efficient proposals are constructed. The posterior for z is multinomial with probabilities for both subsamples. p(z i = k X R, X T, µ, Σ) π k f(x i z i = k, µ k, Σ k ) The posterior distribution of π X R, X T, z, µ, Σ does not follow a closed-form distribution; see Equation A-1 in Appendix B. The contribution of the targeted subsample to the posterior becomes more significant as τs increases, allowing observations in the targeted subsample to belong to components other than K. The posterior for α only depends on the data through V and thus will have the usual posterior distribution (see Ishwaran and James, 2002). ( α Gamma η 1 + K 1, η 2 ) log(1 V 1:K 1 ) 5 (1)

6 The posterior for µ k can be calculated exactly as µ k X, z, Σ k N(m µ k, Sµ k ), where S µ k, mµ k may be readily calculated through Equation A-2 in Appendix B. The posterior for Σ does not follow an Inverse Wishart and cannot be easily sampled from (see Equation A-3 in Appendix B). Due to the non-standard posterior distributions and the dimensionality of the problem, approximating them in order to construct efficient proposals is crucial. 2.1 Markov chain Monte Carlo updates After obtaining the targeted subsample, we construct a Markov chain Monte Carlo sampler with target distribution p(µ, Σ, π, z, V, α X R, X T ) p(x T z T, X R, θ)p(z T X R, θ)p(x R z R, θ)p(z R θ)p(θ). The chain is initialized by drawing µ, Σ, π, z, V, α from their priors, then iterates through the following steps. 1. Update z by generating from the posterior p(z X R, X T, π, µ, Σ) π z f(x R, X T µ, Σ). 2. Update π through a Metropolig-Hastings step by generating from the posterior p(v X R ), set π k = V k k 1 j=1 (1 V j) and accept the proposed move with probability min ( 1, K i=1 ( ) ) π n T i (θ) i. π i (θ) If the targeted subsample of indeed drawn such that almost all of its points belong to component K, the acceptance probability will be 1. 6

7 3. Update α from its posterior given V given in Equation (1). 4. Update µ through a Gibbs step using µ k X, z, Σ k N(m µ k, Sµ k ) above. 5. The posterior distribution of Σ does not take closed form. We construct a proposal distribution q(σ k XR, X T, z, µ) for a Metropolis-Hastings step using that f(x T i X R, z i = k, µ, Σ) = π k (θ)n(x T i µ k, Σ k ), where π k (θ) = π k N (µ k m, (τs + Σ k )) K k=1 π kn (µ k m, (τs + Σ k )), we can use the inverse transformation to obtain where X T i X R, z i = k, µ, Σ N(µ k, Σ k ), X T i ( Σ 1 ) = Σ k k XT i S 1 m. In practice, of course, Σ k is not known and the transformation of X T can only be approximated using an estimate of Σ k, e.g. the previous iteration of Σ. q(σ k X R, X T, z, µ) IW (W k + S 0, n k + s 0 + p 1), where W k = z i =k ( X i Xk ), where X R = X R. In addition, a discount factor may be used in order to increase the variance of the proposal kernel. The Markov chain Monte Carlo sampler sweeps through the updates described above, yielding estimates for the posterior distribution of the parameters of interest. However, due to the high number of parameters to be estimated and the difficulty in defining efficient proposals, the acceptance rate quickly drops to zero for targeted subsamples of moderate size. 3 Focusing on the low-probability component The dimensionality of the problem, combined with the difficulty to construct efficient proposals, results in Markov chain Monte Carlo samplers which require very long running times in order to 7

8 eventually be sampling from the true posterior. At the same time, the approach described above does not exploit the results from the initial run based on the random sample, except for extracting the estimates of µ K, Σ K. We describe how the dimensionality of the problem can be greatly reduced using the posterior distribution estimates obtained from the initial Markov chain Monte Carlo simulation. Notice that the objective is to draw inferences about a region in the sample space which has very low probability. Consequently, very few points in the initial random sample will belong to that region. On the other hand, the targeted sample will, generally, contain observations from the low-probability region. This implies that the posterior distribution of the parameters based on both the random and targeted samples (X R, X T ) p(µ, Σ X R, X T ) = p(µ, Σ X R, X T, z R ) p(z R X R, X T ), z R can be approximated as p(π, µ, Σ X R, X T ) = p(π, µ, Σ X R, X T, z R ) p(z R X R ), } {{ } } {{ } z R (a) (b) using that p(z R X R, X T ) p(z R X R ). Here (a) requires integrating over a much smaller set of parameters z T and can be calculated much more efficiently, and (b) is known from the first Markov chain Monte Carlo run. This de-couples the z-dependence of the random and the targeted sample, greatly reducing the dimensionality of the second analysis. The second Markov chain Monte Carlo is then adapted to a set of chains run for a set of particles drawn from the posterior distribution estimate of the first chain. For particles l = 1 : L, draw a sample of (z, π, µ, Σ) l X R from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler, and carry out the second sampler for each particle only on µ K, Σ K X R, X T, (z R, π, φ K ) l, combining samples at the end. This approach greatly reduces both the complexity of the calculations per sweep, as well as the total number of samples required in order to obtain a good approximation of the posterior distribution. However, because the posteriors µ K, Σ K X R and µ K, Σ K X R, X T may vary greatly, the sampler still suffers from very low acceptance rates and with a moderately sized targeted subsample can fail to reach the region in parameter space of high posterior probability. 8

9 4 Sequential Monte Carlo approach The focused approach drastically reduces the dimensionality of the algorithm, and as a result the computational complexity. However, Metropolis-Hastings updates still show low acceptance rates, because the two posteriors given X R in the one case, and X R, X T in the other, are very different. In addition, the size of the targeted subsample is chosen manually rather than through an automated procedure. Both drawbacks may be addressed drawing the targeted sample through a Sequential Monte Carlo simulation rather than using a two-step procedure. Sequential Monte Carlo methods provide simulation-based inferences from a sequence of probability distributions. A large number of random samples (particles) is used to approximate the sequence of distributions, so that asymptotically it converges to the true target distribution; see Doucet et al. (2001) and Lopes et al.. Here Sequential Monte Carlo can be used instead of the two-step procedure as described above (whereby an initial random sample X R is drawn, subsequently giving rise to the targeted sample X T ). Here we use a sequential scheme such that the targeted sample is selected one (or more) data point at a time, at each draw updating the estimates about the parameters of component K for a set of particles. In other words, we use the fact that the likelihood of the data may be expressed as p(x 1:n µ, Σ) = n p(x i X 1:i 1, µ, Σ). i=1 For each of a set of particles j = 1 : J, draw a sample of (z, π, µ, Σ) X R from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler. Then repeatedly augment the targeted subsample and mutate the parameter estimates through the following steps. For j = 1 : J and for a fixed sequence of τ 1:J 1. Draw u = U{1 : J} and set m j 1 = {µ j 1 K } u and S j 1 = {Σ j 1 K } u where {φ j k } u is the sample of the uth particle at step j for component k. 2. Draw another batch of targeted observations X T j without replacement according to weights w i f(x i m j 1, τ j 1 S j 1 ). 3. Update the configuration indicators z using the posterior weights π k N(x µ k, Σ k ). 9

10 4. Using a fixed number of Metropolis-Hastings steps following the iterates described in the Markov chain Monte Carlo approach above, update The posterior distribution of µ k now becomes where m µ k = S µ µ k, Σ k, π k, α X R, X T 1:j, z. µ k X R, X T 1:j, z, Σ k N(m µ k, Sµ k ), S µ k = (Σ 1 k /t 0 + n R k Σ 1 k + ( n k Σ 1 k x k j i=1 ( (τi S i ) 1 Σ k + I ) 1 Σ 1 k ) 1 j ( (τi S i ) 1 Σ k + I ) ) 1 (τi S i ) 1 m i + µ 0, i=1 where n k is the total number of data points in component k and n T k points in that component coming from the targeted sample. is the number of data It can be shown that, asymptotically (as the number of particles tends to infinity), the approximation of the target distribution will converge to the true density, with the error being of the order N. The parameter τ is a tuning parameter which allows monitoring both the dispersal of the targeted sample and also the assumption of infinite weights. Although in the example presented here (see Subsection 4.1) the parameter τ is held fixed at τ i = 1 i, values > 1 or < 1 may be more beneficial (see Appendix C). Owing to the method in which the parameters m, S of the weight function is fixed at each step of the re-sampling, weight functions located around different regions of sample space may be chosen. When the low-probability component follows a mixture distribution between different regions of sample space, this will be reflected in the estimates obtained from each particle, resulting in each particle corresponding to a different draw. Through our adaptive algorithm, the sample space is explored flexibly and posterior estimates of the parameters are updated incrementally as the targeted subsample is augmented, allowing more efficient inferences. This approach immediately poses the question of when to stop drawing observations for the targeted subsample. Ideally, we would like the targeted sample to contain all data points of component 10

11 K. In order to address this, we introduce a decision rule such that the targeted sample stops being augmented when no more data points in the remaining original data show a high probability of belonging to component K. A natural approach to use is the Bayes Factor for that component; see West and Harrison (1997). In other words, we introduce an extra decision step. 5a. If there are no unsampled observations with Bayes Factor BF K (x i ) = π K (x i)/(1 π K (x i)) π K /(1 π K ) > BF, where π K (x i) π K N(x i µ K, Σ K ), stop. The calculation of the Bayes Factor is computationally demanding; as an alternative, the stopping rule may be expressed purely as a function of the weights. In other words, 5b. If there are less than N threshold unsampled observations within a c threshold contour of the weight function, stop. The Sequential Monte Carlo approach provides an efficient method of drawing inferences about parameters relevant to a low probability region of sample space, at the same time allowing the algorithm to automatically monitor the number of observations in the region of interest. 4.1 Example: flow cytometry The motivating example for this study is a problem arising in flow cytometry, where cellular subtypes may be associated with one (or more) components of a Gaussian mixture model (see Chan et al., 2008). Flow cytometers detect fluorescent reporter markers that typically correspond to specific cell surface or intracellular proteins on individual cells, and can assay millions of such cells in a fluid stream in minutes. Datasets are typically very large, and as a result inference on the full data is computationally prohibitive. Interest lies in identifying and characterizing rare cell subtypes using a mixture model fitted on those markers. The ability to identify such rare cell subsets play important roles in many medical contexts - for example, the detection of antigen-specific cells with MHC 11

12 class I or class II markers, identification of polyfunctional T lymphocytes that correlate with vaccine efficacy or host resistance to pathogens, or in resolving variants of already low frequency cell types, e.g. subtypes of conventional dendritic cells. We use a dataset of 50,000 data points from human peripheral blood cells, with 6 marker measurements each: Forward Scatter, Side Scatter, CD4, IFNg+IL-2, CD8, CD3 1. The objective is to provide higher resolution on the structure and patterns of covariation of cells of a specific cell subtype, specifically CD3+CD4+ and CD3+CD8+ cells secreting IL-2/IFN-g when challenged with a specific viral antigen. The data show a clear component structure for some of the markers (see Figure 1), whereas in others the rare cell subtypes of interest are not separated. We specify the statistical question as drawing inferences about the component centered closest to the markers corresponding to a specific cell of known rare subtype. To illustrate our methods and for ease of exposition, we adapt our algorithm by targeting inferences towards the component with highest CD4 centre. An initial sample of size 5,000 is drawn, providing us with initial estimates m, S for the mean and covariance of the component closest to the high CD4+ region. Due to the strong covariation between the markers, several components are needed (see Figure 3) in order to capture the inhomogeneity of the data. Using initial weights w(x) N(x m, S), we apply our Sequential Monte Carlo algorithm to obtain a complete targeted subsample in terms of the stopping rule as well as posterior samples for all our parameters. Looking at the posterior distribution of the total number of components based on the initial MCMC sampler given the random subsample, and subsequently after the SMC sampler given both the random and targeted subsamples in Figure 3, we observe that indeed the targeted approach has provided a better fit for the structure of the data, reflected through the increased number of components (see Figure 2). More specifically, observing samples from the mixture model (see Figure 3) in the CD4 and IFNg markers before and after the targeted subsample, we see that the targeted approach has led to the emergence of more Gaussian components around the region of the rare cell subtypes, providing higher resolution about the structure and covariation of their markers. 1 Data from an NIAID/BD IntraCellular Staining Quality Assurance Panel (ICS QAP) kindly provided by the Duke Center for AIDS Research (CFAR) Immune Monitoring Core 12

13 Figure 1: Pair plots for the last 4 markers: CD4, IFNg, CD8 and CD3. The complete data set is shown in yellow. We aim to use the random subsample (shown in red) in order to obtain samples from the initial posterior p(µ, Σ, π, α X R ) and draw the targeted subsample (shown in blue) using estimates of the distribution of the data (superimposed as a contour plot). More importantly, our targeted approach has revealed components in the low probability subregion which emerge due to the covariation with the remaining markers. These findings agree with the biologists expectation that cell subtypes may have a non-gaussian structure. 13

14 Figure 2: Posterior distributions for the number of components in the Gaussian mixture model, given only the random subsample p(k X R ) shown in black and given both the random and targeted subsample p(k X R, X T ) shown in white. 5 Additional comments One of the key aspects of this work consists in defining the low probability region of interest and specifying the weight function used in the targeted sample. Naturally, the low probability region in sample space is strongly driven by the scientific question in hand. Based on that, and taking into account algorithmic tractability and efficiency, different weight functions may be used. In this work we presented methods relating to inferences about a specific component, defined in terms of a identifying criterion. In the flow cytometry example used in this paper, this was chosen as the component with mean closest to a specific point. Although the weight function used had a Gaussian shape, the analysis revealed a non-gaussian structure in the region of interest; using mixtures of components as a weight function would be a straightforward extension of our methods. In fact, using a hierarchical model using mixtures of mixtures may provide a better fit to the non-gaussian inhomogeneous structure of the flow cytometry data; our targeted subsampling approach can be implemented using such models at little additional computational cost. 14

15 Figure 3: Using the flow cytometry data, using the Sequential Monte Carlo targeted re-sampling algorithm, sample realization of the mixture model (a) based on the random subsample and (b) based on both the random subsample and the targeted subsample. Crosses are shown at the mean of each component, with 50% contours drawn. A natural extension to the weight functions used in this work stems from the fact that, in the original flow cytometry data, the identifying criterion for the component of interest is not defined on a fixed number of dimensions. Instead, it is defined as the set of markers which are significant in identifying the component in the region of low probability in sample space, which itself is unknown. In other words, the Gaussian mixture may be defined only on a subset of the p markers (unknown), such that we draw inferences about the parameters of the mixture p(θq X) xi Rq are for variable dimensions 1 : q, q p. The targeted learning about θq can be incorporated in the analysis such that, within the Sequential design, the weight function w(x) N (x m, S) is updated at each round of re-sampling both in terms of the mean m and covariance S of the Gaussian distribution, but also in terms of the markers over which the weight function is defined. In the case of flow cytometry data, this can be viewed as soft gating of cells into cell subtypes, based on both the values of the individual markers, but also the set of significant markers. One of the main challenges in drawing inferences about targeted subsamples is constructing efficient proposals for the parameters of interest, as the convergence of the algorithms is influenced by several factors. The size of the targeted subsample in relation to the random subsample plays 15

16 a significant role. This becomes especially important when the assumption of an infinite number of observations within the region of interest is breached, as this would lead to a likelihood used for the targeted subsample which deviates severely from the true likelihood because of sampling without replacement. The multiplicative constant τ also plays a significant role in constructing a weight function which is wide enough to not violate the infinite weights assumption, at the same time targeting the region of interest. Finally, our algorithms were implemented in MATLAB and the code is freely available upon request. 16

17 A Gaussian Mixture Model We are given data X comprising a total N data points from a p-dimensional gaussian mixture distribution K f(x i µ, Σ) = π k N(x i µ k, Σ k ), k=1 using a standard truncated Dirichlet process mixing distribution (see Ishwaran and James, 2002). Here N(x µ, Σ) represents the probability density function of a normal distribution with mean µ and covariance matrix Σ, and the parameters π k represent the mixing weights. Let θ = {π 1:K, φ 1:K }, φ j = {µ j, Σ j }. The mixture model can be realized through the configuration indicators z i for each observation x i, so that we obtain the standard hierarchical model (x i z i = k, φ k ) N(x i µ k, Σ k ), (φ k G) G, (G α, G 0 ) DP (α, G 0 ), where G( ) is an uncertain distribution function, G 0 ( ) is the prior mean of G( ) and α > 0 the total mass, or precision of the DP. From the Pólya urn scheme, α θ i θ 1,..., θ i 1 i 1 + α G 1 i 1 0( ) + δ θj ( ). i 1 + α For conditional conjugacy, it is convenient to use normal-inverse Wishart priors, i.e., G 0 (µ, Σ) = N(µ µ 0, t 0 Σ)IW (Σ s 0, S 0 ). j=1 Finally, we assume a Gamma prior for the Dirichlet precision parameters α Gamma(η 1, η 2 ), and the mixing probabilities are such that π k = V k k 1 i=1 (1 V i), where V i Beta(1, α). B Posterior Distributions Given both the random and the targeted subsample, the posterior distributions of the parameters take the following form. 17

18 The posterior for z is multinomial with probabilities p(z i = k X R, X T, µ, Σ) π k f(x i z i = k, µ k, Σ k ) for both subsamples. The π k s can be realized through a set of stick-breaking weights V (see Ishwaran and James, 2002), such that, given the random subsample, γ 1 = 1 + n k K γ 2 = α + l=k+1 V k X R, z Beta(γ 1, γ 2 ), with V K = 1. The posterior distribution of π given both the random and targeted subsample is given by n k p(π X R, X T, z, µ, Σ) = p(π X R, z R, µ, Σ) = k=1 K k=1 π nt k k ( K π kn µ k m, ( ) ) (τs) 1 + Σ 1 1 k K j=1 π jn(µ j m, ( (τs) 1 + Σ 1 j ) 1) n T k. (A-1) The contribution of the targeted subsample to the posterior distribution for π provides little additional information about the distribution of π when τs is small. The posterior for α only depends on the data through V and thus will have the usual posterior distribution ( α Gamma η 1 + K 1, η 2 ) log(1 V 1:K 1 ). The posterior for µ k can be calculated exactly as µ k X, z, Σ k N(m µ k, Sµ k ), 18

19 where ( S µ k = Σ k (1/t 0 + n R k + n T k (τs) 1 Σ k + I ) ) 1 1 ( m µ k = S µ k n k Σ 1 k x ( k n T k (τs) 1 Σ k + I ) ) 1 τ 1 S 1 m + µ 0, (A-2) where n k is the total number of data points in component k and n T k is the number of data points in that component coming from the targeted subsample. Notice that the contribution of the targeted subsample to the posterior variance of µ k is n T k (τ 1 S 1 Σ k + I) 1, and since S is an estimate of Σ k, this quantity is of the order n T k τ, implying that the narrower the weight τ+1 function, the less information about µ k available, which is intuitive. The posterior for Σ does not follow an Inverse Wishart distribution, and has the form Σ k X, z, µ k Σ k s 0 Σ k nr k /2 (τs) 1 + Σ 1 k nt k /2 exp { S 0Σ 1 k 2 n k i=1 x T i Σ 1 k x i 2 + n R µ T k Σ 1 k xr nr k 2 µt k Σ 1 k µ k nt k 2 µt k ( (τs) 1 Σ k + I ) 1 Σ 1 k µ k n T k µ T k } nt k 2 m(σ 1 τs + I) 1 τsm ( (τs) 1 Σ k + I ) 1 (τs) 1 m (A-3) C Weight functions In both the Markov chain Monte Carlo and Sequential Monte Carlo approaches described above, the targeted sample was weighted proportionally to N(x i m, τs), where m and S are estimates of the mean and covariance of the low-probability component K. The multiplicative constant τ works as a tuning parameter. A larger value will allow for wider dispersal of the targeted subsample, accounting for uncertainty of the initial estimate of φ K. As τ decreases, the weights w i in the targeted sample become heavily weighted around a small number of data points. As a result, the assumption of an infinite number of points with non-negligible weight becomes invalid. If our initial estimate of µ K, Σ K is bad, a small τ will restrict the targeted sample to a region away from the full low probability region of interest. Within the context of the Metropolis-Hastings updates, 19

20 as τ increases, the acceptance rate for µ, Σ increases, since the targeted sample looks more like the random sample. In that case, the posterior distribution of φ K is not pulled too far from the proposed values. At the same time, as τ increases, acceptance rate for π decreases, because the information about π given by the targeted sample becomes significant, and the proposed values (which are based only on the random subsample) may potentially become bad. Consider the one-dimensional case where w(x i ) N(x i m, τs), p = 1, and assume that µ, Σ, π are all known, and that there is an infinite number of data points. The weight function becomes w(x i ) N(x i µ K, τσ K ), and the coefficient τ may be chosen such that the probability of drawing data points from the low-probability component is maximized. Figure 4: Example in one dimension, here the blue curve represents the mixture f(x π, µ, Σ) and the red line the density of the low-probability component N(x µ K, Σ K ). The black curve then represents the weight function N(x µ K, τσ K ), and the green curve the mixture distribution of the targeted sample, f(x π, µ, Σ). Ideally we want the common area of the green and red curve to be maximized. Considering the overlap between the distribution of the targeted subsample and the low-probability component, we plot the common area for varying τ, and obtain the graph shown in Figure 5. As is seen from Figure 5, in terms of maximizing the overlap between the low probability com- 20

21 Figure 5: Example of S(τ) for several values of (µ K, π K ), using a numerical approximation of the integral in order to calculate the common area. ponent and the targeted subsample, the optimum value of τ varies. Specifically, the closer the remaining components are to the component of interest (and similarly the higher their variance) yields a lower value for the optimum τ, and the same happens when the weight of the component of interest decreases. Combining the above results with the fact that a large τ will improve the acceptance rate for µ, Σ but reduce the acceptance rate for π, and taking into account uncertainty on the S = ˆΣ K, it is apparent that the optimum coefficient τ is not uniquely 1, and plays a significant role which affects many levels of the analysis. Acknowledgements Research was partially supported by grants to Duke University from the NSF (DMS ) and the National Institutes of Health (grant P50-GM and contract HHSN C). Aspects of the research were also partially supported by the NSF grant DMS to the Statistical 21

22 and Applied Mathematical Sciences Institute. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH. References S. Balakrishnan and D. Madigan. A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis, 1(2): , C. Chan, F. Feng, J. Ottinger, D. Foster, M. West, and T. Kepler. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry A, 73: , A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, H. Ishwaran and L. James. Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. Journal of Computational and Graphical Statistics, 11: , J. Liu and M. West. Combined parameter and state estimation in simulation-based filtering. In J. F. G. De Freitas A. Doucet and N. J. Gordon, editors, Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, New York, H. F. Lopes, N. G. Polson, and M. Taddy. Particle learning for general mixtures. Submitted. S. N. MacEachern. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics, 7(2): , S. N. MacEachern, M. Clyde, and J. S. Liu. Sequential importance sampling for nonparametric bayes models: The next generation. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, 27(2): , P. Muller, A. Erkanli, and M. West. Bayesian curve fitting using multivariate normal mixtures. Biometrika, 83(1):67,

23 G. Ridgeway and D. Madigan. Bayesian analysis of massive datasets via particle filters. In KDD 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 5 13, New York, NY, USA, ACM. R.A. Seder, P.A. Darrah, and M. Roederer. T-cell quality in memory and protection: implications for vaccine design. Nature Reviews Immunology, 8(4): , M. West. Discovery sampling and selection models. In Decision Theory and Related Topics, M. West. Inference in successive sampling discovery models. Journal of econometrics, 75(1): , M. West and P. J. Harrison. Bayesian Forecasting and Dynamic Models. Springer-Verlag, New York,

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

Monte Carlo-based statistical methods (MASM11/FMS091)

Monte Carlo-based statistical methods (MASM11/FMS091) Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

11. Time series and dynamic linear models

11. Time series and dynamic linear models 11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

More information

How To Solve A Sequential Mca Problem

How To Solve A Sequential Mca Problem Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 6 Sequential Monte Carlo methods II February 3, 2012 Changes in HA1 Problem

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Model-based Synthesis. Tony O Hagan

Model-based Synthesis. Tony O Hagan Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

More information

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Refik Soyer * Department of Management Science The George Washington University M. Murat Tarimcilar Department of Management Science

More information

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators... MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Computational Statistics for Big Data

Computational Statistics for Big Data Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

More information

Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution

Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution Görür D, Rasmussen CE. Dirichlet process Gaussian mixture models: Choice of the base distribution. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 5(4): 615 66 July 010/DOI 10.1007/s11390-010-1051-1 Dirichlet

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

Centre for Central Banking Studies

Centre for Central Banking Studies Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Part 1 Slide 2 Talk overview Foundations of Bayesian statistics

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set Jeffrey W. Miller Brenda Betancourt Abbas Zaidi Hanna Wallach Rebecca C. Steorts Abstract Most generative models for

More information

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT Validation of Software for Bayesian Models using Posterior Quantiles Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT Abstract We present a simulation-based method designed to establish that software

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Lecture notes: single-agent dynamics 1

Lecture notes: single-agent dynamics 1 Lecture notes: single-agent dynamics 1 Single-agent dynamic optimization models In these lecture notes we consider specification and estimation of dynamic optimization models. Focus on single-agent models.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

Monte Carlo-based statistical methods (MASM11/FMS091)

Monte Carlo-based statistical methods (MASM11/FMS091) Monte Carlo-based statistical methods (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 6 Sequential Monte Carlo methods II February 7, 2014 M. Wiktorsson

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Sampling for Bayesian computation with large datasets

Sampling for Bayesian computation with large datasets Sampling for Bayesian computation with large datasets Zaiying Huang Andrew Gelman April 27, 2005 Abstract Multilevel models are extremely useful in handling large hierarchical datasets. However, computation

More information

Normal distribution. ) 2 /2σ. 2π σ

Normal distribution. ) 2 /2σ. 2π σ Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

More information

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano Spinello 1 Motivation Recall: Discrete filter Discretize

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking 174 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 2, FEBRUARY 2002 A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking M. Sanjeev Arulampalam, Simon Maskell, Neil

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

An Introduction to Basic Statistics and Probability

An Introduction to Basic Statistics and Probability An Introduction to Basic Statistics and Probability Shenek Heyward NCSU An Introduction to Basic Statistics and Probability p. 1/4 Outline Basic probability concepts Conditional probability Discrete Random

More information

Bayesian Image Super-Resolution

Bayesian Image Super-Resolution Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a

More information

Selected Topics in Electrical Engineering: Flow Cytometry Data Analysis

Selected Topics in Electrical Engineering: Flow Cytometry Data Analysis Selected Topics in Electrical Engineering: Flow Cytometry Data Analysis Bilge Karaçalı, PhD Department of Electrical and Electronics Engineering Izmir Institute of Technology Outline Compensation and gating

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX

MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX KRISTOFFER P. NIMARK The next section derives the equilibrium expressions for the beauty contest model from Section 3 of the main paper. This is followed by

More information

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Extreme Value Modeling for Detection and Attribution of Climate Extremes Extreme Value Modeling for Detection and Attribution of Climate Extremes Jun Yan, Yujing Jiang Joint work with Zhuo Wang, Xuebin Zhang Department of Statistics, University of Connecticut February 2, 2016

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

More information

Big Data, Statistics, and the Internet

Big Data, Statistics, and the Internet Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

More information

Chapter 14 Managing Operational Risks with Bayesian Networks

Chapter 14 Managing Operational Risks with Bayesian Networks Chapter 14 Managing Operational Risks with Bayesian Networks Carol Alexander This chapter introduces Bayesian belief and decision networks as quantitative management tools for operational risks. Bayesian

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Master s thesis tutorial: part III

Master s thesis tutorial: part III for the Autonomous Compliant Research group Tinne De Laet, Wilm Decré, Diederik Verscheure Katholieke Universiteit Leuven, Department of Mechanical Engineering, PMA Division 30 oktober 2006 Outline General

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

Linear regression methods for large n and streaming data

Linear regression methods for large n and streaming data Linear regression methods for large n and streaming data Large n and small or moderate p is a fairly simple problem. The sufficient statistic for β in OLS (and ridge) is: The concept of sufficiency is

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Dealing with large datasets

Dealing with large datasets Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information