Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling

Transcription

1 Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Ioanna Manolopoulou, Cliburn Chan and Mike West December 29, 2009 Abstract One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of observations from a low probability region of interest, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. We implement our algorithm on a flow cytometry dataset, providing higher resolution inferences for rare cell subtypes. Postdoctoral Fellow, Department of Statistical Science, ( im30@stat.duke.edu) Professor, Department of Biostatistics and Bioinformatics ( cliburn.chan@duke.edu) Professor, Department of Statistical Science, Duke University, Durham NC ( mw@stat.duke.edu) 1

2 1 Introduction Following technological advances, in many biological fields a vast amount of data is available; take, for example, flow cytometry, where tens of thousands to millions of individual cells, each with multiple different fluorescent-tagged antibody labels, are assayed in a single blood or other fluid sample (see Chan et al., 2008). Although Markov chain Monte Carlo is a very powerful tool for drawing inferences, it requires calculating the likelihood of the full data at each iteration. This is a serious drawback in the case of big datasets, often deeming it computationally prohibitive. Several approaches have been developed in order to address this problem. In most cases, very large datasets are addressed by drawing inferences on computationally manageable subsamples which are drawn randomly from the full data. Ridgeway and Madigan (2002) proposed a two-step algorithm of drawing subsamples in a Sequential Monte Carlo sampler without a mutation step, which was then improved by Balakrishnan and Madigan (2006) by introducing a rejuvenation step based on a kernel smoothing approximation similar to Liu and West (2000). In this paper we are interested in drawing inferences about low probability regions in sample space when large amounts of data from a mixture model are available, yielding few observations in the region of interest. Computational methods in mixture models have been studied extensively and provide a very flexible tool for modelling complex distributions; see, for example MacEachern (1998), MacEachern et al. (1999) and Muller et al. (1996). The motivating application arises in flow cytometry, where a vast number of observations (corresponding to cells) is available, with several markers for each cell; (see Chan et al., 2008). The data are assumed to follow a Gaussian mixture model, with individual components or groups of components representing cell types. Specific interest lies in characterizing a given cell subtype, which may often be significantly rare. For example, polyfunctional lymphocyte subsets that are of interest in predicting vaccine efficacy (Seder et al., 2008) may have frequencies of 0.01% or less of the total peripheral blood cell population. As a result, random subsamples typically contain very few observations of the rare subtype. The key idea is to use an initial random subsample in order to construct a weight function directed around the region of interest, which is subsequently used to draw a targeted subsample. Using nonparametric Bayesian mixture models, we implement a two-step Markov chain Monte Carlo approach of first using the random subsample to draw inferences, and then combining it with the targeted 2

3 subsample. We extend the method to a Sequential Monte Carlo algorithm whereby the targeted subsample is augmented sequentially as more information is available, until no more informative data points appear to be present in the full data. The idea of selective sampling through a weight function has been used in the context of discovery models; see West (1994) and West (1996). We assume that the data follows mixture distribution J f(x) = π j f j (x). j=1 Owing to the flow cytometry application, we assume data which follow a Gaussian Mixture Model as implemented by Chan et al. (2008) (see Appendix A), where cell subtypes correspond to groups of Gaussian components; our algorithms, however, may easily be adapted for non-gaussian mixtures. The region which we aim at drawing inferences about is determined by the scientific question at hand, and need not be a low-probability region. In this paper we focus on drawing inferences about the parameters φ K = (µ K, Σ K ) of a low-probability component K of a Gaussian mixture with Dirichlet process mixing weights characterized by θ = (µ, Σ, π, z, V, α), specified eg. as the component centered closest to a specific point. 2 Markov chain Monte Carlo approach The objective is to identify and analyze subsamples of the data which contain information about the specific subset of the parameters of interest. The key idea is to obtain a rough estimate about the low-probability component K based on a random subsample, which is subsequently used to draw weighted subsamples of the data that are more likely to be relevant to our analysis, providing us with higher resolution about the structure of the distribution in the region of interest. The direct approach is to follow a two-step procedure of Markov chain Monte Carlo samplers. We use an initial, randomly drawn subsample from the data in order to obtain an estimate of the parameters, and use this estimate to draw a more informative subsample. The two subsamples are then combined in a joint Markov chain Monte Carlo sampler to provide us with more accurate estimates of φ K. Although interest specifically lies in estimating the parameters of component K given by µ K, Σ K, inference on the full set of µ, Σ is required in order to carry out the analysis. 3

4 We denote the two subsamples (random and targeted) X R and X T of size n R and n T respectively. The first is drawn randomly from the data, whereas the second is drawn according to weights w i 1 i N. We aim to choose the weights so that the targeted subsample contains mostly observations from component K, thus we may choose w i = w(x i ) = N(m, τs), where m, S are estimates of µ K, Σ K from the initial analysis (based on the random subsample), and τ is a tuning scalar. The constant τ will determine how far from the initial estimates the targeted subsample will be. In other words, we choose the second subsample of the data according to how well it fits with the estimated distribution of component K, possibly allowing for a wider distribution through the constant τ. The likelihood of the data (X R, X T ) in component k then takes the following form. For observations in the random subsample: f(x R i z i = k, µ, Σ) = N(x R i µ k, Σ k ), i = 1,..., n R and f(x R i µ, Σ) = K π k N(x R i µ k, Σ k ), i = 1,..., n R k=1 where and For observations in the targeted subsample: f(x T i X R, z i = k, µ, Σ) w(x T i )N(x T i µ K, Σ K )) Σ k = (Σ 1 k N(x T i µ k, Σ k ), i = 1,..., n T, + (τs) 1 ) 1 µ k = Σ k (Σ 1 k µ k + (τs) 1 m) f(x T i X R, µ, Σ) = K π k (θ)n(x T i µ k, Σ k ), i = 1,..., n T, k=1 4

5 where π k (θ) = π k N (µ k m, (τs + Σ k )) K k=1 π kn (µ k m, (τs + Σ k )). Note here that we are using only the un-normalized weights w(x i ) N(x i m, τs) even though we are drawing without replacement, assuming that N i=1 N(x i m, τs) remains unchanged after drawing each of the targeted data points, in other words that the unnormalized weights sum to. This means that we assume a very large number of data points within the region of non-negligible support of the weight function w(x). The first Markov chain Monte Carlo sampler is a standard blocked Gibbs sampler (see Ishwaran and James, 2002) with target distribution p(µ, Σ, π, z, V, α X R ). In order to carry out the second Markov chain Monte Carlo sampler based on the random and targeted subsamples combined, the posterior distributions of the parameters of z, π, µ, Σ, α have to be re-calculated so that efficient proposals are constructed. The posterior for z is multinomial with probabilities for both subsamples. p(z i = k X R, X T, µ, Σ) π k f(x i z i = k, µ k, Σ k ) The posterior distribution of π X R, X T, z, µ, Σ does not follow a closed-form distribution; see Equation A-1 in Appendix B. The contribution of the targeted subsample to the posterior becomes more significant as τs increases, allowing observations in the targeted subsample to belong to components other than K. The posterior for α only depends on the data through V and thus will have the usual posterior distribution (see Ishwaran and James, 2002). ( α Gamma η 1 + K 1, η 2 ) log(1 V 1:K 1 ) 5 (1)

6 The posterior for µ k can be calculated exactly as µ k X, z, Σ k N(m µ k, Sµ k ), where S µ k, mµ k may be readily calculated through Equation A-2 in Appendix B. The posterior for Σ does not follow an Inverse Wishart and cannot be easily sampled from (see Equation A-3 in Appendix B). Due to the non-standard posterior distributions and the dimensionality of the problem, approximating them in order to construct efficient proposals is crucial. 2.1 Markov chain Monte Carlo updates After obtaining the targeted subsample, we construct a Markov chain Monte Carlo sampler with target distribution p(µ, Σ, π, z, V, α X R, X T ) p(x T z T, X R, θ)p(z T X R, θ)p(x R z R, θ)p(z R θ)p(θ). The chain is initialized by drawing µ, Σ, π, z, V, α from their priors, then iterates through the following steps. 1. Update z by generating from the posterior p(z X R, X T, π, µ, Σ) π z f(x R, X T µ, Σ). 2. Update π through a Metropolig-Hastings step by generating from the posterior p(v X R ), set π k = V k k 1 j=1 (1 V j) and accept the proposed move with probability min ( 1, K i=1 ( ) ) π n T i (θ) i. π i (θ) If the targeted subsample of indeed drawn such that almost all of its points belong to component K, the acceptance probability will be 1. 6

7 3. Update α from its posterior given V given in Equation (1). 4. Update µ through a Gibbs step using µ k X, z, Σ k N(m µ k, Sµ k ) above. 5. The posterior distribution of Σ does not take closed form. We construct a proposal distribution q(σ k XR, X T, z, µ) for a Metropolis-Hastings step using that f(x T i X R, z i = k, µ, Σ) = π k (θ)n(x T i µ k, Σ k ), where π k (θ) = π k N (µ k m, (τs + Σ k )) K k=1 π kn (µ k m, (τs + Σ k )), we can use the inverse transformation to obtain where X T i X R, z i = k, µ, Σ N(µ k, Σ k ), X T i ( Σ 1 ) = Σ k k XT i S 1 m. In practice, of course, Σ k is not known and the transformation of X T can only be approximated using an estimate of Σ k, e.g. the previous iteration of Σ. q(σ k X R, X T, z, µ) IW (W k + S 0, n k + s 0 + p 1), where W k = z i =k ( X i Xk ), where X R = X R. In addition, a discount factor may be used in order to increase the variance of the proposal kernel. The Markov chain Monte Carlo sampler sweeps through the updates described above, yielding estimates for the posterior distribution of the parameters of interest. However, due to the high number of parameters to be estimated and the difficulty in defining efficient proposals, the acceptance rate quickly drops to zero for targeted subsamples of moderate size. 3 Focusing on the low-probability component The dimensionality of the problem, combined with the difficulty to construct efficient proposals, results in Markov chain Monte Carlo samplers which require very long running times in order to 7

8 eventually be sampling from the true posterior. At the same time, the approach described above does not exploit the results from the initial run based on the random sample, except for extracting the estimates of µ K, Σ K. We describe how the dimensionality of the problem can be greatly reduced using the posterior distribution estimates obtained from the initial Markov chain Monte Carlo simulation. Notice that the objective is to draw inferences about a region in the sample space which has very low probability. Consequently, very few points in the initial random sample will belong to that region. On the other hand, the targeted sample will, generally, contain observations from the low-probability region. This implies that the posterior distribution of the parameters based on both the random and targeted samples (X R, X T ) p(µ, Σ X R, X T ) = p(µ, Σ X R, X T, z R ) p(z R X R, X T ), z R can be approximated as p(π, µ, Σ X R, X T ) = p(π, µ, Σ X R, X T, z R ) p(z R X R ), } {{ } } {{ } z R (a) (b) using that p(z R X R, X T ) p(z R X R ). Here (a) requires integrating over a much smaller set of parameters z T and can be calculated much more efficiently, and (b) is known from the first Markov chain Monte Carlo run. This de-couples the z-dependence of the random and the targeted sample, greatly reducing the dimensionality of the second analysis. The second Markov chain Monte Carlo is then adapted to a set of chains run for a set of particles drawn from the posterior distribution estimate of the first chain. For particles l = 1 : L, draw a sample of (z, π, µ, Σ) l X R from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler, and carry out the second sampler for each particle only on µ K, Σ K X R, X T, (z R, π, φ K ) l, combining samples at the end. This approach greatly reduces both the complexity of the calculations per sweep, as well as the total number of samples required in order to obtain a good approximation of the posterior distribution. However, because the posteriors µ K, Σ K X R and µ K, Σ K X R, X T may vary greatly, the sampler still suffers from very low acceptance rates and with a moderately sized targeted subsample can fail to reach the region in parameter space of high posterior probability. 8

9 4 Sequential Monte Carlo approach The focused approach drastically reduces the dimensionality of the algorithm, and as a result the computational complexity. However, Metropolis-Hastings updates still show low acceptance rates, because the two posteriors given X R in the one case, and X R, X T in the other, are very different. In addition, the size of the targeted subsample is chosen manually rather than through an automated procedure. Both drawbacks may be addressed drawing the targeted sample through a Sequential Monte Carlo simulation rather than using a two-step procedure. Sequential Monte Carlo methods provide simulation-based inferences from a sequence of probability distributions. A large number of random samples (particles) is used to approximate the sequence of distributions, so that asymptotically it converges to the true target distribution; see Doucet et al. (2001) and Lopes et al.. Here Sequential Monte Carlo can be used instead of the two-step procedure as described above (whereby an initial random sample X R is drawn, subsequently giving rise to the targeted sample X T ). Here we use a sequential scheme such that the targeted sample is selected one (or more) data point at a time, at each draw updating the estimates about the parameters of component K for a set of particles. In other words, we use the fact that the likelihood of the data may be expressed as p(x 1:n µ, Σ) = n p(x i X 1:i 1, µ, Σ). i=1 For each of a set of particles j = 1 : J, draw a sample of (z, π, µ, Σ) X R from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler. Then repeatedly augment the targeted subsample and mutate the parameter estimates through the following steps. For j = 1 : J and for a fixed sequence of τ 1:J 1. Draw u = U{1 : J} and set m j 1 = {µ j 1 K } u and S j 1 = {Σ j 1 K } u where {φ j k } u is the sample of the uth particle at step j for component k. 2. Draw another batch of targeted observations X T j without replacement according to weights w i f(x i m j 1, τ j 1 S j 1 ). 3. Update the configuration indicators z using the posterior weights π k N(x µ k, Σ k ). 9

10 4. Using a fixed number of Metropolis-Hastings steps following the iterates described in the Markov chain Monte Carlo approach above, update The posterior distribution of µ k now becomes where m µ k = S µ µ k, Σ k, π k, α X R, X T 1:j, z. µ k X R, X T 1:j, z, Σ k N(m µ k, Sµ k ), S µ k = (Σ 1 k /t 0 + n R k Σ 1 k + ( n k Σ 1 k x k j i=1 ( (τi S i ) 1 Σ k + I ) 1 Σ 1 k ) 1 j ( (τi S i ) 1 Σ k + I ) ) 1 (τi S i ) 1 m i + µ 0, i=1 where n k is the total number of data points in component k and n T k points in that component coming from the targeted sample. is the number of data It can be shown that, asymptotically (as the number of particles tends to infinity), the approximation of the target distribution will converge to the true density, with the error being of the order N. The parameter τ is a tuning parameter which allows monitoring both the dispersal of the targeted sample and also the assumption of infinite weights. Although in the example presented here (see Subsection 4.1) the parameter τ is held fixed at τ i = 1 i, values > 1 or < 1 may be more beneficial (see Appendix C). Owing to the method in which the parameters m, S of the weight function is fixed at each step of the re-sampling, weight functions located around different regions of sample space may be chosen. When the low-probability component follows a mixture distribution between different regions of sample space, this will be reflected in the estimates obtained from each particle, resulting in each particle corresponding to a different draw. Through our adaptive algorithm, the sample space is explored flexibly and posterior estimates of the parameters are updated incrementally as the targeted subsample is augmented, allowing more efficient inferences. This approach immediately poses the question of when to stop drawing observations for the targeted subsample. Ideally, we would like the targeted sample to contain all data points of component 10

11 K. In order to address this, we introduce a decision rule such that the targeted sample stops being augmented when no more data points in the remaining original data show a high probability of belonging to component K. A natural approach to use is the Bayes Factor for that component; see West and Harrison (1997). In other words, we introduce an extra decision step. 5a. If there are no unsampled observations with Bayes Factor BF K (x i ) = π K (x i)/(1 π K (x i)) π K /(1 π K ) > BF, where π K (x i) π K N(x i µ K, Σ K ), stop. The calculation of the Bayes Factor is computationally demanding; as an alternative, the stopping rule may be expressed purely as a function of the weights. In other words, 5b. If there are less than N threshold unsampled observations within a c threshold contour of the weight function, stop. The Sequential Monte Carlo approach provides an efficient method of drawing inferences about parameters relevant to a low probability region of sample space, at the same time allowing the algorithm to automatically monitor the number of observations in the region of interest. 4.1 Example: flow cytometry The motivating example for this study is a problem arising in flow cytometry, where cellular subtypes may be associated with one (or more) components of a Gaussian mixture model (see Chan et al., 2008). Flow cytometers detect fluorescent reporter markers that typically correspond to specific cell surface or intracellular proteins on individual cells, and can assay millions of such cells in a fluid stream in minutes. Datasets are typically very large, and as a result inference on the full data is computationally prohibitive. Interest lies in identifying and characterizing rare cell subtypes using a mixture model fitted on those markers. The ability to identify such rare cell subsets play important roles in many medical contexts - for example, the detection of antigen-specific cells with MHC 11

12 class I or class II markers, identification of polyfunctional T lymphocytes that correlate with vaccine efficacy or host resistance to pathogens, or in resolving variants of already low frequency cell types, e.g. subtypes of conventional dendritic cells. We use a dataset of 50,000 data points from human peripheral blood cells, with 6 marker measurements each: Forward Scatter, Side Scatter, CD4, IFNg+IL-2, CD8, CD3 1. The objective is to provide higher resolution on the structure and patterns of covariation of cells of a specific cell subtype, specifically CD3+CD4+ and CD3+CD8+ cells secreting IL-2/IFN-g when challenged with a specific viral antigen. The data show a clear component structure for some of the markers (see Figure 1), whereas in others the rare cell subtypes of interest are not separated. We specify the statistical question as drawing inferences about the component centered closest to the markers corresponding to a specific cell of known rare subtype. To illustrate our methods and for ease of exposition, we adapt our algorithm by targeting inferences towards the component with highest CD4 centre. An initial sample of size 5,000 is drawn, providing us with initial estimates m, S for the mean and covariance of the component closest to the high CD4+ region. Due to the strong covariation between the markers, several components are needed (see Figure 3) in order to capture the inhomogeneity of the data. Using initial weights w(x) N(x m, S), we apply our Sequential Monte Carlo algorithm to obtain a complete targeted subsample in terms of the stopping rule as well as posterior samples for all our parameters. Looking at the posterior distribution of the total number of components based on the initial MCMC sampler given the random subsample, and subsequently after the SMC sampler given both the random and targeted subsamples in Figure 3, we observe that indeed the targeted approach has provided a better fit for the structure of the data, reflected through the increased number of components (see Figure 2). More specifically, observing samples from the mixture model (see Figure 3) in the CD4 and IFNg markers before and after the targeted subsample, we see that the targeted approach has led to the emergence of more Gaussian components around the region of the rare cell subtypes, providing higher resolution about the structure and covariation of their markers. 1 Data from an NIAID/BD IntraCellular Staining Quality Assurance Panel (ICS QAP) kindly provided by the Duke Center for AIDS Research (CFAR) Immune Monitoring Core 12

13 Figure 1: Pair plots for the last 4 markers: CD4, IFNg, CD8 and CD3. The complete data set is shown in yellow. We aim to use the random subsample (shown in red) in order to obtain samples from the initial posterior p(µ, Σ, π, α X R ) and draw the targeted subsample (shown in blue) using estimates of the distribution of the data (superimposed as a contour plot). More importantly, our targeted approach has revealed components in the low probability subregion which emerge due to the covariation with the remaining markers. These findings agree with the biologists expectation that cell subtypes may have a non-gaussian structure. 13

14 Figure 2: Posterior distributions for the number of components in the Gaussian mixture model, given only the random subsample p(k X R ) shown in black and given both the random and targeted subsample p(k X R, X T ) shown in white. 5 Additional comments One of the key aspects of this work consists in defining the low probability region of interest and specifying the weight function used in the targeted sample. Naturally, the low probability region in sample space is strongly driven by the scientific question in hand. Based on that, and taking into account algorithmic tractability and efficiency, different weight functions may be used. In this work we presented methods relating to inferences about a specific component, defined in terms of a identifying criterion. In the flow cytometry example used in this paper, this was chosen as the component with mean closest to a specific point. Although the weight function used had a Gaussian shape, the analysis revealed a non-gaussian structure in the region of interest; using mixtures of components as a weight function would be a straightforward extension of our methods. In fact, using a hierarchical model using mixtures of mixtures may provide a better fit to the non-gaussian inhomogeneous structure of the flow cytometry data; our targeted subsampling approach can be implemented using such models at little additional computational cost. 14

15 Figure 3: Using the flow cytometry data, using the Sequential Monte Carlo targeted re-sampling algorithm, sample realization of the mixture model (a) based on the random subsample and (b) based on both the random subsample and the targeted subsample. Crosses are shown at the mean of each component, with 50% contours drawn. A natural extension to the weight functions used in this work stems from the fact that, in the original flow cytometry data, the identifying criterion for the component of interest is not defined on a fixed number of dimensions. Instead, it is defined as the set of markers which are significant in identifying the component in the region of low probability in sample space, which itself is unknown. In other words, the Gaussian mixture may be defined only on a subset of the p markers (unknown), such that we draw inferences about the parameters of the mixture p(θq X) xi Rq are for variable dimensions 1 : q, q p. The targeted learning about θq can be incorporated in the analysis such that, within the Sequential design, the weight function w(x) N (x m, S) is updated at each round of re-sampling both in terms of the mean m and covariance S of the Gaussian distribution, but also in terms of the markers over which the weight function is defined. In the case of flow cytometry data, this can be viewed as soft gating of cells into cell subtypes, based on both the values of the individual markers, but also the set of significant markers. One of the main challenges in drawing inferences about targeted subsamples is constructing efficient proposals for the parameters of interest, as the convergence of the algorithms is influenced by several factors. The size of the targeted subsample in relation to the random subsample plays 15

16 a significant role. This becomes especially important when the assumption of an infinite number of observations within the region of interest is breached, as this would lead to a likelihood used for the targeted subsample which deviates severely from the true likelihood because of sampling without replacement. The multiplicative constant τ also plays a significant role in constructing a weight function which is wide enough to not violate the infinite weights assumption, at the same time targeting the region of interest. Finally, our algorithms were implemented in MATLAB and the code is freely available upon request. 16

17 A Gaussian Mixture Model We are given data X comprising a total N data points from a p-dimensional gaussian mixture distribution K f(x i µ, Σ) = π k N(x i µ k, Σ k ), k=1 using a standard truncated Dirichlet process mixing distribution (see Ishwaran and James, 2002). Here N(x µ, Σ) represents the probability density function of a normal distribution with mean µ and covariance matrix Σ, and the parameters π k represent the mixing weights. Let θ = {π 1:K, φ 1:K }, φ j = {µ j, Σ j }. The mixture model can be realized through the configuration indicators z i for each observation x i, so that we obtain the standard hierarchical model (x i z i = k, φ k ) N(x i µ k, Σ k ), (φ k G) G, (G α, G 0 ) DP (α, G 0 ), where G( ) is an uncertain distribution function, G 0 ( ) is the prior mean of G( ) and α > 0 the total mass, or precision of the DP. From the Pólya urn scheme, α θ i θ 1,..., θ i 1 i 1 + α G 1 i 1 0( ) + δ θj ( ). i 1 + α For conditional conjugacy, it is convenient to use normal-inverse Wishart priors, i.e., G 0 (µ, Σ) = N(µ µ 0, t 0 Σ)IW (Σ s 0, S 0 ). j=1 Finally, we assume a Gamma prior for the Dirichlet precision parameters α Gamma(η 1, η 2 ), and the mixing probabilities are such that π k = V k k 1 i=1 (1 V i), where V i Beta(1, α). B Posterior Distributions Given both the random and the targeted subsample, the posterior distributions of the parameters take the following form. 17

18 The posterior for z is multinomial with probabilities p(z i = k X R, X T, µ, Σ) π k f(x i z i = k, µ k, Σ k ) for both subsamples. The π k s can be realized through a set of stick-breaking weights V (see Ishwaran and James, 2002), such that, given the random subsample, γ 1 = 1 + n k K γ 2 = α + l=k+1 V k X R, z Beta(γ 1, γ 2 ), with V K = 1. The posterior distribution of π given both the random and targeted subsample is given by n k p(π X R, X T, z, µ, Σ) = p(π X R, z R, µ, Σ) = k=1 K k=1 π nt k k ( K π kn µ k m, ( ) ) (τs) 1 + Σ 1 1 k K j=1 π jn(µ j m, ( (τs) 1 + Σ 1 j ) 1) n T k. (A-1) The contribution of the targeted subsample to the posterior distribution for π provides little additional information about the distribution of π when τs is small. The posterior for α only depends on the data through V and thus will have the usual posterior distribution ( α Gamma η 1 + K 1, η 2 ) log(1 V 1:K 1 ). The posterior for µ k can be calculated exactly as µ k X, z, Σ k N(m µ k, Sµ k ), 18

19 where ( S µ k = Σ k (1/t 0 + n R k + n T k (τs) 1 Σ k + I ) ) 1 1 ( m µ k = S µ k n k Σ 1 k x ( k n T k (τs) 1 Σ k + I ) ) 1 τ 1 S 1 m + µ 0, (A-2) where n k is the total number of data points in component k and n T k is the number of data points in that component coming from the targeted subsample. Notice that the contribution of the targeted subsample to the posterior variance of µ k is n T k (τ 1 S 1 Σ k + I) 1, and since S is an estimate of Σ k, this quantity is of the order n T k τ, implying that the narrower the weight τ+1 function, the less information about µ k available, which is intuitive. The posterior for Σ does not follow an Inverse Wishart distribution, and has the form Σ k X, z, µ k Σ k s 0 Σ k nr k /2 (τs) 1 + Σ 1 k nt k /2 exp { S 0Σ 1 k 2 n k i=1 x T i Σ 1 k x i 2 + n R µ T k Σ 1 k xr nr k 2 µt k Σ 1 k µ k nt k 2 µt k ( (τs) 1 Σ k + I ) 1 Σ 1 k µ k n T k µ T k } nt k 2 m(σ 1 τs + I) 1 τsm ( (τs) 1 Σ k + I ) 1 (τs) 1 m (A-3) C Weight functions In both the Markov chain Monte Carlo and Sequential Monte Carlo approaches described above, the targeted sample was weighted proportionally to N(x i m, τs), where m and S are estimates of the mean and covariance of the low-probability component K. The multiplicative constant τ works as a tuning parameter. A larger value will allow for wider dispersal of the targeted subsample, accounting for uncertainty of the initial estimate of φ K. As τ decreases, the weights w i in the targeted sample become heavily weighted around a small number of data points. As a result, the assumption of an infinite number of points with non-negligible weight becomes invalid. If our initial estimate of µ K, Σ K is bad, a small τ will restrict the targeted sample to a region away from the full low probability region of interest. Within the context of the Metropolis-Hastings updates, 19

20 as τ increases, the acceptance rate for µ, Σ increases, since the targeted sample looks more like the random sample. In that case, the posterior distribution of φ K is not pulled too far from the proposed values. At the same time, as τ increases, acceptance rate for π decreases, because the information about π given by the targeted sample becomes significant, and the proposed values (which are based only on the random subsample) may potentially become bad. Consider the one-dimensional case where w(x i ) N(x i m, τs), p = 1, and assume that µ, Σ, π are all known, and that there is an infinite number of data points. The weight function becomes w(x i ) N(x i µ K, τσ K ), and the coefficient τ may be chosen such that the probability of drawing data points from the low-probability component is maximized. Figure 4: Example in one dimension, here the blue curve represents the mixture f(x π, µ, Σ) and the red line the density of the low-probability component N(x µ K, Σ K ). The black curve then represents the weight function N(x µ K, τσ K ), and the green curve the mixture distribution of the targeted sample, f(x π, µ, Σ). Ideally we want the common area of the green and red curve to be maximized. Considering the overlap between the distribution of the targeted subsample and the low-probability component, we plot the common area for varying τ, and obtain the graph shown in Figure 5. As is seen from Figure 5, in terms of maximizing the overlap between the low probability com- 20

21 Figure 5: Example of S(τ) for several values of (µ K, π K ), using a numerical approximation of the integral in order to calculate the common area. ponent and the targeted subsample, the optimum value of τ varies. Specifically, the closer the remaining components are to the component of interest (and similarly the higher their variance) yields a lower value for the optimum τ, and the same happens when the weight of the component of interest decreases. Combining the above results with the fact that a large τ will improve the acceptance rate for µ, Σ but reduce the acceptance rate for π, and taking into account uncertainty on the S = ˆΣ K, it is apparent that the optimum coefficient τ is not uniquely 1, and plays a significant role which affects many levels of the analysis. Acknowledgements Research was partially supported by grants to Duke University from the NSF (DMS ) and the National Institutes of Health (grant P50-GM and contract HHSN C). Aspects of the research were also partially supported by the NSF grant DMS to the Statistical 21

22 and Applied Mathematical Sciences Institute. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH. References S. Balakrishnan and D. Madigan. A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis, 1(2): , C. Chan, F. Feng, J. Ottinger, D. Foster, M. West, and T. Kepler. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry A, 73: , A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, H. Ishwaran and L. James. Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. Journal of Computational and Graphical Statistics, 11: , J. Liu and M. West. Combined parameter and state estimation in simulation-based filtering. In J. F. G. De Freitas A. Doucet and N. J. Gordon, editors, Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, New York, H. F. Lopes, N. G. Polson, and M. Taddy. Particle learning for general mixtures. Submitted. S. N. MacEachern. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics, 7(2): , S. N. MacEachern, M. Clyde, and J. S. Liu. Sequential importance sampling for nonparametric bayes models: The next generation. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, 27(2): , P. Muller, A. Erkanli, and M. West. Bayesian curve fitting using multivariate normal mixtures. Biometrika, 83(1):67,

23 G. Ridgeway and D. Madigan. Bayesian analysis of massive datasets via particle filters. In KDD 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 5 13, New York, NY, USA, ACM. R.A. Seder, P.A. Darrah, and M. Roederer. T-cell quality in memory and protection: implications for vaccine design. Nature Reviews Immunology, 8(4): , M. West. Discovery sampling and selection models. In Decision Theory and Related Topics, M. West. Inference in successive sampling discovery models. Journal of econometrics, 75(1): , M. West and P. J. Harrison. Bayesian Forecasting and Dynamic Models. Springer-Verlag, New York,