# INFERENCE FOR NORMAL MIXTURES IN MEAN AND VARIANCE

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Statistica Sinica 18(2008), INFERENCE FOR NORMAL MIXTURES IN MEAN AND VARIANCE Jiahua Chen 1, Xianming Tan 2 and Runchu Zhang 2 1 University of British Columbia and 2 LPMC Nankai University Abstract: A finite mixture of normal distributions, in both mean and variance parameters, is a typical finite mixture in the location and scale families. Because the likelihood function is unbounded for any sample size, the ordinary maximum likelihood estimator is not consistent. Applying a penalty to the likelihood function to control the estimated component variances is thought to restore the optimal properties of the likelihood approach. Yet this proposal lacks practical guidelines, has not been indisputably ustified, and has not been investigated in the most general setting. In this paper, we present a new and solid proof of consistency when the putative number of components is equal to, and when it is larger than, the true number of components. We also provide conditions on the required size of the penalty and study the invariance properties. The finite sample properties of the new estimator are also demonstrated through simulations and an example from genetics. Key words and phrases: Bernstein inequality, invariant estimation, mixture of normal distributions, penalized maximum likelihood, strong consistency. 1. Introduction Finite mixture models have wide applications in scientific disciplines, especially in genetics (Schork, Allison and Thiel (1996)). In particular, the normal mixture in both mean and variance was first applied to crab data in Pearson (1894), and is the most popular model for analysis of quantitative trait loci, see Roeder (1994), Chen and Chen (2003), Chen and Kalbfleisch (2005), and Tadesse, Sha and Vannucci (2005). In general, let f(x, λ) be a parametric density function with respect to some σ-finite measure and parameter space Λ, usually a subset of some Euclidean space. The density function of a finite mixture model is given by f(x;g) = p =1 π f(x;λ ) where p is the number of components or the order of the model, λ Λ is the parameter of the th component density, π is the proportion of the th component density, and G is the mixing distribution which can be written as G(λ) = p =1 π I(λ λ) with I( ) being the indicator function. In this paper, we focus on inference problems related to the univariate normal mixture distribution with parameter λ representing the mean and variance

2 444 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG (,σ 2 ). Let φ(x) = 1/( 2π)exp x 2 /2}. In normal mixture models, the component density is given by f(x;,σ) = σ 1 φ(σ 1 (x )). The parameter space of G can be written as Γ = G = (π 1,...,π p, 1,..., p,σ 1,...,σ p ) : p =1 } π = 1,π 0,σ 0 for = 1,...,p. For convenience, we use G to represent both the mixing distribution and its relevant parameters. We understand that permuting the order of the components does not change the model. Hence, without loss of generality, we assume σ 1 σ 2 σ p. Let X 1,...,X n be a random sample from a finite normal mixture distribution f(x; G). A fundamental statistical problem is to estimate the mixing distribution G. Pearson (1894) proposed the method of moments for estimating the parameters in the univariate normal mixture. Many other approaches have been proposed, such as those discussed in McLachlan and Basford (1987) and McLachlan and Peel (2000). The maximum likelihood estimator (MLE), known for its asymptotic efficiency for regular statistical models, is one of the most commonly used approaches (Lindsay (1995)). However, in the case of finite normal mixture distributions in both mean and variance, the MLE is not well defined. Note that the log-likelihood function is l n (G) = log f(x i ;G) = p π ( Xi )} log φ. σ σ By letting 1 = X 1 and σ 1 0 with the other parameters fixed, we have l n (G). That is, the ordinary maximum-likelihood estimator of G is not well-defined (Day (1969) and Kiefer and Wolfowitz (1956)). To avoid this difficulty, researchers often turn to estimators on constrained parameter spaces. For example, Redner (1981) proved that the maximum likelihood estimator of G exists and is globally consistent in every compact subparameter space containing the true parameter G 0. When p is known, Hathaway (1985) proposed estimating G by maximizing the likelihood function within a restricted parameter space. Despite the elegant results of Redner (1981) and Hathaway (1985), these methods suffer, at least theoretically, from the risk that the true mixing distribution G 0 may not satisfy the constraint imposed. We advocate the approach of adding a penalty term to the ordinary loglikelihood function. We define the penalized log-likelihood as pl n (G) = l n (G) + p n (G), (1.1) =1

3 INFERENCE FOR NORMAL MIXTURES 445 so that p n (G) as minσ : = 1,...,p} 0. We then estimate G with the penalized maximum likelihood estimator (PMLE) G n = arg max G pl n (G). The penalized-likelihood-based method is a promising approach for countering the unboundedness of l n (G) while keeping the parameter space Γ unaltered. However, to make the PMLE work, one has to consider what penalty functions p n (G) are suitable. This task proves challenging. Ridolfi and Idier (1999, 2000) proposed a class of penalty functions based on a Bayesian conugate prior distribution, but the asymptotic properties of the corresponding PMLE were not discussed. Under some conditions on p n (G) and with p assumed known, Ciuperca, Ridolfi and Idier (2003) provided an insightful proof of strong consistency of the PMLE of G under the normal mixture model. Their proof was for the case where p = p 0 is known, and it contains a few loose steps that do not seem to have quick fixes, see Tan (2005). We use a novel technique to establish the strong consistency of the PMLE for a class of penalty functions, whether or not the true value of p is known. In addition, the proper order of the penalty is established. The paper is organized as follows. We first introduce two important technical lemmas in Section 2, and then present a detailed proof of the strong consistency of the PMLE in Section 3. In Section 4, we present some simulation results and a data example. We summarize the paper in Section Technical Lemmas To make the penalized likelihood approach work, we use a penalty to counter the effect of observations close to the location parameters. For this purpose, we assess the number of observations falling in a small neighborhood of the location parameters in G. Let Ω n (σ) = sup I(0 < Xi < σ log σ) be the number of observations on the positive side of a small neighborhood of. We are interested in Ω n (σ) only when σ is small. The number of observations on the negative side of can be assessed in the same way. Let F n (x) = n 1 n I(X i x) be the empirical distribution function. We have Ω n (σ) = n sup [F n ( σ log σ) F n ()]. Let F = E(F n ) be the true cumulative distribution function. We take M = maxsup x f(x;g 0 ),8} and δ n (σ) = Mσ log(σ) + n 1, where G 0 is the true mixing distribution. The following lemma uses Bahadur s representation to give an order assessment of n 1 Ω n (σ). With a slight abuse of the probability concept, when an inequality involving random quantities holds as n except for a zero probability event, we claim that the inequality is true almost surely. Further, if there is no risk of confusion, we omit the phrase almost surely. Lemma 1. Under the finite normal mixture model assumption, as n and almost surely, we have:

4 446 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG 1. for each given σ between exp( 2) and 8/(nM), sup[f n ( σ log σ) F n ()] 2δ n (σ); 2. uniformly for σ between 0 and 8/(nM), (log n)2 sup[f n ( σ log σ) F n ()] 2. n Proof. 1. Let η 0,η 1,...,η n be such that η 0 =, F(η i ) = i/n, i = 1,...,n 1, η n =. We have sup[f n ( σ log σ) F n ()] max[f n (η σ log σ) F n (η 1 )] max[f n (η σ log σ) F n (η 1 )} F(η σ log σ) F(η 1 )}] + max[f(η σ log σ) F(η 1 )]. By the Mean Value Theorem and for some η ξ η σ log σ, we have F(η σ log σ) F(η 1 ) = F(η σ log σ) F(η ) + n 1 = f(ξ ;G 0 ) σ log σ + n 1 M σ log σ + n 1 = δ n (σ). In summary, we have max [F(η σ log σ) F(η 1 )] δ n (σ). Further, for = 1,...,n, define n = F n (η σ log σ) F n (η 1 )} F(η σ log σ) F(η 1 )}. By the Bernstein Inequality (Serfling (1980)), for any t > 0 we have n 2 t 2 } P n t} 2exp 2nδ n (σ) nt. (2.1) Since σ log σ is monotonic in σ for exp( 2) > σ > 8/(nM), σ log σ 8 nm log nm 8 8log n nm. By letting t = δ n (σ) in (2.1), we obtain P n δ n (σ)} 2exp 3 } 8 nδ n(σ) 2exp 3 } 8 Mn σ log σ 2n 3.

5 INFERENCE FOR NORMAL MIXTURES 447 Thus for any σ in this range, P max n δ n (σ)} P n δ n (σ)} 2n 2. Linking this inequality back to sup [F n ( σ log(σ)) F n ()], we get } } P sup[f n ( σ log σ) F n ()] 2δ n (σ) P max n δ n (σ) 2n 2. The conclusion then follows from the Borel-Cantelli Lemma. 2. When 0 < σ < 8/(nM), we choose t = n 1 (log n) 2 in (2.1). For n large enough, 2δ n (σ) < t/3. Hence P n t} 2exp nt} n 3. The conclusion is then obvious. The claims in Lemma 1 are made for each σ in the range of consideration. The bounds can be violated by a zero-probability event for each σ and the union of zero-probability events may have non-zero probability as there are uncountably many σ in the range. Our next lemma strengthens the conclusion in Lemma 1. Lemma 2. Except for a zero-probability event not depending on σ, and under the same normal mixture assumption, we have for all large enough n: 1. for σ between exp( 2) and 8/(nM), sup [F n ( σ log(σ)) F n ()] 4δ n (σ); 2. for σ between 0 and 8/(nM), sup [F n ( σ log σ) F n ()] 2(log n) 2 /n. Proof. Let σ 0 = 8/(nM), choose σ +1 by σ +1 log σ +1 = 2 σ log σ for = 0,1,2,..., and let s(n) be the largest integer such that σ s(n) exp( 2). Simple algebra shows that s(n) 2 log n. By Lemma 1, for = 1,...,s(n), we have P sup[f n ( σ log σ ) F n ()] 2δ n ( σ ) } 2n 2. Define D n = s(n) =1 sup [F n ( σ log σ ) F n ()] 2δ n ( σ )}. It can be seen that s(n) P(D n ) n=1 n=1 =1 P 4n 2 log n <. n=1 } sup[f n ( σ log σ ) F n ()] 2δ n ( σ ) By the Borel-Cantelli Lemma, P(D n i.o.) = 0. The event D n is defined for a countable number of σ values, and our next step is to allow all σ. For each σ in the range of consideration, there exists a such that σ log σ σ log σ σ +1 log σ +1. Hence, almost surely, sup[f n ( σ log σ) F n ()] sup[f n ( σ +1 log σ +1 ) F n ()]

6 448 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG 2δ n ( σ +1 ) 4δ n (σ). This proves the first conclusion of the lemma. With the same σ 0 = 8/(nM), we have P sup[f n ( σ 0 log σ 0 ) F n ()] 2n 1 (log n) 2} n 3. That is, almost surely, sup [F n ( σ 0 log σ 0 ) F n ()] 2n 1 (log n) 2. For 0 < σ < 8/(nM), we always have sup[f n ( σ log σ) F n ()] sup[f n ( σ 0 log σ 0 ) F n ()], and hence the second conclusion of the lemma follows. In summary, we have shown that almost surely, sup sup [ 8 I( X i < σ log σ ) 8nδ n (σ), for σ (nm),e 2], (2.2) I( X i < σ log σ ) 4(log n) 2, for σ ( 0, 8 ]. (2.3) (nm) It is worth observing that the normality assumption does not play a crucial role in the proofs. Furthermore, the two factors 8 and 4 in (2.2) and (2.3), respectively, carry no specific meaning; they are chosen for simplicity and could be replaced by any positive numbers. 3. Strong Consistency of the PMLE We now proceed to prove the consistency of the PMLE for a class of penalty functions. The penalty must be large enough to counter the effect of the observations in a small neighborhood of the location parameters, and small enough to retain the optimal properties of the likelihood method. In addition, we prefer penalty functions that enable efficient numerical computation Conditions on penalty functions We require the penalty functions to satisfy the following conditions. C1. p n (G) = p =1 p n(σ ). C2. sup σ>0 max0, p n (σ)} = o(n) and p n (σ) = o(n) at any fixed σ > 0. C3. For any σ (0,8/(nM)], we have p n (σ) 4(log n) 2 log σ for large enough n.

7 INFERENCE FOR NORMAL MIXTURES 449 III σ2 II I σ 1 Figure 3.1. Partition of Parameter Space Γ. When the penalty functions depend on the data, the above conditions are in the sense of almost surely. These three conditions are flexible, and functions satisfying them can easily be constructed. Some examples are given in the simulation section. More specifically, C2 rules out functions that substantially elevate or depress the penalized likelihood at any parameter value. At the same time, C2 allows the penalty to be very severe in a shrinking neighborhood of σ = 0, C Consistency of the PMLE when p = p 0 = 2 For clarity, we first consider the case where p = p 0 = 2. Let K 0 = E 0 log f(x; G 0 ), where E 0 ( ) means expectation with respect to the true density f(x;g 0 ). It can be seen that K 0 <. Let ǫ 0 be a small positive constant such that 1. 0 < ǫ 0 < exp( 2); 2. 16Mǫ 0 (log ǫ 0 ) 2 1; 3. log ǫ 0 (log ǫ 0 ) 2 /2 2K 0 4. It can easily be seen that as ǫ 0 0, the inequalities are satisfied. Hence, the existence of ǫ 0 is assured. The value of ǫ 0 carries no specific meaning. For some small τ 0 > 0, we define the regions Γ 1 = G : σ 1 σ 2 ǫ 0 }, Γ 2 = G : σ 1 τ 0,σ 2 ǫ 0 }, Γ 3 = Γ (Γ 1 Γ 2 ). See Figure 3.1. The exact size of τ 0 will be specified later. These three regions represent three situations. One is when the mixing distribution has both scale parameters close to zero. In this case, the number of observations near either one of the location parameters is assessed in the last section; their likelihood contributions are large, but are countered by the penalty so the PMLE has a diminishing

8 450 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG probability of being in Γ 1. In the second case, the likelihood has two maor sources: the observations near a location parameter with a small scale parameter, and the remaining observations. The first source is countered by the penalty, the likelihood from the second source is not large enough to exceed the likelihood at the true mixing distribution so the PMLE also has a diminishing probability of being in Γ 2. The following theorem shows that the penalized log-likelihood function is bounded on Γ 1 in some sense. Theorem 1. Assume that the random sample is from the normal mixture model with p = p 0 = 2, and let pl n (G) be defined as in (1.1) with the penalty function p n (G) satisfying C1 C3. Then sup G Γ1 pl n (G) pl n (G 0 ) almost surely when n. Proof. Let A 1 = i : X i 1 < σ 1 log σ 1 } and A 2 = i : X i 2 < σ 2 log σ 2 }. For any index set, say S, we define l n (G;S) = [ π ( Xi ) 1 (1 π) ( Xi )] 2 log φ + φ, σ 1 σ 1 σ 2 σ 2 i S hence l n (G) = l n (G;A 1 )+l n (G;A c 1 A 2)+l n (G;A c 1 Ac 2 ). We investigate the asymptotic order of these three terms. Let n(a) be the number of observations in A. From the fact that the mixture density is no larger than 1/σ 1, we get l n (G;A 1 ) n(a 1 )log σ 1 and, with a slight refinement, we get l n (G;A c 1 A 2) n(a c 1 A 2)log σ 2 n(a 2 )log σ 2. By the bounds for n(a 1 ) and n(a c 1 A 2) given in Lemma 2, almost surely, we have 4(log n) 2 log σ 1, 0 < σ 1 8 l n (G;A 1 ) (nm), 8log σ 1 + 8Mnσ 1 (log σ 1 ) 2 8, (nm) < σ 1 < ǫ 0, 4(log n) 2 log σ 2, 0 < σ 2 8 l n (G;A c 1 A 2) (nm), 8log σ 2 + 8Mnσ 2 (log σ 2 ) 2 8, (nm) < σ 2 < ǫ 0. (3.1) From (3.1) and condition (C3), we obtain that l n (G;A 1 ) + p n (σ 1 ) < 0 when 0 < σ 1 8/(nM). When 8/(nM) < σ 1 < ǫ 0, based on the choice of ǫ 0 we have, almost surely, l n (G;A 1 ) + p n (σ 1 ) 8Mnσ 1 (log σ 1 ) 2 8log σ 1 8Mnǫ 0 (log ǫ 0 ) 2 + 9log n. The two bounds ust obtained can be unified as l n (G;A 1 ) + p n (σ 1 ) 8Mnǫ 0 (log ǫ 0 ) 2 + 9log n.

9 INFERENCE FOR NORMAL MIXTURES 451 Similarly, we can show that l n (G;A c 1A 2 ) + p n (σ 2 ) 8Mnǫ 0 (log ǫ 0 ) 2 + 9log n. For observations falling outside both A 1 and A 2, the log-likelihood contributions are bounded by } log πσ1 1 φ( log σ 1) + (1 π)σ2 1 φ( log σ 2) log ǫ 0 (log ǫ 0) 2, 2 which is negative. At the same time it is easy to show that, almost surely as n, n(a c 1 Ac 2 ) n n(a 1) + n(a 2 )} n/2. Hence we get the bound l n (G;A c 1 Ac 2 ) (n/2) log ǫ 0 (log ǫ 0 ) 2 /2}. Combining the three bounds and recalling the choice of ǫ 0, we conclude that when G Γ 1, pl n (G) = [l n (G;A 1 ) + p n (σ 1 )] + [l n (G;A c 1 A 2) + p n (σ 2 )] + l n (G;A c 1 Ac 2 ) ( n 16Mnǫ 0 (log ǫ 0 ) 2 + log ǫ 0 2)[ (log ǫ 0) 2 ] + 18log n ( 2 n n + (2K 0 4) + 18log n 2) = n(k 0 1) + 18log n. At the same time, by the Strong Law of Large Numbers, n 1 pl n (G 0 ) K 0 almost surely. Hence, sup G Γ1 pl n (G) pl n (G 0 ) n + 18log n almost surely as n. This completes the proof. To establish a similar result for Γ 2, we define π ( x 1 ) (1 π) ( x 2 ) g(x;g) = a 1 2 φ + a 2 φ, 2σ1 σ 2 σ 2 with a 1 = I(σ 1 0, 1 ± ) and a 2 = I( 2 ± ) for all G in the compacted Γ 2. Note that the first part is not a normal density function as it lacks σ 1 in the denominator of the coefficient. Because of this, the function is well behaved when σ 1 is close to 0 and at 0. It is easy to show that the function g(x;g) has the following properties: 1. g(x;g) is continuous in G almost surely w.r.t. f(x,g 0 ); 2. E 0 logg(x;g)/f(x;g 0 )} < 0 G Γ 2 (the Jensen inequality); 3. supg(x;g) : G Γ 2 } ǫ 1 0. Without loss of generality, we can choose ǫ 0 small enough so that G 0 Γ 2. Consequently we can easily show, as in Wald (1949), that sup G Γ 2 1 n ( g(xi ;G) ) } log δ(τ 0 ) < 0, a.s., as n. (3.2) f(x i ;G 0 )

10 452 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG Note that δ(τ 0 ) > 0 is a decreasing function of τ 0. Hence, we can find a τ 0 such that (a) τ 0 < ǫ 0, and (b) 8Mτ 0 (log τ 0 ) 2 2δ(ǫ 0 )/5 < 2δ(τ 0 )/5. Let τ 0 satisfy these two conditions. Then the PMLE cannot be in Γ 2, as is stated in the following theorem. Theorem 2. Assume the conditions of Theorem 1. Then sup G Γ2 pl n (G) pl n (G 0 ) a.s. as n. Proof. It is easily seen that the log-likelihood contribution of observations in A 1 is no larger than log σ 1 +log g(x i ;G). For other observations the log-likelihood contributions are less than log g(x i ;G). This follows since, if x 1 σ 1 log σ 1 and σ 1 is sufficiently small, 1 exp (x 1) 2 } exp (x 1) 2 }. σ 1 2σ 2 1 Combined with the properties of the penalty function and (3.2), we have sup Γ 2 pl n (G) pl n (G 0 ) sup σ 1 τ 0 i A 1 log 4σ 2 1 ( 1 ) } + p n (σ 1 ) + sup σ 1 Γ 2 8Mnτ 0 (log τ 0 ) 2 + 9log n 9δ(τ 0)n 10 δ(τ 0)n + 9log n + p n (G 0 ), 2 + p n (G 0 ) g(xi ;G) } log + p n (G 0 ) f(x i ;G 0 ) which goes to as n in view of condition C2 on p n (G 0 ). We now claim the strong consistency of the PMLE. Theorem 3. Assume the conditions of Theorem 1. For any mixing distribution G n = G n (X 1,...,X n ) satisfying pl n (G n ) pl n (G 0 ) > c >, we have G n G 0 a.s. as n. Proof. By Theorems 1 and 2, with probability one, G n Γ 3 as n. Confining the mixing distribution G in Γ 3 is equivalent to placing a positive constant lower bound for the variance parameters. Thus, consistency is covered by the result in Kiefer and Wolfowitz (1956). Note that their proof can be modified to accommodate a penalty of size o(n) due to (2.12) of that paper. Let Ĝn be the PMLE that maximizes pl n (G). By definition, pl n (Ĝn) pl n (G 0 ) > 0, and therefore Ĝn G 0 almost surely. Corollary 1. Under the conditions of Theorem 1, the PMLE Ĝn is strongly consistent.

11 INFERENCE FOR NORMAL MIXTURES Strong consistency of the PMLE when p = p 0 > 2 The strong consistency of PMLE for the case where p 0 > 2 can be proved in the same manner. For sufficiently small positive constants ǫ 10 ǫ 20 ǫ p0, we partition the parameter space Γ into Γ k = G : σ 1 σ p k+1 ǫ k0 ; ǫ (k 1)0 σ p k+2 σ p }, for k = 1,...,p and Γ p+1 = Γ p k=1 Γ k. The choice of ǫ k0 (k = 2,...,p) will be given after ǫ (k 1)0 is selected. Let g k (x;g) = p k+1 + =1 π ( x φ )I (σ 0, ± ) 2 2σ p =p k+2 π ( x ) φ I ( ± ). σ σ We can show that sup G Γ k n 1 i log ( gk (X i ;G) ) } δ(ǫ k0 ) < δ(ǫ f(x i ;G 0 ) (k 1)0 ) < 0 (3.3) almost surely as n. The constants ǫ k0 are then chosen so that (a) ǫ k0 < ǫ (k 1)0, and (b) 8(p k+1)mǫ k0 (log ǫ k0 ) 2 < 2δ(ǫ (k 1)0 )/5. In this way, Γ 1,Γ 2,..., Γ p are defined successively. Observe that (3.3) holds since none of Γ 1,...,Γ p contains G 0. This fact will be used again later. The proof of the general case is accomplished in three general steps. First, the probability that the PMLE belongs to Γ 1 goes to zero; this is true because all the σ k s are small. Second, we show the same for Γ k, k = 2,3,...,p. Third, when G is confined to Γ p+1, consistency of the PMLE is covered by Kiefer and Wolfowitz (1956). Step 1. For k = 1,...,p, let A k = i : X i k σ k log σ k }. For sufficiently small ǫ 10 and G Γ 1, we have l n (G;A c 1A c 2 A c k 1 A k) + p n (σ k ) 8Mǫ 10 (log ǫ 10 ) 2 + 9log n for k = 1,...,p, almost surely. Therefore, the likelihood contribution of the X i s in A 1,...,A p, plus the penalty term, is p k=1 } l n (G;A c 1A c 2 A c k 1 A k) + p n (σ k ) 8pMǫ 10 (log ǫ 10 ) 2 + 9p log n.

12 454 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG At the same time, the total likelihood contributions of the X i not in A 1,...,A p are bounded since l n (G;A c 1 Ac p ) n/2 log ǫ 10 (log ǫ 10 ) 2 /2}. A sufficiently small ǫ 10 not depending on n can hence be found such that pl n (G) pl n (G 0 ) < n + 9p log n + p n (G 0 ) almost surely and uniformly for G Γ 1. The fact that the upper bound goes to as n leads to the conclusion of the first step. Step 2. The definition of g k (x;g) is useful in this step. As before, sup Γk E 0 logg k (X;G)/f(X;G 0 )} < 0. Hence, using the same idea as for p 0 = 2, we get p k+1 suppl n (G) pl n (G 0 ) Γ k =1 + sup Γ k [ sup σ <ǫ k0 log ] log σ } + p n (σ ) i A gk (X i ;G) } f(x i ;G 0 ) } (p k + 1) 8Mnǫ k0 (log ǫ k0 ) 2 + 9log n δ(ǫ k0)n 2 + 9(p k + 1)log n. 9δ(ǫ k0)n 10 The last step is a consequence of the choice of these constants. In conclusion, the PMLE is not in Γ 1,...,Γ p except for a zero probability event. Step 3. Again, confining G to Γ p+1 amounts to setting up a positive constant lower bound on σ k, k = 1,...,p. Thus, the consistency proof of the PMLE is covered by Kiefer and Wolfowitz (1956). In summary, the PMLE of G when p = p 0 > 2 is also consistent. Theorem 4. Assume that p n (G) satisfies C1 C3, and that pl n (G) is defined as in (1.1). Then for any sequence G n = G n (X 1,...,X n ) with p = p 0 components satisfying pl n (G n ) pl n (G 0 ) > c > for all n, we have G n G 0 almost surely Convergence of the PMLE when p p 0. The exact number of components is often unknown in applications. Thus, it is particularly important to be able to estimate G consistently when only an upper bound p is known. One such estimator is the PMLE with at most p components. Other than in Kiefer and Wolfowitz (1956) and Leroux (1992), whose results do not apply to finite mixture of normal models, there has been limited discussion of this problem. When p 0 < p <, we cannot expect that every part of G converges to that

13 INFERENCE FOR NORMAL MIXTURES 455 of G 0. We measure their difference by H(G,G 0 ) = G(λ) G 0 (λ) exp σ 2 }ddσ 2, (3.4) R R + where λ = (,σ 2 ). It is easily seen that H(G n,g 0 ) 0 implies G n G 0 in distribution. An estimator Ĝn is strongly consistent if H(Ĝn,G 0 ) 0 almost surely. Theorem 5. Under the conditions of Theorem 1 except that p 0 p <, for any mixing distribution G n = G n (X 1,...,X n ) satisfying pl n (G n ) pl n (G 0 ) c >, we have H(G n,g 0 ) 0 almost surely as n. Most intermediate steps in the proof of consistency of the PMLE when p = p 0 2 are still applicable; some need minor changes. We use many of these results and notations to establish a brief proof. For an arbitrarily small positive number δ, take H(δ) = G : G Γ,H(G,G 0 ) δ}. That is, H(δ) contains all mixing distributions with up to p components that are at least δ > 0 distance from the true mixing distribution G 0. Since G 0 H(δ), we have E[logg k (X;G)/f(X;G 0 )}] < 0 for any G H(δ) Γ k, k = 2,3,...,p. Thus (3.3) remains valid if revised to sup n 1 G H(δ) Γ k log gk (X i ;G) } δ(ǫ k0 ) f(x i ;G 0 ) for some δ(ǫ k0 ) > 0. Because of this, the derivations in Section 3.3 still apply after Γ k is replaced by H(δ) Γ k. That is, with proper choice of ǫ k0, we can get sup G H(δ) Γk pl n (G) pl n (G 0 ) for all k = 1,...,p. With what we have proved, it can be seen that the penalized maximum likelihood estimator of G, Ĝn, must almost surely belong to H c (δ) Γ p+1, where H c (δ) is the complement of H(δ). Since δ is arbitrarily small, Ĝn H c (δ) implies H(Ĝn,G 0 ) 0. On the other hand, Ĝn Γ p+1 is equivalent to putting a positive lower bound on the component variances, which also implies H(Ĝn,G 0 ) 0 by Kiefer and Wolfowitz (1956). That is, consistency of the PMLE is also true when p p Simulation and a Data Example In this section, we present some simulation results and a data example The EM algorithm The EM algorithm is a preferred numerical method in finite mixture models due to its simplicity in coding, and guaranteed convergence to some local

14 456 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG maximum under general conditions (Wu (1983)). The EM algorithm can also be easily modified to work with the penalized likelihood method. Often, the penalized log-likelihood function also increases after each EM iteration (Green (1990)) and the algorithm converges quickly. Let z ik equal 1 when the ith observation is from the kth component, and equal 0 otherwise. The complete observation log-likelihood under the normal mixture model is given by l c (G) = i k z iklog π k log σ k (2σk 2) 1 (X i k ) 2 }. Given the current parameter value G (m) = (π (m) 1,...,π p (m)...,σ p (m) ), the EM algorithm iterates as follows. In the E-Step, we compute the conditional expectation π (m+1) ik = E z ik x;g (m)} = π (m) k p =1 π(m) φ(x i ; (m) k, (m),σ 2(m) k ) φ(x i ; (m),σ 2(m) ) to arrive at Q(G;G (m) ) = E l c (G) + p n (G) x;g (m)} p = (log π ) π (m+1) i 1 p (log σ 2 2 ) =1 1 2 p =1 σ 2 =1 1,..., p (m) π (m+1) i (X i ) 2 + p n (G). π (m+1) i,σ (m) 1, In the M-step, we maximize Q(G,G (m) ) with respect to G, and an explicit solution is often possible. For example, when we choose p p } p n (G) = a n S x (σ 2 ) + log(σ) 2 (4.1) =1 =1 with S x being a function of the data, Q(G,G (m) ) is maximized at G = G (m+1) with π (m+1) = 1 π (m+1) i, n π (m+1) i X i (m+1) =, σ 2(m+1) π (m+1) i = 2a ns x + S (m+1) π (m+1) i + 2a n

15 INFERENCE FOR NORMAL MIXTURES 457 where S (m+1) = π (m+1) i (X i (m+1) ) 2. It is worth observing here that adding the penalty function (4.1) results in a soft constraint on the scale parameters. From the update formula on σ 2, we can easily see that 2a n S x /(n + 2a n ) σ 2(m). We choose the initial values of the location parameters to be within the data range and then it follows that σ 2(m) (2a n S x + n 2 )/2a n ). At the same time, from a Bayesian point of view, the penalty function (4.1) puts an Inverse Gamma distribution prior on σ 2, where S x is the mode of the prior distribution or a prior estimate of σ 2, and a large value of a n implies a strong conviction in the prior estimate Simulation results for p = p 0 When p = p 0, it is meaningful to investigate the bias and variance properties of individual parts of the PMLE. To obtain the results in this section, we generated data from two- and three-component normal mixture models. Two sample sizes, n = 100 and n = 300, were chosen to examine consistency. We computed the bias and standard deviation of the PMLEs based on 5,000 replicates. The EM algorithm might miss the global maximum. In our simulation, we used true values as initial values. The EM algorithm was terminated when λ (m) λ (m+1) < , where v denotes the maximal absolute value among the elements of the vector v. We found that the outcomes were satisfactory. A desirable property of statistical inference for location-scale models is invariance. In this context, given any two real numbers a and b with a 0, this means that the PMLE G based on Y i = ax i + b and the PMLE Ĝ based on X i,i = 1,...,n, have the functional relationship G(a +b,aσ) = Ĝ(,σ). This is true for the ordinary MLE in general, but is not necessarily true for the PMLE unless we choose our penalty function carefully. For illustration, from a large number of possible penalty functions satisfying conditions C1 C3, we select the penalty functions P0 p n (G) = S x p =1 (σ 2 ) + p =1 log(σ2 )}/n, P1 p n (G) = 0.4 p =1 (σ 2 + log σ 2 ). Note that C3 is satisfied because when σ < 8/(nM), σ 2 n 2. The quantity S x in P0 is chosen as the sample variance of the observations between two sample quartiles, 25% and 75% in our simulations. Unlike P1, P0 is invariant under location-scale transformations. The choice of the constant 0.4 in P1 is somewhat arbitrary. A sensible choice should depend on n and the (unknown) values of the true variances. Replacing the true variances with an estimate brings us back

16 458 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG to P0. In the case of P0, the influence of the penalty is minor when the σ s are not close to 0, yet it effectively stops the irregularity. Replacing 1/n by 1/ n or 1 does not markedly change our simulation results. We included P1 in the simulation to illustrate the importance of the invariance property. For this reason, we computed the PMLEs based on Y i = X i /a, i = 1,...,n, a = 3.0, 5.0, In applications, it is a common practice to estimate G with a good local maximum Ĝ of the likelihood function such that ˆσ2 0 for all. Although there are few theoretical guidelines for choosing among the local maxima, we can often identify one that best fits the data by some standard. We regard as the MLE the local maximum located by the EM algorithm with the true mixing distribution as the initial value. When the EM algorithm leads to a local maximum with ˆσ 2 = 0 for some, this outcome is removed; the simulated bias and standard deviation are based on outcomes where no ˆσ 2 = 0. The results provide a yardstick for the proposed PMLEs. Example 1. We consider a two-component normal mixture model with G 0 = (π 0, 10,σ10 2, 20,σ20 2 ) = (0.5,0,1,3,9). The density function of this model has two modes. The biases and standard deviations (in brackets) of the parameter estimators are presented in Table 1. To make the comparison more sensible, we compute the relative bias and standard deviation of ˆσ 2 in terms of (ˆσ2 σ2 0 )/σ2 0 instead of ˆσ 2 σ2 0. The rows marked P11, P1 2, P1 3 are the biases and standard deviations of the PMLEs of P1 calculated based on transformed data with a = 3.0, 5.0, and 10.0 respectively. In addition, these values were transformed back to the original scale for easy comparisons. We note that P0 and P1 have similar performance to the MLE and therefore are both very efficient. The PMLE of P1 is not invariant, and its performance worsens as a increases; the invariance consideration is important when selecting appropriate penalty functions. When a decreases, though, the performance of P1 does not deteriorate. When the sample size increases, all biases and standard deviations decrease, reflecting the consistency of the PMLE. The PMLE based on P1 still suffers from not being invariant but the effect is not as severe. Probably due to the well separated kernel densities, and the use of the true mixing distribution as initial values, the EM algorithm converged to a reasonable local maximum in all cases. Example 2. In this example, we choose the two-component normal mixture model with G 0 = (π 0, 10,σ10 2, 20,σ20 2 ) = (0.5,0,1,1.5,3). In contrast to the model used in Example 1, the density function of this model has only one mode.

17 INFERENCE FOR NORMAL MIXTURES 459 Table 1. Simulation Results for Example 1, Bias and Standard Deviation. π(= 0.5) 1 (= 0) 2 (= 3) σ1 2(= 1) σ2 2 (= 9) n=100 MLE 0.052(0.13) 0.038(0.27) 0.519(1.05) 0.126(0.58) (0.32) P (0.13) 0.038(0.27) 0.521(1.05) 0.127(0.58) (0.32) P (0.13) 0.045(0.24) 0.589(1.09) 0.155(0.60) (0.32) P (0.11) 0.107(0.26) 0.917(1.00) 0.536(0.54) (0.32) P (0.08) 0.184(0.35) 1.188(0.82) 1.072(0.65) (0.32) P (0.15) 0.865(0.56) 1.488(1.07) 5.361(10.7) 2.892(3.97) n=300 MLE 0.015(0.07) 0.006(0.12) 0.145(0.53) 0.027(0.27) (0.16) P (0.07) 0.006(0.12) 0.145(0.53) 0.027(0.27) (0.16) P (0.07) 0.005(0.12) 0.156(0.54) 0.030(0.27) (0.16) P (0.07) 0.030(0.12) 0.301(0.56) 0.189(0.26) (0.18) P (0.06) 0.074(0.13) 0.534(0.57) 0.446(0.26) (0.19) P (0.05) 0.242(0.14) 0.951(0.47) 1.374(0.36) (0.18) The EM algorithm may not be able to locate a reasonable local maximum. Otherwise, the set-up is the same as in Example 1. The simulation results are presented in Table 2. The EM algorithm converged to a local maximum with ˆσ 2 = 0 in the case of the ordinary MLE 46 of 5,000 times when n = 100, even though the true parameter G 0 was used as the initial value; this number was 1 of 5,000 when n = 300. We note that the biases and standard deviations decrease when n increases. In general, the precisions of mixing proportion estimators are not high when the two mixing components are close, as documented in Redner and Walker (1984). The performances of P1 1, P1 2, and P1 3 are poor, reaffirming the importance of invariance. Example 3. In this example, we consider a three-component normal mixture model with G 0 = (π 10, 10,σ 2 10,π 20, 20,σ 2 20,π 30, 30,σ 2 30 ) = (0.2, 3.0,1,0.5,0, 0.01, 0.3,3,0.5). The simulation results are presented in Table 3. The performances of the MLE and of the PMLE with P0 or P1 are satisfactory. We note again that the invariance issue is important. Probably due to the well-separated component densities, the EM algorithm converged in all cases. We remark here that when the component densities are not well separated, much larger sample sizes are needed to achieve precision similar to that in our simulation.

18 460 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG Table 2. Simulation Results for Example 2, Bias and Standard Deviation. π(= 0.5) 1 (= 0) 2 (= 1.5) σ1 2(= 1) σ2 2 (= 3) n=100 MLE 0.147(0.24) 0.089(0.40) 0.987(1.25) 0.080(0.56) (0.44) P (0.24) 0.088(0.40) 0.990(1.25) 0.079(0.56) (0.44) P (0.22) 0.105(0.38) 1.108(1.22) 0.116(0.50) (0.40) P (0.17) 0.236(0.33) 0.716(0.73) 0.651(1.15) 0.161(0.57) P (0.19) 0.571(0.31) 0.705(0.77) 1.887(3.72) 4.007(3.21) P (0.20) 0.722(0.29) 0.482(1.06) 6.340(19.3) 30.34(7.26) n=300 MLE 0.095(0.20) 0.067(0.23) 0.547(0.87) 0.047(0.38) (0.32) P (0.20) 0.067(0.23) 0.548(0.87) 0.047(0.38) (0.32) P (0.19) 0.077(0.22) 0.615(0.89) 0.070(0.36) (0.31) P (0.11) 0.129(0.17) 0.566(0.53) 0.264(0.25) (0.25) P (0.11) 0.280(0.20) 0.573(0.36) 0.623(1.09) 0.358(1.10) P (0.14) 0.706(0.25) 1.214(1.05) 3.495(12.8) 27.21(10.7) Table 3. Simulation Results for Example 3, Bias and Standard Deviation. n=100 n=300 π 1(= 0.2) π 2(= 0.5) π 3(= 0.3) π 1(= 0.2) π 2(= 0.5) π 3(= 0.3) MLE 0.000(0.04) 0.001(0.05) (0.05) 0.000(0.02) 0.000(0.03) 0.000(0.03) P (0.04) 0.001(0.05) (0.05) 0.000(0.02) 0.000(0.03) 0.000(0.03) P (0.04) 0.002(0.05) (0.05) (0.02) 0.001(0.03) 0.000(0.03) P (0.04) 0.006(0.05) (0.05) (0.02) 0.002(0.03) (0.03) P (0.06) (0.08) 0.005(0.09) (0.02) 0.003(0.03) (0.03) P (0.19) (0.00) 0.662(0.19) (0.09) (0.22) 0.512(0.31) 1(= 3) 2(= 0) 3(= 3) 1(= 3) 2(= 0) 3(= 3) MLE 0.005(0.25) 0.000(0.01) 0.001(0.13) (0.13) 0.000(0.01) 0.000(0.08) P (0.25) 0.000(0.01) 0.001(0.13) (0.13) 0.000(0.01) 0.000(0.08) P (0.24) (0.02) 0.001(0.13) (0.13) 0.000(0.01) 0.001(0.08) P (0.24) (0.02) 0.005(0.13) (0.13) (0.01) 0.002(0.08) P (0.46) (0.08) (0.37) (0.21) (0.01) 2.010(0.13) P (0.38) (0.39) (0.21) 1.970(1.01) (0.32) (1.16) σ1(= 2 1) σ2(= ) σ3(= 2 0.5) σ1(= 2 1) σ2(= ) σ3(= 2 0.5) MLE (0.41) (0.21) (0.26) (0.20) (0.12) (0.15) P (0.41) 0.000(0.21) (0.26) (0.20) (0.12) (0.15) P (0.34) 1.600(0.30) (0.25) (0.19) 0.534(0.13) (0.15) P (0.31) 15.33(2.11) 0.403(0.26) 0.052(0.18) 4.935(0.36) 0.130(0.15) P (2.72) 94.78(325.) 1.727(2.35) 1.938(0.50) 15.46(1.01) 2.129(0.39) P (18.2) 10 4 (0.01) 17.22(36.4) 73.34(42.6) > 10 3 (> 10 3 ) 7.530(6.12) 4.3. Simulation results for p p 0. In this subsection, we study the properties of the PMLE when p p 0, p 0 = 1, 2. We generated data from N(0,1) and 0.3N(0,0.1 2 ) + 0.7N(2,1), respectively,

19 INFERENCE FOR NORMAL MIXTURES 461 Table 4. Number of Degeneracies of the EM Algorithm When Computing the Ordinary MLE. p 0 = 1 p 0 = 2 n p = p = p = p = with n = 100, n = 500 and n = 2,500. In each case, we computed the MLE and the PMLE for p = p 0,p 0 + 1,...,p with penalty function P0. The number of replications was 500. The EM algorithm was employed to locate the (local) maxima of the pl n (G). In the EM algorithm, we chose ten initial values; five were in the neighborhood of the true parameter G 0 and the other five were in the neighborhood of some estimates of G 0 without knowledge of p 0. In many cases, the EM algorithm failed to converge when computing the ordinary MLE. A failure was recorded whenever one of the ˆσ 2, = 1,...,p, became very large (greater than 1032 ) or very small (less than ). In all cases, the local maximum (non-degenerate) with the largest likelihood value was considered as the final estimate. The numbers of failures (of ) are given in Table 4. When n = 100, p 0 = 1, p = 2, and p = 5, we found two cases where the EM degenerated with all 10 initial values. These were not included in our simulation results. The distance defined in (3.4) is convenient for theoretical development, but not sensible for measuring the discrepancy between the estimated mixing distribution and the true mixing distribution. To improve the situation, we take a logtransformation on σ 2, and take H (Ĝ,G 0) = [ 10,10] [ 15,5] Ĝ(λ) G 0(λ) dλ where λ = (,log σ 2 ). This region of integration was chosen because all the PMLEs of and log σ 2 were within it. The averages of H (Ĝ,G 0) are reported in Table 5. We first note that it is costly to estimate the mixing distribution with p = 2 when p 0 = 1. The efficiency of the MLE and the PMLE when n = 2,500 is not as good as the MLE with p = 1 and n = 100. Nevertheless, the mean of H (Ĝ,G 0) apparently decreases when n increases in each case. At the same time, the rate of decrease is not substantial, which might be explained by the result in Chen (1995) that the optimal convergence rate of Ĝ is at most n 1/4 when p > p A data example. Liu, Umbach, Peddada, Li, Crockett and Weinberg (2004) analyzed microarray data of the levels of gene expression over time, presented in Bozdech,

20 462 JIAHUA CHEN, XIANMING TAN AND RUNCHU ZHANG Table 5. Average H (Ĝ, G 0) when p p 0. MLE PMLE n p 0 = 1 p = p = p = p = p = p 0 = 2 p = p = p = p = Table 6. Parameter Estimates for the Data Example. method 1 2 σ1 2 σ2 2 π pl n (Ĝ) MLE P σ1 2 = σ Llinas, Pulliam, Wong, Zhu and DeRisi (2003). By employing a random period model, Liu, Umbach, Peddada, Li, Crockett and Weinberg (2004) identified 2,400 cycling transcripts from 3,719 transcripts listed. There is a strong indication that the periods can be modeled by a normal mixture with p = 2. By applying a normal mixture model with equal variance, Liu and Chen (2005) found that there is significant evidence for p = 2 against p = 1, and that the best two-component equal-variance normal mixture model is given by 0.676N(38.2, )+0.324N(53.2, ). Figure 5.2 contains the histogram and the density function of the fitted model. We can also answer the question of whether or not the equal-variance assumption can be ustified by testing the hypothesis H 0 : σ 1 = σ 2 H 1 : σ 1 σ 2. We computed the PMLE with penalty P0 as given in Table 6. It is straightforward that, under the null hypothesis, the penalized likelihood ratio test statistic R = 2sup H1 pl n (G) sup H0 pl n (G)} converges in distribution to χ 2 (1). Here R = 2( ) = 1.0 and P(χ 2 (1) > 1) = Therefore, we have no evidence against the equal-variance assumption. 5. Concluding Remarks. In this paper, we provide a rigorous proof of the consistency of the penalized MLE, both when the putative number of mixture components p = p 0 and when

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### 6. Metric spaces. In this section we review the basic facts about metric spaces. d : X X [0, )

6. Metric spaces In this section we review the basic facts about metric spaces. Definitions. A metric on a non-empty set X is a map with the following properties: d : X X [0, ) (i) If x, y X are points

### Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative

### MA651 Topology. Lecture 6. Separation Axioms.

MA651 Topology. Lecture 6. Separation Axioms. This text is based on the following books: Fundamental concepts of topology by Peter O Neil Elements of Mathematics: General Topology by Nicolas Bourbaki Counterexamples

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Limits and convergence.

Chapter 2 Limits and convergence. 2.1 Limit points of a set of real numbers 2.1.1 Limit points of a set. DEFINITION: A point x R is a limit point of a set E R if for all ε > 0 the set (x ε,x + ε) E is

### RESULTANT AND DISCRIMINANT OF POLYNOMIALS

RESULTANT AND DISCRIMINANT OF POLYNOMIALS SVANTE JANSON Abstract. This is a collection of classical results about resultants and discriminants for polynomials, compiled mainly for my own use. All results

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

### 5. Convergence of sequences of random variables

5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,

### Prime Numbers and Irreducible Polynomials

Prime Numbers and Irreducible Polynomials M. Ram Murty The similarity between prime numbers and irreducible polynomials has been a dominant theme in the development of number theory and algebraic geometry.

### Sums of Independent Random Variables

Chapter 7 Sums of Independent Random Variables 7.1 Sums of Discrete Random Variables In this chapter we turn to the important question of determining the distribution of a sum of independent random variables

### THE CENTRAL LIMIT THEOREM TORONTO

THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................

### False Discovery Rates

False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### Mathematics for Econometrics, Fourth Edition

Mathematics for Econometrics, Fourth Edition Phoebus J. Dhrymes 1 July 2012 1 c Phoebus J. Dhrymes, 2012. Preliminary material; not to be cited or disseminated without the author s permission. 2 Contents

### Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma Shinto Eguchi a, and John Copas b a Institute of Statistical Mathematics and Graduate University of Advanced Studies, Minami-azabu

### Double Sequences and Double Series

Double Sequences and Double Series Eissa D. Habil Islamic University of Gaza P.O. Box 108, Gaza, Palestine E-mail: habil@iugaza.edu Abstract This research considers two traditional important questions,

### Chapter 2 Limits Functions and Sequences sequence sequence Example

Chapter Limits In the net few chapters we shall investigate several concepts from calculus, all of which are based on the notion of a limit. In the normal sequence of mathematics courses that students

### Metric Spaces. Chapter 7. 7.1. Metrics

Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

### Properties of BMO functions whose reciprocals are also BMO

Properties of BMO functions whose reciprocals are also BMO R. L. Johnson and C. J. Neugebauer The main result says that a non-negative BMO-function w, whose reciprocal is also in BMO, belongs to p> A p,and

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### ON TOPOLOGICAL SEQUENCE ENTROPY AND CHAOTIC MAPS ON INVERSE LIMIT SPACES. 1. Introduction

Acta Math. Univ. Comenianae Vol. LXVIII, 2(1999), pp. 205 211 205 ON TOPOLOGICAL SEQUENCE ENTROPY AND CHAOTIC MAPS ON INVERSE LIMIT SPACES J. S. CANOVAS Abstract. The aim of this paper is to prove the

### The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results

### Convex analysis and profit/cost/support functions

CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m

### F. ABTAHI and M. ZARRIN. (Communicated by J. Goldstein)

Journal of Algerian Mathematical Society Vol. 1, pp. 1 6 1 CONCERNING THE l p -CONJECTURE FOR DISCRETE SEMIGROUPS F. ABTAHI and M. ZARRIN (Communicated by J. Goldstein) Abstract. For 2 < p

### MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

### Finite Sets. Theorem 5.1. Two non-empty finite sets have the same cardinality if and only if they are equivalent.

MATH 337 Cardinality Dr. Neal, WKU We now shall prove that the rational numbers are a countable set while R is uncountable. This result shows that there are two different magnitudes of infinity. But we

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Convex Hull Probability Depth: first results

Conve Hull Probability Depth: first results Giovanni C. Porzio and Giancarlo Ragozini Abstract In this work, we present a new depth function, the conve hull probability depth, that is based on the conve

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

### ORDERS OF ELEMENTS IN A GROUP

ORDERS OF ELEMENTS IN A GROUP KEITH CONRAD 1. Introduction Let G be a group and g G. We say g has finite order if g n = e for some positive integer n. For example, 1 and i have finite order in C, since

### About the inverse football pool problem for 9 games 1

Seventh International Workshop on Optimal Codes and Related Topics September 6-1, 013, Albena, Bulgaria pp. 15-133 About the inverse football pool problem for 9 games 1 Emil Kolev Tsonka Baicheva Institute

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### 9.2 Summation Notation

9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

### Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing

### CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win \$1 if head and lose \$1 if tail. After 10 plays, you lost \$2 in net (4 heads

### 61. REARRANGEMENTS 119

61. REARRANGEMENTS 119 61. Rearrangements Here the difference between conditionally and absolutely convergent series is further refined through the concept of rearrangement. Definition 15. (Rearrangement)

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Problem Set. Problem Set #2. Math 5322, Fall December 3, 2001 ANSWERS

Problem Set Problem Set #2 Math 5322, Fall 2001 December 3, 2001 ANSWERS i Problem 1. [Problem 18, page 32] Let A P(X) be an algebra, A σ the collection of countable unions of sets in A, and A σδ the collection

### BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### EXPLICIT ABS SOLUTION OF A CLASS OF LINEAR INEQUALITY SYSTEMS AND LP PROBLEMS. Communicated by Mohammad Asadzadeh. 1. Introduction

Bulletin of the Iranian Mathematical Society Vol. 30 No. 2 (2004), pp 21-38. EXPLICIT ABS SOLUTION OF A CLASS OF LINEAR INEQUALITY SYSTEMS AND LP PROBLEMS H. ESMAEILI, N. MAHDAVI-AMIRI AND E. SPEDICATO

### This chapter describes set theory, a mathematical theory that underlies all of modern mathematics.

Appendix A Set Theory This chapter describes set theory, a mathematical theory that underlies all of modern mathematics. A.1 Basic Definitions Definition A.1.1. A set is an unordered collection of elements.

### RINGS WITH A POLYNOMIAL IDENTITY

RINGS WITH A POLYNOMIAL IDENTITY IRVING KAPLANSKY 1. Introduction. In connection with his investigation of projective planes, M. Hall [2, Theorem 6.2]* proved the following theorem: a division ring D in

### Maximum likelihood estimation of mean reverting processes

Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. jcpollo@onwardinc.com Abstract Mean reverting processes are frequently used models in real options. For

### FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### The Delta Method and Applications

Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and

### The Expectation Maximization Algorithm A short tutorial

The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### FIRST YEAR CALCULUS. Chapter 7 CONTINUITY. It is a parabola, and we can draw this parabola without lifting our pencil from the paper.

FIRST YEAR CALCULUS WWLCHENW L c WWWL W L Chen, 1982, 2008. 2006. This chapter originates from material used by the author at Imperial College, University of London, between 1981 and 1990. It It is is

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Course 221: Analysis Academic year , First Semester

Course 221: Analysis Academic year 2007-08, First Semester David R. Wilkins Copyright c David R. Wilkins 1989 2007 Contents 1 Basic Theorems of Real Analysis 1 1.1 The Least Upper Bound Principle................

### Math 317 HW #5 Solutions

Math 317 HW #5 Solutions 1. Exercise 2.4.2. (a) Prove that the sequence defined by x 1 = 3 and converges. x n+1 = 1 4 x n Proof. I intend to use the Monotone Convergence Theorem, so my goal is to show

### Compactness in metric spaces

MATHEMATICS 3103 (Functional Analysis) YEAR 2012 2013, TERM 2 HANDOUT #2: COMPACTNESS OF METRIC SPACES Compactness in metric spaces The closed intervals [a, b] of the real line, and more generally the

### Section Notes 3. The Simplex Algorithm. Applied Math 121. Week of February 14, 2011

Section Notes 3 The Simplex Algorithm Applied Math 121 Week of February 14, 2011 Goals for the week understand how to get from an LP to a simplex tableau. be familiar with reduced costs, optimal solutions,

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Sequences and Convergence in Metric Spaces

Sequences and Convergence in Metric Spaces Definition: A sequence in a set X (a sequence of elements of X) is a function s : N X. We usually denote s(n) by s n, called the n-th term of s, and write {s

### Prof. Girardi, Math 703, Fall 2012 Homework Solutions: Homework 13. in X is Cauchy.

Homework 13 Let (X, d) and (Y, ρ) be metric spaces. Consider a function f : X Y. 13.1. Prove or give a counterexample. f preserves convergent sequences if {x n } n=1 is a convergent sequence in X then

### Generalization of Chisquare Goodness-of-Fit Test for Binned Data Using Saturated Models, with Application to Histograms

Generalization of Chisquare Goodness-of-Fit Test for Binned Data Using Saturated Models, with Application to Histograms Robert D. Cousins Dept. of Physics and Astronomy University of California, Los Angeles,

### Metric Spaces. Chapter 1

Chapter 1 Metric Spaces Many of the arguments you have seen in several variable calculus are almost identical to the corresponding arguments in one variable calculus, especially arguments concerning convergence

### Series. Chapter Convergence of series

Chapter 4 Series Divergent series are the devil, and it is a shame to base on them any demonstration whatsoever. Niels Henrik Abel, 826 This series is divergent, therefore we may be able to do something

### The fundamental group of the Hawaiian earring is not free (International Journal of Algebra and Computation Vol. 2, No. 1 (1992), 33 37) Bart de Smit

The fundamental group of the Hawaiian earring is not free Bart de Smit The fundamental group of the Hawaiian earring is not free (International Journal of Algebra and Computation Vol. 2, No. 1 (1992),

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### CONTRIBUTIONS TO ZERO SUM PROBLEMS

CONTRIBUTIONS TO ZERO SUM PROBLEMS S. D. ADHIKARI, Y. G. CHEN, J. B. FRIEDLANDER, S. V. KONYAGIN AND F. PAPPALARDI Abstract. A prototype of zero sum theorems, the well known theorem of Erdős, Ginzburg

### On Mardia s Tests of Multinormality

On Mardia s Tests of Multinormality Kankainen, A., Taskinen, S., Oja, H. Abstract. Classical multivariate analysis is based on the assumption that the data come from a multivariate normal distribution.

### Partitioning edge-coloured complete graphs into monochromatic cycles and paths

arxiv:1205.5492v1 [math.co] 24 May 2012 Partitioning edge-coloured complete graphs into monochromatic cycles and paths Alexey Pokrovskiy Departement of Mathematics, London School of Economics and Political

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

### Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

### The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

### 9 More on differentiation

Tel Aviv University, 2013 Measure and category 75 9 More on differentiation 9a Finite Taylor expansion............... 75 9b Continuous and nowhere differentiable..... 78 9c Differentiable and nowhere monotone......

### ON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION

Statistica Sinica 17(27), 289-3 ON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION J. E. Chacón, J. Montanero, A. G. Nogales and P. Pérez Universidad de Extremadura

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### REAL ANALYSIS I HOMEWORK 2

REAL ANALYSIS I HOMEWORK 2 CİHAN BAHRAN The questions are from Stein and Shakarchi s text, Chapter 1. 1. Prove that the Cantor set C constructed in the text is totally disconnected and perfect. In other

### NOTES ON MEASURE THEORY. M. Papadimitrakis Department of Mathematics University of Crete. Autumn of 2004

NOTES ON MEASURE THEORY M. Papadimitrakis Department of Mathematics University of Crete Autumn of 2004 2 Contents 1 σ-algebras 7 1.1 σ-algebras............................... 7 1.2 Generated σ-algebras.........................

### CHAPTER 5. Product Measures

54 CHAPTER 5 Product Measures Given two measure spaces, we may construct a natural measure on their Cartesian product; the prototype is the construction of Lebesgue measure on R 2 as the product of Lebesgue

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,