Some Properties Of the Gaussian Distribution

Transcription

1 Some Properties Of the Gaussian Distribution Jianin Wu GVU Center and College of Computing Georgia Institute of Technology April, 004 Contents Introduction De nition. Univariate Gaussian Multivariate Gaussian Notation and Parameterization 4 4 Linear Operation and Summation 5 4. Univariate case Multivariate Case Geometry and Mahalanobis Distance 6 6 Conditioning 7 7 Product of Gaussians 9 8 Application I: Parameter Estimation 0 8. Maimum Likelihood Estimation Bayesian Parameter Estimation Application II: Kalman Filter 9. The Model The Estimation A Gaussian Integral 4 B Characteristic Functions 5 C Schur Complement and the Matri Inversion Lemma 6

2 D Vector and Matri Derivatives 7 Introduction The Gaussian distribution is the most widely used probability distribution in statistical pattern recognition and machine learning. The nice properties of the Gaussian distribution might be the main reason for its popularity. In this short paper, I try to organize the basic facts about the Gaussian distribution. There is no advanced theory in this paper. However, in order to understand these facts, some linear algebra and multivariate analysis are needed, which are not always covered su ciently in undergraduate tets. The attempt of this paper is to pool these facts together, and hope that it will be useful for new researchers entering this area. De nition. Univariate Gaussian The probability density function of a univariate Gaussian distribution has the following form:! p () p ( ), () in which is the ected value of, and is the variance. We assume that > 0. We have to rst verify that eq. () is a valid density. It is obvious that p () 0 always holds for R. From eq. (96) in Appendi A we know that R t d p t. Applying this equation, we have p () d p which means that p () is a valid density. The density p p! ( ) d () d (3) p p, (4) is called the standard normal density. In Appendi A, it is showed that the mean value and standard deviation of the standard normal distribution are 0 and respectively. By doing a change of variables, it is easy to show that R p () d and R ( ) p () d. I planned to write a short note listing some properties of Gaussian. However, somehow I decided to keep this paper self-containing. The result is that it becomes very fat.

3 . Multivariate Gaussian The probability density function of a multivariate Gaussian distribution has the following form: p () () d jj ( )T ( ), (5) in which is a d-dimensional vector, is the d-dimensional mean vector, and is the d-by-d covariance matri. We assume that is a symmetric, positive de nite matri. We have to rst verify that eq. (5) is a valid probability density function. It is obvious that p () 0 always holds for R d. Net we diagonalize as U T U in which U is an orthogonal matri containing the eigenvectors of, [ ; : : : ; d ] is a diagonal matri containing the eigenvalues of in its diagonal entries and jj jj. Let s de ne a new random vector as y U ( ). (6) If we treat eq. (6) as a change of variables, the determinant of the Jacobian matri will be jj. Now we are ready to calculate the integral p () d () d jj ( )T ( ) d (7) () d jj jj yt y dy (8) dy y p i dy i (9) i dy (0) i in which y i is the ith component of y, i.e. y (y ; : : : ; y d ). This equation gives the validity of the multivariate Gaussian density. Since y is a random vector, it has a density, denoted as p y (y). Using the inverse transform method, we get p y (y) p + U T y U T () U T () d jj () d yt y U T y T U T y () (3) The density de ned by p y (y) d () yt y. (4) 3

4 is called a spherical Gaussian distribution. Let z be a random vector formed by a subset of the components of y. By marginalization it isclear that p (z). Using this fact, y i () jzj zt z, and speci cally p (y i ) p it is straightforward to show that the mean vector and covariance matri of a spherical Gaussian are 0 and I respectively. Using the inverse transform of eq. (6), we can easily calculate the mean vector and covariance matri of the density p (). E E + U T y + E U T y (5) T E ( ) ( ) T E U T y U T y (6) U T E yy T U (7) U T U (8) (9) 3 Notation and Parameterization When we have a density of the form in eq. (5), it is often written as or, N (; ) (0) N (; ; ) () In most cases we will use the mean vector and covariance matri to ress a Gaussian density. This is called the moment parameterization. There is another parameterization of a Gaussian density, called the canonical parameterization. In the canonical parameterization, a Gaussian density is ressed as p () + T T, () in which d log log jj + T is a normalization constant. The parameters in these two representations are related by (3) (4) (5). (6) Notice that there is a confusion in our notation: has di erent meanings in eq. () and eq. (6). In eq. (), is a parameter in the canonical parameterization of a Gaussian density, which is not necessarily diagonal. In eq. (6), is a diagonal matri formed by the eigenvalues of. It is straightforward to show that the moment parameterization and canonical parameterization of the Gaussian distribution are equivalent. In some cases the canonical parameterization is more convenient to use than the moment parameterization. 4

5 4 Linear Operation and Summation 4. Univariate case Suppose N ; and N ; are two independent univariate Gaussian densities. It is obvious that a + b N a + b; a, in which a and b are two scalars. Now consider a random variable z +. The density of z is calculated as: p z (z) p ( ) p ( ) d d (7) z + p ( + ) p ( + ) d 0 d 0 (8) 0 z 0 +0 p ( 0 + ) p (z 0 ) d 0 (9)! (z ) +! (z ) d (30) 0 s (z ) + +! p p (z ) + + (z ) + + C A d (3) (3), (33) in which the step from eq. (3) to eq. (3) used the result of eq. (96). The sum of two univariate Gaussian random variables is again a Gaussian random variable, with the mean value and variance summed up respectively, i.e. z N + ; +. The summation rule is easily generalized to n Gaussians. 4. Multivariate Case Suppose N ( ; ) is a d-dimensional Gaussian density, A is a q-by-d matri and b is a q-dimensional vector, then z A + b is a q-dimensional Gaussian density: z N A + b; AA T. This fact is proved using the characteristic function tool (see Appendi B). 5

6 The characteristic function of z is z (t) E z it T z (34) E it T (A + b) (35) it T b E h i A T t i T (36) it T b i A T t T A T t T A T t (37) it T (A + b) t T AA T t, (38) in which the step from eq. (37) to eq. (38) used the eq. (07) in Appendi B. Appendi B states that if a characteristic function (t) is of the form it T t T t, then the underlying density p () is a Gaussian with mean vector and covariance matri. Applying this fact to eq. (38), we immediately get z N A + b; AA T : (39) Suppose N ( ; ) and N ( ; ) are two independent d- dimensional Gaussian densities, and de ne a new random vector z + y. We can calculate the probability density function p (z) using the same method as we used in the univariate case. However, the calculation is comple and we have to apply the matri inversion lemma in Appendi C. Characteristic function simpli es the calculation. Using eq. (0) in Appendi B, we get z (t) (t) y (t) (40) it T t T t it T t T t (4) it T ( + ) t T ( + ) t, (4) which immediately gives us z N ( + ; + ). The summation of two multivariate Gaussian random variables is as easy to compute as in the univariate case: sum up the mean vectors and covariance matrices. The rule is same for summing up several multivariate Gaussians. Now we have the tool of linear transformation and let s revisit eq. (6). For convenience we retype the equation here: N (; ), and y U ( ). (43) Using the properties of linear transformations on a Gaussian density, y is indeed a Gaussian (in section. we painfully calculate p (y) using the inverse transform method), and has mean vector 0 and covariance matri I. The transformation of applying eq. (6) is called the whitening transformation since the transformed density has an identity covariance matri. 5 Geometry and Mahalanobis Distance Figure shows a bivariate Gaussian density function. Gaussian density has only one mode, which is the mean vector, and the shape of the density is determined 6

7 Figure : Bivariate Gaussian density by the covariance matri. Figure showed the equal probability contour of a bivariate Gaussian density. All points on a given equal probability contour must have the following term evaluated to a constant value: r (; ) ( ) T ( ) c (44) r (; ) is called the Mahalanobis distance from to, given the covariance matri. Eq. (44) de nes a hyperellipsoid in d-dimensional, which means that the equal probability contour of a Gaussian density is a hyperellipsoid in d-dimensional. The principle aes of this hyperellipsoid are given by the eigenvectors of, and the lengths of these aes are proportional to square root of the eigenvalues associated with these eigenvectors. 6 Conditioning Suppose and are two multivariate Gaussian densities, which have a joint density function p () (d+d) jj T!, in which d and d are the dimensionality of and respectively, and. The matrices and are covariance matrices between and, and satisfying T. 7

8 Figure : Equal probability contour of a bivariate Gaussian density The marginal distributions N ( ; ) and N ( ; ) are easy to get from the joint distribution. We are interested in computing the conditional probability density p ( j ). We will need to compute the inverse of, and this task is completed by using the Schur complement (see Appendi C). For notational simplicity, we denote the Schur complement of as S, de ned as S. Similarly, the Schur complement of is S. Applying eq. (0) and noticing that T, we get (writing as 0, and as 0 ) S S T S + T S (45) and T 0T S 0 + 0T T S Thus we can split the joint distribution as p () d js () d j j j T 0 T S 0 + 0T 0. (46) T S 0 + 0! (47) in which we used the fact that jj j j js j (from eq. (9) in Appendi C). 8

9 Since the second term in the right hand side of eq. (47) is the marginal p ( ) and p ( ; ) p ( j ) p ( ), we now get the conditional probability p ( j ) as 0 + T p ( j ) 0 S 0 +! 0, () d js j (48) or j N + 0 ; S (49) N + ( ) ; (50) 7 Product of Gaussians Suppose p () N (; ; ) and p () N (; ; ) are two independent d-dimensional Gaussian densities. Sometimes we want to compute the density which is proportional to the product of the two Gaussian densities, i.e. p () p () p (), in which is a proper normalization constant to make p (z) a valid density function. In this task the canonical notation (see section 3) will be etremely helpful. Writing the two Gaussians in canonical form p () + T (5) p () the density p (z) is easy to compute, as + T p (z) p () p () 0 + ( + ) T T T, (5) T ( + ). (53) This equation states that in the canonical parameterization, in order to compute product of Gaussians, we just sum the parameters. This result is readily etended to the product of n Gaussians. Suppose we have n Gaussian distributions p i (), whose parameters in the canonical parameterization are i and i, i ; ; : : : ; n. Then p () n i p i () is also a Gaussian density, given by p () 0 + ( n i i ) T T ( n i i ). (54) Now let s go back to the moment parameterization. Suppose we have n Gaussian distributions p i (), in which p i () N (; i ; i ), i ; ; : : : ; n. Then p () n i p i () is Gaussian, p () N (; ; ) (55) 9

10 where + + : : : + n (56) + + : : : + n n (57) Now we have listed some properties of the Gaussian distribution. Net let s see how these properties are applied. 8 Application I: Parameter Estimation 8. Maimum Likelihood Estimation Let us suppose that we have a d-dimensional multivariate Gaussian density N (; ), and n i.i.d (independently, identically distributed) samples D f ; ; : : : ; n g sampled from this distribution. The task is to estimate the parameters and. The log-likelihood function of observing the data set D given parameters and is: l (; jd) (58) ny log p ( i ) (59) i nd log () + n log nx ( i ) T ( i ). (60) Taking derivative of the likelihood function with respect to and gives (see Appendi D): X ( i ) n nx ( i ) ( i ) T, (6) i in which eq. (6) used eq. (5) and the chain rule, and eq. (6) used eq. (33), eq. (34) and the fact that T. The notation in eq. (6) is a little bit confusing. There are two in the right hand side, the rst represents a summation and the second represents the covariance matri. In order to nd the maimum likelihood solution, we want to nd the maimum of the likelihood function. Setting both eq. (6) and eq. (6) to 0 gives us the solution: ML nx i (63) n ML n i nx ( i ML ) ( i ML ) T (64) i 0

11 Eq. (63) and eq. (64) clearly states that the maimum likelihood estimation of the mean vector and covariance matri are just the sample mean and sample covariance matri respectively. 8. Bayesian Parameter Estimation In Bayesian estimation, we assume that the covariance matri is known. Let us suppose that we have a d-dimensional multivariate Gaussian density N (; ), and n i.i.d samples D f ; ; : : : ; n g sampled from this distribution. We also need a prior on the parameter. Let s assume that the prior is that N ( 0 ; 0 ). The task is to estimate the parameters. Note that we assume 0, 0, and are all known. The only parameter to be estimated is the mean vector. In Bayesian estimation, instead of nd a point ^ in the parameter space that gives maimum likelihood, we calculate the posterior density for the parameter p (jd), and use the entire distribution of as our estimation for this parameter. Applying the Bayes law, p (jd) p (Dj) p () (65) ny p ( i ) p 0 () (66) i in which is a normalization constant which does not depend on. Apply the result in section 7, we know that p (jd) is also a Gaussian, and where p (jd) N (; n ; n ) (67) n n + 0 (68) n n n (69) Both n and n can be calculated from know parameters. So we have determined the posterior distribution p (jd) for given the data set D. We choose a Gaussian to be the prior family. Usually, the prior distribution is chosen such that the posterior belongs to the same functional form as the prior. A prior and posterior chosen in this way are said to be conjugate. We have seen that Gaussian have the nice property that both the prior and posterior are Gaussian, i.e. Gaussian is auto-conjugate. After p (jd) is determined, a new sample is classi ed by calculating the probability p (jd) p (j) p (jd) d. (70) Eq. (70) and eq. (7) has the same form. Thus we can guess that p (jd) is a Gaussian again, and p (jd) N (; n ; + n ). (7)

12 The guess is correct, and is easy to verify it by repeating the steps in eq.(7) through eq. (33). 9 Application II: Kalman Filter 9. The Model The Kalman lter address the problem of estimating a state vector in a discrete time process, given a linear dynamic model and a linear measurement model k A k + w k, (7) z k H k + v k. (73) The process noise w k and measurement noise v k are assumed to be Gaussian: w N (0; Q) (74) v N (0; R) (75) At time k, assume we know the distribution of k, the task is to estimate the posterior probability of k at time k, given the current observation z k and the previous state estimation p ( k ). In a broader point of view, the task can be formulated as estimating the posterior probability of k at time k, given all the previous state estimates and all the observations up to time step k. Under certain Markovian assumption, it is not hard to prove that the two problem formulations are equivalent. In the Kalman lter setup, we assume the prior is Gaussian, i.e. at time t 0, p ( 0 ) N (; 0 ; P 0 ). Instead of ; here we use P to represent a covariance matri, in order to match the notations in the Kalman lter literature. 9. The Estimation I will show that, with the help of the properties of Gaussians we have obtained, it is quite easy to derive the Kalman lter equations. The Kalman lter can be separated in two (related) steps. In the rst step, based on the estimation p ( k ) and the dynamic model (7), we get an estimate p k. Note that the minus sign means that this estimation is done before we take into account the measurement. In the second step, based on p k and the measurement model (73), we get the nal estimation p (k ). First, let s estimate p k. Assume that at time k, the estimation we already got is a Gaussian p ( k ) N ; k ; P k : (76)

13 This assumption coincides well with the prior p ( 0 ). We will show that, under this assumption, after the Kalman lter update, p ( k ) will also become a Gaussian, and this makes the assumption reasonable. Applying the linear operation equation (39) on the dynamic model (7), we immediately get the estimation for k : k N k ; P k (77) k A k (78) P k AP k A T + Q (79) The estimate p k conditioned on the observation zk gives p ( k ), the estimation we want. Thus the conditioning property (50) can be used. In order to use eq. (50), we have to build the joint covariance matri rst. Since Cov (z k ) HP k H T + R (applying eq. (39) to eq. (73)) and Cov z k ; k Cov H k + v k ; k (80) Cov H k ; k (8) HP k, (8) the joint covariance matri of k ; z k is: Pk P k H T HP k HP k H T + R. (83) Applying the conditioning equation (50), we get p ( k ) p k jz k (84) N ( k ; P k ) (85) P k P k P k H T HP k H T + R HPk (86) k k + P k H T HP k H T + R zk H k (87) The equations (7779) and (8487) are the Kalman lter updating rules. The term P k H T HP k H T + R appears in both eq. (86) and eq. (87). De ning K k P k H T HP k H T + R, (88) the equations are simpli ed as P k (I K k H) P k (89) k k + K k z k H k. (90) The term K k is called the Kalman gain matri, and the term z k the innovation. H k is called 3

14 A Gaussian Integral We will compute the integral of the univariate Gaussian density in this section. The trick in doing this integration is to consider two independent univariate gaussians at one time. e d s e d e y dy (9) s e ( +y ) ddy (9) s 0 0 re r drd (93) r [ e r ] 0 (94) p, (95) in eq. (93) the polar coordinates are used, and the etra r appeared inside the integral is the determinant of the Jacobian matri. The above integral can be easily etend as f (t) d p t (96) t in which we assume t > 0. Then we have df d dt dt t t t and d (97) d (98) df dt d r p t dt t. (99) The above two equations give us r d t t t. (00) Applying eq. (00), we have p d p r 4 (0) 4

15 It is obvious that since p is an odd function. d 0 (0) Eq. (0) and eq. (0) have proved that the mean and standard deviation of a standard normal distribution are 0 and respectively. B Characteristic Functions The characteristic function of a probability density p () is de ned as its Fourier transform (t) E it T (03) in which i p. Let s computer the characteristic function of a Gaussian density. (t) E it T (04) () d jj ( )T ( ) + it T d (05) it T t T t it T it d (06) () d jj it T t T t (07) Since the characteristic function is de ned as a Fourier transform, the inverse Fourier transform of (t) will be eactly p (), i.e. a density is completely determined by its characteristic function. When we see a characteristic function (t) is of the form it T t T t, we know that the underlying density is a Gaussian with mean vector and covariance matri. Suppose and y are two independent random vectors with the same dimension, and de ne a new random vector z + y. Then p z (z) p () p y (y) ddy (08) z+y p () p y (z ) d. (09) Since convolution in the function space is a product in the Fourier space, we have z (z) () y (y), (0) which means that the characteristic function of the sum of two independent random variables is just product of the characteristic functions of the summands. 5

16 C Schur Complement and the Matri Inversion Lemma The Schur complement is very useful in computing the inverse of a block matri. Suppose M is a block matri ressed as A B M, () C D in which A and D are non-singular square matrices. We want to compute M. Some algebraic manipulations give I 0 I A CA M B () I 0 I I 0 A B I A B CA (3) I C D 0 I A B I A B 0 D CA (4) B 0 I A 0 A 0 0 D CA, (5) B 0 S A in which the term D CA B is called the Schur complement of A, denoted as S A. Equation XMY implies that M Y X. Hence we have M I A B A 0 0 I 0 S A A A BS A I 0 0 S A CA I A + A BS A CA A BS A S A CA S A Taking the determinant of both sides of eq. (6), it gives I 0 CA I (6) (7) (8) jmj jaj js A j. (9) We can also compute M by using the Schur complement of D, in the following way: M S D S D BD D CS D D + D CS D BD (0) jmj jdj js D j. () Eq. (8) and eq. (0) are two di erent representations of the same matri M, which means the corresponding blocks in these two equations must be 6

17 equal, e.g. S D A + A BS A CA. This result is known as the matri inversion lemma: S D A BD C A + A B D CA B CA. () The following result, which comes from equating the upper right block is also useful: A B D CA B A BD C BD. (3) This formula and the matri inversion lemma are useful in the derivation of the Kalman lter equations. D Vector and Matri Derivatives Suppose y is a scalar, A is a matri, and and y are vectors. The partial derivative of y with respect to A is de @a ij where a ij the i; j-th component of the matri A. From the de nition (4), it is easy to get the following yt y (5) For a square matri A that is n-by-n, the matri M ij de ned by removing from A the i-th row and j-th column is called a minor of A. The scalar c ij ( ) i+j M ij is called a cofactor of A. The matri A cof with c ij in its i; j-th entry is called the cofactor matri of A. Finally, the adjoint matri of A is de ned as transpose of the cofactor matri A adj A T cof. (6) There are some well-known facts about the minors, determinant, and adjoint of a matri: jaj X j a ij c ij (7) A jaj A adj. (8) Since M ij has removed the i-th row, it does not depend on a ij, neither does c ij. Thus, we ij c ij, jaj A cof (30) 7

18 which in turn shows that A, A T adj (3) jaj A T. (3) Using the chain rule, we immediately get that for a positive de log jaj A T. (33) Applying the de nition (4), it is easy to show that for a square T A T, (34) since T A i j a ij i j where ( ; ; : : : ; n ). References [] C. M. Bishop. Neural Networks for Pattern Recognition, Oford University Press, Oford, UK, 996. [] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi cation, second edition, Wiley-Interscience, New York, NY, USA, 00. [3] [4] M. I. Jordan. An Introduction to Probabilistic Graphical Models, chapter 3, draft. [5] G. Welch and G. Bishop. An Introduction to the Kalman Filter, TR 95-04, Department of Computer Science, University of North Carolina at Chapel Hill. 8