CHAPTER 2: REVIEW OF LINEAR ALGEBRA AND STATISTICS. 2.1 Introduction. 2.2 Matrices and Linear Transformations

Similar documents
Linear Algebra Notes for Marsden and Tromba Vector Calculus

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Introduction to Matrix Algebra

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Similarity and Diagonalization. Similar Matrices

Solution to Homework 2

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Data Mining: Algorithms and Applications Matrix Math Review

1 Introduction to Matrices

NOTES ON LINEAR TRANSFORMATIONS

Chapter 17. Orthogonal Matrices and Symmetries of Space

Lecture 2 Matrix Operations

Name: Section Registered In:

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Simple Regression Theory II 2010 Samuel L. Baker

Notes on Determinant

Orthogonal Diagonalization of Symmetric Matrices

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

The Determinant: a Means to Calculate Volume

Linear Algebra Review. Vectors

Vector and Matrix Norms

Chapter 3 RANDOM VARIATE GENERATION

Lecture L3 - Vectors, Matrices and Coordinate Transformations

Introduction to General and Generalized Linear Models

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Multivariate Normal Distribution

Section 5.3. Section 5.3. u m ] l jj. = l jj u j + + l mj u m. v j = [ u 1 u j. l mj

LS.6 Solution Matrices

Solving Systems of Linear Equations

Regression III: Advanced Methods

The Characteristic Polynomial

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Solving Linear Systems, Continued and The Inverse of a Matrix

Session 7 Bivariate Data and Analysis

Section Inner Products and Norms

1 Sets and Set Notation.

4. Continuous Random Variables, the Pareto and Normal Distributions

Lecture Notes 2: Matrices as Systems of Linear Equations

Chapter 6. Orthogonality

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

MATH APPLIED MATRIX THEORY

Unified Lecture # 4 Vectors

Recall that two vectors in are perpendicular or orthogonal provided that their dot

T ( a i x i ) = a i T (x i ).

Least-Squares Intersection of Lines

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Inner Product Spaces and Orthogonality

Matrix Representations of Linear Transformations and Changes of Coordinates

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Testing Research and Statistical Hypotheses

7 Gaussian Elimination and LU Factorization

Matrix Differentiation

8 Square matrices continued: Determinants

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

1 Solving LPs: The Simplex Algorithm of George Dantzig

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Likelihood: Frequentist vs Bayesian Reasoning

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Introduction to Matrices for Engineers

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Some probability and statistics

Systems of Linear Equations

Numerical Analysis Lecture Notes

MAT188H1S Lec0101 Burbulla

Solutions to Math 51 First Exam January 29, 2015

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Using Excel for inferential statistics

MATH2210 Notebook 1 Fall Semester 2016/ MATH2210 Notebook Solving Systems of Linear Equations... 3

Section V.3: Dot Product

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Multivariate Analysis of Variance (MANOVA): I. Theory

The Matrix Elements of a 3 3 Orthogonal Matrix Revisited

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

Least Squares Estimation

Linear Algebra Notes

DERIVATIVES AS MATRICES; CHAIN RULE

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Simple linear regression

6.4 Normal Distribution

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

5. Orthogonal matrices

October 3rd, Linear Algebra & Properties of the Covariance Matrix

E3: PROBABILITY AND STATISTICS lecture notes

Descriptive Statistics

Continued Fractions and the Euclidean Algorithm

1 VECTOR SPACES AND SUBSPACES

So let us begin our quest to find the holy grail of real analysis.

Section 1.1. Introduction to R n

Inner Product Spaces

Linear Algebra I. Ronald van Luijk, 2012

Linear Programming. March 14, 2014

CURVE FITTING LEAST SQUARES APPROXIMATION

6 3 The Standard Normal Distribution

by the matrix A results in a vector which is a reflection of the given

MA106 Linear Algebra lecture notes

Transcription:

CHAPTER : REVIEW OF LINEAR ALGEBRA AND STATISTICS.1 Introduction In discrete inverse methods, matrices and linear transformations play fundamental roles. So do probability and statistics. This review chapter, then, is divided into two parts. In the first, we will begin by reviewing the basics of matrix manipulations. Then we will introduce some special types of matrices (Hermitian, orthogonal and semiorthogonal). Finally, we will look at matrices as linear transformations that can operate on vectors of one dimension and return a vector of another dimension. In the second section, we will review some elementary probability and statistics, with emphasis on Gaussian statistics. The material in the first section will be particularly useful in later chapters when we cover eigenvalue problems, and methods based on the length of vectors. The material in the second section will be very useful when we consider the nature of noise in the data and when we consider the maximum likelihood inverse.. Matrices and Linear Transformations Recall from the first chapter that, by convention, vectors will be denoted by lower case letters in boldface (i.e., the data vector d), while matrices will be denoted by upper case letters in boldface (i.e., the matrix G) in these notes...1 Review of Matrix Manipulations Matrix Multiplication If A is an N M matrix (as in N rows by M columns), and B is an M L matrix, we write the N L product C of A and B, as We note that matrix multiplication is associative, that is but in general is not commutative. That is, in general C = AB (.1) (AB)C = A(BC) (.) AB BA (.3) 1

In fact, if AB exists, then the product BA only exists if A and B are square. In Equation (.1) above, the ijth entry in C is the product of the ith row of A and the jth column of B. Computationally, it is given by c ij = M k = 1 a ik b kj (.4) One way to form C using standard FORTRAN code would be DO 300 I = 1, N DO 300 J = 1, L C(I,J) = 0.0 DO 300 K = 1, M 300 C(I,J) = C(I,J) + A(I,K)*B(K,J) (.5) A special case of the general rule above is the multiplication of a matrix G (N M ) and a vector m (M 1): In terms of computation, the vector d is given by d = G m (1.13) (N 1) (N M) (M 1) M d i = G ij m (.6) j j=1 The Inverse of a Matrix The mathematical inverse of the M M matrix A, denoted A 1, is defined such that: where I M is the M M identity matrix given by: AA 1 = A 1 A = I M (.7) 1 0 L 0 1 M O 0 L 0 (M M) 0 M 0 1 (.8) 13

A 1 is the matrix, which when either pre- or postmultiplied by A, returns the identity matrix. Clearly, since only square matrices can both pre- and postmultiply each other, the mathematical inverse of a matrix only exists for square matrices. A useful theorem follows concerning the inverse of a product of matrices: Theorem: If A = B C D (.9) N N N N N N N N Then A 1, if it exists, is given by A 1 = D 1 C 1 B 1 (.10) Proof: A(A 1 ) = BCD(D 1 C 1 B 1 ) = BC (DD 1 ) C 1 B 1 = BC I C 1 B 1 = B (CC 1 ) B 1 = BB 1 = I (.11) Similarly, (A 1 )A = D 1 C 1 B 1 BCD = = I (Q.E.D.) The Transpose and Trace of a Matrix The transpose of a matrix A is written as A T and is given by That is, you interchange rows and columns. (A T ) ij = A ji (.1) The transpose of a product of matrices is the product of the transposes, in reverse order. That is, (AB) T = B T A T (.13) 14

Just about everything we do with real matrices A has an analog for complex matrices. In the complex case, wherever the transpose of a matrix occurs, it is replaced by the complex conjugate transpose of the matrix, denoted A. That is, if A ij = a ij + b ij i (.14) then A ij = c ij + d ij i (.15) where c ij = a ji (.16) and d ij = b ji (.17) that is, A ij = a ji b ji i (.18) Finally, the trace of A is given by M trace (A) = a (.19) ii i=1 Hermitian Matrices is, if A matrix A is said to be Hermitian if it is equal to its complex conjugate transpose. That A = A (.0) If A is a real matrix, this is equivalent to A = A T (.1) This implies that A must be square. The reason that Hermitian matrices will be important is that they have only real eigenvalues. We will take advantage of this many times when we consider eigenvalue and shifted eigenvalue problems later... Matrix Transformations Linear Transformations A matrix equation can be thought of as a linear transformation. Consider, for example, the original matrix equation: d = Gm (1.13) 15

where d is an N 1 vector, m is an M 1 vector, and G is an N M matrix. The matrix G can be thought of as an operator that operates on an M-dimensional vector m and returns an N-dimensional vector d. Equation (1.13) represents an explicit, linear relationship between the data and model parameters. The operator G, in this case, is said to be linear because if m is doubled, for example, so is d. Mathematically, one says that G is a linear operator if the following is true: If d = Gm and f = Gr then [d + f] = G[m + r] (.) In another way to look at matrix multiplications, in the by-now-familiar Equation (1.13), d = Gm (1.13) the column vector d can be thought of as a weighted sum of the columns of G, with the weighting factors being the elements in m. That is, where and d = m 1 g 1 + m g + + m M g M (.3) m = [m 1, m,..., m M ] T (.4) g i = [g 1i, g i,..., g Ni ] T (.5) is the ith column of G. Also, if GA = B, then the above can be used to infer that the first column of B is a weighted sum of the columns of G with the elements of the first column of A as weighting factors, etc. for the other columns of B. Each column of B is a weighted sum of the columns of G. or Next, consider d T = [Gm] T (.6) d T = m T G T (.7) 1 N 1 M M N The row vector d T is the weighted sum of the rows of G T, with the weighting factors again being the elements in m. That is, 16

d T = m 1 g 1 T + m g T + + m M g M T (.8) Extending this to A T G T = B T (.9) we have that each row of B T is a weighted sum of the rows of G T, with the weighting factors being the elements of the appropriate row of A T. In a long string of matrix multiplications such as ABC = D (.30) each column of D is a weighted sum of the columns of A, and each row of D is a weighted sum of the rows of C. Orthogonal Transformations An orthogonal transformation is one that leaves the length of a vector unchanged. We can only talk about the length of a vector being unchanged if the dimension of the vector is unchanged. Thus, only square matrices may represent an orthogonal transformation. Suppose L is an orthogonal transformation. Then, if where L is N N, and x, y are both N-dimensional vectors. Then Lx = y (.31) x T x = y T y (.3) where Equation (.3) represents the dot product of the vectors with themselves, which is equal to the length squared of the vector. If you have ever done coordinate transformations in the past, you have dealt with an orthogonal transformation. Orthogonal transformations rotate vectors but do not change their lengths. Properties of orthogonal transformations. There are several properties of orthogonal transformations that we will wish to use. First, if L is an N N orthogonal transformation, then This follows from L T L = I N (.33) y T y = [Lx] T [Lx] 17

= x T L T Lx (.34) but y T y = x T x by Equation (.3). Thus, and L T L = I N (Q.E.D.) (.35) Second, the relationship between L and its inverse is given by L 1 = L T (.36) These two follow directly from Equation (.35) above. L = [L T ] 1 (.37) Third, the determinant of a matrix is unchanged if it is operated upon by orthogonal transformations. Recall that the determinant of a 3 3 matrix A, for example, where A is given by is given by a11 a1 a13 A = a1 a a3 (.38) a 31 a3 a33 det (A) = a 11 (a a 33 a 3 a 3 ) a 1 (a 1 a 33 a 3 a 31 ) +a 13 (a 1 a 3 a a 31 ) (.39) Thus, if A is an M M matrix, and L is an orthogonal transformations, and if it follows that A = (L)A(L) T (.40) det (A) = det (A ) (.41) Fourth, the trace of a matrix is unchanged if it is operated upon by an orthogonal transformation, where trace (A) is defined as M a ii i= 1 trace ( A ) = (.4) 18

That is, the sum of the diagonal elements of a matrix is unchanged by an orthogonal transformation. Thus, trace (A) = trace (A ) (.43) Semiorthogonal Transformations Suppose that the linear operator L is not square, but N M (N M). Then L is said to be semiorthogonal if and only if or L T L = I M, but LL T I N, N > M (.44) LL T = I N, but L T L I M, M > N (.45) where I N and I M are the N N and M M identity matrices, respectively. A matrix cannot be both orthogonal and semiorthogonal. Orthogonal matrices must be square, and semiorthogonal matrices cannot be square. Furthermore, if L is a square N N matrix, and L T L = I N (.35) then it is not possible to have LL T I N (.46)..3 Matrices and Vector Spaces The columns or rows of a matrix can be thought of as vectors. For example, if A is an N M matrix, each column can be thought of as a vector in N-space because it has N entries. Conversely, each row of A can be thought of as being a vector in M-space because it has M entries. We note that for the linear system of equations given by Gm = d (1.13) where G is N M, m is M 1, and d is N 1, that the model parameter vector m lies in M-space (along with all the rows of G), while the data vector lies in N-space (along with all the columns of G). In general, we will think of the M 1 vectors as lying in model space, while the N 1 vectors lie in data space. Spanning a Space 19

The notion of spanning a space is important for any discussion of the uniqueness of solutions or of the ability to fit the data. We first need to introduce definitions of linear independence and vector orthogonality. A set on M vectors v i, i = 1,..., M, in M-space (the set of all M-dimensional vectors), is said to be linearly independent if and only if a 1 v 1 + a v + + a M v M = 0 (.47) where a i are constants, has only the trivial solution a i = 0, i = 1,..., M. This is equivalent to saying that an arbitrary vector s in M space can be written as a linear combination of the v i, i = 1,..., M. That is, one can find a i such that for an arbitrary vector s s = a 1 v 1 + a v + + a M v M (.48) Two vectors r and s in M-space are said to be orthogonal to each other if their dot, or inner, product with each other is zero. That is, if r s = r s cos θ = 0 (.49) where θ is the angle between the vectors, and r, s are the lengths of r and s, respectively. The dot product of two vectors is also given by r T s = s T r = r i s (.50) i M space is spanned by any set of M linearly independent M-dimensional vectors. M i=1 Rank of a Matrix The number of linearly independent rows in a matrix, which is also equal to the number of linearly independent columns, is called the rank of the matrix. The rank of matrices is defined for both square and nonsquare matrices. The rank of a matrix cannot exceed the minimum of the number of rows or columns in the matrix (i.e., the rank is less than or equal to the minimum of N, M). If an M M matrix is an orthogonal matrix, then it has rank M. The M rows are all linearly independent, as are the M columns. In fact, not only are the rows independent for an orthogonal matrix, they are orthogonal to each other. The same is true for the columns. If a matrix is semiorthogonal, then the M columns (or N rows, if N < M) are orthogonal to each other. We will make extensive use of matrices and linear algebra in this course, especially when we work with the generalized inverse. Next, we need to turn our attention to probability and statistics. 0

.3 Probability and Statistics.3.1 Introduction We need some background in probability and statistics before proceeding very far. In this review section, I will cover the material from Menke's book, using some material from other math texts to help clarify things. Basically, what we need is a way of describing the noise in data and estimated model parameters. We will need the following terms: random variable, probability distribution, mean or expected value, maximum likelihood, variance, standard deviation, standardized normal variables, covariance, correlation coefficients, Gaussian distributions, and confidence intervals..3. Definitions, Part 1 Random Variable: A function that assigns a value to the outcome of an experiment. A random variable has well-defined properties based on some distribution. It is called random because you cannot know beforehand the exact value for the outcome of the experiment. One cannot measure directly the true properties of a random variable. One can only make measurements, also called realizations, of a random variable, and estimate its properties. The birth weight of baby goslings is a random variable, for example. Probability Density Function: The true properties of a random variable b are specified by the probability density function P(b). The probability that a particular realization of b will fall between b and b + db is given by P(b)db. (Note that Menke uses d where I use b. His notation is bad when one needs to use integrals.) P(b) satisfies + 1 = P(b) db (.51) which says that the probability of b taking on some value is 1. P(b) completely describes the random variable b. It is often useful to try and find a way to summarize the properties of P(b) with a few numbers, however. Mean or Expected Value: The mean value E(b) (also denoted <b>) is much like the mean of a set of numbers; that is, it is the balancing point of the distribution P(b) and is given by E(b) = + b P(b) db (.5) Maximum Likelihood: This is the point in the probability distribution P(b) that has the highest likelihood or probability. It may or may not be close to the mean E(b) = <b>. An important point is that for Gaussian distributions, the maximum likelihood point and the mean E(b) = <b> 1

are the same! The graph below (after Figure.3, p. 3, Menke) illustrates a case where the two are different. P(b) b ML <b> b The maximum likelihood point b ML of the probability distribution P(b) for data b gives the most probable value of the data. In general, this value can be different from the mean datum <b>, which is at the balancing point of the distribution. Variance: Variance is one measure of the spread, or width, of P(b) about the mean E(b). It is given by σ + = (b <b >) P(b) db (.53) Computationally, for L experiments in which the kth experiment gives b k, the variance is given by L σ = 1 (b k <b >) (.54) L 1 k=1 Standard Deviation: Standard deviation is the positive square root of the variance, given by σ =+ σ (.55) Covariance: Covariance is a measure of the correlation between errors. If the errors in two observations are uncorrelated, then the covariance is zero. We need another definition before proceeding.

Joint Density Function P(b): The probability that b 1 is between b 1 and b 1 + db 1, that b is between b and b + db, etc. If the data are independent, then P(b) = P(b 1 ) P(b )... P(b n ) (.56) If the data are correlated, then P(b) will have some more complicated form. Then, the covariance between b 1 and b is defined as + + cov( b 1, b ) = L ( b1 < b1 > )( b < b > ) P( b) db1 db Ldb (.57) n In the event that the data are independent, this reduces to = 0 + + cov( b1, b ) = ( b1 < b1 > )( b < b > ) P( b1 ) P( b1 ) db1 db (.58) The reason is that for any value of (b 1 <b 1 >), (b <b >) is as likely to be positive as negative, i.e., the sum will average to zero. The matrix [cov b] contains all of the covariances defined using Equation (.57) in an N N matrix. Note also that the covariance of b i with itself is just the variance of b i. In practical terms, if one has an N-dimensional data vector b that has been measured L times, then the ijth term in [cov b], denoted [cov b] ij, is defined as [covb] ij = 1 L 1 L k=1 ( b k i b i ) b k j b j ( ) (.59) where bk i is the value of the ith datum in b on the kth measurement of the data vector, b i is the mean or average value of b i for all L measurements (also commonly written <b i >), and the L 1 term results from sampling theory. Correlation Coefficients: This is a normalized measure of the degree of correlation of errors. It takes on values between 1 and 1, with a value of 0 implying no correlation. The correlation coefficient matrix [cor b] is defined as [cor [covb] ij b ] ij = (.60) σ iσ j where [cov b] ij is the covariance matrix defined term by term as above for cov [b 1, b ], and σ i, σ j are the standard deviations for the ith and jth observations, respectively. The diagonal terms of [cor b] ij are equal to 1, since each observation is perfectly correlated with itself. 3

The figure below (after Figure.8, page 6, Menke) shows three different cases of degree of correlation for two observations b 1 and b. (a) (b) (c) + + + b b b + b 1 + b 1 + b 1 Contour plots of P(b 1, b ) when the data are (a) uncorrelated, (b) positively correlated, (c) negatively correlated. The dashed lines indicate the four quadrants of alternating sign used to determine correlation..3.3 Some Comments on Applications to Inverse Theory Some comments are now in order about the nature of the estimated model parameters. We will always assume that the noise in the observations can be described as random variables. Whatever inverse we create will map errors in the data into errors in the estimated model parameters. Thus, the estimated model parameters are themselves random variables. This is true even though the true model parameters may not be random variables. If the distribution of noise for the data is known, then in principle the distribution for the estimated model parameters can be found by mapping through the inverse operator. This is often very difficult, but one particular case turns out to have a rather simple form. We will see where this form comes from when we get to the subject of generalized inverses. For now, consider the following as magic. If the transformation between data b and model parameters m is of the form m = Mb + v (.61) where M is any arbitrary matrix and v is any arbitrary vector, then and <m> = M<b> + v (.6) [cov m] = M [cov b] M T (.63) 4

.3.4 Definitions, Part Gaussian Distribution: This is a particular probability distribution given by 1 ( b < b > ) P ( b) = exp (.64) πσ σ The figure below (after Figure.10, page 9, Menke) shows the familiar bell-shaped curve. It has the following properties: 0.50 Mean = E(b) = <b> and Variance = σ A P(b) 0.5 B -5-4 -3 - -1 0 1 3 4 5 Gaussian distribution with zero mean and σ = 1 for curve A, and σ = for curve B. Many distributions can be approximated fairly accurately (especially away from the tails) by the Gaussian distribution. It is also very important because it is the limiting distribution for the sum of random variables. This is often just what one assumes for noise in the data. One also needs a way to represent the joint probability introduced earlier for a set of random variables each of which has a Gaussian distribution. The joint probability density function for a vector b of observations that all have Gaussian distributions is chosen to be [see Equation (.10) of Menke, page 30] b P( b) = ( det[ covb] ) N / ( π ) 1/ exp 1 T 1 { [ b < b > ] [ covb] [ b < b > ]} (.65) 5

which reduces to the previous case in Equation (.64) for N = 1 and var (b 1 ) = σ. In statistics books, Equation (.65) is often given as P(b) = (π) N/ Σ b 1/ exp{ [b µ b ] T Σ 1 [b µ b ]} With this background, it makes sense (statistically, at least) to replace the original relationship: with b = Gm (1.13) <b> = Gm (.66) The reason is that one cannot expect that there is an m that should exactly predict any particular realization of b when b is in fact a random variable. Then the joint probability is given by P( b) = (det[covb]) N / (π ) 1/ exp 1 T 1 { [ b Gm] [covb] [ b Gm] } (.67) What one then does is seek an m that maximizes the probability that the predicted data are in fact close to the observed data. This is the basis of the maximum likelihood or probabilistic approach to inverse theory. Standardized Normal Variables: It is possible to standardize random variables by subtracting their mean and dividing by the standard deviation. If the random variable had a Gaussian (i.e., normal) distribution, then so does the standardized random variable. Now, however, the standardized normal variables have zero mean and standard deviation equal to one. Random variables can be standardized by the following transformation: m m s = (.68) σ where you will often see z replacing s in statistics books. We will see, when all is said and done, that most inverses represent a transformation to standardized variables, followed by a simple inverse analysis, and then a transformation back for the final solution. Chi-Squared (Goodness of Fit) Test: A statistical test to see whether a particular observed distribution is likely to have been drawn from a population having some known form. 6

The application we will make of the chi-squared test is to test whether the noise in a particular problem is likely to have a Gaussian distribution. This is not the kind of question one can answer with certainty, so one must talk in terms of probability or likelihood. For example, in the chi-squared test, one typically says things like there is only a 5% chance that this sample distribution does not follow a Gaussian distribution. As applied to testing whether a given distribution is likely to have come from a Gaussian population, the procedure is as follows: One sets up an arbitrary number of bins and compares the number of observations that fall into each bin with the number expected from a Gaussian distribution having the same mean and variance as the observed data. One quantifies the departure between the two distributions, called the chi-squared value and denoted χ, as i= 1 [(#obs in bin i) (#expected in bin i) ] [#expected in bin i] k χ (.69) = where the sum is over the number of bins, k. Next, the number of degrees of freedom for the problem must be considered. For this problem, the number of degrees is equal to the number of bins minus three. The reason you subtract three is as follows: You subtract 1 because if an observation does not fall into any subset of k 1 bins, you know it falls in the one bin left over. You are not free to put it anywhere else. The other two come from the fact that you have assumed that the mean and standard deviation of the observed data set are the mean and standard deviations for the theoretical Gaussian distribution. With this information in hand, one uses standard chi-squared test tables from statistics books and determines whether such a departure would occur randomly more often than, say, 5% of the time. Officially, the null hypothesis is that the sample was drawn from a Gaussian distribution. If the observed value for χ is greater than χ α, called the critical χ value for the α significance level, then the null hypothesis is rejected at the α significance level. Commonly, α = 0.05 is used for this test, although α = 0.01 is also used. The α significance level is equivalent to the 100*(1 α)% confidence level (i.e., α = 0.05 corresponds to the 95% confidence level). Consider the following example, where the underlying Gaussian distribution from which all data samples d are drawn has a mean of 7 and a variance of 10. Seven bins are set up with edges at 4,, 4, 6, 8, 10, 1, and18, respectively. Bin widths are not prescribed for the chisquared test, but ideally are chosen so there are about an equal number of occurrences expected in each bin. Also, one rule of thumb is to only include bins having at least five expected occurrences. I have not followed the about equal number expected in each bin suggestion because I want to be able to visually compare a histogram with an underlying Gaussian shape. However, I have chosen wider bins at the edges in these test cases to capture more occurrences at the edges of the distribution. Suppose our experiment with 100 observations yields a sample mean of 6.76 and a sample variance of 8.7, and 3, 13, 6, 5, 16, 14, and 3 observations, respectively, in the bins from left to right. Using standard formulas for a Gaussian distribution with a mean of 6.76 and a variance of 8.7, the number expected in each bin is 4.90, 11.98,.73, 7.10, 0.31, 9.56, and 3.41, respectively. The calculated χ, using Equation (.69), is 4.48. For seven bins, the DOFs 7

for the test is 4, and χ α = 9.49 for α = 0.05. Thus, in this case, the null hypothesis would be accepted. That is, we would accept that this sample was drawn from a Gaussian distribution with a mean of 6.76 and a variance of 8.7 at the α = 0.05 significance level (95% confidence level). The distribution is shown below, with a filled circle in each histogram at the number expected in that bin. It is important to note that this distribution does not look exactly like a Gaussian distribution, but still passes the χ test. A simple, non-chi-square analogy may help better understand the reasoning behind the chi-square test. Consider tossing a true coin 10 times. The most likely outcome is 5 heads and 5 tails. Would you reject a null hypothesis that the coin is a true coin if you got 6 heads and 4 tails in your one experiment of tossing the coin ten times? Intuitively, you probably would not reject the null hypothesis in this case, because 6 heads and 4 tails is not that unlikely for a true coin. In order to make an informed decision, as we try to do with the chi-square test, you would need to quantify how likely, or unlikely, a particular outcome is before accepting or rejecting the null hypothesis that it is a true coin. For a true coin, 5 heads and 5 tails has a probability of 0.46 (that is, on average, it happens 4.6% of the time), while the probability of 6 heads and 4 tails is 0.05, 7 heads and 3 tails is 0.117, and 8 heads and tails is 0.044, respectively. A distribution of 7 heads and 3 tails does not look like 5 heads and 5 tails, but occurs more than 10% of the time with a true coin. Hence, by analogy, it is not too unlikely and you would probably not reject the null hypothesis that the coin is a true coin just because you tossed 7 heads and 3 tails in one experiment. Ten heads and no tails only occurs, on average, one time in 104 experiments (or about 0.098% of the time). If you got 10 heads and 0 tails, you d probably reject the null hypothesis that you are tossing a true coin because the outcome is very unlikely. Eight heads and two tails occurs 4.4% of the time, on average. You might also reject the null hypothesis in this 8

case, but you would do so with less confidence, or at a lower significance level. In both cases, however, your conclusion will be wrong occasionally just due to random variations. You accept the possibility that you will be wrong rejecting the null hypothesis 4.4% of the time in this case, even if the coin is true. The same is true with the chi-square test. That is, at the α = 0.05 significance level (95% confidence level), with χ greater than χ α, you reject the null hypothesis, even though you recognize that you will reject the null hypothesis incorrectly about 5% of the time in the presence of random variations. Note that this analogy is a simple one in the sense that it is entirely possible to actually do a chi-square test on this coin toss example. Each time you toss the coin ten times you get one outcome: x heads and (10 x) tails. This falls into the x heads and (10 x) tails bin. If you repeat this many times you get a distribution across all bins from 0 heads and 10 tails to 10 heads and 0 tails. Then you would calculate the number expected in each bin and use Equation (.69) to calculate a chi-square value to compare with the critical value at the α significance level. Now let us return to another example of the chi-square test where we reject the null hypothesis. Consider a case where the observed number in each of the seven bins defined above is now, 17, 13, 4, 6, 9, and 9, respectively, and the observed distribution has a mean of 7.8 and variance of 10.8. The expected number in each bin, for the observed mean and variance, is 4.95, 10.3, 19.16, 4.40, 1.3, 1.78, and 7.0, respectively. The calculated χ is now 10.77, and the null hypothesis would be rejected at the α = 0.05 significance level (95% confidence level). That is, we would reject that this sample was drawn from a Gaussian distribution with a mean of 7.8 and variance of 10.8 at this significance level. The distribution is shown on the next page, again with a filled circle in each histogram at the number expected in that bin. Confidence Intervals: One says, for example, with 98% confidence that the true mean of a random variable lies between two values. This is based on knowing the probability distribution 9

for the random variable, of course, and can be very difficult, especially for complicated distributions that include nonzero correlation coefficients. However, for Gaussian distributions, these are well known and can be found in any standard statistics book. For example, Gaussian distributions have 68% and 95% confidence intervals of approximately ±1σ and ±σ, respectively. T and F Tests: These two statistical tests are commonly used to determine whether the properties of two samples are consistent with the samples coming from the same population. The F test in particular can be used to test the improvement in the fit between predicted and observed data when one adds a degree of freedom in the inversion. One expects to fit the data better by adding more model parameters, so the relevant question is whether the improvement is significant. As applied to the test of improvement in fit between case 1 and case, where case uses more model parameters to describe the same data set, the F ratio is given by F ( E1 E ) /( DOF1 DOF ) = (.70) ( E / DOF ) where E is the residual sum of squares and DOF is the number of degrees of freedom for each case. If F is large, one accepts that the second case with more model parameters provides a significantly better fit to the data. The calculated F is compared to published tables with DOF 1 DOF and DOF degrees of freedom at a specified confidence level. (Reference: T. M. Hearns, P n travel times in Southern California, J. Geophys. Res., 89, 1843 1855, 1984.) The next section will deal with solving inverse problems based on length measures. This will include the classic least squares approach. 30