Quick Tour of Basic Linear Algebra and Probability Theory

Transcription

1 Quick Tour of and Probability Theory CS246: Mining Massive Data Sets Winter 2011

2 Outline 1 2 Basic Probability Theory

3 Matrices and Vectors Matrix: A rectangular array of numbers, e.g., A R m n : a 11 a a 1n a 21 a a 2n A =.... a m1 a m2... a mn

4 Matrices and Vectors Matrix: A rectangular array of numbers, e.g., A R m n : a 11 a a 1n a 21 a a 2n A =.... a m1 a m2... a mn Vector: A matrix consisting of only one column (default) or one row, e.g., x R n x = x 1 x 2. x n

5 Matrix Multiplication If A R m n, B R n p, C = AB, then C R m p : n C ij = A ik B kj k=1

6 Matrix Multiplication If A R m n, B R n p, C = AB, then C R m p : C ij = n A ik B kj k=1 Special cases: Matrix-vector product, inner product of two vectors. e.g., with x, y R n : x T y = n x i y i R i=1

7 Properties of Matrix Multiplication Associative: (AB)C = A(BC)

8 Properties of Matrix Multiplication Associative: (AB)C = A(BC) Distributive: A(B + C) = AB + AC

9 Properties of Matrix Multiplication Associative: (AB)C = A(BC) Distributive: A(B + C) = AB + AC Non-commutative: AB BA

10 Properties of Matrix Multiplication Associative: (AB)C = A(BC) Distributive: A(B + C) = AB + AC Non-commutative: AB BA Block multiplication: If A = [A ik ], B = [B kj ], where A ik s and B kj s are matrix blocks, and the number of columns in A ik is equal to the number of rows in B kj, then C = AB = [C ij ] where C ij = k A ikb kj

11 Properties of Matrix Multiplication Associative: (AB)C = A(BC) Distributive: A(B + C) = AB + AC Non-commutative: AB BA Block multiplication: If A = [A ik ], B = [B kj ], where A ik s and B kj s are matrix blocks, and the number of columns in A ik is equal to the number of rows in B kj, then C = AB = [C ij ] where C ij = k A ikb kj Example: If x R n and A = [ a 1 a 2... a n ] R m n, B = [ b 1 b 2... b p ] R n p : A x = n x i ai i=1 AB = [A b 1 A b 2... A b p ]

12 Operators and properties Transpose: A R m n, then A T R n m : (A T ) ij = A ji

13 Operators and properties Transpose: A R m n, then A T R n m : (A T ) ij = A ji Properties: (A T ) T = A (AB) T = B T A T (A + B) T = A T + B T

14 Operators and properties Transpose: A R m n, then A T R n m : (A T ) ij = A ji Properties: (A T ) T = A (AB) T = B T A T (A + B) T = A T + B T Trace: A R n n, then: tr(a) = n i=1 A ii

15 Operators and properties Transpose: A R m n, then A T R n m : (A T ) ij = A ji Properties: (A T ) T = A (AB) T = B T A T (A + B) T = A T + B T Trace: A R n n, then: tr(a) = n i=1 A ii Properties: tr(a) = tr(a T ) tr(a + B) = tr(a) + tr(b) tr(λa) = λtr(a) If AB is a square matrix, tr(ab) = tr(ba)

16 Special types of matrices Identity matrix: I = I n R n n : { 1 i=j, I ij = 0 otherwise. A R m n : AI n = I m A = A

17 Special types of matrices Identity matrix: I = I n R n n : { 1 i=j, I ij = 0 otherwise. A R m n : AI n = I m A = A Diagonal matrix: D = diag(d 1, d 2,..., d n ): { d i j=i, D ij = 0 otherwise.

18 Special types of matrices Identity matrix: I = I n R n n : { 1 i=j, I ij = 0 otherwise. A R m n : AI n = I m A = A Diagonal matrix: D = diag(d 1, d 2,..., d n ): { d i j=i, D ij = 0 otherwise. Symmetric matrices: A R n n is symmetric if A = A T.

19 Special types of matrices Identity matrix: I = I n R n n : { 1 i=j, I ij = 0 otherwise. A R m n : AI n = I m A = A Diagonal matrix: D = diag(d 1, d 2,..., d n ): { d i j=i, D ij = 0 otherwise. Symmetric matrices: A R n n is symmetric if A = A T. Orthogonal matrices: U R n n is orthogonal if UU T = I = U T U

20 Linear Independence and Rank A set of vectors {x 1,..., x n } is linearly independent if {α 1,..., α n }: n i=1 α ix i = 0

21 Linear Independence and Rank A set of vectors {x 1,..., x n } is linearly independent if {α 1,..., α n }: n i=1 α ix i = 0 Rank: A R m n, then rank(a) is the maximum number of linearly independent columns (or equivalently, rows)

22 Linear Independence and Rank A set of vectors {x 1,..., x n } is linearly independent if {α 1,..., α n }: n i=1 α ix i = 0 Rank: A R m n, then rank(a) is the maximum number of linearly independent columns (or equivalently, rows) Properties: rank(a) min{m, n} rank(a) = rank(a T ) rank(ab) min{rank(a), rank(b)} rank(a + B) rank(a) + rank(b)

23 Matrix Inversion If A R n n, rank(a) = n, then the inverse of A, denoted A 1 is the matrix that: AA 1 = A 1 A = I Properties: (A 1 ) 1 = A (AB) 1 = B 1 A 1 (A 1 ) T = (A T ) 1

24 Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R}

25 Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R} Projection: Proj(y; {x i } 1 i n ) = argmin v span({xi } 1 i n ){ y v 2 }

26 Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R} Projection: Proj(y; {x i } 1 i n ) = argmin v span({xi } 1 i n ){ y v 2 } Range: A R m n, then R(A) = {Ax x R n } is the span of the columns of A

27 Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R} Projection: Proj(y; {x i } 1 i n ) = argmin v span({xi } 1 i n ){ y v 2 } Range: A R m n, then R(A) = {Ax x R n } is the span of the columns of A Proj(y, A) = A(A T A) 1 A T y

28 Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R} Projection: Proj(y; {x i } 1 i n ) = argmin v span({xi } 1 i n ){ y v 2 } Range: A R m n, then R(A) = {Ax x R n } is the span of the columns of A Proj(y, A) = A(A T A) 1 A T y Nullspace: null(a) = {x R n Ax = 0}

29 Determinant A R n n, a 1,..., a n the rows of A, S = { n i=1 α ia i 0 α i 1}, then det(a) is the volume of S. Properties: det(i) = 1 det(λa) = λdet(a) det(a T ) = det(a) det(ab) = det(a)det(b) det(a) 0 if and only if A is invertible. If A invertible, then det(a 1) = det(a)

30 Quadratic Forms and Positive Semidefinite Matrices A R n n, x R n, x T Ax is called a quadratic form: x T Ax = A ij x i x j 1 i,j n

31 Quadratic Forms and Positive Semidefinite Matrices A R n n, x R n, x T Ax is called a quadratic form: x T Ax = A ij x i x j 1 i,j n A is positive definite if x R n : x T Ax > 0 A is positive semidefinite if x R n : x T Ax 0 A is negative definite if x R n : x T Ax < 0 A is negative semidefinite if x R n : x T Ax 0

32 Eigenvalues and Eigenvectors A R n n, λ C is an eigenvalue of A with the corresponding eigenvector x C n (x 0) if: Ax = λx eigenvalues: the n possibly complex roots of the polynomial equation det(a λi) = 0, and denoted as λ 1,..., λ n Properties: tr(a) = n i=1 λ i det(a) = n i=1 λ i rank(a) = {1 i n λ i 0}

33 Matrix Eigendecomposition A R n n, λ 1,..., λ n the eigenvalues, and x 1,..., x n the eigenvectors. X = [x 1 x 2... x n ], Λ = diag(λ 1,..., λ n ), then AX = XΛ. A called diagonalizable if X invertible: A = XΛX 1 If A symmetric, then all eigenvalues real, and X orthogonal (hence denoted by U = [u 1 u 2... u n ]): A = UΛU T = n λ i u i ui T i=1 A special case of Signular Value Decomposition

34 Basic Probability Theory Outline 1 2 Basic Probability Theory

35 Basic Probability Theory Elements of Probability Sample Space Ω: Set of all possible outcomes Event Space F: A family of subsets of Ω Probability Measure: Function P : F R with properties: 1 P(A) 0 ( A F) 2 P(Ω) = 1 3 A i s disjoint, then P( i A i) = i P(A i)

36 Basic Probability Theory Conditional Probability and Independence For events A, B: P(A B) = P(A B) P(B) A, B independent if P(A B) = P(A) or equivalently: P(A B) = P(A)P(B)

37 Basic Probability Theory Random Variables and Distributions A random variable X is a function X : Ω R Example: Number of heads in 20 tosses of a coin Probabilities of events associated with random variables defined based on the original probability function. e.g., P(X = k) = P({ω Ω X(ω) = k}) Cumulative Distribution Function (CDF) F X : R [0, 1]: F X (x) = P(X x) Probability Mass Function (pmf): X discrete then p X (x) = P(X = x) Probability Density Function (pdf): f X (x) = df X (x)/dx

38 Basic Probability Theory Properties of Distribution Functions CDF: 0 F X (x) 1 F X monotone increasing, with lim x F X (x) = 0, lim x F X (x) = 1 pmf: 0 p X (x) 1 x p X (x) = 1 x A p X (x) = p X (A) pdf: f X (x) 0 f X (x)dx = 1 x A f X (x)dx = P(X A)

39 Basic Probability Theory Expectation and Variance Assume random variable X has pdf f X (x), and g : R R. Then E[g(X)] = g(x)f X (x)dx for discrete X, E[g(X)] = x g(x)p X (x) Properties: for any constant a R, E[a] = a E[ag(X)] = ae[g(x)] Linearity of Expectation: E[g(X) + h(x)] = E[g(X)] + E[h(X)] Var[X] = E[(X E[X]) 2 ]

40 Basic Probability Theory Some Common Random Variables X Bernoulli(p) (0 p 1): { p x=1, p X (x) = 1 p x=0. X Geometric(p) (0 p 1): p X (x) = p(1 p) x 1 X Uniform(a, b) (a < b): { 1 f X (x) = b a a x b, 0 otherwise. X Normal(µ, σ 2 ): f X (x) = 1 2πσ e 1 2σ 2 (x µ)2

41 Basic Probability Theory Multiple Random Variables and Joint Distributions X 1,..., X n random variables Joint CDF: F X1,...,X n (x 1,..., x n ) = P(X 1 x 1,..., X n x n ) Joint pdf: f X1,...,X n (x 1,..., x n ) = n F X1,...,Xn (x 1,...,x n) x 1... x n Marginalization: f X1 (x 1 ) =... f X 1,...,X n (x 1,..., x n )dx 2... dx n Conditioning: f X1 X 2,...,X n (x 1 x 2,..., x n ) = f X 1,...,Xn (x 1,...,x n) f X2,...,Xn (x 2,...,x n) Chain Rule: f (x 1,..., x n ) = f (x 1 ) n i=2 f (x i x 1,..., x i 1 ) Independence: f (x 1,..., x n ) = n i=1 f (x i). More generally, events A 1,..., A n independent if P( i S A i) = i S P(A i) ( S {1,..., n}).

42 Basic Probability Theory Random Vectors X 1,..., X n random variables. X = [X 1 X 2... X n ] T random vector. If g : R n R, then E[g(X)] = R n g(x 1,..., x n )f X1,...,X n (x 1,..., x n )dx 1... dx n if g : R n R m, g = [g 1... g m ] T, then E[g(X)] = [ E[g 1 (X)]... E[g m (X)] ] T Covariance Matrix: Σ = Cov(X) = E [ (X E[X])(X E[X]) T ] Properties of Covariance Matrix: Σ ij = Cov[X i, X j ] = E [ (X i E[X i ])(X j E[X j ]) ] Σ symmetric, positive semidefinite

43 Basic Probability Theory Multivariate Gaussian Distribution µ R n, Σ R n n symmetric, positive semidefinite X N (µ, Σ) n-dimensional Gaussian distribution: f X (x) = 1 (2π) n/2 det(σ) 1/2 exp ( 1 2 (x µ)t Σ 1 (x µ) ) E[X] = µ Cov(X) = Σ

44 Basic Probability Theory Parameter Estimation: Maximum Likelihood Parametrized distribution f X (x; θ) with parameter(s) θ unknown. IID samples x 1,..., x n observed. Goal: Estimate θ MLE: ˆθ = argmax θ {f (x 1,..., x n ; θ)}

45 Basic Probability Theory MLE Example X Gaussian(µ, σ 2 ). θ = (µ, σ 2 ) unknown. Samples x 1,..., x n. Then: f (x 1,..., x n ; µ, σ 2 1 ) = ( 2πσ 2 )n/2 exp ( n i=1 (x i µ) 2 2σ 2 ) Setting: log f µ Gives: = 0 and log f σ = 0 n i=1 ˆµ MLE = x i n, ˆσ 2 MLE = n i=1 (x i ˆµ) 2 n If not possible to find the optimal point in closed form, iterative methods such as gradient decent can be used.

46 Basic Probability Theory Some Useful Inequalities Markov s Inequality: X random variable, and a > 0. Then: P( X a) E[ X ] a Chebyshev s Inequality: If E[X] = µ, Var(X) = σ 2, k > 0, then: Pr( X µ kσ) 1 k 2 Chernoff bound: X 1,..., X n iid random variables, with E[X i ] = µ, X i {0, 1} ( 1 i n). Then: P( 1 n X i µ ɛ) 2 exp( 2nɛ 2 ) n i=1 Multiple variants of Chernoff-type bounds exist, which can be useful in different settings

47 Basic Probability Theory References 1 CS229 notes on basic linear algebra and probability theory 2 Wikipedia!