Probability Review. Rob Hall. September 9, 2010

Transcription

1 Probability Review Rob Hall September 9, 2010

2 What is Probability? Probability reasons about a sample, knowing the population. The goal of statistics is to estimate the population based on a sample. Both provide invaluable tools to modern machine learning.

3 Plan Facts about sets (to get our brains in gear). Definitions and facts about probabilities. Random variables and joint distributions. Characteristics of distributions (mean, variance, entropy). Some asymptotic results (a high level perspective). Goals: get some intuition about probability, learn how to formulate a simple proof, lay out some useful identities for use as a reference. Non-goal: supplant an entire semester long course in probability.

4 Set Basics A set is just a collection of elements denoted e.g., S = {s 1, s 2, s 3}, R = {r : some condition holds on r}. Intersection: the elements that are in both sets: A B = {x : x A and x B} Union: the elements that are in either set, or both: A B = {x : x A or x B} Complementation: all the elements that aren t in the set: A C = {x : x A}. A C A A B B A B

5 Properties of Set Operations Commutativity: A B = B A Associativity: A (B C) = (A B) C. Likewise for intersection. Proof?

6 Properties of Set Operations Commutativity: A B = B A Associativity: A (B C) = (A B) C. Likewise for intersection. Proof? Follows easily from commutative and associative properties of and and or in the definitions.

7 Properties of Set Operations Commutativity: A B = B A Associativity: A (B C) = (A B) C. Likewise for intersection. Proof? Follows easily from commutative and associative properties of and and or in the definitions. Distributive properties: A (B C) = (A B) (A C) A (B C) = (A B) (A C) Proof?

8 Properties of Set Operations Commutativity: A B = B A Associativity: A (B C) = (A B) C. Likewise for intersection. Proof? Follows easily from commutative and associative properties of and and or in the definitions. Distributive properties: A (B C) = (A B) (A C) A (B C) = (A B) (A C) Proof? Show each side of the equality contains the other. DeMorgan s Law...see book.

9 Disjointness and Partitions A sequence of sets A 1, A 2... is called pairwise disjoint or mutually exclusive if for all i j, A i A j = {}. If the sequence is pairwise disjoint and i=1 A i = S, then the sequence forms a partition of S. Partitions are useful in probability theory and in life: B S = B ( A i ) = i=1 (def of partition) (B A i ) (distributive property) i=1 Note that the sets B A i are also pairwise disjoint (proof?).

10 Disjointness and Partitions A sequence of sets A 1, A 2... is called pairwise disjoint or mutually exclusive if for all i j, A i A j = {}. If the sequence is pairwise disjoint and i=1 A i = S, then the sequence forms a partition of S. Partitions are useful in probability theory and in life: B S = B ( A i ) = i=1 (def of partition) (B A i ) (distributive property) i=1 Note that the sets B A i are also pairwise disjoint (proof?). If S is the whole space, what have we constructed?.

11 Probability Terminology Name What it is Common What it means Symbols Sample Space Set Ω, S Possible outcomes. Event Space Collection of subsets F, E The things that have probabilities.. Probability Measure Measure P, π Assigns probabilities to events. Probability Space A triple (Ω, F, P) Remarks: may consider the event space to be the power set of the sample space (for a discrete sample space - more later).

12 Probability Terminology Name What it is Common What it means Symbols Sample Space Set Ω, S Possible outcomes. Event Space Collection of subsets F, E The things that have probabilities.. Probability Measure Measure P, π Assigns probabilities to events. Probability Space A triple (Ω, F, P) Remarks: may consider the event space to be the power set of the sample space (for a discrete sample space - more later). e.g., rolling a fair die: Ω = {1, 2, 3, 4, 5, 6}

13 Probability Terminology Name What it is Common What it means Symbols Sample Space Set Ω, S Possible outcomes. Event Space Collection of subsets F, E The things that have probabilities.. Probability Measure Measure P, π Assigns probabilities to events. Probability Space A triple (Ω, F, P) Remarks: may consider the event space to be the power set of the sample space (for a discrete sample space - more later). e.g., rolling a fair die: Ω = {1, 2, 3, 4, 5, 6} F = 2 Ω = {{1}, {2}... {1, 2}... {1, 2, 3}... {1, 2, 3, 4, 5, 6}, {}}

14 Probability Terminology Name What it is Common What it means Symbols Sample Space Set Ω, S Possible outcomes. Event Space Collection of subsets F, E The things that have probabilities.. Probability Measure Measure P, π Assigns probabilities to events. Probability Space A triple (Ω, F, P) Remarks: may consider the event space to be the power set of the sample space (for a discrete sample space - more later). e.g., rolling a fair die: Ω = {1, 2, 3, 4, 5, 6} F = 2 Ω = {{1}, {2}... {1, 2}... {1, 2, 3}... {1, 2, 3, 4, 5, 6}, {}} P({1}) = P({2}) =... = 1 (i.e., a fair die) 6 P({1, 3, 5}) = 1 (i.e., half chance of odd result) 2 P({1, 2, 3, 4, 5, 6}) = 1 (i.e., result is almost surely one of the faces).

15 Axioms for Probability A set of conditions imposed on probability measures (due to Kolmogorov) P(A) 0, A F P(Ω) = 1 P( i=1 A i) = i=1 P(A i) where {A i } i=1 disjoint. F are pairwise

16 Axioms for Probability A set of conditions imposed on probability measures (due to Kolmogorov) P(A) 0, A F P(Ω) = 1 P( i=1 A i) = i=1 P(A i) where {A i } i=1 disjoint. These quickly lead to: F are pairwise P(A C ) = 1 P(A) (since P(A) + P(A C ) = P(A A C ) = P(Ω) = 1). P(A) 1 (since P(A C ) 0). P({}) = 0 (since P(Ω) = 1).

17 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) A A B B

18 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) And so: P(B) = P(B A) + P(B A C ) For a general partition this is called the law of total A A B B probability.

19 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) And so: P(B) = P(B A) + P(B A C ) For a general partition this is called the law of total A A B B probability. P(A B) = P(A (B A C ))

20 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) And so: P(B) = P(B A) + P(B A C ) For a general partition this is called the law of total A A B B probability. P(A B) = P(A (B A C )) = P(A) + P(B A C )

21 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) And so: P(B) = P(B A) + P(B A C ) For a general partition this is called the law of total A A B B probability. P(A B) = P(A (B A C )) = P(A) + P(B A C ) = P(A) + P(B) P(B A)

22 P(A B) General Unions Recall that A, A C form a partition of Ω: B = B Ω = B (A A C ) = (B A) (B A C ) And so: P(B) = P(B A) + P(B A C ) For a general partition this is called the law of total A A B B probability. P(A B) = P(A (B A C )) = P(A) + P(B A C ) = P(A) + P(B) P(B A) P(A) + P(B) Very important difference between disjoint and non-disjoint unions. Same idea yields the so-called union bound aka Boole s inequality

23 Conditional Probabilities For events A, B F with P(B) > 0, we may write the conditional probability of A given B: A A B B P(A B) = P(A B) P(B) Interpretation: the outcome is definitely in B, so treat B as the entire sample space and find the probability that the outcome is also in A.

24 Conditional Probabilities For events A, B F with P(B) > 0, we may write the conditional probability of A given B: A A B B P(A B) = P(A B) P(B) Interpretation: the outcome is definitely in B, so treat B as the entire sample space and find the probability that the outcome is also in A. This rapidly leads to: P(A B)P(B) = P(A B) aka the chain rule for probabilities. (why?)

25 Conditional Probabilities For events A, B F with P(B) > 0, we may write the conditional probability of A given B: A A B B P(A B) = P(A B) P(B) Interpretation: the outcome is definitely in B, so treat B as the entire sample space and find the probability that the outcome is also in A. This rapidly leads to: P(A B)P(B) = P(A B) aka the chain rule for probabilities. (why?) When A 1, A 2... are a partition of Ω: P(B) = P(B A i ) = i=1 P(B A i )P(A i ) i=1

26 Conditional Probabilities For events A, B F with P(B) > 0, we may write the conditional probability of A given B: A A B B P(A B) = P(A B) P(B) Interpretation: the outcome is definitely in B, so treat B as the entire sample space and find the probability that the outcome is also in A. This rapidly leads to: P(A B)P(B) = P(A B) aka the chain rule for probabilities. (why?) When A 1, A 2... are a partition of Ω: P(B) = P(B A i ) = i=1 P(B A i )P(A i ) i=1 This is also referred to as the law of total probability.

27 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) =

28 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) =

29 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) =

30 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) = P(A B) P(B)

31 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) = P(A B) P(B) = P({1, 3}) P(B)

32 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) = P(A B) P(B) = P({1, 3}) P(B) = 2 3

33 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) = P(A B) P(B) = P({1, 3}) P(B) = 2 3 P(B A) = P(A B) P(A)

34 Conditional Probability Example Suppose we throw a fair die: Ω = {1, 2, 3, 4, 5, 6}, F = 2 Ω, P({i}) = 1 6, i = A = {1, 2, 3, 4} i.e., result is less than 5, B = {1, 3, 5} i.e., result is odd. P(A) = 2 3 P(B) = 1 2 P(A B) = P(A B) P(B) = P({1, 3}) P(B) P(B A) = = 1 2 P(A B) P(A) = 2 3 Note that in general, P(A B) P(B A) however we may quantify their relationship.

35 Bayes Rule Using the chain rule we may see: P(A B)P(B) = P(A B) = P(B A)P(A) Rearranging this yields Bayes rule: Often this is written as: P(B A) = P(A B)P(B) P(A) P(B i A) = P(A B i)p(b i ) i P(A B i)p(b i ) Where B i are a partition of Ω (note the bottom is just the law of total probability).

36 Independence Two events A, B are called independent if P(A B) = P(A)P(B).

37 Independence Two events A, B are called independent if P(A B) = P(A)P(B). When P(A) > 0 this may be written P(B A) = P(B) (why?)

38 Independence Two events A, B are called independent if P(A B) = P(A)P(B). When P(A) > 0 this may be written P(B A) = P(B) (why?) e.g., rolling two dice, flipping n coins etc.

39 Independence Two events A, B are called independent if P(A B) = P(A)P(B). When P(A) > 0 this may be written P(B A) = P(B) (why?) e.g., rolling two dice, flipping n coins etc. Two events A, B are called conditionally independent given C when P(A B C) = P(A C)P(B C).

40 Independence Two events A, B are called independent if P(A B) = P(A)P(B). When P(A) > 0 this may be written P(B A) = P(B) (why?) e.g., rolling two dice, flipping n coins etc. Two events A, B are called conditionally independent given C when P(A B C) = P(A C)P(B C). When P(A) > 0 we may write P(B A, C) = P(B C)

41 Independence Two events A, B are called independent if P(A B) = P(A)P(B). When P(A) > 0 this may be written P(B A) = P(B) (why?) e.g., rolling two dice, flipping n coins etc. Two events A, B are called conditionally independent given C when P(A B C) = P(A C)P(B C). When P(A) > 0 we may write P(B A, C) = P(B C) e.g., the weather tomorrow is independent of the weather yesterday, knowing the weather today.

42 Random Variables caution: hand waving A random variable is a function X : Ω R d e.g., Roll some dice, X = sum of the numbers. Indicators of events: X (ω) = 1 A (ω). e.g., toss a coin, X = 1 if it came up heads, 0 otherwise. Note relationship between the set theoretic constructions, and binary RVs. Give a few monkeys a typewriter, X = fraction of overlap with complete works of Shakespeare. Throw a dart at a board, X R 2 are the coordinates which are hit.

43 Distributions By considering random variables, we may think of probability measures as functions on the real numbers. Then, the probability measure associated with the RV is completely characterized by its cumulative distribution function (CDF): F X (x) = P(X x). If two RVs have the same CDF we call then identically distributed. We say X F X or X f X (f X coming soon) to indicate that X has the distribution specified by F X (resp, f X ). F X(x) F X(x)

44 Discrete Distributions If X takes on only a countable number of values, then we may characterize it by a probability mass function (PMF) which describes the probability of each value: f X (x) = P(X = x).

45 Discrete Distributions If X takes on only a countable number of values, then we may characterize it by a probability mass function (PMF) which describes the probability of each value: f X (x) = P(X = x). We have: x f X (x) = 1 (why?)

46 Discrete Distributions If X takes on only a countable number of values, then we may characterize it by a probability mass function (PMF) which describes the probability of each value: f X (x) = P(X = x). We have: x f X (x) = 1 (why?) since each ω maps to one x, and P(Ω) = 1. e.g., general discrete PMF: f X (x i ) = θ i, i θ i = 1, θ i 0.

47 Discrete Distributions If X takes on only a countable number of values, then we may characterize it by a probability mass function (PMF) which describes the probability of each value: f X (x) = P(X = x). We have: x f X (x) = 1 (why?) since each ω maps to one x, and P(Ω) = 1. e.g., general discrete PMF: f X (x i ) = θ i, i θ i = 1, θ i 0. e.g., bernoulli distribution: X {0, 1}, f X (x) = θ x (1 θ) 1 x A general model of binary outcomes (coin flips etc.).

48 Discrete Distributions Rather than specifying each probability for each event, we may consider a more restrictive parametric form, which will be easier to specify and manipulate (but sometimes less general).

49 Discrete Distributions Rather than specifying each probability for each event, we may consider a more restrictive parametric form, which will be easier to specify and manipulate (but sometimes less general). e.g., multinomial distribution: X N d, d i=1 x n! i = n, f X (x) = d x 1!x 2! x d! i=1 θx i i. Sometimes used in text processing (dimensions correspond to words, n is the length of a document). What have we lost in going from a general form to a multinomial?

50 Continuous Distributions When the CDF is continuous we may consider its derivative f x(x) = d dx F X (x). This is called the probability density function (PDF).

51 Continuous Distributions When the CDF is continuous we may consider its derivative f x(x) = d F dx X (x). This is called the probability density function (PDF). The probability of an interval (a, b) is given by P(a < X < b) = b f a X (x) dx. The probability of any specific point c is zero: P(X = c) = 0 (why?).

52 Continuous Distributions When the CDF is continuous we may consider its derivative f x(x) = d F dx X (x). This is called the probability density function (PDF). The probability of an interval (a, b) is given by P(a < X < b) = b f a X (x) dx. The probability of any specific point c is zero: P(X = c) = 0 (why?). e.g., Uniform distribution: f X (x) = 1 b a 1 (a,b)(x)

53 Continuous Distributions When the CDF is continuous we may consider its derivative f x(x) = d F dx X (x). This is called the probability density function (PDF). The probability of an interval (a, b) is given by P(a < X < b) = b f a X (x) dx. The probability of any specific point c is zero: P(X = c) = 0 (why?). e.g., Uniform distribution: f X (x) = 1 b a 1 (a,b)(x) e.g., Gaussian aka normal: f X (x) = 1 2πσ exp{ (x µ)2 2σ 2 }

54 Continuous Distributions When the CDF is continuous we may consider its derivative f x(x) = d F dx X (x). This is called the probability density function (PDF). The probability of an interval (a, b) is given by P(a < X < b) = b f a X (x) dx. The probability of any specific point c is zero: P(X = c) = 0 (why?). e.g., Uniform distribution: f X (x) = 1 b a 1 (a,b)(x) e.g., Gaussian aka normal: f X (x) = 1 2πσ exp{ (x µ)2 2σ 2 } Note that both families give probabilities for every interval on the real line, yet are specified by only two numbers. dnorm (x) x

55 Multiple Random Variables We may consider multiple functions of the same sample space, e.g., X (ω) = 1 A (ω), Y (ω) = 1 B (ω): A B A B May represent the joint distribution as a table: X=0 X=1 Y= Y= We write the joint PMF or PDF as f X,Y (x, y)

56 Multiple Random Variables Two random variables are called independent when the joint PDF factorizes: f X,Y (x, y) = f X (x)f Y (y) When RVs are independent and identically distributed this is usually abbreviated to i.i.d. Relationship to independent events: X, Y ind. iff {ω : X (ω) x}, {ω : Y (ω) y} are independent events for all x, y.

57 Working with a Joint Distribution We have similar constructions as we did in abstract prob. spaces: Marginalizing: f X (x) = Y f X,Y (x, y) dy. Similar idea to the law of total probability (identical for a discrete distribution). Conditioning: f X Y (x, y) = f X,Y (x,y) Similar to previous definition. f Y (y) = f X,Y (x,y) X f X,Y (x,y) dx. Old? Blood pressure? Heart Attack? P How to compute P(heart attack old)?

58 Characteristics of Distributions We may consider the expectation (or mean ) of a distribution: { x E(X ) = xf X (x) X is discrete xf X (x) dx X is continuous

59 Characteristics of Distributions We may consider the expectation (or mean ) of a distribution: { x E(X ) = xf X (x) X is discrete xf X (x) dx X is continuous Expectation is linear: E(aX + by + c) = x,y (ax + by + c)f X,Y (x, y) = x,y axf X,Y (x, y) + x,y byf X,Y (x, y) + x,y cf X,Y (x, y) = a x,y xf X,Y (x, y) + b x,y yf X,Y (x, y) + c x,y f X,Y (x, y) = a x x y f X,Y (x, y) + b y y x f X,Y (x, y) + c = a x xf X (x) + b y yf Y (y) + c = ae(x ) + be(y ) + c

60 Characteristics of Distributions Questions: 1. E[EX ] =

61 Characteristics of Distributions Questions: 1. E[EX ] = x (EX )f X (x) =

62 Characteristics of Distributions Questions: 1. E[EX ] = x (EX )f X (x) = (EX ) x f X (x) = EX

63 Characteristics of Distributions Questions: 1. E[EX ] = x (EX )f X (x) = (EX ) x f X (x) = EX 2. E(X Y ) = E(X )E(Y )?

64 Characteristics of Distributions Questions: 1. E[EX ] = x (EX )f X (x) = (EX ) x f X (x) = EX 2. E(X Y ) = E(X )E(Y )? Not in general, although when f X,Y = f X f Y : E(X Y ) = x,y xyf X (x)f Y (y) = x xf X (x) y yf Y (y) = EX EY

65 Characteristics of Distributions We may consider the variance of a distribution: Var(X ) = E(X EX ) 2 This may give an idea of how spread out a distribution is.

66 Characteristics of Distributions We may consider the variance of a distribution: Var(X ) = E(X EX ) 2 This may give an idea of how spread out a distribution is. A useful alternate form is: E(X EX ) 2 = E[X 2 2XE(X ) + (EX ) 2 ] = E(X 2 ) 2E(X )E(X ) + (EX ) 2 = E(X 2 ) (EX ) 2

67 Characteristics of Distributions We may consider the variance of a distribution: Var(X ) = E(X EX ) 2 This may give an idea of how spread out a distribution is. A useful alternate form is: E(X EX ) 2 = E[X 2 2XE(X ) + (EX ) 2 ] = E(X 2 ) 2E(X )E(X ) + (EX ) 2 Variance of a coin toss? = E(X 2 ) (EX ) 2

68 Characteristics of Distributions Variance is non-linear but the following holds: Var(aX ) = E(aX E(aX )) 2 = E(aX aex ) 2 = a 2 E(X EX ) 2 = a 2 Var(X )

69 Characteristics of Distributions Variance is non-linear but the following holds: Var(aX ) = E(aX E(aX )) 2 = E(aX aex ) 2 = a 2 E(X EX ) 2 = a 2 Var(X ) Var(X +c) = E(X +c E(X +c)) 2 = E(X EX +c c) 2 = E(X EX ) 2 = Var(X )

70 Characteristics of Distributions Variance is non-linear but the following holds: Var(aX ) = E(aX E(aX )) 2 = E(aX aex ) 2 = a 2 E(X EX ) 2 = a 2 Var(X ) Var(X +c) = E(X +c E(X +c)) 2 = E(X EX +c c) 2 = E(X EX ) 2 = Var(X ) Var(X + Y ) = E(X EX + Y EY ) 2 = E(X EX ) 2 + E(Y EY ) 2 +2 E(X EX )(Y EY ) }{{}}{{}}{{} Var(X ) Var(Y ) Cov(X,Y )

71 Characteristics of Distributions Variance is non-linear but the following holds: Var(aX ) = E(aX E(aX )) 2 = E(aX aex ) 2 = a 2 E(X EX ) 2 = a 2 Var(X ) Var(X +c) = E(X +c E(X +c)) 2 = E(X EX +c c) 2 = E(X EX ) 2 = Var(X ) Var(X + Y ) = E(X EX + Y EY ) 2 = E(X EX ) 2 + E(Y EY ) 2 +2 E(X EX )(Y EY ) }{{}}{{}}{{} Var(X ) Var(Y ) Cov(X,Y ) So when X, Y are independent we have: (why?) Var(X + Y ) = Var(X ) + Var(Y )

72 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) =

73 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) = E[ 1 n n X i ] = i=1

74 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) = E[ 1 n n X i ] = 1 n i=1 n E(X i ) = i=1

75 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) = E[ 1 n n X i ] = 1 n i=1 n E(X i ) = 1 n nµ = µ i=1

76 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) = E[ 1 n n X i ] = 1 n i=1 n E(X i ) = 1 n nµ = µ i=1 Var( X n ) = Var( 1 n n X i ) = i=1

77 Putting it all together Say we have X 1... X n i.i.d., where EX i = µ and Var(X i ) = σ 2. We want to know the expectation and variance of X n = 1 n n i=1 X i. E( X n ) = E[ 1 n n X i ] = 1 n i=1 n E(X i ) = 1 n nµ = µ i=1 Var( X n ) = Var( 1 n n i=1 X i ) = 1 n 2 nσ2 = σ2 n

78 Entropy of a Distribution Entropy is a measure of uniformity in a distribution. H(X ) = x f X (x) log 2 f X (x) Imagine you had to transmit a sample from f X, so you construct the optimal encoding scheme: Entropy gives the mean depth in the tree (= mean number of bits).

79 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n.

80 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n. We had: E X n = µ, Var( X n ) = σ2 n. Distribution appears to be contracting: as n increases, variance is going to 0.

81 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n. We had: E X n = µ, Var( X n ) = σ2 n. Distribution appears to be contracting: as n increases, variance is going to 0. Using Chebyshev s inequality: P( X n µ ɛ) σ2 nɛ 2 0 For any fixed ɛ, as n.

82 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n.

83 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n. The weak law of large numbers: lim P( X n µ < ɛ) = 1 n In English: choose ɛ and a probability that X n µ < ɛ, I can find you an n so your probability is achieved.

84 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n. The weak law of large numbers: lim P( X n µ < ɛ) = 1 n In English: choose ɛ and a probability that X n µ < ɛ, I can find you an n so your probability is achieved. The strong law of large numbers: P( lim X n = µ) = 1 n In English: the mean converges to the expectation almost surely as n increases.

85 Law of Large Numbers (LLN) Recall our variable X n = 1 n n i=1 X i. We may wonder about its behavior as n. The weak law of large numbers: lim P( X n µ < ɛ) = 1 n In English: choose ɛ and a probability that X n µ < ɛ, I can find you an n so your probability is achieved. The strong law of large numbers: P( lim X n = µ) = 1 n In English: the mean converges to the expectation almost surely as n increases. Two different versions, each holds under different conditions, but i.i.d. and finite variance is enough for either.

86 Central Limit Theorem (CLT) The distribution of X n also converges weakly to a Gaussian, lim F n X n (x) = Φ( x µ ) nσ Simulated n dice rolls and took average, 5000 times: n= 1 n= 2 n= 10 n= 75 Density Density Density Density h h h h

87 Central Limit Theorem (CLT) The distribution of X n also converges weakly to a Gaussian, lim F n X n (x) = Φ( x µ ) nσ Simulated n dice rolls and took average, 5000 times: n= 1 n= 2 n= 10 n= 75 Density Density Density Density h h h h Two kinds of convergence went into this picture (why 5000?):

88 Central Limit Theorem (CLT) The distribution of X n also converges weakly to a Gaussian, lim F n X n (x) = Φ( x µ ) nσ Simulated n dice rolls and took average, 5000 times: n= 1 n= 2 n= 10 n= 75 Density Density Density Density h h h h Two kinds of convergence went into this picture (why 5000?): 1. True distribution converges to a Gaussian (CLT). 2. Empirical distribution converges to true distribution (Glivenko-Cantelli).

89 Asymptotics Opinion Ideas like these are crucial to machine learning: We want to minimize error on a whole population (e.g., classify text documents as well as possible) We minimize error on a training set of size n. What happens as n? How does the complexity of the model, or the dimension of the problem affect convergence?