Stat Camp for the Full-time MBA Program Daniel Solow Lecture 4 The Normal Distribution and the Central Limit Theorem 188 Example 1: Dear Abby You wrote that a woman is pregnant for 266 days. Who said so? I carried my baby for ten months and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy and it couldn t possibly have been any other time because I saw him only once for an hour, and I didn t see him again until the day before the baby was born. I don t drink or run around, and there is no way this baby isn t his, so please print a retraction about the 266-day carrying time because otherwise I am in a lot of trouble. San Diego Reader 189 Dear Abby Step 1: Identify an appropriate random variable. Y = number of days of pregnancy What are the possible values for Y? About 230 290? What is the density function for Y???? Prob. Density Days 255 260 265 270 275 Idea: Approximate the density of Y with a normal! 190 Dear Abby Question: If you are going to use a normal approximation, what information do you need? Answer: The mean and standard deviation. Fact: According to the collective experience of generations of pediatricians, pregnancies have a mean of 266 and standard deviation of 16 days, so Y ~ N ( = 266, = 16). Question: What are the possible values for Y? to Question: How can the number of days of pregnancy be < 230? Answer: Using the normal distribution, you have that P(Y < 230) = NORMDIST(230, 266, 16, true) 0.01. Thus, when using the normal approximation, there is only about 1% chance that a pregnancy lasts less than 230 days. Models are NOT the real world but hopefully good approximations! 191 1
Dear Abby Step 2: State what you are looking for as a probability question in terms of the rv. You want to find P(Y 10 mo. and 5 days) = P(Y 310). Step 3: Use the probability distribution of the rv to answer the probability question. P(Y 310) = 1 P(Y < 310) = 1 NORMDIST(310, 266, 16, TRUE) = 0.00298 Was she telling the truth? Possibly, but highly unlikely. 192 Example 2: Problem of GoodTire GoodTire has a new tire for which, in order to be competitive, they want to offer a warranty of 30,000 miles. Before doing so, the company wants to know what fraction of tires they can expect to be returned under the warranty. 193 The Problem of GoodTire Step 1: Identify an appropriate random variable. For GoodTire, let X = number of miles such a tire will last. What are the possible values for X? What is the density function for X? 0 90000???? (cont.) From statistical analysis of a random sample, GoodTire believes the mileage follows approximately a normal distribution with a mean of 40,000 miles and a standard deviation of 10,000 miles, so assume that X ~N( = 40000, = 10000) with possible values: to 194 The Problem of GoodTire Step 2: State what you are looking for in terms of a probability question pertaining to the random variable. GoodTire wants to know the Fraction of tires returned = Likelihood a tire fails = P{X 30000} =? 195 2
The Problem of GoodTire Step 3: Use the probability distribution of the random variable to answer the probability question. For GoodTire,,y you have P{X { 30000} =? NORMDIST(30000, 40000, 10000, TRUE) = 0.1587 X N(40000, 10000) 30000 40000 196 The Problem of GoodTire Question: The CEO finds that a 16% return rate is too high. What warranty mileage s should they offer to get a 5% return rate? Step 2: Probability Question: What should s be so that P{X s} = 0.05? Step 3: s = NORMINV(0.05, 40000, 10000) = 23551.47 Fact: While you cannot control the value of a rv, you 0.05 can control the likelihood of certain events occurring s =? 40000 with that RV. 197 Example 3: Marketing Projections From historical data over a number of years, a firm knows that its annual sales average $25 million. For planning purposes, the CEO wants to know the likelihood that sales next year will: Exceed $30 million. Be within $1.5 million of the average. The CEO is willing to issue bonuses if sales are sufficiently high. What level should be set so that bonuses are given at most 20% of the time? 198 Marketing Projections Step 1: Identify an appropriate random variable. Let Y = next year s sales in $ millions. What are the possible values for Y? 0 50? What is the density function for Y???? From statistical analysis over a number of years, they believe that annual sales follows approximately a normal distribution with a mean of $25 mil. and a standard deviation of $3 mil., so assume that Y ~N( = 25, = 3) 199 3
Marketing Projections Step 2: State what you are looking for in terms of a probability question pertaining to the random variable. You want to know: P(sales exceeds $30 mil.) = P(Y 30). P(sales is within $1.5 of $25 mil.) = P(23.5 Y 26.5). What should be the value of sales (s) so that P(giving a bonus) = 0.20? P(Y s) = 0.20? 200 Marketing Projections Step 3: Use the probability distribution of the random variable to answer the probability question. From Excel, using = 25 and = 3: P(Y 30) = 1 NORMDIST(30, 25, 3, TRUE) = 0.045. P(23.5 Y 26.5) = NORMDIST(26.5, 25, 3, TRUE) NORMDIST(23.5, 25, 3, TRUE) = 0.383. s = NORMINV(0.8, 25, 3) = 27.524. 201 Example 4: DUI Test In many states, a driver is legally drunk if the blood alcohol concentration, as determined by a breath analyzer, is 0.10% or higher. Suppose that a driver has a true blood alcohol concentration of 0.095%. With the breath analyzer test, what is the probability that the person will be (incorrectly) booked on a DUI charge? Step 1: Identify an appropriate random variable. Let Y = the measurement of the analyzer as a %. (cont.) Question: What are the possible values for Y? 0 0.3? 202 DUI Test Step 1 (continued). Question: What is the density function for Y? Answer: We do not know, but experience indicates that Y follows approximately a normal distribution with mean equal to the person ss true alcohol level and standard deviation equal to 0.004%, so Y ~N(, = 0.004), where = the person s true blood alcohol level (%) 203 4
DUI Test Step 2: State what you are looking for in terms of a probability question pertaining to the random variable. You want to know the probability that a person with = 0.095 will be (incorrectly) booked on a DUI charge: P(being booked on a DUI) = P(Y 0.10) 204 DUI Test Step 3: Use the probability distribution of the random variable to answer the probability question. From Excel l( (using = 0.095095 and = 0.004): 004) P(Y 0.10) = 0.1056. 1 NORMDIST(0.10, 0.095, 0.004, true) = There is about a 10% chance that such a person will be incorrectly charged with a DUI. 205 An Insurance Problem GoodHands is considering insuring employees of GoodTire. What annual premium should the company charge to be sure that there is a likelihood of no more than 1% of losing money on each customer? This is an example of decision making under uncertainty: you have to make a decision today how much should the annual premium be facing an uncertain future. Question: Why is the future uncertain? 206 Solving the Insurance Problem Step 1: Identify an appropriate random variable. Let X = the $ claimed by a customer in one year. What are the possible values for X? [0, 100000 (?)] Is X continuous or discrete? discrete What is the density function for X? It is unknown, so borrow one. From statistical analysis, the annual claim for these people follows approximately a normal distribution with a mean of $2500 and a standard deviation of $1000, so: X ~N( = 2500, = 1000) discrete or cont.? Note: It can be OK to approximate a discrete RV with a continuous distribution. 207 5
An Insurance Problem Step 2: State what you are looking for in terms of a probability question pertaining to the RV. For GoodHands, what should the premium s be so that the likelihood of losing money is no more than 1%. Question: When do you lose money on a customer? Probability bili Question: What should the premium s be so that the P( X s) = 0.01? X N(2500, 1000) 2500 P{ X s} 0.01 s 208 An Insurance Problem Step 3: Use the probability distribution of the random variable to answer the probability question. X N(2500, 1000) P{ X s} 0.01 2500 s = NORMINV(0.99, 2500, 1000) = $4826.35 Fact: While you cannot control the value of a rv (such as the claim of a person), you can control the likelihood of certain events occurring with that RV (such as the likelihood of such a claim exceeding the premium). 209 The Insurance Problem (cont.) Question: GoodTire wants to insure all 100 of its employees through GoodHands. What premium should GoodHands charge per employee so that the likelihood of losing money on the average of all these claims is 1%? Step 1: Identify appropriate random variables. For GoodHands, let X i = the $ / annual claim of customer i (i = 1,,100) X i ~N( = 2500, = 1000) X ( X 1... X100 ) /100 Question: What is the distribution of the random variable X? Answer: You do not know. However, because X is the AVERAGE of other rvs, try 210 The Central Limit Theorem The Central Limit Theorem provides an approximate density function when the r.v. you are interested in is the average of n other rvs, say, X 1, X 2,, X n, that are: (1) Independent (knowing the value of one rv tells you nothing about the values of the other rvs). (2) Identically distributed (have the same density function with mean and standard deviation ), then, for large n, X1... X X n ~ N(, / n) (approx.) n 211 6
The Insurance Problem (cont.) For the insurance problem, you have X i = annual $ claimed by person i (i = 1,, 100) ~ N 2500, =1000. X1... X100 X ~ N 2500, 1000 / 100 N 2500, 100. 100 (1) Are X 1, X 2,, X 100 independent random variables? Yes, because the amount claimed by one person has no effect on the amount claimed by another person. (2) Are X 1, X 2,, X 100 identically distributed? Yes, because Therefore, by the CLT, X is approximately Normal with An Insurance Problem Step 2: State what you are looking for in terms of a probability question pertaining to the random variable. For GoodHands, What should the premium s be so that the probability that the average of the 100 claims exceeds s is 0.01? Probability Question: What should s be so that 1... 100 X X P X 100 s 0.01? 212 213 An Insurance Problem (cont.) Probability Question: What should the premium s be so that P X s 0.01? X N(2500, 100) P{ X s} 0.01 2500 s Step 3: Use the probability distribution of the random variable to answer the probability question. s = NORMINV(0.99, 2500, 100) = $2732.64 214 Another Example of the CLT In modeling the performance of a team with 5 people, consider the following five rvs: P i = performance contribution of person i for (i = 1,,5) Possible values: [0, 1] (continuous) Density function: U[0,1] 1 E[P i ] = = 0.5 STDEV[P i ] = = 0. 29 12 However, what is of interest is the team performance, so let 215 7
Another Example of the CLT T = performance of the whole team P P P P P 1 2 3 4 5 5 Possible values: [0, 1] (continuous) Density function:??? You cannot find the true density function, so borrow one. Because the rv T is the average of other RVs, think of using the Central Limit Theorem to approximate the density function of T. 216 The Team Problem For the team problem, you have P i = performance of person i (i = 1, 2, 3, 4, 5) ~U[0, 1] with mean = 0.5 and std. dev. = 0.29. P1 P2 P3 P4 P5 T ~ N (0.5,0.29/ 5) N(0.5,0.13). ) 5 (1) Are P 1, P 2, P 3, P 4, P 5 independent random variables? Yes, assuming that the performance of a person says nothing about the performance of another person. (2) Are P 1, P 2, P 3, P 4, P 5 identically distributed? Yes, because Therefore, by the CLT, P is approximately Normal with 217 The Team Problem Question: What is the probability that the team performance is at least 0.75? P(T 0.75) = 0.027 1 NORMDIST(0.75, 0.5, 0.13, TRUE) = T N(0.5, 0.13) P(T 0.75) 0.5 0.75 218 The Average of a Sample Suppose you are going to record the numbers X 1, X 2,, X n taken from a sample of size n from a population and then compute: X1... X n Is X a rv? X n The answer depends on timing. If you have already taken the sample, then X is NOT a rv. If you have not yet taken the sample, then X IS arv rv. All possible values: The (finite) list of averages of every group of size n in the population. Groups of size n: G1 G2 G3 Discrete, X for the group: A1 A2 A3 but There is no practical way to list the possible values, so YOU CANNOT WRITE THE DENSITY FUNCTION. 219 8
The Average of a Sample X1... X n X The rvs X 1, X 2,, X n are iid n from the same population with mean = and std. dev. = Solution: Because X is the average of rvs, think of the using the CLT which, if applicable, results in the following density function for X : X ~ N(, / n ) Possible Values: (, + ) Now you can use the Normal Distribution to answer your probability question about X. 220 A Final Example of the CLT Historical data collected at a paper mill show that 40% of sheet breaks are due to water drops, resulting from the condensation of steam. Suppose that the causes of the next 100 sheet breaks are monitored and that the sheet breaks are independent d of one another. Find the expected value and the standard deviation of the number of sheet breaks that will be caused by water drops. What is the probability that at least 35 of the breaks will be due to water drops? 221 Exact Answer Success = break due to water drops P(success) = p = 0.4 X = number of breaks due to water drops X is Binomial with n = 100 and p = 0.4 E(X) = np = (100)(0.4) = 40 SD(X) = n p (1 p) = (100)(0.4)(0.6) = 24 = 4.9 From Excel P(X 35) = 1 P(X < 35) = 1 P(X 34) = 1 BINOMDIST(34, 100, 0.4, TRUE) = 0.8617 222 Normal Approx. to Binomial For this problem, let p = P(success) = 0.4, and 1, if a success on trial i X i, i 1,..., 100 0, if a failure on trial i In this problem, you are interested in the rv X = number of successes in 100 trials = X 1 + X 2 + +X 100 To find P(X 35) = P(X / 100 35 / 100), you need to know the probability distribution of X X /100, which, by the CLT, is approximately normal, so 223 9
Normal Approx. to Binomial Each X i ~ Binomial(1, p = 0.4), so E[X i ] = = p = 0.4 SD[ X i ] p(1 p) 0.49 Assuming that The X i are pairwise independent and n = 100 is large enough (np > 5and n(1 p) > 5), then by the CLT, the random variable X 1 L X100 X ~ N(, / n) (0.4, 0.049) 100 N 224 Normal Approx. to Binomial Then, for X = X 1 + + X 100 X, 100 100 P(X 35) = P(X / 100 35 / 100) P( X 0.35) = 1 NORMDIST(0.35, 0.4, 0.049, TRUE) = 0.85. (The exact answer was 0.86.) 225 Review of Basic Math A function y = f(x) describes a relationship between the two quantitative variables x and y. y = f(x) = x + 2 y = f(x) = x 2 2x + 1 (a linear relationship) (a nonlinear relationship) You can represent a function visually as follows: Review of Functions You can also think of a function f as transforming an input x into an output y, as follows: x y y f f(x ) = y x x Note: A function f can have many input values, instead of just one. 226 227 10
Review of Linear Equations A linear equation y = mx + b, provides a relationship between the two variables, x and y, in which: y y = mx + b b = the y-intercept = the value of y when x = 0. m = the slope of the line = the change in y per unit of increase in x. m > 0: as x increases, y increases. m= 0: as x increases, y remains the same. m < 0: as x increases, y decreases. b y x 1 m x + 1 x m > 0 m = 0 m < 0 x 228 An Example of a Line If y = the thousands of bushels of wheat x = the number of inches of rain then, for the line y = 80x + 71, b = 71 means that there are 71,000 bushels of wheat when there is no rain. m = 80 means that each extra inch of rain results in 80,000 more bushels of wheat. 229 A Different Equation for a Line Sometimes a line is written in the form: a 1 x 1 + a 2 x 2 = c Assuming that a 2 0, you can solve for x 2 : x 2 = (a 1 / a 2 ) x 1 + (c / a 2 ) y = m x + b How Large is Large Enough? For symmetric but outlier-prone data, n = 15 samples should be enough to use the normal approximation. For mild skewness, n = 30 should generally be sufficient to make the normal approximation appropriate. For severe skewness, n should be at least 100 to use the normal approximation. Generally speaking, the larger n is, the better the normal approximation is. 230 231 11
Graphing a Line Example of Graphing a Line To draw the graph of the line a 1 x 1 + a 2 x 2 = b: Find two different points on the line (usually by setting x 1 = 0 and finding x 2 and then setting x 2 = 0 and finding x 1 ). Plotting these two points on a graph. Drawing the straight line through those two points. The line: 2x 1 + x 2 = 230 When x 1 = 0, x 2 = 230 When x 2 = 0, x 1 = 115 2, 1 Note: Any point on the line gives a value for x 1 and a value for x 2 that satisfies 2x 1 + x 2 = 230. 300 200 100 x 2 100 200 300 x 1 232 233 Solving Two Linear Equations Objective: Solve the following two equations for x 1 and x 2 : 2x 1 + x 2 = 230 (a) x 1 + 2x 2 = 250 (b) Solution Procedure: Solve (a) for x 2 : x 2 = 230 2x 1 (c) Substitute x 2 = 230 2x 1 in (b): x 1 + 2(230 2x 1 ) = 3x 1 + 460 = 250 (d) Solve (d) for x 1 : x 1 = 70 Substitute x 1 = 70 in (c): x 2 = 230 2x 1 = 90. 234 Another Approach Objective: Solve the following for x 1 and x 2 : (a) 2x 1 + x 2 = 230 (c) 4x 1 + 2x 2 = 460 (b) x 1 + 2x 2 = 250 [ (b) x 1 + 2x 2 = 250 ] Alternative Procedure: Multiply py( (a) through by 2. (d) 3x 1 = 210 Subtract (b) from (c). Solve (d) for x 1 : x 1 = 70 Substitute x 1 = 70 in (a) and solve for x 2 : x 2 = 230 2x 1 = 90 Note: There are computer packages for solving n linear equations in n unknowns. 235 12
Exponentials An exponent is the power to which a number (called the base) is raised. Example: 2 5 (base = 2; exponent = 5) Question: How much will $1000 be worth after 5 years at t6% compound dinterest? t? Year 1 Year 2 Year 3 Year 4 Year 5 Principal $1,000.00 $1,060.00 $1,123.60 $1,191.02 $1,262.48 Interest $60.00 $63.60 $67.42 $71.46 $75.75 Total $1,060.00 $1,123.60 $1,191.02 $1,262.48 $1,338.23 Answer: Total = f (P, r, n) = P(1 + r ) n = 1000 (1 + 0.06) 5 = 1338.23 236 Properties of Exponents Laws of Exponents: x a + b =x b + a =x a x b (example: 2 3 + 2 = 2 3 2 2 ) (x a ) b = (x b ) a = x ab (example: (2 3 ) 2 = 2 6 ) x a = 1 / x a (example: 2 3 = 1 / 2 3 = 1 / 8) x 0 = 1 Exponential Functions Increase and Decrease Rapidly: 1200000 1000000 800000 600000 400000 200000 0 y = 2^x 0 5 10 15 20 25 y = 2^x 0.6 0.5 0.4 0.3 0.2 0.1 0 y = 2^(-x) 0 5 10 15 20 25 y = 2^(-x) 237 Scientific Notation Scientific Notation: a 10 b (also written as a E ±b) means move the decimal point of a: b positions to the right, if b > 0. b positions to the left, if b < 0. Example: 4.000 10 3 = 4.000 E+3 = 4000. Example: 4 10 3 = 4 E 3 = 0.004. Logarithms The log base b of x [written log (x)] is the power to which you must raise b to get x. Examples: log 10 (100) = 2, log 2 (32) = 5 Logs are only defined for positive numbers. If the base is omitted, the default is 10. The base e = 2.718 is used in some financial applications (such as continuous compounding), in which case, log e (x) is written as ln(x) (the natural log of x). 238 239 13
Laws of Logarithms Logs convert products to sums, that is, log b (xy) = log b (x) + log b (y). Ex: log 2 (64) = log 2 (4 16) = log 2 (4) + log 2 (16) = 2+4 = 6 log b (x / y) = log b (x) log b (y) Ex: log 10 (1000 / 100) = log 10 (1000) log 10 (100) = 3 2= 1 Logs bring down exponents, that is, log b (x y ) = y log b (x). Example: log 2 (4 5 ) = 5 log 2 (4) = 5(2) = 10 Logs undo exponentiation, that is, log b (b y ) = y log b (b) = y. Example: log 2 (2 5 ) = 5 log a (x) = k log b (x), where k = log a (b) Example: log 2 (x) = 3.322 log 10 (x) 240 Problem Solving with Logs Question: How many years will it take to double an investment at i % interest compounded annually? Answer: Let P=the initial investment r = interest rate as a fraction = i / 100 n = the number of years of compounding Then, after n years, you will have P(1 + r ) n. 241 Problem Solving with Logs Answer (continued): Thus, you want to find n so that P(1 + r ) n = 2P (1 + r ) n = 2 (a) To solve (a) for n, take the log of both sides to bring the exponent n down: log[(1 + r ) n ] = log(2) Qn: Log base what? n log[(1 + r )] = log(2) n = log(2) / log[(1 + r )] Example: At 6% (r = 0.06), it will take n = log(2) / log(1.06) = 0.301 / 0.025 = 11.9 years. Ans: Log base 10 (but any base will work). 242 14