Using the normal tables from A-1, look up a standard score of The corresponding area from to is percent.

Similar documents
MEASURES OF VARIATION

Continuing, we get (note that unlike the text suggestion, I end the final interval with 95, not 85.

Descriptive Statistics

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Chapter 3. The Normal Distribution

Week 4: Standard Error and Confidence Intervals

Statistics 151 Practice Midterm 1 Mike Kowalski

6.4 Normal Distribution

Mean, Median, Standard Deviation Prof. McGahagan Stat 1040

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

4. Continuous Random Variables, the Pareto and Normal Distributions

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Characteristics of Binomial Distributions

Exercise 1.12 (Pg )

AP Statistics Solutions to Packet 2

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

First Midterm Exam (MATH1070 Spring 2012)

MBA 611 STATISTICS AND QUANTITATIVE METHODS

7. Normal Distributions

WEEK #22: PDFs and CDFs, Measures of Center and Spread

Data Analysis Tools. Tools for Summarizing Data

Lecture 14. Chapter 7: Probability. Rule 1: Rule 2: Rule 3: Nancy Pfenning Stats 1000

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

AP * Statistics Review. Descriptive Statistics

The Normal Distribution

Section 1.3 Exercises (Solutions)

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Probability Distributions

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

6 3 The Standard Normal Distribution

Random variables, probability distributions, binomial random variable

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Chapter 1: Exploring Data

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

Name: Date: Use the following to answer questions 2-3:

Lesson 4 Measures of Central Tendency

Frequency Distributions

Stats on the TI 83 and TI 84 Calculator

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

9. Sampling Distributions

Means, standard deviations and. and standard errors

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Measures of Central Tendency and Variability: Summarizing your Data for Others

Descriptive Statistics and Measurement Scales

MATH 60 NOTEBOOK CERTIFICATIONS

CALCULATIONS & STATISTICS

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Exploratory Data Analysis. Psychology 3256

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Stat 20: Intro to Probability and Statistics

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Measurement with Ratios

Standard Deviation Estimator

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Lecture Notes Module 1

The correlation coefficient

Unit 7: Normal Curves

Name: Date: Use the following to answer questions 3-4:

Example: Find the expected value of the random variable X. X P(X)

Module 3: Correlation and Covariance

TImath.com. F Distributions. Statistics

AP STATISTICS REVIEW (YMS Chapters 1-8)

November 08, S8.6_3 Testing a Claim About a Standard Deviation or Variance

Simple Regression Theory II 2010 Samuel L. Baker

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Interpreting Data in Normal Distributions

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left.

One-Way Analysis of Variance

Mathematical goals. Starting points. Materials required. Time needed

Statistics. Measurement. Scales of Measurement 7/18/2012

Math 370, Actuarial Problemsolving Spring 2008 A.J. Hildebrand. Practice Test, 1/28/2008 (with solutions)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Chapter 7 Section 7.1: Inference for the Mean of a Population

Introduction; Descriptive & Univariate Statistics

Descriptive Statistics

Chapter 3 RANDOM VARIATE GENERATION

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Simple linear regression

Solutions to Homework 6 Statistics 302 Professor Larget

The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Year 9 set 1 Mathematics notes, to accompany the 9H book.

Lesson 20. Probability and Cumulative Distribution Functions

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

3.4 The Normal Distribution

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

TImath.com. Statistics. Areas in Intervals

2 Sample t-test (unequal sample sizes and unequal variances)

3.2 Measures of Spread

Normality Testing in Excel

6.2 Normal distribution. Standard Normal Distribution:

CHAPTER 14 NONPARAMETRIC TESTS

Transcription:

Review Exercises Normal Approximation to Data Chapter 5, FPP, p. 93-96 Dr. McGahagan Problem 1. Test scores and the normal approximation. Given: Mean = 50, SD = 10 A 1.25 SD interval around the mean = 50 +/- 12.5 = 37.5 to 62.5 Using the normal tables from A-1, look up a standard score of 1.25. The corresponding area from - 1.25 to + 1.25 is 78.87 percent. A sorted list should make the counting easier: Data = (29 36 37 37 39 41 44 47 47 48 49 49 50 50 52 52 53 54 56 58 59 62 64 65 72) Eighteen of the 25 values are within the bounds, this is 18 / 25 = 0.72 or 72 percent of the data, rather less than expected. Use EcLS by first defining the data: (bind data (list 29... 72)) as above. (stats data) will confirm the text assertions about the mean and SD. (density-plot data) shows that the data is roughly normal, so the normal tables will give a fair idea of what percentages to expect. (normal-area -1.25 1.25) was the command used to generate the graph on the right. (anorm 1.25) is a computer shortcut to getting the table value without the graph.

Problem 2. Computer printout of standardized test scores First 10 entries: ( - 6.2 3.5 1.2-0.13 4.3-5.1-7.2-11.3 1.8 6.3) Is the data surprising? Was the standardization procedure botched? The fact that some numbers are negative is NOT surprising: we would subtract the mean of (say) 75 from all the scores before standardizing, so scores below 75 would have a negative standard score. The fact that the sum of these scores is negative (- 12.83) is more surprising, but we have only a few values out of 100 test scores, and the sample may not be representative. But having so many numbers which are greater than 3 in absolute value is VERY surprising if the data are even remotely normal. In a normal distribution, we would expect only 3 observations out of a thousand to be greater than 3 or less than -3 SDs away from the mean, and be amazed to find ANY more than 6 SDs away from the mean. Hence it is almost certain that the computer program was in error if the scores were normally distributed. Longer explanation: To find the percentage area under the normal curve less than a given number, we ask for the value of the cumulative distribution function up to a desired point: First set the computer to show 10 decimals (we'll need them in a bit): > (normal-cdf -3) =.001349898. Take the reciprocal and to see that 1 in 741 observations will fall in the normal distribution to the left of -3 standard units; another 1 of those 741 observations will fall to the right of +3 > (normal-cdf -6) = 0.0000000010. Reciprocal = 1, 013,594, 692, so one in a billion observations fall to the left of a standardized value of -6; another one falls to the right of + 6. We have FOUR observations in that category. (- 6.2-7.2-11.3 and 6.3), which stretches credibility. > (/ 1.0 (normal-cdf -11)) = 5, 233, 794, 723, 805, 674, 230, 000, 000, 000. One in that many observations (one in 5 octillion if my counting of the commas is correct) can be expected to fall 11 SDs below the mean. What if the data is not normal? The Russian mathematician Panufty Chebyshev showed that WHATEVER the distribution of data, the chance of being further in either direction than k standard deviations from the mean is 1 / k; that is, the chance of being more than 6 SDs away from the mean is 1/6 = 0.17 or 17 percent; the chance of being more than 11 SDs away from the data is 1/11 or about 9 percent. So if the data is not normal, this is possible. I would want to look at a histogram of all the data before betting my house on there being a mistake, but I think that the computer program probably has a problem. Problem 3. SAT scores -- Verbal Part a. In 1967, verbal SAT scores were distributed normally with mean = 466 and SD 110. To find the percentage of students scoring above 600, (1) standardize the score: 600-466 134 Z = -------------------- = ------- = 1.2182 or approximately 1.22 (closest value given) 110 110 (2) look up the area from -1.2 to + 1.2 in table A-1. Area = 76.99 (3) Find the tail area above 1.22 = 11.50 Procedure: (100-76.99) = 23.01 gives the TWO tail area outside the center; divide by 2 to get the one tail area, which is the percentage scoring above 600. Part b. In 1994, verbal SAT scores had mean 423 and SD 110. The Z-score is for the percentage above 600 is Z = ( 600-423) / 110 = 1.6091; look up the area 1.6 in the tables, and you find that Central area = 89.04; two-tail area = 100-89.04 = 10.96; one tail area = 5.48 percent. 11.5 percent of SAT verbal scores were above 600 in 1967; only 5.5 percent were above 600 in 1994.

ASSIGNED. Problem 4. SAT scores -- Math In 1994, male SAT scores were distributed with mean of 500 and SD of 120. women's SAT scores had mean 460 and SD of 120. Find the percentage of (a) men and (b) women scoring above 660. Problem 5. Fill-in referring to normal curve and histogram, heights of men (mean = 69, SD = 3). Percent of men with heights between 66 and 72 inches is equal to the area between (a) -1 and (b) +1 under the (c) normal curve. This percentage is approximately equal to the area between (d) 66 inches and (e) 72 inches under the (f) histogram. Problem 6. Is the curve normal? LSAT at one law school had mean score of 169 and SD of 9; the highest score was 178. Was the curve normal? Note that ONLY ONE SD will get you up to the highest score; with a normal curve, we would expect 17 percent of the data to lie outside the 1 SD range, and half of that or over 8 percent of the scores to be HIGHER than 178. The data is more tightly packed about the mean than would be the case with a normal distribution; the density of the histogram would be higher, and the distribution more pointed than a normal distribution. We call this excess "pointiness" leptokurtosis. For the difference between lepto- and platy- kurtosis, you can consult this sketch by W. E. Gosset (the "Student" in the Student's t distribution) Source: "Student", "Errors of Routine Analysis", Biometrika, v.19, No. 1/2, July 1927, p. 160. Note that we would expect leptokurtosis if students self-select by ability when applying to law schools. Note also a slight error in the diagram: leptokurtic distributions typically have THICKER tails than the normal distribution, and the kangaroo tails look thinner than they should.

ASSIGNED. Problem 7. Finding percentiles. Explain your procedure for this problem carefully. Assume the math SAT for applicants to a school had a mean of 500 and SD of 100 and followed a normal curve. Part a. A score of 350 was at the percentile of the distribution. Part b. To be at the 75th percentile of the distribution a student would need as score of Question 8. True/False a. True. Adding 7 to each number of a list adds 7 N to the sum of the N numbers; when we divide by N, we will be left with an average that is larger by 7 than the original average. b. False. Adding 7 to the list will make both each number and the mean larger by 7, so subtracting the mean from each number leaves the deviation unchanged. c. True. Doubling each entry on the list doubles the average; N is no larger, but the sum of numbers is twice what is was. d. True. Example: x = ( 4 11) Average = 7.5, SD = 3.5; y = ( 8 22) Average = 15; SD = 7. When we calculate the variance (mean squared deviation), each term in the numerator has the form: (square ( X - Xbar)). If we double all terms, they will be: (square ( 2X - 2Xbar)) = (square (2 (X - Xbar) = 4 (square (X - Xbar)) Factor out the 4 from all terms, and we have found that the variance of the new series is 4 times that of the old. But when we calcuate the SD,we take the square root of both sides, so the SD of the new series is twice that of the old. e. True. Changing the sign of each number in a list changes the sign of the average. Example: x = ( 10 20), average is 15; x = (-10-20) and the average is minus 15. f. False. Changing the sign of each number in a list DOES NOT change the SD. Changing the sign means multiplying each number by -1; when we calculate the SD, we will be squaring the minus 1. Example: (bind x (rnd 10)) = (34 3 34 83 59 79 82 84 50 4) (mean x) = 51.2 (sd x) = 30.0227 (bind y (* -1 x)) = (-34-3 -34-83 -59-79 -82-84 -50-4) (mean y) = -51.2 (sd y) = 30.0227 The (rnd 10) command means "generate 10 random integers between 0 and 100" Question 9. More True and False. a. False. Mean and median are close only if the distribution is SYMMETRICAL. Example: x = ( 1 2 3 4 90) Mean = 20, Median = 3. b. False. Half of the list is not necessarily "below average"; in the previous example, 4 out of 5 numbers were below the mean of 20. c. False. Histograms of a sample of data, however large, will follow the normal curve only if the underlying distribution is normal. Example: generate a sample of 100 non-normal variates by: (bind x (rchisq 50 3)) [random from the chi-squared distribution with 3 degrees of freedom -- don't worry about the meaning of "degrees of freedom" yet. (hist x) Repeat, substituting (bind x (rchisq 5000 3)). Is the distribution more normal? Of course, if the data generating function were (rnorm x), the histogram would very likely look more normal if we had 5000 observations rather than 50. d. False. One counter-example is sufficient to demonstrate falsity. List A = ( 40 40 60 60) has mean = 50, SD = 10, and NO numbers between 40 and 60. List B = (45 55 (50 + x) (50 - x))

Problem 10. Percentages in a skewed distribution are not normal. Income distribution in 1992: mean: $ 35,000 SD: $ 23,000. Percentage of incomes above the mean is likely to be much smaller than with the normal distribution, here it would be 50 percent; go with the 40 percent figure. Only if the income distribution were LEFT SKEWED rather than right skewed would it be possible for the percentage above the mean to be greater than 50 percent. Problem 11. Statistics at Berkeley. Knowing more about Statistics 2 would help: is Statistics 1 a prerequisite? Or is Stat 2 an alternative designed for students who placed out of Stat 1 on an exam? If Stat 1 were prerequisite, we would begin with an average of 1, with a very few students who had another course or two pulling the average up to 1.1. The right-skewed distribution (i) would definitely be the one to choose. If Stat 1 were not prerequisite, the right-skewed distribution is still the likely choice: there is no possibility of taking a negative number of math courses, so a long left tail as in (iii) could be ruled out, and one would expect a few students who have taken a large number of courses, making symmetry unlikely. Problem 12. Census "households" and "families" Households include both single-person households and multiple person households -- obviously the income of multiple person households stands a good chance of being higher than of just one. A bit less obviously, single persons are more likely to be young, and hence more likely to be getting starting salaries. The figures from the Census Bureau from Income, Poverty and Health Insurance Coverage in the United States, 2007 (Aug. 2008, accessible from www.census.gov/hhes/www/income/income.html. Table 1, p. 7) are: Median income 90 percent CI Number of households (000) Alll households : 50,233 +/- 230 116,783 Family households : 62,359 +/- 322 77,873 Married couples : 72,785 +/- 528 58,370 Non-family : 38,910 +/- 260 38,910 The mean income of all households was 67,609 (with standard error of 236) -- Table A-1, p.31.