STA201 Intermediate Statistics Lecture Notes. Luc Hens

Size: px
Start display at page:

Download "STA201 Intermediate Statistics Lecture Notes. Luc Hens"

Transcription

1 STA201 Intermediate Statistics Lecture Notes Luc Hens 15 January 2016

2 ii

3 How to use these lecture notes These lecture notes start by reviewing the material from STA101 (most of it covered in Freedman et al. (2007)): descriptive statistics, probability distributions, sampling distributions, and confidence intervals. Following the review, STA201 then covers the following topics: hypothesis tests (Freedman et al., 2007, chapters 26 and 29); the t-test for small samples (Freedman et al., 2007, chapter 26, section 6); hypothesis tests on two averages (Freedman et al., 2007, chapter 27), and the Chi-square test (Freedman et al., 2007, chapter 28). STA201 then covers correlation and simple linear regression (Freedman et al., 2007, chapters 10, 11, 12). Two related subjects (multiple regression and inference for regression) that are not covered in Freedman et al. (2007) are covered in-depth in the lecture notes. Each chapter from the lecture notes ends with Questions for Review; be prepared to answer these questions in class. Work the problems at the end of the chapters in the lecture notes. The key concepts are set in boldface; you should know their definitions. You can find the lectures notes and other material on the course web site: We ll use the open-source statistical environment R with the graphical user interface R Commander. The course web page contains a document (Getting started in STA101 ) that explains how to install R and R Commander on your computer, as well as R scripts and data sets used in the course. Thanks to a web interface (Rweb) you can also run R from any computer or mobile device (tablet or smartphone) with a web browser, without having R installed. Make sure you are connected to the internet. In your web browser, open a new a new tab. Point your browser to Rweb: Remove everything from the window at the top (data(meaudret) etc.). Type R code (or paste an R script) in the window. Click the Submit button. Wait until Results from Rweb appears. If the script generates a graph, scroll down to see the graph. Practice is important to learn statistics. Students who wish to work additional exercises can find hundreds of solved exercises in Kazmier (1995) (or a more recent edition). Moore et al. (2012) covers the same ground as Freedman et al. (2007) and has many exercises; the solutions to the odd-numbered exercises are in the back of the book. Older but still useful editions of both books are available in the VUB library. iii

4 iv HOW TO USE THESE LECTURE NOTES Remember the following calculation rules: Always carry the units of measurement in the calculations. For instance, when you have two measurements in dollars ($ 2 and $ 3) and you compute their average, write: $ 2 + $ 3 = $ To express a fraction (say 2/5) as a percentage, multiply by 100% (not by 100): 2 100% = 40% 5 The same holds for expressing decimals (say, 0.40) as a percentage: % = 40% (STA201 was for a while taught as STA301 Methods: Statistics for Business and Economics.)

5 Chapter 1 Descriptive statistics 1.1 Basic concepts of statistics Suppose you want to find out which percentage of employees in a given company has a private pension plan. The population is the set of cases about which you want to find things out. In this case, the population consists of all employees in the given company; each employee is a case. A variable is a characteristic of each case in the population. In this case you are interested in the variable private pension plan. It can take two values: yes or no (it s a qualitative variable). The percentage of employees who have a private pension plan is a parameter: a numerical characteristic of the population. The monthly salary of the employees is a quantitative variable. The average monthly salary of all employees in the company is another parameter. We ll be mainly concerned with these two types of parameters: percentages (of qualitative variables) and averages (of quantitative variables). If you conduct a survey and every employee in the company fills out the survey form, the collected data set covers all of the population, and you can find the exact value of the population parameter. In some cases collecting data for the population may not be possible; you may have to rely on a sample drawn from the population. A sample is a subset of the population. The sample percentage (which percentage of employees in the sample has a private pension plan) is called a statistic. Statistical inference is when you use a sample to draw conclusions about the population it was drawn from. We ll see that when the sample is a simple random sample, the sample percentage (the statistic) is a good estimate of the population percentage (the parameter). Much of statistical inference deals with quantifying the degree of uncertainty that is the result of generalizing from sample evidence. First we will deal with descriptive statistics: ways to summarize data (from a population or a sample) in a table, a graph, or with numbers. 1.2 Summarizing data by a frequency table How can we summarize information about a quantitative variable of a sample or a population, often consisting of thousands of measurements? 1

6 2 CHAPTER 1. DESCRIPTIVE STATISTICS When a particular stock is traded unusually frequently on a given day, usually this indicates something is going on. Table 1.1 shows the number of traded Apple shares for each of the first fifty trading days of A glance at the data reveals that the trade volumes differ considerable from day to day. Table 1.1: Volumes of Apple stock traded on NASDAQ on the first 50 trading days of Source: nasdaq.com Date Volume Date Volume (yyyy/mm/dd) (yyyy/mm/dd) 2013/03/ /02/ /03/ /02/ /03/ /02/ /03/ /02/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/

7 1.2. SUMMARIZING DATA BY A FREQUENCY TABLE 3 How can we get a better idea of the typical daily volumes and the spread around the typical volumes? A good start is to rank the values from low to high: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , (In R Commander, type the sort() function in the script window. The name of the variable should be between the brackets.) The values vary from 10.8 to 52.1 million shares per day. The middle value in the ordered list is called the median. Because we have an even number of values (50), there are two middle values: the values at position 25 ( ) and 26 ( ). In that case, the convention is to take the average of the two middle values as the median: median = = The median gives an idea of the central tendency of the data distribution: half of the days the value (the volume of traded shares) was less than the median, and the other half the value was more than the median. We can summarize the ordered list in a frequency table. First, define class intervals that don t overlap and cover all data. You don t want too few class intervals (because that would leave out too much information), nor too many (because that wouldn t summarize the information from the data). You also want the class intervals to have boundaries that are easy, rounded numbers. The class intervals don t have to be of the same width. Let us define the first class interval as to ( included, not included), the second as to , and so on, until to A frequency table has three columns: class interval, absolute frequency, and relative frequency (table 1.2). The absolute frequency (or count) is how many values fall in each class interval. The first class interval ( to ) contains 13 values (verify!): the absolute frequency for this interval is 13. Find the absolute frequencies for the other class intervals. The relative frequency expresses the number of values in a class interval (the absolute frequency) as a percentage of the total number of values in the data set. For the first class interval ( to ) the relative frequency is: % = 26% 50 Verify the relative frequencies for the other class intervals. Show your work. The absolute frequencies add up to the number of values in the data set, and the relative frequencies (before rounding) add up to 100%. If that is not the case, you made a mistake.

8 4 CHAPTER 1. DESCRIPTIVE STATISTICS Table 1.2: Frequency table of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of Note. Class intervals include left boundaries and don t include right boundaries. Volume Absolute Relative (shares per day) frequency frequency (%) 10 to 15 million to 20 million to 25 million to 30 million to 35 million to 40 million to 45 million to 50 million to 55 million 1 2 Sum: Summarizing data by a density histogram The frequency table gives you a pretty good idea of what the most common values are, and how the values differ. One way to graph the information from a frequency table is to plot the values of the variable (in this case: the daily volumes) on the vertical axis, and the absolute or relative frequency on the vertical axis. The heights of the bars represent the absolute or relative frequencies. The areas of the bars don t have a meaning. Such a bar chart is called a frequency histogram. For reasons that will soon be clear, it is more interesting to plot a frequency table in a bar chart where the areas of the bars represent the relative frequencies. Such a bar chart is called a density histogram. The height of each bar in a density histogram represents the density of the data in the class interval. To construct a density histogram, we have to find the height for each bar. How do we compute the height? Remember that the area of a rectangle (such as the bars in the density histogram) is given by width times height: area = width height The area of the bar is the relative frequency, the width of the bar is the width of the class interval, and the height of the bar is the density. Hence: relative frequency = width of the interval density Divide both sides by the width of the interval, to obtain: density = relative frequency (%) width of the interval This formula is on the formula sheet. For the class interval from 10 million to 15 million shares the relative frequency was 26% (table 1.2). Hence the density for this interval is: density = 26% 15 million shares 10 million shares

9 1.3. SUMMARIZING DATA BY A DENSITY HISTOGRAM 5 26% = 5 million shares = 5.2%/million shares Now that you know the height of the bar over the interval from 10 to 15 million shares (5.2% per million shares), you can draw the bar. The density for the interval from 10 to 15 million shares tells us which percentage of all 50 values falls in each interval of one unit wide on the horizontal axis, assuming that the values in interval from 10 to 15 million shares would be uniformly distributed. In the interval from 10 to 15 million shares, about 5.2% of all values falls between 10 and 11 million shares, about 5.2% of all values falls between 11 and 12 million shares, about 5.2% of all values falls between 12 and 13 million shares, about 5.2% of all values falls between 13 and 14 million shares, and about 5.2% of all values falls between 14 and 15 million shares. It is as if the bar is sliced up in vertical strips of one horizontal unit (here: one million shares) wide. The density measures which percentage of all values falls in such a strip of one unit wide. Note the unit of measurement of density: percent per million shares. More generally, density is expressed in percent per unit on the horizontal axis. Given a data set such as table 1.1, you should be able to construct a frequency table and a density histogram. The first assignment asks you to do exactly that. Figure 1.1 shows the density histogram as generated by R. A script to draw this density histogram in R Commander is posted on the course web page. Suppose you don t have the data set or the frequency table, but just the density histogram (figure 1.1). On which percentage of trading days was the volume of traded Apple shares between 20 and 30 million? Show in the density histogram what represents your answer. On (approximately) which percentage of trading days was the volume of traded Apple shares between 24 and 27 million? Show in the density histogram what represents your answer. We conclude that the area under de histogram between two values represents the percentage of observations that falls between those two values. What is the area under all of the histogram? %. In a density histogram the vertical axis shows the density of the data. The areas of the bars represent percentages. The area under a density histogram over an interval is the percentage of data that fall in that interval. The total area under a density histogram is 100%. (Freedman et al., 2007, p. 41) A density histogram reveals the shape of the data distribution. To assess the shape of the density histogram, locate the median on the horizontal axis and draw a vertical line. Is the histogram symmetric about the median, or is it skewed? Is the histogram skewed to the left (that is, with a long tail to the left) or to the right (with a long tail to the right)? Is the histogram bell-shaped? Watch this two-minute video clip (Rösling, 2015) that uses a histogram to show how the world income distribution has changed over the last two centuries: Although a density histogram is somewhat more complicated than a frequency histogram, a density histogram has several advantages:

10 6 CHAPTER 1. DESCRIPTIVE STATISTICS 6 Density (% per million shares) Daily volume (in millions) Figure 1.1: Density histogram of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of a density histogram allows for intervals with different widths; a bell-shaped density histogram can be approximated by the normal curve (see below); a density histogram has an interpretation that resembles the interpretation of a probability distribution curve (see below). 1.4 Summarizing data by numbers: average We already saw that the median is a measure of the central tendency of the data distribution. Another useful measure of central tendency is the average. The formula to compute the average of a list of measurements is: average = sum of all measurements how many measurements there are

11 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 7 Here is an example. Suppose you collected the price of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 5 = 4 A disadvantage is that the average is sensitive to outliers (exceptionally low or exceptionally high values). Suppose that the list looked like this: The average of this list is: 2, 2, 4, 5, 22 average = = 35 = The one exceptionally expensive bottle of 22 pulled the average up quite a lot. In cases like this we often prefer to use a different measure of central tendency: the median. To find the median, first rank the values from low to high. Then take the middle value. The median of the list { 2, 2, 4, 5, 22} is 4. The median of the first list { 2, 2, 4, 5, 7} is also 4. As you can see, the outlier doesn t affect the median. When a density histogram is skewed or when there are outliers, the median usually is a better measure of the central tendency. One example is the distribution of families by income (Freedman et al., 2007, figure 4 p. 36). 1.5 Summarizing data by numbers: standard deviation We have seen how to summarize the central tendency of a data set. Another feature we would like capture is the spread (or dispersion) of the data. One way to measure the spread is to look at how much the measurements deviate from the average. Let s go back to the prices of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 = The deviation from the average measures how much a measurement is below ( ) or above (+) the average: The deviations are: deviation = measurement average 2 4 = = = = = + 3

12 8 CHAPTER 1. DESCRIPTIVE STATISTICS To get an idea of the typical deviation, we could take the arithmetic mean of the deviations: ( 2) + ( 2) (+ 1) + (+ 3) 5 = 0 It can be easily proven that whatever the list of measurements the arithmetic mean of the deviations is always equal to 0: the negative deviations exactly cancel out the positive ones. Therefore statisticians use the quadratic mean of the deviations as a measure of the spread; the outcome is called the standard deviation. The standard deviation (SD) is a measure of the typical deviation of the measurements from their mean. It is computed as the quadratic mean (or rootmean-square size) of the deviations from the average. The quadratic mean is usually referred to as the root-mean-square (R-M-S) size. To obtain the standard deviation, find the deviations. The compute the quadratic mean (or root-mean-square size) of the deviations, apply the rootmean-square recipe in reverse order: first square the deviations, then find the (arithmetic) mean of the result, and finally take the (square) root. In our example: 1. Square the deviations: ( 2) 2 = 2 4 ( 2) 2 = 2 4 ( 0) 2 = 2 0 (+ 1) 2 = 2 1 (+ 3) 2 = 2 9 By squaring we get rid of the minus signs. Note that the unit of measurement (here: ) is squared, too. 2. Next find the arithmetic mean (or average) of the results from the previous step: mean = The unit ( ) is still squared ( 2 ). = = Finally take the square root of the result from the previous step: This is the standard deviation. Note that by taking the square root, the units are again: the standard deviation has the same unit as the measurements. In this case, the measurements were in euros, so the standard deviation is also in euros.

13 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 9 Expressed as a formula, we get: sum of (deviations) 2 SD = number of measurements (The formula is on the formula sheet, so you don t have to learn it by heart.) The formula above is for the standard deviation of a population. For reasons I won t explain, a better formula for the standard deviation of a sample is: SD + = sum of (deviations) 2 sample size sample size sample size 1 that is, you compute the SD with the usual formula (the quadratic mean of the deviations), which is the first factor in the equation above, and then multiply by sample size sample size 1 (you don t have to memorize this formula). Because the second factor is larger than 1, the formula gives a value larger than SD. That s why Freedman et al. (2007) use the notation SD +. For large samples, the difference between SD and SD + is small. In what follows, we ll use the SD formula for both samples and populations, unless stated explicitly otherwise. We ll return to SD + when we discuss small samples. Remember the following rule: few measurements are more than three SDs from the average. 1 This rule holds for histograms of any shape. Measurements that are more than three SDs from the average (exceptionally small or exceptionally large measurements) are called outliers. To identify outliers, compute the standard scores of all measurements. The standard score expresses how many standard deviations a measurement is below ( ) or above ( ) the average: standard score = measurement average standard deviation Converting measurements to standard scores is called standardizing. Let us return to the daily traded volumes of Apple shares (table 1.1). The volumes of Apple shares trade on the first 50 trading days of 2013 have an average of and a standard deviation of On 14 March 2013 only Apple shares were traded. Is that volume exceptionally small? Compute the standard score for : = De standard score of 1.13 means that the volume of shares was 1.13 standard deviations below the average. Because the absolute value of the 1 A more precise statement can be made. It can be proven (Chebychev s Theorem) that at least 8/9 of the measurements fall within 3 SDs of the average, that is, between [average 3 SD, average + 3 SD] Hence at most 1/9 of the measurements fall outside that interval. You don t have to memorize this.

14 10 CHAPTER 1. DESCRIPTIVE STATISTICS standard score (after omitting the minus sign: 1.13) is smaller than 3, we don t consider as an outlier. Standard scores have no units. The following example illustrates this. A list of incomes per person for most countries in the world (the Penn World Table, Heston et al. (2012)) has an average of $ and a standard deviation of $ Income per person in Belgium is $ De standard score for income per person in Belgium is: $ $ $ = $ $ The units in the numerator ($) and denominator ($) cancel each other out, and hence the standard score has no units. That s why Freedman et al. (2007) refer to computing standard scores as converting a measurement to standard units. The standard score of 1.32 means that income per person in Belgium is 1.32 standard deviations above the average of all countries in the list. So is income per person in Belgium an outlier? Shortcut formula for the SD of 0-1 lists. Computing the SD is tedious. To estimate percentages, we ll be dealing with lists that consist of just zeroes and ones (0-1 lists): for instance, we will model an employee with a private pension plan as a 1, and an employee without a private pension plan as a 0. The following shortcut formula simplifies the calculation of the SD of 0-1 lists: the standard deviation of a list that consist of just zeroes and ones can be computed as: SD of 0-1 list = ( fraction of ones in the list ) ( fraction of zeroes in the list (This formula is on the formula sheet, so no need to memorize. Just for your information, I posted a proof on the course home page.) Here is an example. Consider the list {0, 1, 1, 1, 0}. The average is 3/5. The deviations from the average are: { 3/5, 2/5, 2/5, 2/5, 3/5}, or { 0.6, 0.4, 0.4, 0.4, 0.6}. The SD is the root-mean-square size of the deviations: 1. Square the deviations: {0.36, 0.16, 0.16, 0.16, 0.36} 2. Next find the average of the squared deviations: Finally take the square root to obtain the SD: SD = = According to the shortcut rule we can compute the SD as: ( ) ( ) fraction of fraction of ones zeroes = 0.24 ) which yields: = 25 = which indeed yields the same result, with far fewer calculations.

15 1.6. THE NORMAL CURVE The normal curve Many bell-shaped histograms can be approximated by a special curve called the normal curve. The function describing the normal curve is complicated: y = 1 2π e x2 /2 In practice we won t need this equation: it is programmed in all statistical software packages. The equation describes the standard normal curve, which is the only version of the normal curve we ll need. In what follows, I ll refer to the standard normal curve simply as the normal curve. Figure 1.2 illustrates the properties of the standard normal curve: 1. the curve is symmetric about 0; 2. the area under the curve is 100% (or 1); 3. the curve is always above the horizontal axis Density (% per standard unit) Standard units (z) Figure 1.2: The standard normal curve Statisticians use statistical software (on a calculator or a computer) to find areas under the normal curve. On a TI-84, you find the area under the standard normal curve using the normal cumulative density function (normalcdf). The area under the standard normal curve between 1 and 2 is:

16 12 CHAPTER 1. DESCRIPTIVE STATISTICS DISTR normalcdf( 1,2) which yields approximately To express the area as a percentage, multiply by 100%: % = 81.86% The area under the standard normal curve to the right of 1 (that is, between 1 and infinity) is: DISTR normalcdf( 1,10 99 ) The area under the standard normal curve to the left of 2 (that is, between minus infinity and 2) is: DISTR normalcdf( 10 99, 2) For the exams, you have to use the TI-84 to find areas under the normal curve. On the course web page I posted an R script (area-under-normal-curve.r) that computes and plots the area under the normal curve between any two values on the horizontal axis. R Commander has a built-in function to find the area under the normal curve in the left tail or in the right tail: Distributions Continuous distributions Normal distribution Normal probabilities Approximating a density histogram by the normal curve These are scores of 100 job applicants who took a selection test: 74, 82, 70, 84, 54, 60, 79, 62, 72, 66, 72, 79, 73, 73, 84, 59, 53, 65, 62, 81, 76, 67, 72, 89, 70, 72, 71, 78, 98, 58, 68, 89, 70, 62, 71, 56, 68, 68, 76, 63, 63, 71, 82, 63, 98, 76, 74, 71, 52, 80, 80, 66, 69, 67, 70, 81, 62, 63, 76, 57, 89, 60, 87, 80, 75, 71, 87, 59, 69, 65, 66, 67, 62, 87, 58, 58, 60, 54, 74, 83, 48, 77, 79, 60, 84, 86, 68, 64, 83, 65, 77, 79, 68, 75, 77, 72, 47, 77, 68, 67 (the data are posted on the course web page) The average of the test scores is about 70, and the standard deviation is about 10 (verify using R Commander). Figure 1.3 shows the density histogram. The histogram is bell-shaped. In 1870, the Belgian statistician Adolphe Quetelet had the idea to approximate bell-shaped histograms by the normal curve (Freedman et al., 2007, p. 78). The horizontal scale of the histogram differs from that of the standard normal curve: most test scores are between 40 and 100, while most of the standard area under the normal curve extends between 3 and +3 on the horizontal axis; and the center of the density histogram is about 70, while the center of the standard normal curve is 0. If we standardize the values, we get what we want. To obtain the standard scores, do: standard score = measurement average standard deviation For example, to standardize the first test score (74; in this case the variable has no units), do: standard score = = The list of standard scores is: 0.4; 1.2; 0.0;... ; 0.3. Verify that you can compute the first couple of standard scores.

17 1.7. APPROXIMATING A DENSITY HISTOGRAM BY THE NORMAL CURVE13 3 Density (% per point) Test score (points) Figure 1.3: Density histogram of 100 test scores Figure 1.4 shows the histogram of the standard scores. If you compare with the histogram of the original test scores (figure 1.3) you notice that the shape of the histogram hasn t changed. Consider the original test scores. Count the number of job applicants who had a test score between 75 and 85: 25 out of the 100 job applicants had a test score between 75 and 85. So 25% of the job applicants had a test score between 75 and 85. In the histogram (figure 1.3), the percentage corresponds to the area under the histogram between 75 and 85. The standard scores of 75 and 85 are: en = +0.5 = +1.5 In the histogram of the standard scores (figure 1.4) the percentage (25%) corresponds to the area under the histogram between +0.5 and The area under the normal curve between +0.5 and +1.5 approximates the area under the histogram between +0.5 and Now carefully look at figure 1.4. The normal approximation overestimates the bar over the interval between +0.5 and +1.0, and underestimates the bar over the interval between +1.0 and The area under the normal curve between +0.5 and +1.5 is approximately: DISTR normalcdf(0.5,1.5) %

18 14 CHAPTER 1. DESCRIPTIVE STATISTICS 30 Density (% per standard unit) Standard units Figure 1.4: Density histogram of 100 test scores, standardized The normal approximation (24.17%) is quite close to the actual percentage (25%). Use your TI-84 to find the areas under the normal curve between 1 and +1. Using the normal approximation, which percentage of measurements will be between ave SD and ave + SD? Repeat for 2 and +2 and 3 and +3. You see that the normal approximation implies the following rule, called the rule. For a bell-shaped histogram: approximately 68% of the measurements are within one SD of the average, that is, between ave SD and ave + SD; approximately 95% of the measurements are within two SDs of the average, that is, between ave 2 SD and ave + 2 SD; approximately 99.7% of the measurements are within three SDs of the average, that is, between ave 3 SD and ave + 3 SD; (The rule is not on the formula sheet; you have to know it by heart.) The normal approximation will turn out to be very useful in statistical inference (drawing conclusions about population parameters on basis of sample evidence).

19 1.8. QUESTIONS FOR REVIEW Questions for Review 1. What is the difference between a qualitative and a quantitative variable? Illustrate using examples where you consider different characteristics of the students in the class. 2. What is the difference between a parameter and a statistic? 3. What does descriptive statistics do? 4. What does statistical inference do? 5. How can you summarize the distribution of a numerical data set in a table? In a graph? 6. In a density histogram, what does the density represent? What are the units of density? Explain for a hypothetical distribution of heights (in centimeter) of people. 7. When would the median be a better measure of the central tendency of a distribution than the mean? Illustrate by giving an example. 8. What does the standard deviation measure? How is the standard deviation computed? 9. What are the properties of the normal curve? 10. What does the standard score measure? How is the standard score computed? 11. What does the % rule say? 1.9 Exercises 1. Download the data file AAPL-HistoricalQuotes.csv from the course web site: and save the data file to your STA201 folder (directory). The data set contains data about Apple stock. Run R Commander and load the data set: Data Import data from textfile, clipboard, or URL.... A window opens. For Location of Data File select Local file system. For Field Separator select Commas. For Decimal-Point Character select Period [.]. Press OK, navigate to the data file AAPL-HistoricalQuotes.csv, abd double-click the file. Your data should now be loaded by R Commander. In the R Commander menu, click the View Data Set button. A new window opens, showing the data set. The variable volume is the variable from table 1.1. Now enter the following line of script in the R script window: h <- hist(dataset$volume/ ,right=false) and press the Submit button. This command will compute the numbers needed to make a histogram and store then in an object called h. Next, type in the R script window:

20 16 CHAPTER 1. DESCRIPTIVE STATISTICS h$breaks and press the Submit button. The output window will display the breaks between the intervals, that is, the boundaries of the intervals used by R when it computes the frequency table. Next, type in the R script window: h$counts and press the Submit button. The output window will display the absolute frequencies (counts) of each interval. Next, type in the R script window: h$density and press the Submit button. The output window will display the densities of each interval. The densities are expressed as decimal fractions per horizontal unit; to get densities expressed as percentages per horizontal unit you have to multiply by 100%. Finally, type in the R script window: h$counts/sum(h$counts) and press the Submit button. The output window will display the relative frequencies for each interval; to get relative frequencies expressed as percentages you have to multiply by 100%. 2. Use the relative frequencies from table 1.2 to compute the densities for the other intervals. Add a column to show the densities. Then draw the density histogram on scale on squared paper. 3. Figure 1.1 shows that the daily traded volumes of Apple shares have a skewed distribution. The average daily volume is shares. Find the median. Show your work. How do mean and median compare? Is that what you expected from the shape of the histogram? Explain. 4. Find the standard deviation of {1, 1, 1, 1, 0} using two methods: the usual formula (root-mean-square size of the deviations) and the shortcut formula for 0-1 lists. Do you get the same result? 5. The daily traded volumes of Apple shares (table 1.1) have an average of and a standard deviation of Is an outlier? And ? Show your work and explain. 6. Use the TI-84 to find the areas under the standard normal curve: (a) to the right of 1.87 (b) to the left of 5.20 (c) between 1 and +1 (d) between 2 and +2 (e) between 3 and +3 Make for every case a sketch, with the relevant area shaded. Verify your answers using the R script. We ll get back to cases (c), (d), and (e) in a moment.

21 1.9. EXERCISES For the 100 given test scores, find which percentage of job applicants scored between 50 and 60. Then use the normal approximation. Is the normal approximation close? 8. For 164 adult Belgian men born in 1962 the average height is centimeter and the SD is 8.2 centimeter (Garcia and Quintana-Domeque, 2007). Suppose that the histogram of the 164 heights follows the normal curve (heights usually do). What is, approximately, the percentage of men in this group with a height of 170 centimeter or less? What is, approximately, the percentage of men in this group with a height of between 170 centimeter and 180 centimeter? 9. Of the volumes of Apple shares traded in the first 50 trading days of 2013 (p. 1.2) the average is and the SD is Find the actual percentage of values between: ave SD and ave + SD; ave 2 SD and ave + 2 SD; ave 3 SD and ave + 3 SD; Does the rule give a good approximation? Why (not)?

22 18 CHAPTER 1. DESCRIPTIVE STATISTICS

23 Chapter 2 Probability distributions 2.1 Chance experiments Examples of chance experiments are: rolling a die and counting the dots; tossing a coin and observing whether you get heads or tails; or randomly drawing a card from a well-shuffled deck of cards and observing which card you get. It is convenient to think of a chance experiment in terms of the following chance model: randomly drawing one or more tickets from a box. For instance, rolling a die is modeled as randomly drawing a ticket from the box: In R: box <- c(1,2,3,4,5,6) sample(box,1) In R: Tossing a coin is like randomly drawing a ticket from the box: box <- c("heads","tails") sample(box,1) heads tails 2.2 Frequency interpretation of probability Consider the following chance experiment. Roll a die and count the dots. If you get an ace (1), write down 1; if you don t get an ace (2, 3, 4, 5, or 6), write down 0. Repeat the experiment many times. After each roll, compute the relative frequency of aces up to that point. Make a graph with the number of tosses on the horizontal axis and the relative frequency on the vertical axis. Figure 2.1 shows the result of repetitions in such an experiment. The frequency of aces tends towards 1/6 ( %, the horizontal dashed line). The frequency interpretation of probability states that the probability of an event is the percentage to which the relative frequency tends if you repeat the chance experiment over and over, independently and under the same conditions (Freedman et al., 2007, p. 222). 19

24 20 CHAPTER 2. PROBABILITY DISTRIBUTIONS Frequency of aces (%) Number of repeats Figure 2.1: Frequency of aces in 10,000 rolls of a die 2.3 Drawing with and without replacement Consider the following box with tickets: The probability to draw an even number is 3/6: P ( 2nd draw is even ) = 3 6 Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you replace the ticket, and again randomly draw a ticket from the box. This is called drawing with replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2, is again 3/6. In mathematical notation: P ( 2nd draw is even 1st draw was 2 ) = 3 6 The vertical bar ( ) is shorthand for given that. What comes after the vertical bar ( ) is called the condition. A probability with a condition is called a conditional probability. Note that in this case imposing the condition didn t affect the probability of drawing an even number: whether the first draw was 2 or not doesn t matter

25 2.4. THE SUM OF DRAWS 21 for the second draw, because we replaced the ticket after the first draw. In both cases, the probability of getting an even number was the same (3/6): P ( 2nd draw is even 1st draw was 2 ) = P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be independent: the probability of the second event is not affected by how the first event turned out. That is because we were drawing with replacement. When drawing with replacement, the events are independent. Now consider a different chance experiment. Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you don t replace the ticket. The box now looks like this: If we now again randomly draw a ticket from the box, this is called drawing without replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2 now is : P ( 2nd draw is even 1st draw was 2 ) = 2 5 In this case, what happened in the first draw (as expressed by the condition 1st draw was 2 ) does make a difference: the probability of getting an even number differs: P ( 2nd draw is even 1st draw was 2 ) P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be dependent: the probability of the second event is affected by how the first event turned out. That is because we were drawing without replacement. When drawing without replacement, the events are dependent. Think of a population as a box with tickets. A random sample is like drawing a number of tickets without replacement from this box. The number of draws is the sample size. Remember this. We ll use this box model when doing statistical inference. 2.4 The sum of draws For the theory of statistical inference, we ll frequently use the concept of the sum of draws. Here s a simple example: roll a die twice, and add the numbers. The chance model has the following box: Draw two tickets with replacement from the box, and add the outcomes. The result is the sum of draws. The sum of draws is a brief way to say the following (Freedman et al., 2007, p. 280):

26 22 CHAPTER 2. PROBABILITY DISTRIBUTIONS Draw tickets from a box. Add the numbers on the tickets. As the following activity makes clear, the sum of draws is itself a random variable: (a) Conduct the chance experiment above using an actual die or the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) (b) Repeat the experiment a couple of times and write up the outcomes (using an actual die, or in R by running the line sample(box,1)+sample(box,1). Would it be fair to say that the sum of draws is a chance variable? Explain. 2.5 Picking an appropriate chance model We model a population as a box with tickets. Taking a random sample is like randomly drawing a number of tickets from the box, without replacement; the number of draws is the sample size. In order to use such a chance model for inference, we will use some interesting properties of the sum of draws. The trick is to set up the chance model in such a way that the chance variable of interest is the sum of draws, or is computed from the sum of draws. An example clarifies my argument. Suppose you roll a die three times, and want to know what the sum of the outcomes is. What is the appropriate chance model? What is the chance variable? An appropriate chance model is a box with six tickets: and the chance variable is the sum of three random draws with replacement from the box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 3, 2, and 6. The sum of draws ( = 11) is obtained by adding up the outcomes. Now suppose that we are interested in another question: how many times (out of three rolls) will we get a six? First, we need the appropriate chance model. When we roll a die, when can get two kinds of outcomes: either we get a six (we ll label this outcome as a success), or we get another number (1, 2, 3, 4, 5: not a success). The term success is used here in a technical meaning: the outcome we are interested in. Note that we classify the outcomes of a single roll as a success or not a success. In such a case, the appropriate chance model is a box with six tickets: one ticket 1 for the outcome 6 labelled as a success, and five tickets 0 for the outcomes 1, 2, 3, 4, or 5 labelled as not a success:

27 2.6. PROBABILITY DISTRIBUTIONS 23 Now we are interested in the number of sixes in three rolls, so we need to count the sixes. Counting the sixes is the same thing as taking the sum of three draws from the 0-1 box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 0, 0, and 1 (we classified each outcome as a success or not a success). The sum of draws ( = 1) is the number of sixes (the number of successes). A box like this, with tickets that can only take values 0 and 1, is called a 0-1 box. Remember that when the problem is one of classifying and counting, the appropriate box is a 0-1 box. Here s a real-world example. Suppose you are the marketing manager of a telecommunications company that doesn t cover Brussels yet. You would like to find out which percentage of households in Brussels already has a tablet. The population of interest is all households in Brussels. Think of each household in Brussels as a ticket in a box, so there are as many tickets as households. A ticket takes value 1 if the household has a tablet, and 0 if the household doesn t. Taking a random sample of households is like randomly drawing tickets without replacement from this 0-1 box. The number of households in the sample who have a tablet is the sum of draws. The percentage of households in the sample who have a tablet is: sample percentage = 2.6 Probability distributions sum of draws size of the sample 100% Chance experiments can be described using probability distributions. In what follows, we ll focus on the probability distribution of the sum of draws. Suppose you roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box and add the outcomes. The chance variable (the sum of the two draws) can take the following values: {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} (the chance variable is discrete; we won t develop the theory for continuous chance variables). For each of these possible outcomes, we can compute the probability. There are 36 possible combinations: Each of these 36 combinations has the same probability, and as the probabilities have to add up to 1, each combination has a probability of 1/36. By applying the rules or probability, we can find the probability that the sum of draws takes the value 2, and then repeat the work to find the probability that the sum of draws takes the value 3, and so on. There are for instance two combinations that yield a sum of 3:

28 24 CHAPTER 2. PROBABILITY DISTRIBUTIONS when the first draw is 1 and the second draw is 2 (row 1, column 2 in the table above) when the first draw is 2 and the second draw is 1 (row 2, column 1) The probability that the sum of draws is 3 is therefore equal to: P (sum is 3) = P [(first 1, than 2) or (first 2, then 1)] Apply the addition rule (Freedman et al., 2007, pp ) to obtain: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) something The third term ( minus something ) is equal to zero because the events (first 1, than 2) and (first 2, then 1) are mutually exclusive (two events are mutually exclusive when as one event happens, the other cannot happen at the same time). So we get: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) 0 = = 2 36 If you do this for all other possible values of the chance variable, you get the following table: outcome probability A table that shows all possible values for a (discrete) chance variable and the corresponding probabilities is called a probability distribution. We can graph the probability distribution as a bar chart. On the horizontal axis we put the chance variable, and we construct the bar chart in such a way that the area of a bar shows the probability (expressed as a percentage), just as in a density histogram the area of a bar showed the relative frequency (expressed as a percentage) of the data over the interval. That is why Freedman et al. (2007, pp ) call such a bar chart a probability histogram. For a discrete chance variable the convention is to center the bars on the values that the variable can take: the bar over 2 will start at 1.5 and end at 2.5; the bar over 3 will start at 2.5 and end at 3.5, and so on. The width of each bar is equal to 1. The height of each bar in a probability distribution is called probability density: the probability per unit on the horizontal axis. We find the probability densities by applying the formula for the area of a rectangle: area = width height We want the area to represent the probability (expressed as a percentage) and the height to represent the probability density (expressed as percent per unit on the horizontal axis), and hence the equation becomes: probability = width of interval on horizontal axis probability density Divide both sides of the equation by (width of interval on horizontal axis) to obtain: probability density = probability width of interval on horizontal axis

29 2.6. PROBABILITY DISTRIBUTIONS 25 Because the width of each interval on horizontal axis is one unit of the horizontal axis, this becomes: probability density = probability per unit on the horizontal axis which gives us the meaning of probability density. For example, the probability to get a 7 is 6/36 (= %). The probability density over the interval from 6.5 to 7.5 then is equal to: probability density = % = % per per unit on the horizontal axis Figure 2.2 shows the corresponding bar chart representing the probability distribution. The curve traced by the bar chart of the probability distribution is called the probability density function. The probability density function has the following properties: the curve is always on or above the horizontal axis, that is, the probability density (on the vertical axis) is always 0 or positive; de area under the curve is equal to 1 (or 100%); the area under the curve between two values on the horizontal axis gives the probability. The probability distribution has an expectation and a standard error. The following example illustrates the intuition of these concepts. Roll a die twice and add the numbers. You can do that with an actual die, or run the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) Repeat this a couple of times, and write down the outcomes. You will get something like {6, 7, 10, 8, 10,... }. The outcomes are random. The lowest value you can get is 2 (when you roll two aces), and the highest value is 12 (when you roll two sixes). If you repeat the experiment many times you ll notice that those extreme values occur only occasionally; values like 6, 7, or 8 occur much more frequently. The expectation is the typical value that the random variable will take; the value around which the outcomes vary. Another way to think about the expectation is as the center of the probability distribution (figure 2.2). In this case the expectation is 7 (we ll see below how to compute the expectation). Now define the difference between the outcome of a chance experiment and the expectation as the chance error. For instance, our first outcome was 6, the expectation is 7, and hence the chance error was: chance error = outcome expectation = 6 7 = 1 (the negative value of 1 means that the outcome was 1 below the expectation). If we compute the chance errors for the other outcomes, we get:

30 26 CHAPTER 2. PROBABILITY DISTRIBUTIONS Probability density (% per horizontal unit) Outcome Figure 2.2: Probability distribution of the sum of two rolls of a die outcome chance error (without the minus sign) 6 1 (1) 7 0 (0) (3) 8 1 (1) (3) typical value: expectation standard error The third column shows the chance errors without the minus sign. The standard error is the typical size of the chance errors (without the minus sign). Average and expectation are related concepts: the average is a measure of the central tendency of data (represented in a density histogram), and the expectation is a measure of the central tendency of a chance variable (represented in a probability density graph). Similarly, the standard deviation is a measure of the spread of data around the average, and the standard error is a measure of the spread of a chance variable around the expectation. In brief: central tendency spread data average standard deviation (SD) chance variable expectation (E) standard error (SE) Let us now define these concepts more rigorously.

31 2.7. INTERMEZZO: A WEIGHTED AVERAGE Intermezzo: a weighted average To define the expectation and standard error of a discrete chance variable, we need the concept of a weighted average. A weighted arithmetic average of a list of numbers is obtained by multiplying each value in the list by a weight and adding up the outcomes; each of the weights is a number between zero (included) and one (included), and the weights add up to one. Suppose the first value in the list is x 1 and its weight is w 1, the second value in the list is x 2 and with weight w 2,..., and the last (nth) value in the list is x n and with weight w n, then the weighted average is: (w 1 x 1 ) + (w 2 x 2 ) (w n x n ) An example is the way a professor computes the students grades for a course. Here are the weights for the graded components of a course, and the results for a student: component weight (%) result (score/20) assignment assignment assignment assignment participation and preparedness midterm exam final exam Each weight is between 0 and 1: 7.50 percent is 0.075, 10 percent is 0.10, and 30 percent is Moreover, the sum of the weights is equal to 1: 7.50% % % % % % % = 100% = 1 The weighted average of the scores is: ( ) + ( ) + ( ) + ( ) +( ) + ( ) + ( ) = So this student has an overall score of 14.35/ Expectation (E) Just as the average is a measure of the central tendency of a density histogram, the expectation of a chance variable is in a sense a measure for the central tendency of a probability distribution. For a discrete chance variable, the expectation is defined as the weighted average of all possible values that the chance variable can take; the weights are the probabilities. The probability distribution of the sum of two rolls of a die is: outcome probability The expectation of the chance variable sum of two rolls of a die (or of two draws with replacement from a box with the tickets {1,2,3,4,5,6}) is the weighted

32 28 CHAPTER 2. PROBABILITY DISTRIBUTIONS average: = = = 7 Let the operator E denote the expectation: E(sum of two rolls of die) = Standard error (SE) Just as the standard deviation is a measure of the spread of a density histogram, the standard error of a chance variable is in a sense a measure for the spread of a probability distribution. We defined the chance error as the difference between the outcome of a chance variable and the expectation of that chance variable. If the chance experiment is to roll a die twice and add the outcomes, we could get 2 as an outcome; in that case de chance error is 2 7 = 5. For the outcome 3, the chance error is 3 7 = 4, etc. It is useful to add the chance errors to the table of the probability distribution: outcome probability chance error The standard error of a discrete chance variable is defined as the weighted quadratic average of the chance errors; the weights are the probabilities. (A quadratic average is the root-mean-square size.) Start from the chance errors in the example (the third line we just added to the table of the probability distribution): 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5 1. Square. First square the chance errors: ( 5) 2, ( 4) 2, ( 3) 2, ( 2) 2, ( 1) 2, 0 2, 1 2,2 2, 3 2, 4 2, 5 2. This yields: 25, 16, 9, 4, 1, 0, 1, 4, 9, 16, Mean. Then take the weighted average. Use the probabilities of the chance errors as the weights: Verify that this indeed yields approximately 5.33 (a spreadsheet is helpful). 3. Root. Finally take the square root: The standard error of the sum of two draws from {1, 2, 3, 4, 5, 6} is approximately You can think of this as the typical size of the chance errors.

33 2.10. EXPECTATION AND SE FOR THE SUM OF DRAWS Expectation and SE for the sum of draws When doing statistical inference, we ll use the sum of draws with replacement from a box with tickets. The formulas for the expectation and the standard error of discrete probability distributions from the previous sections also apply if the chance variable is a sum of draws. However, the computations can become tedious. It can be shown that the following formulas hold: E(sum of draws) = (number of draws) (average of box) SE(sum of draws) = number of draws (SD of the box) Average of box means: the average of the values on the tickets in the box; similarly SD of the box means the SD of the values on the tickets in the box. You don t have to memorize these formulas; they are on the formula sheet. In inference, the box will represent the population, so the average of the box is the population average and the SD of the box is the population SD. Let us apply these formulas to the example from the previous sections: roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box and add the outcomes. We found in the previous sections that the expectation is 7 and the SE is approximately What is we use the formulas for the expectation and the SE of the sum of draws? To apply the formula for the expectation of the sum of draws we first need the average of the box: average of box = The expectation of the sum of two draws is: = 21 6 E(sum of draws) = (number of draws) (average of box) = = 21 3 = 7 This is the same number we found be applying the definition of the expectation. To apply the formula for the standard error for the sum of draws, we first need the SD of the box; the SD of the box is about 1.71 (exercise: verify this). Then apply the formula for the standard error for the sum of draws: SE (sum of draws) = number of draws (SD of the box) This is the same number we found be applying the definition of the standard error The Central Limit Theorem Consider again the chance experiment: roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box

34 30 CHAPTER 2. PROBABILITY DISTRIBUTIONS Density (% per unit on the horizontal axis) Outcome Figure 2.3: Histogram of the dots on a die and add the outcomes. The chance variable is the sum of draws. A histogram of the box (the list of numbers {1, 2, 3, 4, 5, 6}) is shown in figure 2.3. Note that the histogram is not bell-shaped at all. We already computed and plotted the probability distribution of the sum of two draws (figure 2.2). Figure 2.4 compares the probability distribution with the normal curve. The normal curve approximates the probability distribution reasonably well. From the probability distribution table (p. 2.9) we know that the probability to get an outcome between 5 (included) and 7 (included) is = % In figure 2.4 the probability of 42%corresponds to the area of the bar over 5 (between 4.5 and 5.5), plus the area of the bar over 6 (between 5.5 and 6.5), plus the area of the bar over 7 (between 6.5 and 7.5). The area under the normal curve between 4.5 and 7.5 approximates the area under the blocks. We can find the area under the normal curve between 4.5 and 7.5 using statistical software. First, standardize the boundaries of the interval (4.5 and 7.5). The variable on the horizontal axis is a chance variable, not data, so we use the expectation instead of the average and the standard error instead of the standard deviation to standardize: chance variable in standard units = value expectation SE The left boundary (4.5) in standard units is approximately: The right boundary (7.5) in standard units is approximately: To find the area under the standard normal curve between 1.04 and 0.21 on the TI-84, use the normalcdf-function: normalcdf(-1.04,0.21)

35 2.11. THE CENTRAL LIMIT THEOREM Probability density (% per horizontal unit) Outcome Figure 2.4: Probability distribution of the sum of two rolls of a die which yields approximately 0.43 or 43%. The normal approximation (43%) is close to the actual probability (42%). The example illustrates the central limit theorem: When drawing at random with replacement from a box, the probability distribution for the sum of draws will follow the normal curve, even if the contents of the box do not. The number of draws must be reasonably large. When is the number of draws reasonably large? Consider a box with 99 tickets 0 and one 1. The histogram of the box is very skewed (figure 2.5). Let us now investigate how the sum of 100, 400, or 900 draws from this skewed box is distributed (the calculations to find the probabilities are very tedious and are done using statistical software). The top panel in figure 2.6 shows the distribution of the sum of 100 draws; the probability distribution of the sum is skewed. The middle panel in figure 2.6(a) shows the distribution of the sum of 400 draws; the probability distribution of the sum is still skewed, but less so than in the case of 100 draws. The bottom panel in figure 2.6 shows the distribution of the sum of 900 draws; the probability distribution of the sum is pretty much bell-shaped. This example illustrates that the number of draws required to use the normal approximation for the sum of draws differs from case to case. When rolling a die (drawing from a box {1, 2, 3, 4, 5, 6}), two draws were sufficient. Generally,

36 32 CHAPTER 2. PROBABILITY DISTRIBUTIONS Density (% per unit on the horizontal axis) Outcome Figure 2.5: Histogram of a box with 99 tickets 0 and one 1 when drawing from a box with a histogram that is not too skewed, often 30 draws will suffice. But when drawing from a very skewed box, often hundreds or even thousands of draws are needed before the normal curve is a reasonably good approximation of the probability distribution of the sum of draws. Why is the central limit theorem important? When doing statistical inference, we will use a sample drawn from a population. The sample is like tickets drawn from a box (the box represents the population). The sample statistic (for instance, the sample proportion) is a chance variable: as the sample is random, so is the sample statistic. We can use the central limit theorem to approximate the probability distribution of the sample statistic by the normal curve. But the normal approximation is only good if the sample is large enough.

37 2.12. QUESTIONS FOR REVIEW 33 Percent per horizontal unit (a) Sum of 100 draws Percent per horizontal unit (b) Sum of 400 draws Percent per horizontal unit (c) Sum of 900 draws Figure 2.6: Probability distributions of the sum of 100, 400, and 900 draws from a box with 99 tickets 0 and one ticket Questions for Review 1. The chance of drawing the queen of hearts from a well-shuffled deck of cards is 1/52. Explain what this means, using the frequency interpretation of probability. 2. What is the difference between drawing with and without replacement? Use as an example drawing a ball from a fishbowl filled with white and red balls. 3. When are two events independent? Give an example, referring to a fishbowl filled with white and red balls. 4. What does the sum of draws mean? 5. Explain the difference between adding and classifying & counting. 6. What does the addition rule say? 7. When are two events mutually exclusive?

38 34 CHAPTER 2. PROBABILITY DISTRIBUTIONS 8. What is a probability distribution for a discrete chance variable? Which properties should it have? 9. What is a probability density histogram? Which properties does it have? 10. What is probability density? 11. What is a chance error? 12. What is a weighted average? 13. What is the expectation of a discrete chance variable? 14. What is the standard error of a discrete chance variable? 15. What does the Central Limit Theorem say? 2.13 Exercises 1. Conduct the experiment described in section 2.2 using an actual die (or with Roll the die ten times. After each roll, compute the relative frequency of aces up to that point. Complete the following table: Table 2.1: Number of aces in rolls of a die Repeat Ace (1) or not (0) Absolute Relative frequency (*) frequency, % (*) (*) Absolute and relative frequency of aces in this and all previous repeats Plot the number of tosses on the horizontal axis and the relative frequency on the vertical axis. 2. Conduct the experiment described in section 2.2 using the R script roll-a-die.r on the course home page (the script simulates rolls of a die). How does the graph look like? Run the script again. Is the graph exactly the same? How does it differ? In what respect is it similar? Run the script once more. Is there a pattern?

39 2.13. EXERCISES You roll a die twice and add the outcomes. Find the probability to get a 10. Show your work and explain. 4. You toss a coin twice and count the number of heads. Construct a probability distribution table and a probability density histogram. What does the area under a bar in the probability density histogram show? And the height of a bar? Find the expectation, the chance errors, and the standard error. (This was an exam question in Fall 2015.) 5. Consider the following chance experiment: roll a die and count the number of dots. Formulate an appropriate chance model. What are the possible outcomes? What are the probabilities? Make a table and a bar chart of the probability distribution (in the chart, put the probability density on the vertical axis). Compute the expectation and the standard error. 6. Work parts (a) and (b) of Freedman et al. (2007, review exercise 2 p. 304).

40 36 CHAPTER 2. PROBABILITY DISTRIBUTIONS

41 Chapter 3 Sampling Distributions A sample percentage is chance variable, with a probability distribution. The probability distribution of a sample percentage is called a sampling distribution (the probability distribution of a sample average is also a sampling distribution). This chapter discusses the properties of sampling distributions. The next two chapters build on the properties of sampling distributions to estimate confidence intervals and test hypotheses for the percentage or the average of a population. 3.1 Sampling distribution of a sample percentage In a small town there are households households (46% of the total) own a tablet. The population percentage (46%) is a parameter: a numerical characteristic of the population. A market research firm doesn t know the parameter. It tries to estimate the parameter by interviewing a random sample of 100 households. The researchers counts the number of households in the sample and computes the sample percentage: number in the sample sample percentage = 100% size of sample The sample percentage is a statistic: a numerical characteristic of a sample. We model the population as a box with tickets. Every household that owns a tablet is represented by a ticket 1, and every household that doesn t own a tablet is represented by a ticket 0 : 5400 tickets tickets 1 Of course, the market research firm doesn t know how many out of the tickets are tickets with a 1 (but we do). The random sample is like randomly without replacement drawing 100 tickets from the box. The researcher counts the number of tickets with 1 (the number of households in the sample who own a tablet). Suppose they draw

42 38 CHAPTER 3. SAMPLING DISTRIBUTIONS The number of households in the sample who own a tablet is then equal to: that is, the number in the sample is the sum of draws from the 0-1 box. As the researcher computes the sample percentage: sample percentage = number in the sample size of sample 100% the numerator (the number of households in the sample who own a tablet) is the sum of draws from the 0-1 box. Hence the sample percentage is computed from the sum of draws. Remember this. Will the sample percentage be equal to the percentage in the population? We can find out by simulating the experiment described above in R. First we define the box with tickets 0 and tickets 0 : population <- c(rep(1,4600),rep(0,5400)) This line of code generates a list (called population ) of numbers: times 1 and times 0. You can check this by letting R display a table summarizing the contents of the list called population : table(population) Now take a random sample of 100 households from the population: sample(population,100,replace=false) You get a list of 100 numbers that looks something like this: The researcher is interested in the number of households in the sample who own a tablet. That number is the sum of the draws: sum(sample(population,100,replace=false)) You get something like: 39 So this sample contained 39 households who own a tablet (and 61 who don t). If you divide the number in the sample (39) by the sample size (100) and multiply by 100%, you get the sample percentage: sample percentage = number in the sample size of sample 100% = % = 39% 100 So the sample percentage (39%) is not equal to the percentage in the population (46%). That should be no surprise: the sample percentage is just an estimate of the population percentage, based on a random sample of 100 out of the tickets. The difference between the estimate (the sample percentage) and the parameter (the population percentage) is called the chance error: chance error = sample percentage population percentage In this case the chance error is chance error = 39% 46% = 7% (the minus sign indicates that the sample percentage underestimates the population percentage. Of course, because the researcher doesn t know the population percentage, she doesn t know how big the chance error she made is; all she knows is that she made a chance error.

43 3.1. SAMPLING DISTRIBUTION OF A SAMPLE PERCENTAGE 39 Why is the estimation error called a chance error? That s because the estimation error a chance variable. That can be easily seen by repeating the line sum(sample(population,100,replace=false)) a couple of times, and computing the sample percentage for every sample. You ll get something like: 39, 43, 46, 37, 43, 52,... : the sample percentage is a chance variable. In a table: sample percentage chance error (without the minus sign) 39 7 (7) 43 3 (3) 46 0 (0) 37 9 (9) 52 6 (6) typical value: expectation standard error So in repeated samples, the sample percentage is a chance variable. The sample percentage has a probability distribution (called the sampling distribution of the sample percentage). The expectation of the sample percentage is the typical value around which the sample percentage varies in repeated samples (take a look at the first column: do you have a hunch what the expectation of the of the sample percentage is?). The standard error of the sample percentage is the typical size of the chance error (after you omitted the minus signs, as shown in the third column). It can be shown that the expectation of the sample percentage is the population percentage (proof omitted): E(sample percentage) = population percentage The sample percentage is said to be an unbiased estimator of the population percentage. This also implies that E(chance error) = 0 To find the SE for the sample percentage, start from sample percentage = which can be written as: sample percentage = Take the standard error of both sides: SE(sample percentage) = number in the sample size of sample 100% sum of draws number of draws 100% SE(sum of draws) number of draws 100% From the square root law (p. 29) we know that for random draws with replacement: SE(sum of draws) = number of draws (SD of the box)

44 40 CHAPTER 3. SAMPLING DISTRIBUTIONS This is still approximately true for draws without replacement, provided that the population is much larger than the sample: SE(sum of draws) number of draws (SD of the box) So the expression for the SE for the sample percentage becomes: number of draws (SD of the box) SE(sample percentage) = 100% number of draws or: SE(sample percentage) SD of population sample size 100% You don t have to memorize this formula. The formula is only approximately right because taking a sample is drawing without replacement. When the population is much bigger than the sample, the distinction between drawing with and without replacement becomes small (Freedman et al., 2007, pp ). In that case, the formula gives a good approximation. To find the SD of the population, use the shortcut rule for 0-1 lists: ( ) ( ) fraction of fraction of SD of population = ones zeroes 4600 = (Of course, the researcher doesn t know the fraction of ones in the population (the fraction of households in the population who own a tablet). If the sample is large, she can estimate the SD of the population by the SD of the sample. This technique is called the bootstrap. We ll get back to this when we discuss inference.) Now we can find the standard error of the sample percentage: SE(sample percentage) SD of population 100% % 5% sample size 100 In sum: if many researchers would each take a random sample of 100 households and compute the sample percentage, the sample percentage will be about 46% (the expectation), give or take 5% (the standard error). What is the shape of the sampling distribution? A computer simulation is helpful (see the R script repeats.R). Let us start from the box: 5400 tickets tickets 1 and let a researcher (who doesn t know the contents of the box) randomly without replacement draw 100 tickets from the box. The researcher uses the sample to compute the sample percentage (the percentage of 1s in the sample). We write down the result (say, 39%) and toss the tickets back in the box. Then we let another researcher randomly without replacement draw 100 tickets from the box and compute the sample percentage, and so on. The computer simulation repeats this chance experiment times, so researchers each

45 3.1. SAMPLING DISTRIBUTION OF A SAMPLE PERCENTAGE Density (% per unit on the horizontal axis) Percentage of households in the sample who own a tablet Figure 3.1: Density histogram of the sample percentages in repeats randomly without replacement draw 100 tickets from the box and compute the sample percentage. The result is a list of sample percentages (39%, 43%,... ). Even for a computer this is a lot of work, so running the simulation can take a while. The program finally plots a density histogram of the sample percentages of the researchers (figure 3.1). Given the frequency interpretation of probability and the large number of repeats, the density histogram (figure 3.1) resembles the probability distribution of the sample percentage. The density histogram shows that most researchers found a sample percentage in the neighborhood of 46%: almost all come up with a sample percentage between 31% and 61%. The distribution is clearly bell-shaped. Why is that the case? Remember that each researcher computes the sample percentage as: sample percentage = number in the sample size of sample 100% = sum of draws size of sample 100% that is, the sample percentage is computed from the sum of draws. From the central limit theorem we know that the sum of draws follows the normal curve, if we draw with replacement and if the number of draws is reasonably large. The researchers drew without replacement, but when the size of the population is much larger than the size of the sample, the normal curve will still be a reasonably good approximation (in this case the size of the population is and the size of the sample is 100). If the sum of draws follows the normal curve, so will the sample percentage.

46 42 CHAPTER 3. SAMPLING DISTRIBUTIONS The % rule applies (using expectation instead of average, and SE instead of SD). Most (approximately 99.7%) of the sample percentages fall within three standard errors of the expectation, that is, between or between 46% 3 5% and 46% + 3 5% 31% and 61% Similarly, approximately 95% of the sample percentages fall within two standard errors of the expectation, that is, between or between 46% 2 5% and 46% + 2 5% 36% and 56% To summarize the properties of the sampling distribution of the sample percentage: sample percentage = sum of draws size of the sample 100% 1. The sample percentage is an unbiased estimator of the population percentage: E(sample percentage) = population percentage 2. The standard error is: SE(sample percentage) SD of population sample size 100% This approximation is good if the population is much larger than the sample. The SD of the population (the box) can be found using the shortcut formula for 0-1 lists. (You don t have to memorize the formula for the SE.) 3. If the sample is large, the sampling distribution of the sample percentage approximately follows the normal curve (central limit theorem). 3.2 Sampling distribution of a sample average We can now follow the same line of reasoning for the sampling distribution of a sample average. Suppose a market research firm is interested in the annual household income of the households in a small town. Let us model this population as a box with tickets. On each ticket the annual income of a household is written: The average annual income of all households in the population is ; this average is a parameter: a numerical characteristic of the population. The standard deviation of the population is (another parameter). The market research firm doesn t know these parameters, and would like to estimate the population average by taking a random sample of 100 households. Suppose the sample looks like this:

47 3.2. SAMPLING DISTRIBUTION OF A SAMPLE AVERAGE (100 tickets) The sample average is called a statistic (a numerical characteristic of a sample). The researcher will compute the sample average by adding up the incomes in the sample and dividing by how many there are: that is: sample average = sample average = sum of draws sample size Just like in the previous example, the sample average is a chance variable: in repeated samples, the outcome would be different. Just like in the previous example, the sample average is computed from the sum of draws, so (under certain conditions) the central limit theorem applies. It can be shown that the sample average is an unbiased estimator of the population average. The standard error for the sample average is: SE(sample average) = SE(sum of draws) sample size Using the square root law for the SE of the sum of draws, we get: number of draws (SD of the box) SE(sample average) sample size The number of draws is the same thing as the sample size, and the SD of the box is the same thing as the SD of the population, so we get: sample size (SD of population) SE(sample average) sample size which can be written as: SE(sample average) SD(population) sample size The SD of the population is given as 8 245, so the SE for the sample average is: SE(sample average) = This means that a researcher who tries to estimate the population average using a random sample of 100 households is typically going to be off by or so; is the typical size of the chance error that a researcher will make. (Of course, the researcher doesn t know the SD of the population. If the sample is large, she can estimate the SD of the population by the SD of the sample. This technique is called the bootstrap. We ll get back to this when we discuss inference.) In sum, the sampling distribution of the sample average has the following properties: sample average = sum of draws sample size

48 44 CHAPTER 3. SAMPLING DISTRIBUTIONS 1. The sample average is an unbiased estimator of the population average: 2. The standard error is E(sample average) = population average SE(sample average) SD(population) sample size This approximation is good if the population is much larger than the sample. (You don t have to memorize the formula for the SE.) 3. If the sample is large, the sampling distribution of the sample average approximately follows the normal curve (central limit theorem). 3.3 Questions for Review 1. The sample percentage is a statistic. Explain. 2. What does the term sampling distribution mean? Explain, using an example for the sampling distribution of a sample percentage. 3. The sample percentage is random variable. Explain. 4. You want to estimate the percentage of a population. Explain what the chance error is. 5. What does the term expectation of the sample percentage mean? Explain, using the concept of repeated samples. 6. What does the term standard error of the sample percentage mean? Explain, using the concept of repeated samples. 7. The sample percentage is un unbiased estimator of the population percentage. Explain. 8. Suppose you would be able to take many large samples (each of the same size, say, 2500) from the same population. For each sample, you compute the sample percentage. How would the density histogram of the sample percentage look like (central location, spread, shape)? Explain. 9. Assume that the sample is sufficiently large. How does the probability density of a sample percentage look like (central location, spread, shape)? Explain.

49 Chapter 4 Confidence intervals Carefully review all of chapters 21 (skip section 5) and 23 from Freedman et al. (2007), covered in STA101. Below is a summary of the main ideas. The summary is no substitute for reviewing chapters 21 and Estimating a percentage In a small town there are households. A market research agency wants to find out which percentage of households own a tablet. The population percentage is an (unknown) parameter. To estimate the parameter, the market research agency interviews a random sample of 100 households. The researcher counts the number of households in the sample who say they own a tablet, and computes the sample percentage (a statistic). How reliable is the estimate? In order to answer this question, we need an appropriate chance model. We model the population as a box of tickets. There is a ticket for every household in the population. This is a case of classifying and counting: we classify a household as owning a tablet or not owning a tablet, and want to count the number of households who own a tablet. In cases of classifying and counting, a 0-1 box is appropriate. For households who own a tablet the value on the ticket is 1, and for households who own a tablet the value on the ticket is 1. The number of tickets of each kind is unknown:??? tickets 0??? tickets tickets The sample is like 100 random draws without replacement from this box. It will look something like: {0, 0, 1, 0, 1, 0, 0,...} (100 entries) The number of households in the sample who own a tablet is the sum of draws. The sum of draws is a chance variable: if the researcher had drawn a different sample of 100 households, the number of households in the sample who own a tablet would most likely have been different. 45

50 46 CHAPTER 4. CONFIDENCE INTERVALS Suppose that the researcher interviewed 100 random households of whom 41 say that they own a tablet. The sample percentage is: sample percentage = number in the sample size of sample 100% = % = 41% 100 The sample percentage is called a point estimator of the population percentage, and the result (41%) is a point estimate. The decision maker who gave the market research agency the job of estimating the percentage would like to know how reliable the point estimate of 41% is. The sample percentage is a chance variable, subject to sampling variability: a different sample would most likely have generated a different estimate. We know from the previous chapter that if the sample is random the sample percentage is an unbiased estimator of the population percentage: E(sample percentage) = population percentage Intuitively that means that if many researchers all would take a random sample of 100 households and each would compute a sample percentage, they would get different results, but the results would vary around the population percentage. Some researchers might come up with a sample percentage that is exactly equal to the population percentage, but about half of the researchers would come up with a sample percentage that underestimates the population percentage, and about half would come up with a sample percentage that overestimates the population percentage. The difference between the sample percentage and the population percentage is called the chance error: chance error = sample percentage population percentage The researcher who came up with the estimate of 41% also made a chance error. Of course, she won t be able to find how big the chance error exactly is, because she doesn t know the population percentage. But she does know that she makes a chance error. In the previous chapter we saw that the typical size of the chance error is called the standard error (SE). We also saw that for a sample percentage, the standard error is: SE(sample percentage) SD of population sample size 100% The bad news is that the researcher doesn t know the SD of the population. The good news is that statisticians have shown that provided that the sample is large the SD of the sample is a reasonably good estimate of the SD of the population. So for large samples, we can approximate the SE for the sample percentage by: SE(sample percentage) SD of sample sample size 100% (This is an example of the bootstrap technique.) To find the SD of the sample, the researcher can use the shortcut formula (p. 10): ( ) ( ) fraction of fraction of SD of sample = ones zeroes

51 4.2. CONFIDENCE INTERVAL FOR A PERCENTAGE = The resulting estimate for the standard error for the sample percentage is: SE(sample percentage) SD of the sample sample size 100% % 4.9% In sum: the sample estimate (41%) is off by 4.9% or so. It is very unlikely that the estimate is off by more than 14.7% (3 SEs). 4.2 Confidence interval for a percentage From the previous chapter we know that, for large samples, the sampling distribution of the sample percentage approximately follows the normal curve (thanks to the central limit theorem). We also know that the sample percentage is un unbiased estimator of the population percentage: E(sample percentage) = population percentage Hence the probability distribution of the sample percentage looks approximately like this: pop% Sample percentage The distribution of the sample percentage implies that for 95% of all possible samples, the sample percentage will be in the interval from population percentage 2 SE to population percentage + 2 SE (SE refers to the SE for the sample percentage). This implies that for 95% of all possible samples the chance error (= sample percentage population percentage) will be in the interval from 2 SE to + 2 SE

52 48 CHAPTER 4. CONFIDENCE INTERVALS Put in a different way, for 95% of all possible samples the chance error (without the minus sign) will be smaller than 2 SE. Or: for 95% of all possible samples, the interval from sample percentage 2 SE to sample percentage + 2 SE will cover the population percentage. This interval is called the 95%-confidence interval for the population percentage. There is a shorter notation for the 95%-confidence interval: sample percentage ± 2 SE(sample percentage) The term 2 SE is called the margin of error. The researcher found an estimate of 41% and a standard error of about 4.9%. The sample was reasonably large (100), so it s safe to assume that the normal curve is a good approximation of the probability distribution of the sample percentage. Hence the 95%-confidence interval for the population percentage is the interval between: 41% 2 4.9% and 41% % 41% 9.8% and 41% + 9.8% 31.2% and 50.8% In sum: the sample estimate (41%) is off by 4.9% or so. You can be about 95% confident that the the interval from 31.2% to 50.8% covers the population percentage. To compute a confidence interval for a population percentage with the TI- 84, do STAT TESTS 1-PropZInt (one-proportion z interval). The z refers to the fact that we use the normal approximation (central limit theorem). The value x that you have to enter is the number of times the event occurs in the sample (41 in the example); n is the sample size (100 in the example); C-Level stands for confidence level: for a 95% confidence interval enter.95 (the default value). The procedure gives the sample proportion (ˆp) and the boundaries of the confidence interval as decimal fractions; to get percentages multiply by 100%. The confidence interval provided by the TI-84 (and by statistical software) differs somewhat from what you find using the formula above. Don t worry: our formula gives a good approximation. 4.3 Interpreting a confidence interval Carefully read Freedman et al. (2007, section 3 pp ). It s important that you understand the correct interpretation of a confidence interval. Suppose you have a fish bowl with marbles (the population): red marbles and blue ones. The proportion of red marbles in the population is: population proportion = 100% = 80% The population proportion is a parameter. Now you conduct the following experiment. You hire a researcher and tell her that you don t know the proportion of red marbles, and that you would like her to estimate the proportion of red

53 4.3. INTERPRETING A CONFIDENCE INTERVAL 49 marbles from a random sample of marbles. The researcher takes up the job: she takes a simple random sample of marbles, counts the number of red marbles in the sample, and computes a the sample proportion as a point estimate of the population proportion: number of red marbles in the sample sample percentage = 100% Because the sample is large (and hence the central limit theorem applies), she can compute a 95%-confidence interval: sample percentage ± 2 SE(sample percentage) The sample percentage is a chance variable: the outcome depends on the sample she took. Had she taken another sample of size 2 500, the sample percentage would most likely have been different (the population percentage would still have been 80%, of course): the chance is in the sampling variability, not in the parameter. Suppose that she finds a confidence interval like that in case (1): Three confidence intervals (x = the parameter) (1) x (covers) (2) x (does not cover) (3) x (does not cover) Confidence interval (1) covers the population percentage (the researcher will of course not know this because she doesn t know the population percentage). Now you hire another researcher, and you tell him the same thing: that you don t know the proportion of red marbles, and that you would like him to estimate the proportion of red marbles from a random sample of marbles. Because he will draw a different sample, he will come up with a different point estimate, a different SE, and a different confidence interval. The confidence interval may cover the population percentage, but due to sampling variability it may not: the interval may be too far too the right (case (2)) or too far to the left (case (3)) (Freedman et al. (2007, p. 384) call confidence intervals that don t contain the parameter lemons ). Again, the researcher doesn t know whether the confidence interval he computed covers the population percentage: it may (case (1)), or it may not (cases (2) or (3)). You can find out what happens in repeated samples if you repeat the experiment many times (say: you hire 100 researchers) and plot the resulting 95%- confidence intervals. A computer simulation posted on the course web site does exactly that (interpreting-a-confidence-interval-for-a-percentage.r). The script generates a diagram like figure 4.1 (compare to figure 1 in Freedman et al. (2007, p. 385)). Each horizontal line represents a 95%-confidence interval computed by a researcher. The vertical line shows the population percentage. Run the script a number of times (and make sure you understand what the script does). Count the number of lemons in each run. Is there a pattern? In sum: if 100 researchers would each take a simple random sample of 100 marbles, and each computes a 95%-confidence interval, we get 100 confidence intervals. The confidence intervals differ because of sampling variability. For about 95% of samples, the interval sample percentage ± 2 SE(sample percentage) covers the population percentage, and for the other 5% is does not.

54 50 CHAPTER 4. CONFIDENCE INTERVALS sample Percentage of reds Figure 4.1: Interpreting confidence intervals. The 95%-confidence interval is shown for 100 different samples. The interval changes from sample to sample. For about 95% of the samples, the interval covers the population percentage, marked by a vertical line. 4.4 Confidence interval for an average Review all of chapter 23 in Freedman et al. (2007). This section just gives a brief summary of the main ideas, based on example 3 from Freedman et al. (2007, pp ). A marketing research agency wants to find out the average years of schooling in a small town. The population consists of all people age 25 and over in the town. The average of the population is an (unknown) parameter. The researcher interviews a random sample of 400 people of age 25 and over, and list the answers in a table. Together, the 400 interviewed people had years of schooling. That implies a sample average of years years From the table with responses, the researcher also computes the standard deviation of the sample, which turns out to be 2.74 years. What is the standard error? What is the 95%-confidence interval? Model the population as a box with many tickets; the researcher may not even know how many tickets exactly. Each ticket represents a person age 25 or over who lives in the town. On each ticket the years of schooling of that person

55 4.4. CONFIDENCE INTERVAL FOR AN AVERAGE 51 is written. For instance, for someone who completed high school but took no higher education, the ticket says: 12 years. The box will look like this: many tickets Of course the researcher doesn t know the exact contents of the box. The sample is like 400 draws without replacement from the box. The researcher lists the ages of the people in the sample. The list will look something like: {16, 12, 18, 8,...} (400 entries) To find the sample average, the researcher adds all numbers (suppose the sum is years) and divides by how many entries there are in the sample (400): sample average = sum of draws sample size years = = 13.1 years 400 Note that the sample average is computed using the sum of draws. From the list of responses, the researcher can also compute the standard deviation. As noted above, suppose the standard deviation of the sample is 2.74 years. The sample average is a called a point estimator of the population average, and the result (13.1 years) is a point estimate. The sample average is a chance variable: had the researcher drawn another sample of size 400, the sample average would most likely have been different. When the researcher reports to the decision maker, the decision maker would like to know how precise the point estimate is. In the previous chapter we learned that the sample average is an unbiased estimator of the sample average: E(sample average) = population average That doesn t mean that the sample average found from this sample (13.1 years) is equal to the population average. Why not? Because the researcher made a chance error: chance error = sample average population average The researcher who came up with the estimate of 13.1 years also made a chance error. Of course, she won t be able to find how big the chance error exactly is, because she doesn t know the population average. But she does know that she makes a chance error. In the previous chapter we saw that the typical size of the chance error is called the standard error (SE). We also saw that the standard error for the sample average is: SE(sample average) SD(population) sample size The researcher doesn t know the SD of the population, but if the sample is large the SD of the sample is a reasonably good estimate of the SD of the population: SE(sample average) SD(sample) sample size

56 52 CHAPTER 4. CONFIDENCE INTERVALS (This is an example of the bootstrap technique.) The researcher can compute the SD from the sample (and found: SD(sample) = 2.74 years). The resulting estimate for the standard error for the sample average is: SE(sample average) SD(sample) sample size 2.74 years years From the previous chapter we know that if the sample is reasonably large the probability distribution of the sample average follows approximately the normal curve. We also know that the sample average is an unbiased estimator of the population average: E(sample average) = population average As a result, the probability distribution of the sample average looks like this: pop. ave. Sample average The probability distribution of the sample average implies that for 95% of all possible samples of size 400, the sample average will be between population average 2 SE and population average + 2 SE (SE refers to the SE for the sample average.) This implies that for 95% of all possible samples of size 400, the interval from sample average 2 SE to sample average + 2 SE will cover the population average. This interval is the 95%-confidence interval for the population average. A shorter notation is: sample average ± 2 SE(sample average) The researcher in our example found a sample average of 13.1 years and a standard error for the sample average of about years. The 95%-confidence interval for the population average is: 13.1 years ± years

57 4.5. DON T CONFUSE SE AND SD 53 or: 13.1 years ± years The margin of error (with a confidence level of 95%) is years. 95%-confidence interval for the population average is the interval from: So the or from: 13.1 years years to 13.1 years years years to years In sum: the sample estimate (13.1 years) is off by years or so. You can be about 95% confident that the interval from years to years covers the population average. To compute a confidence interval for a population average with the TI-84, do STAT TESTS ZInterval. The z refers to the fact that we use the normal approximation (central limit theorem). The value σ (sigma) that you have to enter is the standard deviation of the population; as you don t know the standard deviation of the population, enter the sample standard deviation instead (you are using the bootstrap, but remember that the bootstrap only works when the sample is large). In the example, the standard deviation of the sample was 2.74 years. The value x (x-bar) that you have to enter is the sample average (13.1 years in the example) and n is the sample size (400 in the example). C-Level stands for confidence level; for a 95% confidence interval enter Don t confuse SE and SD The researcher computed two numbers: she found that the SD of the sample was 2.74 years, and that the SE for the sample average was years. These two numbers tell two different stories (Freedman et al., 2007, p. 417): the SD says how far schooling is from the average for typical people. the SE says how far the sample averages are from the population average for typical samples. People who confuse SE and SE often think that 95% of the people have schooling in the range 13.1 years ± years (13.1 years ± 2 SE). That is wrong. The interval 13.1 years ± years covers only a very small part of the years of schooling: the SD is about 2.74 years. The confidence interval measures something else: if many researchers each take a sample of 400 people, and each computes a 95%-confidence interval, then about 95% of the confidence intervals will cover the population average; the other 5% of the confidence intervals won t. The term confidence reminds you of the fact that the chance is in the sample variability; the population average doesn t change. 4.6 Questions for Review 1. Why do we need to use the bootstrap when estimating the standard error for a percentage?

58 54 CHAPTER 4. CONFIDENCE INTERVALS 2. What is the margin or error (at a 95% confidence level) for the population percentage? For a population mean? 3. Suppose the decision maker requires a level of confidence higher than 95% (say, 99%). Would the margin or error be bigger, smaller, or the same as with a level of confidence of 95%? Explain. 4. Suppose the decision maker is happy with a confidence level of 95% but wants a smaller margin of error. What should you do? Explain. 5. What is the difference between the standard deviation of the sample and the standard error of the sample average? Explain. 6. A researcher computes a 95%-confidence interval for the mean. Right or wrong: 95 percent of the values in the sample fall within this interval. Explain. 7. A researcher computes a 95%-confidence interval for the mean. Explain what the meaning of the interval is, using the concept of repeated samples. Add a sketch. 4.7 Exercises Work the following exercises from Freedman et al. (2007), chapter 21: A 1; A 2 and B 2(a); A 3; A 4, A 5; A 9; B 1; B 4; C 5; C 6; D 1; D 2. Work the following exercises from Freedman et al. (2007), chapter 23: A 1; A 2; A 3; A 4; A 5; A 6; A 7; A 8; A 9; A 10; B 1; B 2; B 3; B 4; B 5; B 6; B 7; C 1; C 2; C 3; C 4; D 1; D 2; D 3; D 4.

59 Chapter 5 Hypothesis tests Read Freedman et al. (2007, Ch. 26, 29). Leave section 6 of chapter 26 (pp ) for later. Questions for Review 1. Freedman et al. (2007, Ch. 26) repeatedly use the term observed difference. Explain (the difference between what and what?). 2. What is a test statistic? Explain using an example. 3. What is the observed significance level (or P -value)? 4. When the P -value is less than 1% (so the results is highly statistically significant ), what does this imply for the null hypothesis? Exercises Work the following exercises from Freedman et al. (2007), chapter 26: Set A: 4, 5. Set B: 1, 2, 5. Set C: 1, 2, 4, 5. Set D: all. Set E: 1 5, 7, 10. Set F: 1 4. Review Exercises: 1 5, 7, 8. Work the following exercises from Freedman et al. (2007), chapter 29: Set A: 1, 2. Set C: 1, 4, 5, 7. Set D: 2, 5. 55

60 56 CHAPTER 5. HYPOTHESIS TESTS

61 Chapter 6 Hypothesis tests for small samples Read Freedman et al. (2007, Ch. 26, section 6). The spectrophotometer example used by Freedman et al. (2007, Ch. 26, pp ) is rather complicated. Consider instead the following story (that uses the same numbers): A large chain of gas stations sells refrigerated cans of Coke. On average the chain sells about 70 per day per station. The manager notices that a competing chain has increased the price of a refrigerated can of Coke, and wonders whether as a result she now is on average selling more than before. She records the number of cans sold by five randomly selected gas stations from the chain: Four out of five of these numbers are higher than 70, and some of them by quite a bit. Can this explained on basis of chance variation? Or did the mean number of cans sold per gas station increase? (You can now construct the box model, formulate the null and alternative hypothesis, and compute the test statistic. Continue reading on p. 490 and replace ppm by cans.) Questions for Review 1. When the sample is small, there is extra uncertainty. How does the test procedure take this extra uncertainty into account (two ways)? 2. What are the properties of Student s t-curve? Compare with the normal curve. 3. When should Student s curve be used? 57

62 58 CHAPTER 6. HYPOTHESIS TESTS FOR SMALL SAMPLES Exercises 1. A long series of the number of refrigerated cans of Coca Cola sold by a large chain of gas stations averages to 253 cans per station per week. Following an advertising campaign by Pepsi Cola, the manager of the chain collects data from ten randomly selected gas stations. She finds that the number of cans of Coca Cola in the sample averages 245 cans, and that the standard deviation is 9 cans. Did the mean fall, or is this chance variation? (This is a variant of Set F: exercise 6.) 2. (I will add more exercises later.)

63 Chapter 7 Hypothesis tests on two averages Read Freedman et al. (2007, Ch. 27). Questions for Review 1. What is the standard error for a difference? Explain using a box model. Give an example of a case where the formula from for the SE for a difference used in the textbook does not apply. 2. What are the assumptions of the two-sample z-test for comparing two averages? Can you think of examples when you want to compare two averages but a z-test is not appropriate? 3. When should the χ 2 -test be used, as opposed to the z-test? 4. What are the six ingredients of a χ 2 -test? Exercises Work the following exercises from Freedman et al. (2007), chapter 28: Set A: 1 (use the TI-84), 2 (use the TI-84), 3, 4, 7, 8. Set C: 2. 59

64 60 CHAPTER 7. HYPOTHESIS TESTS ON TWO AVERAGES

65 Chapter 8 Correlation Read Freedman et al. (2007, Ch. 8, 9). Skip section 3 ( The SD line, pp ) from Freedman et al. (2007, Ch. 8). Questions for Review 1. If you want to summarize a scatter diagram, which five numbers would you report? 2. What are the properties of the coefficient of correlation? 3. How do you compute a coefficient of correlation? 4. What are the units of a coefficient of correlation? 5. Which data manipulations will not affect the coefficient of correlation? 6. In which cases can a coefficient of correlation be misleading? Make sketches to illustrate your point. 7. What are ecological correlations? Why can they be misleading? 8. What is the connection between correlation and causation? Exercises Work the following exercises from Freedman et al. (2007), chapter 8: Set A: 1, 6. Set B: 1, 2, 9. Set D: 1. Work the following exercises from Freedman et al. (2007), chapter 9: Set C: 1, 2. Set D: 1, 2. Set E: 2, 3, 4. 61

66 62 CHAPTER 8. CORRELATION

67 Chapter 9 Line of best fit Read Freedman et al. (2007), chapters 10 (introduction pp only), 11, 12. Note The formula sheet has the following formula for the y-intercept of the line of best fit: y-intercept = (ave of y) slope (ave of x) This formula is obtained as follows. The equation of the line of best fit is: predicted value of y = slope x + y-intercept As the line of best fit passes through the point of averages (ave of x, ave of y), we know that: ave of y = slope (ave of x) + y-intercept Solving this expression for the y-intercept yields: Q.E.D. Questions for Review y-intercept = (ave of y) slope (ave of x) 1. What does line of best fit measure? 2. On the average, what happens to y if there is an increase of one SD in x? 3. Suppose you have a scatter plot with a line of best fit. What is the error (or residual) of a given point of the scatter plot? Illustrate. 4. What does the standard error (or r.m.s. error) of regression measure? 5. What is the standard error (or r.m.s. error) of regression computed? 6. What properties do the residuals have? 63

68 64 CHAPTER 9. LINE OF BEST FIT 7. What is the difference between a homoscedastic and a heteroscedastic scatter diagram? Illustrate. 8. If you run an observational study, can the line of best fit be used to predict the results of interventions? Why (not)? 9. In what sense is the line of best fit the line that gives the best fit? Exercises Work the following exercises from Freedman et al. (2007), chapter 10: Set A: 1, 2, 4. Work the following exercises from Freedman et al. (2007), chapter 11: Set A: 3, 4, 7. Set D: 2, 3. Work the following exercises from Freedman et al. (2007), chapter 12: Set A: 1, 2. Additional exercise Table 9.1 shows the heights and weights of 30 students (the file is available as students.csv on the course web site). The average height is cm and the SD is 9.63 cm. The average weight is kg and the SD is kg. The coefficient of correlation between height and weight is (a) Make a scatter plot (12 cm tall and 10 cm wide) on squared or graphing paper. Truncate the horizontal axis at 150 cm and let it run to 200 cm (2 cm on the page is 10 cm of a student s height). Truncate the vertical axis at 40 kg and let it run to 100 kg (2 cm on the page is 10 kg of a student s weight). In class, plot 10 points or so (you can plot the other points later). (b) Show the point of averages, the run, the rise, and the second point of the line of best fit. With the two points, draw the line of best fit. (c) Compute the slope and the y-intercept of the line of best fit. Report the line of best fit (use the actual variable names and pay attention to the units of measurement). (d) Find the predicted weight for the student with a height of 190 cm. Illustrate in the scatter plot. (e) Find the residual for the student with a height of 190 cm. Illustrate in the scatter plot. (f) The r.m.s. error is 8.65 kg (you can verify this using the R script referred to below). Draw the line of best fit plus two r.m.s. errors and minus two r.m.s. errors (cf. Freedman et al. (2007, p. 183). Add all 30 points to the scatter and count which percentage of points lies within two r.m.s. errors of the line of best fit. Does the rule of thumb give a good approximation?

69 65 Use R Commander to find the descriptive statistics (average and standard deviation), the coefficient of correlation, the equation of the line of best fit, and to make a scatter plot with the line of best fit. To get the averages and SDs, select in the menu: Statistics Summaries Numerical summaries. To get the coefficient of correlation, select : Statistics Summaries Correlation matrix. To get the scatter plot with the line of best fit, select: Graphs Scatterplot... In the Data tab, select the x-variable and the y-variable. In the Options tab, only select the Plot option Least-squares line (unselect other items that may be selected by default). In the Identify Points options, select Do not identify. To get the coefficients of the line of best fit, select : Statistics Fit models Linear regression... Select the correct response variable (the dependent variable; the variable on the y-axis of the scatter plot) and explanatory variable (the independent variable, the variable on the x-axis of the scatter plot). Compare your outcomes with the outcomes obtained using R. (Alternatively, the R script R-script-Scatter-plot-of-heights-and-weights.R on the course web site computes the line of best fit and makes the scatter plot.)

70 66 CHAPTER 9. LINE OF BEST FIT Table 9.1: Heights and weights of 30 students Height (cm) Weight (kg)

71 Chapter 10 Multiple Regression 10.1 An example Suppose we are interested in child mortality. 1 Child mortality differs tremendously across countries: in Sierra Leone, out of every 1000 children that were born alive, 193 die before their fifth birthday; in Iceland, only 3 do (data refer to 2010). We suspect that child mortality is related to income per person and the education of young mothers. To examine whether such a relationship exists, we collected the following data for 214 countries from World Bank (2013): child mortality: mortality rate, under-5 (per live births), 2010 (indicator code: SH.DYN.MORT); income per person: gross national income (GNI) per capita, PPP (current international $), 2010 (indicator code: NY.GNP.PCAP.PP.CD); for the education of young mothers I use as a proxy: literacy rate, youth female (% of females ages 15-24), 2010 or most recent (indicator code: SE.ADT.1524.LT.FE.ZS). The data set is posted on the course web site as STA201-multiple-regressionexample.csv. In R Commander, import the data (Data Import From text file, clipboard, or URL... ). Inspect the data by clicking the View Data button. To obtain the descriptive statistics, do: Statistics Summaries Numerical summaries.... The computer output looks like this: mean sd n NA GNI.per.capita.PPP Literacy.rate.youth.female Mortality.rate.Under Always report the descriptive statistics (mean, standard deviation) and units of measurement of all variables. Never include raw computer output (like the one above) in a paper. Summarize the information (including units of measurement and the data sources) in a reader-friendly table (table 10.1). Round numbers to the number of decimals that is relevant to your reader. Carefully document the sources, either in the text of in a note to the table. 1 I borrowed the example from Gujarati (2003, pp and table 6.4 p. 185) and updated the data. 67

72 68 CHAPTER 10. MULTIPLE REGRESSION Table 10.1: Descriptive statistics mean SD n NA Income per capita (international $, PPP) Youth female literacy rate (%) Child mortality rate (per 1000) Note. n is the number of countries for which the data exist. NA is the number of countries for which data are not available. If the relationship between child mortality (as the response variable) and income per person and literacy of young mothers is linear, the regression equation looks like this: predicted child mortality = m 1 income + m 2 literacy + b (or, more generally for any variables y, x 1, and x 2 : predicted value of y = m 1 x 1 + m 2 x 2 + b ) Regression with more than one explanatory variable is called multiple regression; m 1 and m 2 are slope coefficients, and b is the y-intercept. We can t (easily) plot this equation because it is is no longer the equation of a straight line (it is a plane in three-dimensional space). It is however possible to find the values for the coefficients m 1, m 2, and b that minimize the r.m.s. error of regression. The mathematics behind the formula to estimate the coefficients is beyond the scope of this course, but any statistical computer program can compute the coefficients. In R Commander do: Statistics Fit models... Linear regression. This is the R output for the regression of the example: Call: lm(formula = Mortality.rate.Under.5 ~ GNI.per.capita.PPP + Literacy.rate.youth.female, data = Dataset) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.131e e < 2e-16 *** GNI.per.capita.PPP e e *** Literacy.rate.youth.female e e < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 132 degrees of freedom (79 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 132 DF, p-value: < 2.2e-16

73 10.2. INTERPRETATION OF THE SLOPE COEFFICIENTS 69 (The regression output of other statistical software is similar.) The first line shows the regression equation. lm stands for linear model. The variable before the tilde (~) is the dependent variable (the variable you want to predict; y in the mathematical notation). Following the tilde are the independent variables (x 1 and x 2 in the mathematical notation). Think of the tilde as meaning is predicted by : y ~ x1 + x2 means y is predicted by x 1 and x 2. The column Estimate in the Coefficients table gives the estimated coefficients of the regression equation. We discuss the other columns of the Coefficients table in the chapter on inference for regression. Residual standard error is the r.m.s. error of regression (calculated slightly differently than in Freedman et al. (2007, pp ), so the result may differ somewhat; if the number of cases is small, the difference may be quite large). The r.m.s. error of regression measures by how much the predicted value for y typically deviates from the actual value. Report the equation using the actual variable names, not y, x 1, and x 2 : predicted child mortality = income literacy The r.m.s. error of regression is 25.89: this is by how much the predicted value for child mortality typically deviates from the actual value. The R output reports that 79 observations were deleted due to missingness, so the regression uses only 135 (= ) countries Interpretation of the slope coefficients In a controlled experiment, the researcher controls for variables other than the treatment variable. In observational studies however, usually the variables y, x 1, and x 2 all vary at the same time. One of the strengths of multiple regression is that it allows to isolate the association between the dependent variable (y) and one of the independent variables, keeping the other independent variables in the equation constant. The slope coefficient m 1 measures by how much the predicted value of y changes if x 1 increases by one unit, keeping all other independent variables in the equation constant. (In this case, there is only one other independent variable: x 2.) Similarly, the slope coefficient m 2 measures by how much the predicted value of y changes if x 2 increases by one unit, keeping all other independent variables in the equation constant. (If you took calculus: m 1 is the partial derivative of y with regards to x 1, and m 2 is the partial derivative of y with regards to x 2.) For the numerical example, the slope coefficient of income per capita shows that a one unit (in this case: a one dollar) increase in income per capita is associated with decrease in the predicted child mortality rate by units (children per 1000 life births). Note that the slope coefficients have as units of measurement: units of the response variable per units of the independent variable. This is clear from the formula for the slope coefficient in simple regression: r SD of y slope = SD of x and is still true for the slope coefficients in multiple regression. In practice, the units of measurement of the coefficients in multiple regression are often omitted when the equation is reported.

74 70 CHAPTER 10. MULTIPLE REGRESSION On the scale of income per capita that typically ranges from $1 950 (first quartile) to $ (third quartile) a $1 increase is not very meaningful; a $1000 increase is more relevant. So let us reinterpret the slope coefficient of income per capita: the regression predicts that a $1000 increase of income per capita is associated with a decrease by 0.77 in the predicted number of children per 1000 who die before their fifth birthday. The slope coefficient of the literacy rate of youth females shows that with a 1 percentage point increase in the literacy rate of youth females is associated a drop of the predicted child mortality rate by units, that is, a decrease by in the number of children per 1000 who die before their fifth birthday. Be careful when drawing policy conclusions from an observational study. It is tempting to infer from the regression that a policy that increases the female youth literacy rate by 10 percentage points would cause child mortality to decrease by about 18. However, [T]he slope cannot be relied on to predict how y would respond if you intervene to change the value of x. If you run an observational study, the regression line only describes the data you see. The line cannot be relied on for predicting the results of interventions. (Freedman et al., 2007, p. 206) Coefficient of determination The following sketch shows in a scatter plot of a simple regression the actual value of y, the predicted value of y, and the average of y (many textbooks use the following notation: y for the actual value, ŷ for the predicted value, and y for the average): yactual-ypredicted-yave.pdf Take a closer look at the following vertical differences: actual value for y ave y predicted value for y ave y actual value for y predicted value for y = residual Note that: actual value for y ave y = (predicted value for y ave y)+(actual value for y predicted value for y)

75 10.3. COEFFICIENT OF DETERMINATION 71 Define the total sum of squares (TSS) as the sum of squared deviations between each actual value for y and the average of y: TSS = sum of (actual values for y ave y) 2 The total sum of squares is a measure of the variation of y around its average. The explained sum of squares (ESS) is the sum of squared deviations between each predicted value for Y and the average of y: ESS = sum of (predicted values for y ave y) 2 The explained sum of squares is a measure of the variation of y around its average that is explained by the regression equation. The residual sum of squares (RSS) is the sum of squared residuals: RSS = sum of (actual values for y predicted values for y) 2 When we computed the r.m.s. of regression we already encountered the residual sum of squares as the numerator of the fraction under the square root in Freedman et al. (2007, p. 182): RSS = (error #1) 2 + (error #2) (error #n) 2 It can be shown (proof omitted) that the total sum of squares is equal to the explained sum of squares plus the residual sum of squares: TSS = ESS + RSS Divide both sides by TSS: 1 = ESS TSS + RSS TSS The term RSS/TSS measures which proportion of the variation in y around its average is left unexplained by the regression. The term ESS/TSS shows which proportion of the variation in y around its average is explained by the regression. We call this proportion the coefficient of determination (notation: R 2 ): the coefficient of determination (R 2 ) measures the proportion of the variation in the dependent variable that is explained by the regression equation: R 2 = ESS sum of (predicted values for y ave y)2 = TSS sum of (actual values of y ave y) 2 The coefficient of determination is a measure of the goodness-of-fit of the estimated regression equation. You don t have to memorize the formula, but you do have to know the meaning of R 2. In the R computer output, the coefficient of determination is reported as Multiple R-squared. In the example, R 2 is equal to ; this means that the estimated regression equation (for the 132 countries in the data set) explains about two-thirds (66%) of the variation of child mortality around its mean. That is quite a lot: the estimated regression equation fit the data quite well. It can be shown that for simple regression (with only one independent variable), R 2 is equal to the coefficient of correlation (r) squared: R 2 = r 2. For multiple regression, that is not the case (there are several coefficients of correlation, one for each independent variable: between y and x 1, between y and x 2 ). That is why for multiple regression the coefficient of determination is written as capital R 2, not lowercase r 2.

76 72 CHAPTER 10. MULTIPLE REGRESSION 10.4 Questions for Review 1. How does multiple regression differ from simple regression? 2. What is the interpretation of the coefficient of one of the variables at the right-hand side of a multiple regression equation? 3. How is a residual in a multiple regression model computed? 4. What is the total sum of squares? What is the explained sum of squares? What is the residual sum of squares? 5. How is the coefficient of determination of a regression model computed? 6. What is the meaning of the coefficient of determination? If the coefficient of determination of a regression model is equal to 0.67, what does this mean? 10.5 Exercises 1. For 14 systems analysts, their annual salaries (in $), years of experience, and years of postsecondary education were recorded (Kazmier, 1995, table 15.2 p. 275). Below is the computer output for the descriptive statistics and a multiple regression of the annual salaries on the years of experience and the years of postsecondary education. (a) Download the data (Kazmier1995-table-15-2.csv) from the course web site. Compute the descriptive statistics and run the regression in R with R Commander. You should get the same output as shown below. mean sd n annual.salary years.of.experience years.of.postsecondary.education lm(formula = annual.salary ~ years.of.experience + years.of.postsecondary.education, data = Dataset) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** years.of.experience ** years.of.postsecondary.education * --- Signif. codes: 0 *** ** 0.01 *

77 10.5. EXERCISES 73 Residual standard error: 2189 on 11 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 11 DF, p-value: (b) Report the equation like you would in a paper. (c) Explain what the coefficients, the residual standard error and the value of R 2 mean. Pay attention to the units of measurement. (d) Predict the annual salary of a systems analyst with four years of education and three years of experience. (e) Would it be meaningful to use the regression equation to predict the annual salary of a systems analyst with four years of education and twenty years of experience? Why (not)?

78 74 CHAPTER 10. MULTIPLE REGRESSION

79 Chapter 11 Hypothesis tests for regression coefficients Until now we have used regression as a tool of descriptive statistics: as a method to describe relationships between variables. Under certain conditions, regression can also be a tool of inferential statistics: we can test hypotheses about regression coefficients. This chapter explains when and how Population regression function Consider the following (hypothetical) example (drawn from Gujarati (2003, Chapter 2)). Suppose that during one week a population of 60 families had the weekly income and the weekly consumption expenditure shown in table The data set is posted on the course web site as Gujarati-2003-table-2-1.csv; the R script as two-variable-regression-analysis.r. Figure 11.1 shows the scatter plot. There are ten income groups (families with incomes of $80, $100, $120,..., and $260). 75

80 76 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION Table 11.1: Income and consumption of a population of 60 families ($) case weekly family weekly family case weekly family weekly family income consumption income consumption expenditure expenditure 1 $ 80 $ $ 180 $ Let us first focus on families with a weekly income of $80. The population has five such households. Each family with a weekly income of $80 is represented by a ticket; on the ticket, the amount of the family s consumption expenditures is written. This is the box for the sub-population of families with a weekly income of $80: $55 $60 $65 $70 $75 Consider the following chance experiment: draw one ticket from the box. The following table shows all possible values for the chance variable and the corresponding probabilities: y x = $80 $55 $60 $65 $70 $ probability

81 11.2. THE ERROR TERM 77 This table is the probability distribution of the consumption expenditures (y) for families with a weekly income of $80. What are the expected consumption expenditures of households with a weekly income of $80? The expectation of a chance variable is the weighted average of all possible values; the weights are the probabilities. This gives: E(y x = $80) = $ $ $ $ $ = $65 Similarly, it can be shown that E(y x = $100) = $77 E(y x = $120) = $89 E(y x = $140) = $101 E(y x = $160) = $113 E(y x = $180) = $125 E(y x = $200) = $137 E(y x = $220) = $149 E(y x = $240) = $161 E(y x = $260) = $173 (Exercise 1 asks you to verify this for E(y x = $180)) The expected values are shown as black dots in figure Verify with the TI-84 that the points (x, E(y x)) are on the straight line with equation: E(y x) = 0.6x + 17 This equation is called the population regression function. It is shown as a solid line in the scatter plot (figure 11.1). The relationship between E(y x) and x need not be a linear one, that is, a function that yields a straight line when you plot it, but we limit out attention to those cases where the population regression function is a linear function: E(y x) = mx + b (or in the case of multiple regression a linear equation of the form E(y x) = m 1 x 1 + m 2 x 2 + b) The error term Within each income class (a vertical strip in the scatter plot), we define the error as the difference between the actual value of y and the expected value of y: error = actual expected For the five households with a weekly income of $80 the errors are: error #1 = $55 $65 = $10 error #2 = $60 $65 = $5 error #3 = $65 $65 = $0 error #4 = $70 $65 = $5 error #5 = $75 $65 = $10

82 78 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION Weekly family consumption expenditure ($) Weekly family income ($) Figure 11.1: Weekly income and consumption expenditures of a population of 60 families. (The solid line is the population regression function; the black dots indicate the expected values E(y x).) Because error = y E(y x) we can write the values of y as: y = E(y x) + error or The error captures: y = mx + b + error things (other than x) that are associated with y without that we know them or can measure them; measurement errors in y; the intrinsic random nature of behavior.

83 11.3. SAMPLE REGRESSION FUNCTION 79 One assumption of the regression model is that the error terms within a vertical strip of the scatter plot have a probability distribution that is independent from the value of x: if we plot the error terms against x, the resulting error plot should show no pattern Sample regression function Now suppose that a researcher doesn t know the population that consists of the sixty cases in table To estimate the (unknown) population regression function she draws a random sample (sample A) of 10 families from the population and records income and consumption expenditures for the families in the sample: case weekly family weekly family conincome (x) sumption expenditures (y) 3 $ 80 $ Verify using the LinReg function of the TI-84 that the regression line for this sample is: predicted value of y = x This equation is called the sample regression function (SRF) for sample A. The sample regression function for sample A is plotted as a dashed line in figure There are many sample regression functions: a different sample of 10 families would have given a different sample regression function. For instance, if another researcher would have drawn households 9, 16, 21, 23, 25, 39, 40, 43, 47, and 57 (sample B) the sample regression function would have been: predicted value of y = x (verify this using the LinReg function of the TI-84) The estimated slope varies from one sample to another: sample slope of SRF A B In repeated samples, the slope of the sample regression function is a chance variable (and so is the intercept). The slope of the sample regression function has a probability distribution (the sampling distribution of the slope of the sample regression function). The expectation of the slope of the sample regression

84 80 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION Weekly family consumption expenditure ($) Weekly family income ($) Figure 11.2: Weekly income and consumption expenditures of a population of 60 families. Note. The solid line is the population regression function, the dashed line is the sample regression function for sample A. function is the typical value around which the slope of the sample regression function varies in repeated samples (take a look at the sample estimates for the slope: do you have a hunch what the expectation is?). It can be shown that the expectation of the slope of the sample regression function (SRF) is the slope of the population regression function (PRF): E(slope of the SRF) = slope of the PRF That is, the slope of the sample regression function is an unbiased estimator of the slope of the population regression function. This is only the case if the independent variable (x) is not a chance variable (proof omitted). The chance error for the slope is defined as: chance error = slope of SRF E(slope of SRF) Now we can compute the chance errors that were made by each of the two researchers (of course, the researchers themselves can t compute the chance

85 11.3. SAMPLE REGRESSION FUNCTION 81 error they made because they don t know the population regression function). For sample A, the chance error of the slope is: = For sample B, the chance error of the slope is: = Here are the chance errors for samples A and B, and some other random samples: sample slope of SRF chance error (without sign) A (0.0186) B (0.0183) C (0.0287) D (0.0246) E (0.0963) typical value: expectation standard error The standard error (SE) of the the slope of the sample regression function is the typical size of the chance error (after you omitted the minus signs, as shown in the last column). The formula of the SE for the slope of the sample regression function and uses information about the population. Unlike in the numerical example above, in practice we don t know the population. So how can we find the SE of the slope of the sample regression function? The answer is that just like when we estimated the SE for a sample average we will use the bootstrap and the sample data to find an estimate for the SE of the slope of the sample regression function. The formula is complicated and I won t report it here, but statistical software will compute an estimate of the SE based on the sample data. As in the case of the SE for an average, the SE for the slope of the sample regression function gets smaller if the sample size gets bigger: a bigger random sample tends to give a more precise estimate for the slope coefficient. The same arguments apply to the intercept and in multiple regression the slope coefficients of the other independent variables. Samples and populations Suppose you have a data set covering all 50 states of the U.S. Some would argue that such a data set covers a population (all the states of the U.S.), not a sample. Clearly the states were not randomly selected. Think however of the y values (on for each state) as generated by the following random process: y = m 1 x + b + error = E(y x) + error The first part (m 1 x+b) is deterministic (determined by the population regression function). The second part (the error term) is random: in terms of a box model, the error term is obtained by randomly drawing a ticket from a box with tickets; each ticket contains a value for the error term. Consider the data to be the result of a natural experiment. As events unfold, Nature runs the experiment by drawing an error term from the box whenever a x takes a certain value. So a set of observations of x and the corresponding y can be considered as a random sample, even when the observations cover all the possible subjects (such as all 50 stated of the US): the chance is in the error terms, not in the cases.

86 82 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION 11.4 Example: child mortality In the previous chapter, we used data for 135 countries to estimate a sample regression function relating child mortality (y) to income per capita (x 1 ) and the literacy rate of young women (x 2 ). The computer output was: Call: lm(formula = Mortality.rate.Under.5 ~ GNI.per.capita.PPP + Literacy.rate.youth.female, data = Dataset) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.131e e < 2e-16 *** GNI.per.capita.PPP e e *** Literacy.rate.youth.female e e < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 132 degrees of freedom (79 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 132 DF, p-value: < 2.2e-16 The software reports the coefficient estimates in a table. The first column gives the name of the variable, the second the estimated regression coefficient for that variable, and the third column gives the standard error for the coefficient. The standard error is estimated using the bootstrap, so the reported standard errors are only reliable if the sample is sufficiently large. From the computer output we see that the estimated intercept is 213.1, with an SE of 12.24; the estimated slope coefficient of income per capita is , with an SE of ; and the the estimated slope coefficient of the literacy rate of young women is , with an SE of The convention is to report the sample regression equation with the standard errors in brackets on the next line, like this: predicted child mortality = income literacy (SEs:) (12.2) ( ) (0.146) In a paper, you would after you reported the equation above interpret the meaning of the coefficients (see p. 69): the slope coefficient of income per capita shows that with a $1000 increase of income per capita is associated a decrease by 0.77 in the predicted child mortality rate (the number of children per 1000 who die before their fifth birthday); and the slope coefficient of literacy shows that with a 1 percentage point increase in the literacy rate of youth females is associated a drop of the predicted child mortality rate by You would also report and interpret the coefficient of determination: the regression equation (for the 135 countries in the data set) explains about 66% of the variation of child mortality around its mean.

87 11.5. CONFIDENCE INTERVAL FOR A REGRESSION COEFFICIENT Confidence interval for a regression coefficient It can be shown that, if the error terms in the population regression function follow the normal curve, the sampling distribution of the coefficients of the sample regression function also follows the normal curve. Let us also assume that the error terms are homoscedastic, that is, that their spread is the same in each vertical strip. With the estimate and the standard error, we can now compute a 95%-confidence interval for a population regression coefficient using the familiar formula: coefficient of SRF ± 2 (SE for coefficient of SRF) A 95%-confidence interval for the population regression coefficient of income per capita is: ± which yields the interval from to So one can be 95% confident that the population regression coefficient of income per capita is between and The interpretation is like before: if 100 researchers each would take a random sample and compute a 95%-confidence interval, about 95 of the confidence intervals would cover the population regression coefficient; the other five wouldn t. This formula works for large samples. For a small sample, you should use a number larger than 2 in the formula above, and the confidence interval will be wider Hypothesis test for a regression coefficient With the estimate and the standard error, you can also perform hypothesis tests. The test statistic is: test statistic = estimator hypothetical value SE for estimator Suppose you want to test the hypothesis that the population regression coefficient of income per capita is is equal to , against the two-sided alternative that the coefficient is different from The test statistic then is: ( ) test statistic = If the errors of the population regression function follow the normal curve, the test statistic follows the normal curve. The P -value then is the area under the normal curve to the left of and to the right of This area is equal to about 89% (verify using the normalcdf function of the TI-84). As the P -value is large, we do not reject the null hypothesis. Suppose we want to test the null hypothesis that the population regression coefficient of income per capita is equal to 0, against the two-sided alternative hypothesis that the population regression coefficient differs from 0. The test statistic is: test statistic =

88 84 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION If the error terms follow the normal curve, so does the test statistic. The P -value then is the area under the normal curve to the left of and to the right of This area is equal to (verify using the normalcdf-function of the TI-84), or about 0.04%. Because the P -value is small, we reject the null hypothesis: the sample evidence supports the alternative hypothesis that the population regression coefficient of income per capita differs from 0. The coefficients is said to be statistically significant (which is short for statistically significantly different from zero ). Note that the value of the test statistic ( 3.567) is shown in the t value column of the table in the R output. The P -value we found is (approximately) the value shown in the Pr(> t ) column of the R output. The statistical software uses the Student curve to find the P -value; that s why the computer output reports the test statistic as t value rather than as z value. The degrees of freedom are equal to the sample size minus the number of coefficients (including the intercept), in this case: = 132 (the degrees of freedom are reported in the computer output above). The area under the Student curve with 132 degrees of freedom to the left of and to the right of is (or about 0.05%), as is reported in the computer output. If the sample is large, there is little difference between a t test and a z test, and it is OK to use the normal curve to find the the P -value. The codes next to the Pr(> t ) column are a quick guide to the size of the P -value. The legend below the coefficients table gives the meaning of the symbols: Signif. codes: 0 *** ** 0.01 * Three asterisks (***) means the P -value is between 0 and (0.1%); two asterisks (**) means that the P -value is between (0.1%) and 0.01 (1%); one asterisk (*) means that the P -value is between 0.01 (1%) and 0.05 (5%); a dot (.) means that the P -value is between 0.05 (5%) and 0.1 (10%); nothing means that the P -value is between 0.1 (10%) and 1 (100%). So if there is at least one asterisk (*), you can reject the null hypothesis that the coefficient is equal to zero at the 5% significance level. Remember the following: the t value column gives the test statistic for the test of the hypothesis that the population regression coefficient is equal to 0; the Pr(> t ) column gives the P -value for a two-sided test of the hypothesis that the population regression coefficient is equal to 0. If the P -value is sufficiently small, reject the null hypothesis. One or more asterisks means that you can reject the null hypothesis at the conventional significance levels. Note that statistically significant is not the same as substantive (review Freedman et al. (2007, pp )): a coefficient can be statistically significantly different from zero, but at the same time be so small that it is of little substantive importance. Suppose you run a regression relating total sales to advertising spending. You find that a $1000 increase in advertising spending is associated with an increase in predicted total sales by $1, and that the coefficient of advertising spending is statistically significant. From the business

89 11.7. ASSUMPTIONS OF THE REGRESSION MODEL 85 context, it is clear that the effect is not substantive, even though it is statistically significant. Conversely, a coefficient can be statistically insignificant but substantive. Suppose that a rehydration set (good for a week-long treatment) costs $5. You find that with a drop in the price of a rehydration set by $1, is associated a drop in the predicted child mortality rate by 10 (per 1000 children under five years old), but that the coefficient is not statistically significant at the 5% level. Should you dismiss the relationship between the cost of a rehydration set and child mortality? Probably not, as the effect you found is substantive: in the sample, a modest drop in the price of a rehydration set is associated with a substantive drop in child mortality. Even though the coefficient was statistically insignificant, it is probably worth paying attention to the price of a rehydration sets. To avoid confusion, use the term statistically significant (rather than significant ) when you talk about statistical significance; use the term substantive when you talk about the size of the coefficient. Statistics can tell you whether a coefficient is statistically significant or not, but not whether the size of a coefficient is substantive; to know whether a coefficient is substantive, you should use your judgement in the context of the problem Assumptions of the regression model In the last two chapters we made a number of assumptions that were needed to make regression work. It is useful to summarize the assumptions (Kennedy, 2003, pp ): 1. the dependent variable is a linear function of the independent variable(s), plus an error term; 2. the expectation of the error term is zero (if that is not the case, the estimate of the intercept is biased); 3. the observations on the independent variable are not random: they can be considered fixed in repeated samples (if that is not the case, the coefficient estimates are biased); 4. the error terms have the same standard error (are homoscedastic) and are not correlated with each other (if that is not the case, the estimates for the SEs may be far off and hence inference is no longer valid; the computed coefficient of determination may also be misleading. But the estimators are still unbiased); 5. the distribution of the error terms follows the normal curve. This assumption is needed to do inference (make confidence intervals, do hypothesis tests); but even if the error terms don t follow the normal curve, the estimators are still unbiased. A final warning concerns time series data. Time series data are values measured at recurring points in time. For instance, annual data from the national income accounts on GDP and its components (consumption, investment, government purchases, and net exports) are time series. Time series data usually

90 86 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION are notated with a time index (y t, x t ). A time series of n observations of the variable y t is a list that looks like this: {y 1, y 2, y 3,..., y t,... y n } where y 1 is the value of y (say, consumption) observed in the first period (say, the year 2000), y 2 is the value of y observed in the second period (the year 2001), and so on. It turns out that many time series have a statistical property called nonstationarity. Amongst other things, the presence of a time trend in the data (a tendency for the values to go up or down over time) will make a series non-stationary. To spot a possible time trend, it is a good idea to plot a time series diagram of each time series. A time series diagram is a line diagram with time (t) on the horizontal axis and the time series (y t on the vertical axis (include figure as example). If the data are non-stationary, the results of regression (and of inference based on regression) may be wrong. Be cautious if your data are time series. Two easy fixes may work: include the time variable t as one of the independent variables in the multiple regression, or use the change in y and the change in x in the regression. Still, if you suspect non-stationarity, consult someone who knows how to deal with it Questions for Review 1. What is a population regression function? 2. How is the error of regression defined? What does it capture? 3. What is a sample regression function? 4. Why does an estimated sample regression function differ from the population regression function? 5. What does it mean that the slope of the sample regression function is an unbiased estimator of the slope of the population regression function? 6. How is the chance error of the slope defined? 7. What does the standard error (SE) of the the slope of the sample regression function measure? 8. Given that in practice we don t know the population, how can we estimate the standard error (SE) of the the slope of the sample regression function? 9. What happens to the standard error (SE) of the slope of the sample regression function if (other things equal) the sample gets bigger? 10. How do you compute a 95% confidence interval for the slope of the population regression function? Under which conditions can you apply the formula? 11. How do you interpret a 95% confidence interval for the slope of the population regression function? Give the exact probability interpretation, using the concept of repeated samples.

91 11.9. EXERCISES How do you compute the test statistic for a test on a coefficient from a regression? 13. Suppose that you want to test the null hypothesis that a coefficient of the population regression is equal to zero. How do you interpret the P -value for the test? 14. What does the column Estimate in computer regression output report? 15. What does the column Std. Error in computer regression output report? 16. What does the column t value in computer regression output report? 17. What does the column Pr(> t ) in computer regression output report? 18. What is the meaning of the Residual standard error in computer regression output? 19. What is the meaning of the R-squared in computer regression output? 20. What are the assumptions underlying the multiple regression model used in this chapter? 21. What are time series data? Illustrate using an example. 22. Why should you be careful when the multiple regression model for time series data? 11.9 Exercises 1. For the example in section 11.1, verify that E(y x = $180) = $125. Show your work. 2. Find a 95%-confidence interval for the population regression coefficient of the literacy rate in the child mortality regression. Give the probability interpretation of a 95%-confidence interval. Which assumptions did you have to make? 3. For 14 systems analysts, their annual salaries (in $), years of experience, and years of postsecondary education were recorded (Kazmier, 1995, table 15.2 p. 275) (same regression as in the exercise of the previous chapter). Below is the computer output for the multiple regression of the annual salaries on the years of experience and the years of postsecondary education:

92 88 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION lm(formula = annual.salary ~ years.of.experience + years.of.postsecondary.education, data = Dataset) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** years.of.experience ** years.of.postsecondary.education * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 2189 on 11 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 11 DF, p-value: (a) Download the data (Kazmier-1995-table-15-2.csv) from the course web site and run the regression in R with R Commander. You should get the same output as shown above. (b) Report the estimated regression equation (with the SEs), like you would in a paper. (c) Explain the meaning of the SE for the coefficient of years of experience. (d) Find a 95% confidence interval for each of the three population regression coefficients. Make explicit which assumptions you made. (Ignore the fact that the sample is small.) (e) Explain the exact probability meaning of the 95% confidence interval for the coefficient of years of experience. (f) Test the null hypothesis that the population regression coefficient of years of experience is equal to $1000/year. Make explicit which assumptions you made. (Ignore the fact that the sample is small.) (g) What do the asterisks (*) in the right column of the coefficients table mean? Test the null hypothesis that the intercept of the population regression function is equal to 0. Test the null hypothesis that the population regression coefficient of years of experience is equal to 0. Test the null hypothesis that the population regression coefficient of years of postsecondary education is equal to 0. Make explicit which assumptions you made.

93 11.9. EXERCISES A researcher collected the prices (in $) of 30 randomly selected single-family houses, together with the living area (in square feet) and the lot size (in square feet) of each house (Kazmier, 1995, table 15.3 p. 290). Here s the computer output for the descriptive statistics: mean sd n Living.area.sq.ft Lot.size.sq.ft Price.USD This is the computer output for the multiple regression of the price on living area and lot size: lm(formula = Price.USD ~ Living.area.sq.ft + Lot.size.sq.ft, data = House.prices) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * Living.area.sq.ft Lot.size.sq.ft Signif. codes: 0 *** ** 0.01 * Residual standard error: 9087 on 27 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 27 DF, p-value: 2.407e-16 (a) Report the estimated regression equation (with the SEs), like you would in a paper. (b) Interpret the estimated intercept. interpretation? Why (not)? Should we give much weight to this (c) What are the units of measurement of the slope coefficients? Interpret the estimated slope coefficients. (d) Explain the meaning of the SE for the coefficient of living area. (e) Find a 95% confidence interval for each of the three population regression coefficients. Make explicit which assumptions you made. (f) Explain the exact probability meaning of the 95% confidence interval for the coefficient of living area. (g) Test the null hypothesis that the population regression coefficient of living area is equal to zero. Use the normal curve to find the p-value. Make explicit which assumptions you made. Complete the columns t value and Pr(> t ) (which I omitted for the coefficient of Living.area.sq.ft). How many asterisks should be in the last column? Explain.

94 90 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION (h) Interpret the r.m.s. error of regression (Residual standard error in the computer output). (i) Interpret the R 2 (R-squared in the computer output). (j) Download the data (Kazmier-1995-table-15-3.csv) from the course web site and run the regression with R Commander. You should get the same output as shown above.

95 Chapter 12 The Chi-Square test Read Freedman et al. (2007, Ch. 28). Skip the explanation of how to use χ 2 - tables (starting on p. 527 with In principle, there is one table... and ending with the sketch on the top of p. 528); statisticians use a statistical calculator or statistical software to find areas under the χ 2 -curve. Also skip section 3 ( How Fisher used the χ 2 -test ), pp Questions for Review 1. When should the χ 2 -test be used, as opposed to the z-test? 2. What are the six ingredients of a χ 2 -test? 12.1 Exercises Work the following exercises from Freedman et al. (2007), chapter 28: Set A: 1, 2, 3, 4, 7, 8. Set C: 2. Review exercises: 7. 91

96 92 CHAPTER 12. THE CHI-SQUARE TEST

97 Bibliography Freedman, D., Pisani, R., and Purves, R. (2007). Statistics. Norton, New York and London, 4 th edition. Garcia, J. and Quintana-Domeque, C. (2007). The evolution of adult height in Europe: A brief note. Economics & Human Biology, 5(2): Gujarati, D. N. (2003). Basic Econometrics. McGraw-Hill, Boston, 4th edition. Heston, A., Summers, R., and Aten, B. (2012). Penn World Table Version 7.1. Center for International Comparisons of Production, Income and Prices at the University of Pennsylvania, Philadelphia. Kazmier, L. J. (1995). Schaum s Outline of Theory and Problems of Business Statistics. Schaum s Outline Series. McGraw-Hill, New York. Kennedy, P. (2003). A Guide to Econometrics. Blackwell, Malden, MA, 6 th edition. Moore, D. S., McCabe, G. P., and Craig, B. A. (2012). Practice of Statistics. Freeman, New York, 7 th edition. Introduction to the Rösling, H. (2015). No more rich world and poor world Don t panic: How to end poverty in 15 years BBC Two. (video). World Bank (2013). World Bank Open Data. Consulted on 21 November 2013 on 93

6 3 The Standard Normal Distribution

6 3 The Standard Normal Distribution 290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

The Normal Distribution

The Normal Distribution Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple. Graphical Representations of Data, Mean, Median and Standard Deviation In this class we will consider graphical representations of the distribution of a set of data. The goal is to identify the range of

More information

Coins, Presidents, and Justices: Normal Distributions and z-scores

Coins, Presidents, and Justices: Normal Distributions and z-scores activity 17.1 Coins, Presidents, and Justices: Normal Distributions and z-scores In the first part of this activity, you will generate some data that should have an approximately normal (or bell-shaped)

More information

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175) Describing Data: Categorical and Quantitative Variables Population The Big Picture Sampling Statistical Inference Sample Exploratory Data Analysis Descriptive Statistics In order to make sense of data,

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Measurement with Ratios

Measurement with Ratios Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical

More information

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

AMS 5 CHANCE VARIABILITY

AMS 5 CHANCE VARIABILITY AMS 5 CHANCE VARIABILITY The Law of Averages When tossing a fair coin the chances of tails and heads are the same: 50% and 50%. So if the coin is tossed a large number of times, the number of heads and

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

MEASURES OF VARIATION

MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Unit 7: Normal Curves

Unit 7: Normal Curves Unit 7: Normal Curves Summary of Video Histograms of completely unrelated data often exhibit similar shapes. To focus on the overall shape of a distribution and to avoid being distracted by the irregularities

More information

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives. The Normal Distribution C H 6A P T E R The Normal Distribution Outline 6 1 6 2 Applications of the Normal Distribution 6 3 The Central Limit Theorem 6 4 The Normal Approximation to the Binomial Distribution

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Problem Solving and Data Analysis

Problem Solving and Data Analysis Chapter 20 Problem Solving and Data Analysis The Problem Solving and Data Analysis section of the SAT Math Test assesses your ability to use your math understanding and skills to solve problems set in

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Chapter 2: Descriptive Statistics

Chapter 2: Descriptive Statistics Chapter 2: Descriptive Statistics **This chapter corresponds to chapters 2 ( Means to an End ) and 3 ( Vive la Difference ) of your book. What it is: Descriptive statistics are values that describe the

More information

Probability. Distribution. Outline

Probability. Distribution. Outline 7 The Normal Probability Distribution Outline 7.1 Properties of the Normal Distribution 7.2 The Standard Normal Distribution 7.3 Applications of the Normal Distribution 7.4 Assessing Normality 7.5 The

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis

More information

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k. REPEATED TRIALS Suppose you toss a fair coin one time. Let E be the event that the coin lands heads. We know from basic counting that p(e) = 1 since n(e) = 1 and 2 n(s) = 2. Now suppose we play a game

More information

You flip a fair coin four times, what is the probability that you obtain three heads.

You flip a fair coin four times, what is the probability that you obtain three heads. Handout 4: Binomial Distribution Reading Assignment: Chapter 5 In the previous handout, we looked at continuous random variables and calculating probabilities and percentiles for those type of variables.

More information

Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits)

Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits) Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits) Instructor: Luc Hens Telephone: +32 2 629 11 92 e-mail: [email protected] Web site: http://homepages.vub.ac.be/~lmahens/ Course

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Scatter Plots with Error Bars

Scatter Plots with Error Bars Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

More information

Statistics Revision Sheet Question 6 of Paper 2

Statistics Revision Sheet Question 6 of Paper 2 Statistics Revision Sheet Question 6 of Paper The Statistics question is concerned mainly with the following terms. The Mean and the Median and are two ways of measuring the average. sumof values no. of

More information

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540 To complete this technology assignment, you should already have created a scatter plot for your data on your calculator and/or in Excel. You could do this with any two columns of data, but for demonstration

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

AP Statistics Solutions to Packet 2

AP Statistics Solutions to Packet 2 AP Statistics Solutions to Packet 2 The Normal Distributions Density Curves and the Normal Distribution Standard Normal Calculations HW #9 1, 2, 4, 6-8 2.1 DENSITY CURVES (a) Sketch a density curve that

More information

Standard Deviation Estimator

Standard Deviation Estimator CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of

More information

Normal distribution. ) 2 /2σ. 2π σ

Normal distribution. ) 2 /2σ. 2π σ Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

More information

Week 3&4: Z tables and the Sampling Distribution of X

Week 3&4: Z tables and the Sampling Distribution of X Week 3&4: Z tables and the Sampling Distribution of X 2 / 36 The Standard Normal Distribution, or Z Distribution, is the distribution of a random variable, Z N(0, 1 2 ). The distribution of any other normal

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected] www.excelmasterseries.com

More information

GeoGebra Statistics and Probability

GeoGebra Statistics and Probability GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,

More information

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Directions for Frequency Tables, Histograms, and Frequency Bar Charts Directions for Frequency Tables, Histograms, and Frequency Bar Charts Frequency Distribution Quantitative Ungrouped Data Dataset: Frequency_Distributions_Graphs-Quantitative.sav 1. Open the dataset containing

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

Updates to Graphing with Excel

Updates to Graphing with Excel Updates to Graphing with Excel NCC has recently upgraded to a new version of the Microsoft Office suite of programs. As such, many of the directions in the Biology Student Handbook for how to graph with

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Normal Probability Distribution

Normal Probability Distribution Normal Probability Distribution The Normal Distribution functions: #1: normalpdf pdf = Probability Density Function This function returns the probability of a single value of the random variable x. Use

More information

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds Isosceles Triangle Congruent Leg Side Expression Equation Polynomial Monomial Radical Square Root Check Times Itself Function Relation One Domain Range Area Volume Surface Space Length Width Quantitative

More information

6.3 Conditional Probability and Independence

6.3 Conditional Probability and Independence 222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

$2 4 40 + ( $1) = 40

$2 4 40 + ( $1) = 40 THE EXPECTED VALUE FOR THE SUM OF THE DRAWS In the game of Keno there are 80 balls, numbered 1 through 80. On each play, the casino chooses 20 balls at random without replacement. Suppose you bet on the

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

7.7 Solving Rational Equations

7.7 Solving Rational Equations Section 7.7 Solving Rational Equations 7 7.7 Solving Rational Equations When simplifying comple fractions in the previous section, we saw that multiplying both numerator and denominator by the appropriate

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

0 Introduction to Data Analysis Using an Excel Spreadsheet

0 Introduction to Data Analysis Using an Excel Spreadsheet Experiment 0 Introduction to Data Analysis Using an Excel Spreadsheet I. Purpose The purpose of this introductory lab is to teach you a few basic things about how to use an EXCEL 2010 spreadsheet to do

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

What Does the Normal Distribution Sound Like?

What Does the Normal Distribution Sound Like? What Does the Normal Distribution Sound Like? Ananda Jayawardhana Pittsburg State University [email protected] Published: June 2013 Overview of Lesson In this activity, students conduct an investigation

More information

What Do You Think? for Instructors

What Do You Think? for Instructors Accessing course reports and analysis views What Do You Think? for Instructors Introduction As an instructor, you can use the What Do You Think? Course Evaluation System to see student course evaluation

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

The Binomial Probability Distribution

The Binomial Probability Distribution The Binomial Probability Distribution MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2015 Objectives After this lesson we will be able to: determine whether a probability

More information

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Role in quality management system Quality Control (QC) is a component of process control, and is a major element of the quality management

More information

Statistics 2014 Scoring Guidelines

Statistics 2014 Scoring Guidelines AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Probability Distributions

Probability Distributions CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

1.6 The Order of Operations

1.6 The Order of Operations 1.6 The Order of Operations Contents: Operations Grouping Symbols The Order of Operations Exponents and Negative Numbers Negative Square Roots Square Root of a Negative Number Order of Operations and Negative

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate 1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such

More information

AP STATISTICS REVIEW (YMS Chapters 1-8)

AP STATISTICS REVIEW (YMS Chapters 1-8) AP STATISTICS REVIEW (YMS Chapters 1-8) Exploring Data (Chapter 1) Categorical Data nominal scale, names e.g. male/female or eye color or breeds of dogs Quantitative Data rational scale (can +,,, with

More information

MATH 140 Lab 4: Probability and the Standard Normal Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes

More information

Introduction; Descriptive & Univariate Statistics

Introduction; Descriptive & Univariate Statistics Introduction; Descriptive & Univariate Statistics I. KEY COCEPTS A. Population. Definitions:. The entire set of members in a group. EXAMPLES: All U.S. citizens; all otre Dame Students. 2. All values of

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 [email protected] 1. Descriptive Statistics Statistics

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Chapter 2: Frequency Distributions and Graphs

Chapter 2: Frequency Distributions and Graphs Chapter 2: Frequency Distributions and Graphs Learning Objectives Upon completion of Chapter 2, you will be able to: Organize the data into a table or chart (called a frequency distribution) Construct

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Chapter 1 Review 1. As part of survey of college students a researcher is interested in the variable class standing. She records a 1 if the student is a freshman, a 2 if the student

More information

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality. 8 Inequalities Concepts: Equivalent Inequalities Linear and Nonlinear Inequalities Absolute Value Inequalities (Sections 4.6 and 1.1) 8.1 Equivalent Inequalities Definition 8.1 Two inequalities are equivalent

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

Measures of Central Tendency and Variability: Summarizing your Data for Others

Measures of Central Tendency and Variability: Summarizing your Data for Others Measures of Central Tendency and Variability: Summarizing your Data for Others 1 I. Measures of Central Tendency: -Allow us to summarize an entire data set with a single value (the midpoint). 1. Mode :

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

Frequency Distributions

Frequency Distributions Descriptive Statistics Dr. Tom Pierce Department of Psychology Radford University Descriptive statistics comprise a collection of techniques for better understanding what the people in a group look like

More information

Using Microsoft Word. Working With Objects

Using Microsoft Word. Working With Objects Using Microsoft Word Many Word documents will require elements that were created in programs other than Word, such as the picture to the right. Nontext elements in a document are referred to as Objects

More information