STA201 Intermediate Statistics Lecture Notes. Luc Hens

Transcription

1 STA201 Intermediate Statistics Lecture Notes Luc Hens 15 January 2016

2 ii

3 How to use these lecture notes These lecture notes start by reviewing the material from STA101 (most of it covered in Freedman et al. (2007)): descriptive statistics, probability distributions, sampling distributions, and confidence intervals. Following the review, STA201 then covers the following topics: hypothesis tests (Freedman et al., 2007, chapters 26 and 29); the t-test for small samples (Freedman et al., 2007, chapter 26, section 6); hypothesis tests on two averages (Freedman et al., 2007, chapter 27), and the Chi-square test (Freedman et al., 2007, chapter 28). STA201 then covers correlation and simple linear regression (Freedman et al., 2007, chapters 10, 11, 12). Two related subjects (multiple regression and inference for regression) that are not covered in Freedman et al. (2007) are covered in-depth in the lecture notes. Each chapter from the lecture notes ends with Questions for Review; be prepared to answer these questions in class. Work the problems at the end of the chapters in the lecture notes. The key concepts are set in boldface; you should know their definitions. You can find the lectures notes and other material on the course web site: We ll use the open-source statistical environment R with the graphical user interface R Commander. The course web page contains a document (Getting started in STA101 ) that explains how to install R and R Commander on your computer, as well as R scripts and data sets used in the course. Thanks to a web interface (Rweb) you can also run R from any computer or mobile device (tablet or smartphone) with a web browser, without having R installed. Make sure you are connected to the internet. In your web browser, open a new a new tab. Point your browser to Rweb: Remove everything from the window at the top (data(meaudret) etc.). Type R code (or paste an R script) in the window. Click the Submit button. Wait until Results from Rweb appears. If the script generates a graph, scroll down to see the graph. Practice is important to learn statistics. Students who wish to work additional exercises can find hundreds of solved exercises in Kazmier (1995) (or a more recent edition). Moore et al. (2012) covers the same ground as Freedman et al. (2007) and has many exercises; the solutions to the odd-numbered exercises are in the back of the book. Older but still useful editions of both books are available in the VUB library. iii

4 iv HOW TO USE THESE LECTURE NOTES Remember the following calculation rules: Always carry the units of measurement in the calculations. For instance, when you have two measurements in dollars ($ 2 and $ 3) and you compute their average, write: $ 2 + $ 3 = $ To express a fraction (say 2/5) as a percentage, multiply by 100% (not by 100): 2 100% = 40% 5 The same holds for expressing decimals (say, 0.40) as a percentage: % = 40% (STA201 was for a while taught as STA301 Methods: Statistics for Business and Economics.)

5 Chapter 1 Descriptive statistics 1.1 Basic concepts of statistics Suppose you want to find out which percentage of employees in a given company has a private pension plan. The population is the set of cases about which you want to find things out. In this case, the population consists of all employees in the given company; each employee is a case. A variable is a characteristic of each case in the population. In this case you are interested in the variable private pension plan. It can take two values: yes or no (it s a qualitative variable). The percentage of employees who have a private pension plan is a parameter: a numerical characteristic of the population. The monthly salary of the employees is a quantitative variable. The average monthly salary of all employees in the company is another parameter. We ll be mainly concerned with these two types of parameters: percentages (of qualitative variables) and averages (of quantitative variables). If you conduct a survey and every employee in the company fills out the survey form, the collected data set covers all of the population, and you can find the exact value of the population parameter. In some cases collecting data for the population may not be possible; you may have to rely on a sample drawn from the population. A sample is a subset of the population. The sample percentage (which percentage of employees in the sample has a private pension plan) is called a statistic. Statistical inference is when you use a sample to draw conclusions about the population it was drawn from. We ll see that when the sample is a simple random sample, the sample percentage (the statistic) is a good estimate of the population percentage (the parameter). Much of statistical inference deals with quantifying the degree of uncertainty that is the result of generalizing from sample evidence. First we will deal with descriptive statistics: ways to summarize data (from a population or a sample) in a table, a graph, or with numbers. 1.2 Summarizing data by a frequency table How can we summarize information about a quantitative variable of a sample or a population, often consisting of thousands of measurements? 1

6 2 CHAPTER 1. DESCRIPTIVE STATISTICS When a particular stock is traded unusually frequently on a given day, usually this indicates something is going on. Table 1.1 shows the number of traded Apple shares for each of the first fifty trading days of A glance at the data reveals that the trade volumes differ considerable from day to day. Table 1.1: Volumes of Apple stock traded on NASDAQ on the first 50 trading days of Source: nasdaq.com Date Volume Date Volume (yyyy/mm/dd) (yyyy/mm/dd) 2013/03/ /02/ /03/ /02/ /03/ /02/ /03/ /02/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/

7 1.2. SUMMARIZING DATA BY A FREQUENCY TABLE 3 How can we get a better idea of the typical daily volumes and the spread around the typical volumes? A good start is to rank the values from low to high: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , (In R Commander, type the sort() function in the script window. The name of the variable should be between the brackets.) The values vary from 10.8 to 52.1 million shares per day. The middle value in the ordered list is called the median. Because we have an even number of values (50), there are two middle values: the values at position 25 ( ) and 26 ( ). In that case, the convention is to take the average of the two middle values as the median: median = = The median gives an idea of the central tendency of the data distribution: half of the days the value (the volume of traded shares) was less than the median, and the other half the value was more than the median. We can summarize the ordered list in a frequency table. First, define class intervals that don t overlap and cover all data. You don t want too few class intervals (because that would leave out too much information), nor too many (because that wouldn t summarize the information from the data). You also want the class intervals to have boundaries that are easy, rounded numbers. The class intervals don t have to be of the same width. Let us define the first class interval as to ( included, not included), the second as to , and so on, until to A frequency table has three columns: class interval, absolute frequency, and relative frequency (table 1.2). The absolute frequency (or count) is how many values fall in each class interval. The first class interval ( to ) contains 13 values (verify!): the absolute frequency for this interval is 13. Find the absolute frequencies for the other class intervals. The relative frequency expresses the number of values in a class interval (the absolute frequency) as a percentage of the total number of values in the data set. For the first class interval ( to ) the relative frequency is: % = 26% 50 Verify the relative frequencies for the other class intervals. Show your work. The absolute frequencies add up to the number of values in the data set, and the relative frequencies (before rounding) add up to 100%. If that is not the case, you made a mistake.

8 4 CHAPTER 1. DESCRIPTIVE STATISTICS Table 1.2: Frequency table of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of Note. Class intervals include left boundaries and don t include right boundaries. Volume Absolute Relative (shares per day) frequency frequency (%) 10 to 15 million to 20 million to 25 million to 30 million to 35 million to 40 million to 45 million to 50 million to 55 million 1 2 Sum: Summarizing data by a density histogram The frequency table gives you a pretty good idea of what the most common values are, and how the values differ. One way to graph the information from a frequency table is to plot the values of the variable (in this case: the daily volumes) on the vertical axis, and the absolute or relative frequency on the vertical axis. The heights of the bars represent the absolute or relative frequencies. The areas of the bars don t have a meaning. Such a bar chart is called a frequency histogram. For reasons that will soon be clear, it is more interesting to plot a frequency table in a bar chart where the areas of the bars represent the relative frequencies. Such a bar chart is called a density histogram. The height of each bar in a density histogram represents the density of the data in the class interval. To construct a density histogram, we have to find the height for each bar. How do we compute the height? Remember that the area of a rectangle (such as the bars in the density histogram) is given by width times height: area = width height The area of the bar is the relative frequency, the width of the bar is the width of the class interval, and the height of the bar is the density. Hence: relative frequency = width of the interval density Divide both sides by the width of the interval, to obtain: density = relative frequency (%) width of the interval This formula is on the formula sheet. For the class interval from 10 million to 15 million shares the relative frequency was 26% (table 1.2). Hence the density for this interval is: density = 26% 15 million shares 10 million shares

9 1.3. SUMMARIZING DATA BY A DENSITY HISTOGRAM 5 26% = 5 million shares = 5.2%/million shares Now that you know the height of the bar over the interval from 10 to 15 million shares (5.2% per million shares), you can draw the bar. The density for the interval from 10 to 15 million shares tells us which percentage of all 50 values falls in each interval of one unit wide on the horizontal axis, assuming that the values in interval from 10 to 15 million shares would be uniformly distributed. In the interval from 10 to 15 million shares, about 5.2% of all values falls between 10 and 11 million shares, about 5.2% of all values falls between 11 and 12 million shares, about 5.2% of all values falls between 12 and 13 million shares, about 5.2% of all values falls between 13 and 14 million shares, and about 5.2% of all values falls between 14 and 15 million shares. It is as if the bar is sliced up in vertical strips of one horizontal unit (here: one million shares) wide. The density measures which percentage of all values falls in such a strip of one unit wide. Note the unit of measurement of density: percent per million shares. More generally, density is expressed in percent per unit on the horizontal axis. Given a data set such as table 1.1, you should be able to construct a frequency table and a density histogram. The first assignment asks you to do exactly that. Figure 1.1 shows the density histogram as generated by R. A script to draw this density histogram in R Commander is posted on the course web page. Suppose you don t have the data set or the frequency table, but just the density histogram (figure 1.1). On which percentage of trading days was the volume of traded Apple shares between 20 and 30 million? Show in the density histogram what represents your answer. On (approximately) which percentage of trading days was the volume of traded Apple shares between 24 and 27 million? Show in the density histogram what represents your answer. We conclude that the area under de histogram between two values represents the percentage of observations that falls between those two values. What is the area under all of the histogram? %. In a density histogram the vertical axis shows the density of the data. The areas of the bars represent percentages. The area under a density histogram over an interval is the percentage of data that fall in that interval. The total area under a density histogram is 100%. (Freedman et al., 2007, p. 41) A density histogram reveals the shape of the data distribution. To assess the shape of the density histogram, locate the median on the horizontal axis and draw a vertical line. Is the histogram symmetric about the median, or is it skewed? Is the histogram skewed to the left (that is, with a long tail to the left) or to the right (with a long tail to the right)? Is the histogram bell-shaped? Watch this two-minute video clip (Rösling, 2015) that uses a histogram to show how the world income distribution has changed over the last two centuries: Although a density histogram is somewhat more complicated than a frequency histogram, a density histogram has several advantages:

10 6 CHAPTER 1. DESCRIPTIVE STATISTICS 6 Density (% per million shares) Daily volume (in millions) Figure 1.1: Density histogram of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of a density histogram allows for intervals with different widths; a bell-shaped density histogram can be approximated by the normal curve (see below); a density histogram has an interpretation that resembles the interpretation of a probability distribution curve (see below). 1.4 Summarizing data by numbers: average We already saw that the median is a measure of the central tendency of the data distribution. Another useful measure of central tendency is the average. The formula to compute the average of a list of measurements is: average = sum of all measurements how many measurements there are

11 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 7 Here is an example. Suppose you collected the price of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 5 = 4 A disadvantage is that the average is sensitive to outliers (exceptionally low or exceptionally high values). Suppose that the list looked like this: The average of this list is: 2, 2, 4, 5, 22 average = = 35 = The one exceptionally expensive bottle of 22 pulled the average up quite a lot. In cases like this we often prefer to use a different measure of central tendency: the median. To find the median, first rank the values from low to high. Then take the middle value. The median of the list { 2, 2, 4, 5, 22} is 4. The median of the first list { 2, 2, 4, 5, 7} is also 4. As you can see, the outlier doesn t affect the median. When a density histogram is skewed or when there are outliers, the median usually is a better measure of the central tendency. One example is the distribution of families by income (Freedman et al., 2007, figure 4 p. 36). 1.5 Summarizing data by numbers: standard deviation We have seen how to summarize the central tendency of a data set. Another feature we would like capture is the spread (or dispersion) of the data. One way to measure the spread is to look at how much the measurements deviate from the average. Let s go back to the prices of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 = The deviation from the average measures how much a measurement is below ( ) or above (+) the average: The deviations are: deviation = measurement average 2 4 = = = = = + 3

12 8 CHAPTER 1. DESCRIPTIVE STATISTICS To get an idea of the typical deviation, we could take the arithmetic mean of the deviations: ( 2) + ( 2) (+ 1) + (+ 3) 5 = 0 It can be easily proven that whatever the list of measurements the arithmetic mean of the deviations is always equal to 0: the negative deviations exactly cancel out the positive ones. Therefore statisticians use the quadratic mean of the deviations as a measure of the spread; the outcome is called the standard deviation. The standard deviation (SD) is a measure of the typical deviation of the measurements from their mean. It is computed as the quadratic mean (or rootmean-square size) of the deviations from the average. The quadratic mean is usually referred to as the root-mean-square (R-M-S) size. To obtain the standard deviation, find the deviations. The compute the quadratic mean (or root-mean-square size) of the deviations, apply the rootmean-square recipe in reverse order: first square the deviations, then find the (arithmetic) mean of the result, and finally take the (square) root. In our example: 1. Square the deviations: ( 2) 2 = 2 4 ( 2) 2 = 2 4 ( 0) 2 = 2 0 (+ 1) 2 = 2 1 (+ 3) 2 = 2 9 By squaring we get rid of the minus signs. Note that the unit of measurement (here: ) is squared, too. 2. Next find the arithmetic mean (or average) of the results from the previous step: mean = The unit ( ) is still squared ( 2 ). = = Finally take the square root of the result from the previous step: This is the standard deviation. Note that by taking the square root, the units are again: the standard deviation has the same unit as the measurements. In this case, the measurements were in euros, so the standard deviation is also in euros.

13 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 9 Expressed as a formula, we get: sum of (deviations) 2 SD = number of measurements (The formula is on the formula sheet, so you don t have to learn it by heart.) The formula above is for the standard deviation of a population. For reasons I won t explain, a better formula for the standard deviation of a sample is: SD + = sum of (deviations) 2 sample size sample size sample size 1 that is, you compute the SD with the usual formula (the quadratic mean of the deviations), which is the first factor in the equation above, and then multiply by sample size sample size 1 (you don t have to memorize this formula). Because the second factor is larger than 1, the formula gives a value larger than SD. That s why Freedman et al. (2007) use the notation SD +. For large samples, the difference between SD and SD + is small. In what follows, we ll use the SD formula for both samples and populations, unless stated explicitly otherwise. We ll return to SD + when we discuss small samples. Remember the following rule: few measurements are more than three SDs from the average. 1 This rule holds for histograms of any shape. Measurements that are more than three SDs from the average (exceptionally small or exceptionally large measurements) are called outliers. To identify outliers, compute the standard scores of all measurements. The standard score expresses how many standard deviations a measurement is below ( ) or above ( ) the average: standard score = measurement average standard deviation Converting measurements to standard scores is called standardizing. Let us return to the daily traded volumes of Apple shares (table 1.1). The volumes of Apple shares trade on the first 50 trading days of 2013 have an average of and a standard deviation of On 14 March 2013 only Apple shares were traded. Is that volume exceptionally small? Compute the standard score for : = De standard score of 1.13 means that the volume of shares was 1.13 standard deviations below the average. Because the absolute value of the 1 A more precise statement can be made. It can be proven (Chebychev s Theorem) that at least 8/9 of the measurements fall within 3 SDs of the average, that is, between [average 3 SD, average + 3 SD] Hence at most 1/9 of the measurements fall outside that interval. You don t have to memorize this.

14 10 CHAPTER 1. DESCRIPTIVE STATISTICS standard score (after omitting the minus sign: 1.13) is smaller than 3, we don t consider as an outlier. Standard scores have no units. The following example illustrates this. A list of incomes per person for most countries in the world (the Penn World Table, Heston et al. (2012)) has an average of $ and a standard deviation of $ Income per person in Belgium is $ De standard score for income per person in Belgium is: $ $ $ = $ $ The units in the numerator ($) and denominator ($) cancel each other out, and hence the standard score has no units. That s why Freedman et al. (2007) refer to computing standard scores as converting a measurement to standard units. The standard score of 1.32 means that income per person in Belgium is 1.32 standard deviations above the average of all countries in the list. So is income per person in Belgium an outlier? Shortcut formula for the SD of 0-1 lists. Computing the SD is tedious. To estimate percentages, we ll be dealing with lists that consist of just zeroes and ones (0-1 lists): for instance, we will model an employee with a private pension plan as a 1, and an employee without a private pension plan as a 0. The following shortcut formula simplifies the calculation of the SD of 0-1 lists: the standard deviation of a list that consist of just zeroes and ones can be computed as: SD of 0-1 list = ( fraction of ones in the list ) ( fraction of zeroes in the list (This formula is on the formula sheet, so no need to memorize. Just for your information, I posted a proof on the course home page.) Here is an example. Consider the list {0, 1, 1, 1, 0}. The average is 3/5. The deviations from the average are: { 3/5, 2/5, 2/5, 2/5, 3/5}, or { 0.6, 0.4, 0.4, 0.4, 0.6}. The SD is the root-mean-square size of the deviations: 1. Square the deviations: {0.36, 0.16, 0.16, 0.16, 0.36} 2. Next find the average of the squared deviations: Finally take the square root to obtain the SD: SD = = According to the shortcut rule we can compute the SD as: ( ) ( ) fraction of fraction of ones zeroes = 0.24 ) which yields: = 25 = which indeed yields the same result, with far fewer calculations.

15 1.6. THE NORMAL CURVE The normal curve Many bell-shaped histograms can be approximated by a special curve called the normal curve. The function describing the normal curve is complicated: y = 1 2π e x2 /2 In practice we won t need this equation: it is programmed in all statistical software packages. The equation describes the standard normal curve, which is the only version of the normal curve we ll need. In what follows, I ll refer to the standard normal curve simply as the normal curve. Figure 1.2 illustrates the properties of the standard normal curve: 1. the curve is symmetric about 0; 2. the area under the curve is 100% (or 1); 3. the curve is always above the horizontal axis Density (% per standard unit) Standard units (z) Figure 1.2: The standard normal curve Statisticians use statistical software (on a calculator or a computer) to find areas under the normal curve. On a TI-84, you find the area under the standard normal curve using the normal cumulative density function (normalcdf). The area under the standard normal curve between 1 and 2 is:

16 12 CHAPTER 1. DESCRIPTIVE STATISTICS DISTR normalcdf( 1,2) which yields approximately To express the area as a percentage, multiply by 100%: % = 81.86% The area under the standard normal curve to the right of 1 (that is, between 1 and infinity) is: DISTR normalcdf( 1,10 99 ) The area under the standard normal curve to the left of 2 (that is, between minus infinity and 2) is: DISTR normalcdf( 10 99, 2) For the exams, you have to use the TI-84 to find areas under the normal curve. On the course web page I posted an R script (area-under-normal-curve.r) that computes and plots the area under the normal curve between any two values on the horizontal axis. R Commander has a built-in function to find the area under the normal curve in the left tail or in the right tail: Distributions Continuous distributions Normal distribution Normal probabilities Approximating a density histogram by the normal curve These are scores of 100 job applicants who took a selection test: 74, 82, 70, 84, 54, 60, 79, 62, 72, 66, 72, 79, 73, 73, 84, 59, 53, 65, 62, 81, 76, 67, 72, 89, 70, 72, 71, 78, 98, 58, 68, 89, 70, 62, 71, 56, 68, 68, 76, 63, 63, 71, 82, 63, 98, 76, 74, 71, 52, 80, 80, 66, 69, 67, 70, 81, 62, 63, 76, 57, 89, 60, 87, 80, 75, 71, 87, 59, 69, 65, 66, 67, 62, 87, 58, 58, 60, 54, 74, 83, 48, 77, 79, 60, 84, 86, 68, 64, 83, 65, 77, 79, 68, 75, 77, 72, 47, 77, 68, 67 (the data are posted on the course web page) The average of the test scores is about 70, and the standard deviation is about 10 (verify using R Commander). Figure 1.3 shows the density histogram. The histogram is bell-shaped. In 1870, the Belgian statistician Adolphe Quetelet had the idea to approximate bell-shaped histograms by the normal curve (Freedman et al., 2007, p. 78). The horizontal scale of the histogram differs from that of the standard normal curve: most test scores are between 40 and 100, while most of the standard area under the normal curve extends between 3 and +3 on the horizontal axis; and the center of the density histogram is about 70, while the center of the standard normal curve is 0. If we standardize the values, we get what we want. To obtain the standard scores, do: standard score = measurement average standard deviation For example, to standardize the first test score (74; in this case the variable has no units), do: standard score = = The list of standard scores is: 0.4; 1.2; 0.0;... ; 0.3. Verify that you can compute the first couple of standard scores.

17 1.7. APPROXIMATING A DENSITY HISTOGRAM BY THE NORMAL CURVE13 3 Density (% per point) Test score (points) Figure 1.3: Density histogram of 100 test scores Figure 1.4 shows the histogram of the standard scores. If you compare with the histogram of the original test scores (figure 1.3) you notice that the shape of the histogram hasn t changed. Consider the original test scores. Count the number of job applicants who had a test score between 75 and 85: 25 out of the 100 job applicants had a test score between 75 and 85. So 25% of the job applicants had a test score between 75 and 85. In the histogram (figure 1.3), the percentage corresponds to the area under the histogram between 75 and 85. The standard scores of 75 and 85 are: en = +0.5 = +1.5 In the histogram of the standard scores (figure 1.4) the percentage (25%) corresponds to the area under the histogram between +0.5 and The area under the normal curve between +0.5 and +1.5 approximates the area under the histogram between +0.5 and Now carefully look at figure 1.4. The normal approximation overestimates the bar over the interval between +0.5 and +1.0, and underestimates the bar over the interval between +1.0 and The area under the normal curve between +0.5 and +1.5 is approximately: DISTR normalcdf(0.5,1.5) %

18 14 CHAPTER 1. DESCRIPTIVE STATISTICS 30 Density (% per standard unit) Standard units Figure 1.4: Density histogram of 100 test scores, standardized The normal approximation (24.17%) is quite close to the actual percentage (25%). Use your TI-84 to find the areas under the normal curve between 1 and +1. Using the normal approximation, which percentage of measurements will be between ave SD and ave + SD? Repeat for 2 and +2 and 3 and +3. You see that the normal approximation implies the following rule, called the rule. For a bell-shaped histogram: approximately 68% of the measurements are within one SD of the average, that is, between ave SD and ave + SD; approximately 95% of the measurements are within two SDs of the average, that is, between ave 2 SD and ave + 2 SD; approximately 99.7% of the measurements are within three SDs of the average, that is, between ave 3 SD and ave + 3 SD; (The rule is not on the formula sheet; you have to know it by heart.) The normal approximation will turn out to be very useful in statistical inference (drawing conclusions about population parameters on basis of sample evidence).

19 1.8. QUESTIONS FOR REVIEW Questions for Review 1. What is the difference between a qualitative and a quantitative variable? Illustrate using examples where you consider different characteristics of the students in the class. 2. What is the difference between a parameter and a statistic? 3. What does descriptive statistics do? 4. What does statistical inference do? 5. How can you summarize the distribution of a numerical data set in a table? In a graph? 6. In a density histogram, what does the density represent? What are the units of density? Explain for a hypothetical distribution of heights (in centimeter) of people. 7. When would the median be a better measure of the central tendency of a distribution than the mean? Illustrate by giving an example. 8. What does the standard deviation measure? How is the standard deviation computed? 9. What are the properties of the normal curve? 10. What does the standard score measure? How is the standard score computed? 11. What does the % rule say? 1.9 Exercises 1. Download the data file AAPL-HistoricalQuotes.csv from the course web site: and save the data file to your STA201 folder (directory). The data set contains data about Apple stock. Run R Commander and load the data set: Data Import data from textfile, clipboard, or URL.... A window opens. For Location of Data File select Local file system. For Field Separator select Commas. For Decimal-Point Character select Period [.]. Press OK, navigate to the data file AAPL-HistoricalQuotes.csv, abd double-click the file. Your data should now be loaded by R Commander. In the R Commander menu, click the View Data Set button. A new window opens, showing the data set. The variable volume is the variable from table 1.1. Now enter the following line of script in the R script window: h <- hist(dataset$volume/ ,right=false) and press the Submit button. This command will compute the numbers needed to make a histogram and store then in an object called h. Next, type in the R script window:

20 16 CHAPTER 1. DESCRIPTIVE STATISTICS h$breaks and press the Submit button. The output window will display the breaks between the intervals, that is, the boundaries of the intervals used by R when it computes the frequency table. Next, type in the R script window: h$counts and press the Submit button. The output window will display the absolute frequencies (counts) of each interval. Next, type in the R script window: h$density and press the Submit button. The output window will display the densities of each interval. The densities are expressed as decimal fractions per horizontal unit; to get densities expressed as percentages per horizontal unit you have to multiply by 100%. Finally, type in the R script window: h$counts/sum(h$counts) and press the Submit button. The output window will display the relative frequencies for each interval; to get relative frequencies expressed as percentages you have to multiply by 100%. 2. Use the relative frequencies from table 1.2 to compute the densities for the other intervals. Add a column to show the densities. Then draw the density histogram on scale on squared paper. 3. Figure 1.1 shows that the daily traded volumes of Apple shares have a skewed distribution. The average daily volume is shares. Find the median. Show your work. How do mean and median compare? Is that what you expected from the shape of the histogram? Explain. 4. Find the standard deviation of {1, 1, 1, 1, 0} using two methods: the usual formula (root-mean-square size of the deviations) and the shortcut formula for 0-1 lists. Do you get the same result? 5. The daily traded volumes of Apple shares (table 1.1) have an average of and a standard deviation of Is an outlier? And ? Show your work and explain. 6. Use the TI-84 to find the areas under the standard normal curve: (a) to the right of 1.87 (b) to the left of 5.20 (c) between 1 and +1 (d) between 2 and +2 (e) between 3 and +3 Make for every case a sketch, with the relevant area shaded. Verify your answers using the R script. We ll get back to cases (c), (d), and (e) in a moment.

21 1.9. EXERCISES For the 100 given test scores, find which percentage of job applicants scored between 50 and 60. Then use the normal approximation. Is the normal approximation close? 8. For 164 adult Belgian men born in 1962 the average height is centimeter and the SD is 8.2 centimeter (Garcia and Quintana-Domeque, 2007). Suppose that the histogram of the 164 heights follows the normal curve (heights usually do). What is, approximately, the percentage of men in this group with a height of 170 centimeter or less? What is, approximately, the percentage of men in this group with a height of between 170 centimeter and 180 centimeter? 9. Of the volumes of Apple shares traded in the first 50 trading days of 2013 (p. 1.2) the average is and the SD is Find the actual percentage of values between: ave SD and ave + SD; ave 2 SD and ave + 2 SD; ave 3 SD and ave + 3 SD; Does the rule give a good approximation? Why (not)?

22 18 CHAPTER 1. DESCRIPTIVE STATISTICS

23 Chapter 2 Probability distributions 2.1 Chance experiments Examples of chance experiments are: rolling a die and counting the dots; tossing a coin and observing whether you get heads or tails; or randomly drawing a card from a well-shuffled deck of cards and observing which card you get. It is convenient to think of a chance experiment in terms of the following chance model: randomly drawing one or more tickets from a box. For instance, rolling a die is modeled as randomly drawing a ticket from the box: In R: box <- c(1,2,3,4,5,6) sample(box,1) In R: Tossing a coin is like randomly drawing a ticket from the box: box <- c("heads","tails") sample(box,1) heads tails 2.2 Frequency interpretation of probability Consider the following chance experiment. Roll a die and count the dots. If you get an ace (1), write down 1; if you don t get an ace (2, 3, 4, 5, or 6), write down 0. Repeat the experiment many times. After each roll, compute the relative frequency of aces up to that point. Make a graph with the number of tosses on the horizontal axis and the relative frequency on the vertical axis. Figure 2.1 shows the result of repetitions in such an experiment. The frequency of aces tends towards 1/6 ( %, the horizontal dashed line). The frequency interpretation of probability states that the probability of an event is the percentage to which the relative frequency tends if you repeat the chance experiment over and over, independently and under the same conditions (Freedman et al., 2007, p. 222). 19

24 20 CHAPTER 2. PROBABILITY DISTRIBUTIONS Frequency of aces (%) Number of repeats Figure 2.1: Frequency of aces in 10,000 rolls of a die 2.3 Drawing with and without replacement Consider the following box with tickets: The probability to draw an even number is 3/6: P ( 2nd draw is even ) = 3 6 Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you replace the ticket, and again randomly draw a ticket from the box. This is called drawing with replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2, is again 3/6. In mathematical notation: P ( 2nd draw is even 1st draw was 2 ) = 3 6 The vertical bar ( ) is shorthand for given that. What comes after the vertical bar ( ) is called the condition. A probability with a condition is called a conditional probability. Note that in this case imposing the condition didn t affect the probability of drawing an even number: whether the first draw was 2 or not doesn t matter

25 2.4. THE SUM OF DRAWS 21 for the second draw, because we replaced the ticket after the first draw. In both cases, the probability of getting an even number was the same (3/6): P ( 2nd draw is even 1st draw was 2 ) = P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be independent: the probability of the second event is not affected by how the first event turned out. That is because we were drawing with replacement. When drawing with replacement, the events are independent. Now consider a different chance experiment. Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you don t replace the ticket. The box now looks like this: If we now again randomly draw a ticket from the box, this is called drawing without replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2 now is : P ( 2nd draw is even 1st draw was 2 ) = 2 5 In this case, what happened in the first draw (as expressed by the condition 1st draw was 2 ) does make a difference: the probability of getting an even number differs: P ( 2nd draw is even 1st draw was 2 ) P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be dependent: the probability of the second event is affected by how the first event turned out. That is because we were drawing without replacement. When drawing without replacement, the events are dependent. Think of a population as a box with tickets. A random sample is like drawing a number of tickets without replacement from this box. The number of draws is the sample size. Remember this. We ll use this box model when doing statistical inference. 2.4 The sum of draws For the theory of statistical inference, we ll frequently use the concept of the sum of draws. Here s a simple example: roll a die twice, and add the numbers. The chance model has the following box: Draw two tickets with replacement from the box, and add the outcomes. The result is the sum of draws. The sum of draws is a brief way to say the following (Freedman et al., 2007, p. 280):

26 22 CHAPTER 2. PROBABILITY DISTRIBUTIONS Draw tickets from a box. Add the numbers on the tickets. As the following activity makes clear, the sum of draws is itself a random variable: (a) Conduct the chance experiment above using an actual die or the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) (b) Repeat the experiment a couple of times and write up the outcomes (using an actual die, or in R by running the line sample(box,1)+sample(box,1). Would it be fair to say that the sum of draws is a chance variable? Explain. 2.5 Picking an appropriate chance model We model a population as a box with tickets. Taking a random sample is like randomly drawing a number of tickets from the box, without replacement; the number of draws is the sample size. In order to use such a chance model for inference, we will use some interesting properties of the sum of draws. The trick is to set up the chance model in such a way that the chance variable of interest is the sum of draws, or is computed from the sum of draws. An example clarifies my argument. Suppose you roll a die three times, and want to know what the sum of the outcomes is. What is the appropriate chance model? What is the chance variable? An appropriate chance model is a box with six tickets: and the chance variable is the sum of three random draws with replacement from the box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 3, 2, and 6. The sum of draws ( = 11) is obtained by adding up the outcomes. Now suppose that we are interested in another question: how many times (out of three rolls) will we get a six? First, we need the appropriate chance model. When we roll a die, when can get two kinds of outcomes: either we get a six (we ll label this outcome as a success), or we get another number (1, 2, 3, 4, 5: not a success). The term success is used here in a technical meaning: the outcome we are interested in. Note that we classify the outcomes of a single roll as a success or not a success. In such a case, the appropriate chance model is a box with six tickets: one ticket 1 for the outcome 6 labelled as a success, and five tickets 0 for the outcomes 1, 2, 3, 4, or 5 labelled as not a success:

27 2.6. PROBABILITY DISTRIBUTIONS 23 Now we are interested in the number of sixes in three rolls, so we need to count the sixes. Counting the sixes is the same thing as taking the sum of three draws from the 0-1 box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 0, 0, and 1 (we classified each outcome as a success or not a success). The sum of draws ( = 1) is the number of sixes (the number of successes). A box like this, with tickets that can only take values 0 and 1, is called a 0-1 box. Remember that when the problem is one of classifying and counting, the appropriate box is a 0-1 box. Here s a real-world example. Suppose you are the marketing manager of a telecommunications company that doesn t cover Brussels yet. You would like to find out which percentage of households in Brussels already has a tablet. The population of interest is all households in Brussels. Think of each household in Brussels as a ticket in a box, so there are as many tickets as households. A ticket takes value 1 if the household has a tablet, and 0 if the household doesn t. Taking a random sample of households is like randomly drawing tickets without replacement from this 0-1 box. The number of households in the sample who have a tablet is the sum of draws. The percentage of households in the sample who have a tablet is: sample percentage = 2.6 Probability distributions sum of draws size of the sample 100% Chance experiments can be described using probability distributions. In what follows, we ll focus on the probability distribution of the sum of draws. Suppose you roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box and add the outcomes. The chance variable (the sum of the two draws) can take the following values: {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} (the chance variable is discrete; we won t develop the theory for continuous chance variables). For each of these possible outcomes, we can compute the probability. There are 36 possible combinations: Each of these 36 combinations has the same probability, and as the probabilities have to add up to 1, each combination has a probability of 1/36. By applying the rules or probability, we can find the probability that the sum of draws takes the value 2, and then repeat the work to find the probability that the sum of draws takes the value 3, and so on. There are for instance two combinations that yield a sum of 3:

28 24 CHAPTER 2. PROBABILITY DISTRIBUTIONS when the first draw is 1 and the second draw is 2 (row 1, column 2 in the table above) when the first draw is 2 and the second draw is 1 (row 2, column 1) The probability that the sum of draws is 3 is therefore equal to: P (sum is 3) = P [(first 1, than 2) or (first 2, then 1)] Apply the addition rule (Freedman et al., 2007, pp ) to obtain: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) something The third term ( minus something ) is equal to zero because the events (first 1, than 2) and (first 2, then 1) are mutually exclusive (two events are mutually exclusive when as one event happens, the other cannot happen at the same time). So we get: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) 0 = = 2 36 If you do this for all other possible values of the chance variable, you get the following table: outcome probability A table that shows all possible values for a (discrete) chance variable and the corresponding probabilities is called a probability distribution. We can graph the probability distribution as a bar chart. On the horizontal axis we put the chance variable, and we construct the bar chart in such a way that the area of a bar shows the probability (expressed as a percentage), just as in a density histogram the area of a bar showed the relative frequency (expressed as a percentage) of the data over the interval. That is why Freedman et al. (2007, pp ) call such a bar chart a probability histogram. For a discrete chance variable the convention is to center the bars on the values that the variable can take: the bar over 2 will start at 1.5 and end at 2.5; the bar over 3 will start at 2.5 and end at 3.5, and so on. The width of each bar is equal to 1. The height of each bar in a probability distribution is called probability density: the probability per unit on the horizontal axis. We find the probability densities by applying the formula for the area of a rectangle: area = width height We want the area to represent the probability (expressed as a percentage) and the height to represent the probability density (expressed as percent per unit on the horizontal axis), and hence the equation becomes: probability = width of interval on horizontal axis probability density Divide both sides of the equation by (width of interval on horizontal axis) to obtain: probability density = probability width of interval on horizontal axis

29 2.6. PROBABILITY DISTRIBUTIONS 25 Because the width of each interval on horizontal axis is one unit of the horizontal axis, this becomes: probability density = probability per unit on the horizontal axis which gives us the meaning of probability density. For example, the probability to get a 7 is 6/36 (= %). The probability density over the interval from 6.5 to 7.5 then is equal to: probability density = % = % per per unit on the horizontal axis Figure 2.2 shows the corresponding bar chart representing the probability distribution. The curve traced by the bar chart of the probability distribution is called the probability density function. The probability density function has the following properties: the curve is always on or above the horizontal axis, that is, the probability density (on the vertical axis) is always 0 or positive; de area under the curve is equal to 1 (or 100%); the area under the curve between two values on the horizontal axis gives the probability. The probability distribution has an expectation and a standard error. The following example illustrates the intuition of these concepts. Roll a die twice and add the numbers. You can do that with an actual die, or run the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) Repeat this a couple of times, and write down the outcomes. You will get something like {6, 7, 10, 8, 10,... }. The outcomes are random. The lowest value you can get is 2 (when you roll two aces), and the highest value is 12 (when you roll two sixes). If you repeat the experiment many times you ll notice that those extreme values occur only occasionally; values like 6, 7, or 8 occur much more frequently. The expectation is the typical value that the random variable will take; the value around which the outcomes vary. Another way to think about the expectation is as the center of the probability distribution (figure 2.2). In this case the expectation is 7 (we ll see below how to compute the expectation). Now define the difference between the outcome of a chance experiment and the expectation as the chance error. For instance, our first outcome was 6, the expectation is 7, and hence the chance error was: chance error = outcome expectation = 6 7 = 1 (the negative value of 1 means that the outcome was 1 below the expectation). If we compute the chance errors for the other outcomes, we get: