STA201 Intermediate Statistics Lecture Notes. Luc Hens

Size: px
Start display at page:

Download "STA201 Intermediate Statistics Lecture Notes. Luc Hens"

Transcription

1 STA201 Intermediate Statistics Lecture Notes Luc Hens 15 January 2016

2 ii

3 How to use these lecture notes These lecture notes start by reviewing the material from STA101 (most of it covered in Freedman et al. (2007)): descriptive statistics, probability distributions, sampling distributions, and confidence intervals. Following the review, STA201 then covers the following topics: hypothesis tests (Freedman et al., 2007, chapters 26 and 29); the t-test for small samples (Freedman et al., 2007, chapter 26, section 6); hypothesis tests on two averages (Freedman et al., 2007, chapter 27), and the Chi-square test (Freedman et al., 2007, chapter 28). STA201 then covers correlation and simple linear regression (Freedman et al., 2007, chapters 10, 11, 12). Two related subjects (multiple regression and inference for regression) that are not covered in Freedman et al. (2007) are covered in-depth in the lecture notes. Each chapter from the lecture notes ends with Questions for Review; be prepared to answer these questions in class. Work the problems at the end of the chapters in the lecture notes. The key concepts are set in boldface; you should know their definitions. You can find the lectures notes and other material on the course web site: We ll use the open-source statistical environment R with the graphical user interface R Commander. The course web page contains a document (Getting started in STA101 ) that explains how to install R and R Commander on your computer, as well as R scripts and data sets used in the course. Thanks to a web interface (Rweb) you can also run R from any computer or mobile device (tablet or smartphone) with a web browser, without having R installed. Make sure you are connected to the internet. In your web browser, open a new a new tab. Point your browser to Rweb: Remove everything from the window at the top (data(meaudret) etc.). Type R code (or paste an R script) in the window. Click the Submit button. Wait until Results from Rweb appears. If the script generates a graph, scroll down to see the graph. Practice is important to learn statistics. Students who wish to work additional exercises can find hundreds of solved exercises in Kazmier (1995) (or a more recent edition). Moore et al. (2012) covers the same ground as Freedman et al. (2007) and has many exercises; the solutions to the odd-numbered exercises are in the back of the book. Older but still useful editions of both books are available in the VUB library. iii

4 iv HOW TO USE THESE LECTURE NOTES Remember the following calculation rules: Always carry the units of measurement in the calculations. For instance, when you have two measurements in dollars ($ 2 and $ 3) and you compute their average, write: $ 2 + $ 3 = $ To express a fraction (say 2/5) as a percentage, multiply by 100% (not by 100): 2 100% = 40% 5 The same holds for expressing decimals (say, 0.40) as a percentage: % = 40% (STA201 was for a while taught as STA301 Methods: Statistics for Business and Economics.)

5 Chapter 1 Descriptive statistics 1.1 Basic concepts of statistics Suppose you want to find out which percentage of employees in a given company has a private pension plan. The population is the set of cases about which you want to find things out. In this case, the population consists of all employees in the given company; each employee is a case. A variable is a characteristic of each case in the population. In this case you are interested in the variable private pension plan. It can take two values: yes or no (it s a qualitative variable). The percentage of employees who have a private pension plan is a parameter: a numerical characteristic of the population. The monthly salary of the employees is a quantitative variable. The average monthly salary of all employees in the company is another parameter. We ll be mainly concerned with these two types of parameters: percentages (of qualitative variables) and averages (of quantitative variables). If you conduct a survey and every employee in the company fills out the survey form, the collected data set covers all of the population, and you can find the exact value of the population parameter. In some cases collecting data for the population may not be possible; you may have to rely on a sample drawn from the population. A sample is a subset of the population. The sample percentage (which percentage of employees in the sample has a private pension plan) is called a statistic. Statistical inference is when you use a sample to draw conclusions about the population it was drawn from. We ll see that when the sample is a simple random sample, the sample percentage (the statistic) is a good estimate of the population percentage (the parameter). Much of statistical inference deals with quantifying the degree of uncertainty that is the result of generalizing from sample evidence. First we will deal with descriptive statistics: ways to summarize data (from a population or a sample) in a table, a graph, or with numbers. 1.2 Summarizing data by a frequency table How can we summarize information about a quantitative variable of a sample or a population, often consisting of thousands of measurements? 1

6 2 CHAPTER 1. DESCRIPTIVE STATISTICS When a particular stock is traded unusually frequently on a given day, usually this indicates something is going on. Table 1.1 shows the number of traded Apple shares for each of the first fifty trading days of A glance at the data reveals that the trade volumes differ considerable from day to day. Table 1.1: Volumes of Apple stock traded on NASDAQ on the first 50 trading days of Source: nasdaq.com Date Volume Date Volume (yyyy/mm/dd) (yyyy/mm/dd) 2013/03/ /02/ /03/ /02/ /03/ /02/ /03/ /02/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /03/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/ /02/ /01/

7 1.2. SUMMARIZING DATA BY A FREQUENCY TABLE 3 How can we get a better idea of the typical daily volumes and the spread around the typical volumes? A good start is to rank the values from low to high: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , (In R Commander, type the sort() function in the script window. The name of the variable should be between the brackets.) The values vary from 10.8 to 52.1 million shares per day. The middle value in the ordered list is called the median. Because we have an even number of values (50), there are two middle values: the values at position 25 ( ) and 26 ( ). In that case, the convention is to take the average of the two middle values as the median: median = = The median gives an idea of the central tendency of the data distribution: half of the days the value (the volume of traded shares) was less than the median, and the other half the value was more than the median. We can summarize the ordered list in a frequency table. First, define class intervals that don t overlap and cover all data. You don t want too few class intervals (because that would leave out too much information), nor too many (because that wouldn t summarize the information from the data). You also want the class intervals to have boundaries that are easy, rounded numbers. The class intervals don t have to be of the same width. Let us define the first class interval as to ( included, not included), the second as to , and so on, until to A frequency table has three columns: class interval, absolute frequency, and relative frequency (table 1.2). The absolute frequency (or count) is how many values fall in each class interval. The first class interval ( to ) contains 13 values (verify!): the absolute frequency for this interval is 13. Find the absolute frequencies for the other class intervals. The relative frequency expresses the number of values in a class interval (the absolute frequency) as a percentage of the total number of values in the data set. For the first class interval ( to ) the relative frequency is: % = 26% 50 Verify the relative frequencies for the other class intervals. Show your work. The absolute frequencies add up to the number of values in the data set, and the relative frequencies (before rounding) add up to 100%. If that is not the case, you made a mistake.

8 4 CHAPTER 1. DESCRIPTIVE STATISTICS Table 1.2: Frequency table of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of Note. Class intervals include left boundaries and don t include right boundaries. Volume Absolute Relative (shares per day) frequency frequency (%) 10 to 15 million to 20 million to 25 million to 30 million to 35 million to 40 million to 45 million to 50 million to 55 million 1 2 Sum: Summarizing data by a density histogram The frequency table gives you a pretty good idea of what the most common values are, and how the values differ. One way to graph the information from a frequency table is to plot the values of the variable (in this case: the daily volumes) on the vertical axis, and the absolute or relative frequency on the vertical axis. The heights of the bars represent the absolute or relative frequencies. The areas of the bars don t have a meaning. Such a bar chart is called a frequency histogram. For reasons that will soon be clear, it is more interesting to plot a frequency table in a bar chart where the areas of the bars represent the relative frequencies. Such a bar chart is called a density histogram. The height of each bar in a density histogram represents the density of the data in the class interval. To construct a density histogram, we have to find the height for each bar. How do we compute the height? Remember that the area of a rectangle (such as the bars in the density histogram) is given by width times height: area = width height The area of the bar is the relative frequency, the width of the bar is the width of the class interval, and the height of the bar is the density. Hence: relative frequency = width of the interval density Divide both sides by the width of the interval, to obtain: density = relative frequency (%) width of the interval This formula is on the formula sheet. For the class interval from 10 million to 15 million shares the relative frequency was 26% (table 1.2). Hence the density for this interval is: density = 26% 15 million shares 10 million shares

9 1.3. SUMMARIZING DATA BY A DENSITY HISTOGRAM 5 26% = 5 million shares = 5.2%/million shares Now that you know the height of the bar over the interval from 10 to 15 million shares (5.2% per million shares), you can draw the bar. The density for the interval from 10 to 15 million shares tells us which percentage of all 50 values falls in each interval of one unit wide on the horizontal axis, assuming that the values in interval from 10 to 15 million shares would be uniformly distributed. In the interval from 10 to 15 million shares, about 5.2% of all values falls between 10 and 11 million shares, about 5.2% of all values falls between 11 and 12 million shares, about 5.2% of all values falls between 12 and 13 million shares, about 5.2% of all values falls between 13 and 14 million shares, and about 5.2% of all values falls between 14 and 15 million shares. It is as if the bar is sliced up in vertical strips of one horizontal unit (here: one million shares) wide. The density measures which percentage of all values falls in such a strip of one unit wide. Note the unit of measurement of density: percent per million shares. More generally, density is expressed in percent per unit on the horizontal axis. Given a data set such as table 1.1, you should be able to construct a frequency table and a density histogram. The first assignment asks you to do exactly that. Figure 1.1 shows the density histogram as generated by R. A script to draw this density histogram in R Commander is posted on the course web page. Suppose you don t have the data set or the frequency table, but just the density histogram (figure 1.1). On which percentage of trading days was the volume of traded Apple shares between 20 and 30 million? Show in the density histogram what represents your answer. On (approximately) which percentage of trading days was the volume of traded Apple shares between 24 and 27 million? Show in the density histogram what represents your answer. We conclude that the area under de histogram between two values represents the percentage of observations that falls between those two values. What is the area under all of the histogram? %. In a density histogram the vertical axis shows the density of the data. The areas of the bars represent percentages. The area under a density histogram over an interval is the percentage of data that fall in that interval. The total area under a density histogram is 100%. (Freedman et al., 2007, p. 41) A density histogram reveals the shape of the data distribution. To assess the shape of the density histogram, locate the median on the horizontal axis and draw a vertical line. Is the histogram symmetric about the median, or is it skewed? Is the histogram skewed to the left (that is, with a long tail to the left) or to the right (with a long tail to the right)? Is the histogram bell-shaped? Watch this two-minute video clip (Rösling, 2015) that uses a histogram to show how the world income distribution has changed over the last two centuries: Although a density histogram is somewhat more complicated than a frequency histogram, a density histogram has several advantages:

10 6 CHAPTER 1. DESCRIPTIVE STATISTICS 6 Density (% per million shares) Daily volume (in millions) Figure 1.1: Density histogram of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of a density histogram allows for intervals with different widths; a bell-shaped density histogram can be approximated by the normal curve (see below); a density histogram has an interpretation that resembles the interpretation of a probability distribution curve (see below). 1.4 Summarizing data by numbers: average We already saw that the median is a measure of the central tendency of the data distribution. Another useful measure of central tendency is the average. The formula to compute the average of a list of measurements is: average = sum of all measurements how many measurements there are

11 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 7 Here is an example. Suppose you collected the price of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 5 = 4 A disadvantage is that the average is sensitive to outliers (exceptionally low or exceptionally high values). Suppose that the list looked like this: The average of this list is: 2, 2, 4, 5, 22 average = = 35 = The one exceptionally expensive bottle of 22 pulled the average up quite a lot. In cases like this we often prefer to use a different measure of central tendency: the median. To find the median, first rank the values from low to high. Then take the middle value. The median of the list { 2, 2, 4, 5, 22} is 4. The median of the first list { 2, 2, 4, 5, 7} is also 4. As you can see, the outlier doesn t affect the median. When a density histogram is skewed or when there are outliers, the median usually is a better measure of the central tendency. One example is the distribution of families by income (Freedman et al., 2007, figure 4 p. 36). 1.5 Summarizing data by numbers: standard deviation We have seen how to summarize the central tendency of a data set. Another feature we would like capture is the spread (or dispersion) of the data. One way to measure the spread is to look at how much the measurements deviate from the average. Let s go back to the prices of the same bottle of wine in five restaurants: 2, 2, 4, 5, 7 The average price is: average = = 20 = The deviation from the average measures how much a measurement is below ( ) or above (+) the average: The deviations are: deviation = measurement average 2 4 = = = = = + 3

12 8 CHAPTER 1. DESCRIPTIVE STATISTICS To get an idea of the typical deviation, we could take the arithmetic mean of the deviations: ( 2) + ( 2) (+ 1) + (+ 3) 5 = 0 It can be easily proven that whatever the list of measurements the arithmetic mean of the deviations is always equal to 0: the negative deviations exactly cancel out the positive ones. Therefore statisticians use the quadratic mean of the deviations as a measure of the spread; the outcome is called the standard deviation. The standard deviation (SD) is a measure of the typical deviation of the measurements from their mean. It is computed as the quadratic mean (or rootmean-square size) of the deviations from the average. The quadratic mean is usually referred to as the root-mean-square (R-M-S) size. To obtain the standard deviation, find the deviations. The compute the quadratic mean (or root-mean-square size) of the deviations, apply the rootmean-square recipe in reverse order: first square the deviations, then find the (arithmetic) mean of the result, and finally take the (square) root. In our example: 1. Square the deviations: ( 2) 2 = 2 4 ( 2) 2 = 2 4 ( 0) 2 = 2 0 (+ 1) 2 = 2 1 (+ 3) 2 = 2 9 By squaring we get rid of the minus signs. Note that the unit of measurement (here: ) is squared, too. 2. Next find the arithmetic mean (or average) of the results from the previous step: mean = The unit ( ) is still squared ( 2 ). = = Finally take the square root of the result from the previous step: This is the standard deviation. Note that by taking the square root, the units are again: the standard deviation has the same unit as the measurements. In this case, the measurements were in euros, so the standard deviation is also in euros.

13 1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 9 Expressed as a formula, we get: sum of (deviations) 2 SD = number of measurements (The formula is on the formula sheet, so you don t have to learn it by heart.) The formula above is for the standard deviation of a population. For reasons I won t explain, a better formula for the standard deviation of a sample is: SD + = sum of (deviations) 2 sample size sample size sample size 1 that is, you compute the SD with the usual formula (the quadratic mean of the deviations), which is the first factor in the equation above, and then multiply by sample size sample size 1 (you don t have to memorize this formula). Because the second factor is larger than 1, the formula gives a value larger than SD. That s why Freedman et al. (2007) use the notation SD +. For large samples, the difference between SD and SD + is small. In what follows, we ll use the SD formula for both samples and populations, unless stated explicitly otherwise. We ll return to SD + when we discuss small samples. Remember the following rule: few measurements are more than three SDs from the average. 1 This rule holds for histograms of any shape. Measurements that are more than three SDs from the average (exceptionally small or exceptionally large measurements) are called outliers. To identify outliers, compute the standard scores of all measurements. The standard score expresses how many standard deviations a measurement is below ( ) or above ( ) the average: standard score = measurement average standard deviation Converting measurements to standard scores is called standardizing. Let us return to the daily traded volumes of Apple shares (table 1.1). The volumes of Apple shares trade on the first 50 trading days of 2013 have an average of and a standard deviation of On 14 March 2013 only Apple shares were traded. Is that volume exceptionally small? Compute the standard score for : = De standard score of 1.13 means that the volume of shares was 1.13 standard deviations below the average. Because the absolute value of the 1 A more precise statement can be made. It can be proven (Chebychev s Theorem) that at least 8/9 of the measurements fall within 3 SDs of the average, that is, between [average 3 SD, average + 3 SD] Hence at most 1/9 of the measurements fall outside that interval. You don t have to memorize this.

14 10 CHAPTER 1. DESCRIPTIVE STATISTICS standard score (after omitting the minus sign: 1.13) is smaller than 3, we don t consider as an outlier. Standard scores have no units. The following example illustrates this. A list of incomes per person for most countries in the world (the Penn World Table, Heston et al. (2012)) has an average of $ and a standard deviation of $ Income per person in Belgium is $ De standard score for income per person in Belgium is: $ $ $ = $ $ The units in the numerator ($) and denominator ($) cancel each other out, and hence the standard score has no units. That s why Freedman et al. (2007) refer to computing standard scores as converting a measurement to standard units. The standard score of 1.32 means that income per person in Belgium is 1.32 standard deviations above the average of all countries in the list. So is income per person in Belgium an outlier? Shortcut formula for the SD of 0-1 lists. Computing the SD is tedious. To estimate percentages, we ll be dealing with lists that consist of just zeroes and ones (0-1 lists): for instance, we will model an employee with a private pension plan as a 1, and an employee without a private pension plan as a 0. The following shortcut formula simplifies the calculation of the SD of 0-1 lists: the standard deviation of a list that consist of just zeroes and ones can be computed as: SD of 0-1 list = ( fraction of ones in the list ) ( fraction of zeroes in the list (This formula is on the formula sheet, so no need to memorize. Just for your information, I posted a proof on the course home page.) Here is an example. Consider the list {0, 1, 1, 1, 0}. The average is 3/5. The deviations from the average are: { 3/5, 2/5, 2/5, 2/5, 3/5}, or { 0.6, 0.4, 0.4, 0.4, 0.6}. The SD is the root-mean-square size of the deviations: 1. Square the deviations: {0.36, 0.16, 0.16, 0.16, 0.36} 2. Next find the average of the squared deviations: Finally take the square root to obtain the SD: SD = = According to the shortcut rule we can compute the SD as: ( ) ( ) fraction of fraction of ones zeroes = 0.24 ) which yields: = 25 = which indeed yields the same result, with far fewer calculations.

15 1.6. THE NORMAL CURVE The normal curve Many bell-shaped histograms can be approximated by a special curve called the normal curve. The function describing the normal curve is complicated: y = 1 2π e x2 /2 In practice we won t need this equation: it is programmed in all statistical software packages. The equation describes the standard normal curve, which is the only version of the normal curve we ll need. In what follows, I ll refer to the standard normal curve simply as the normal curve. Figure 1.2 illustrates the properties of the standard normal curve: 1. the curve is symmetric about 0; 2. the area under the curve is 100% (or 1); 3. the curve is always above the horizontal axis Density (% per standard unit) Standard units (z) Figure 1.2: The standard normal curve Statisticians use statistical software (on a calculator or a computer) to find areas under the normal curve. On a TI-84, you find the area under the standard normal curve using the normal cumulative density function (normalcdf). The area under the standard normal curve between 1 and 2 is:

16 12 CHAPTER 1. DESCRIPTIVE STATISTICS DISTR normalcdf( 1,2) which yields approximately To express the area as a percentage, multiply by 100%: % = 81.86% The area under the standard normal curve to the right of 1 (that is, between 1 and infinity) is: DISTR normalcdf( 1,10 99 ) The area under the standard normal curve to the left of 2 (that is, between minus infinity and 2) is: DISTR normalcdf( 10 99, 2) For the exams, you have to use the TI-84 to find areas under the normal curve. On the course web page I posted an R script (area-under-normal-curve.r) that computes and plots the area under the normal curve between any two values on the horizontal axis. R Commander has a built-in function to find the area under the normal curve in the left tail or in the right tail: Distributions Continuous distributions Normal distribution Normal probabilities Approximating a density histogram by the normal curve These are scores of 100 job applicants who took a selection test: 74, 82, 70, 84, 54, 60, 79, 62, 72, 66, 72, 79, 73, 73, 84, 59, 53, 65, 62, 81, 76, 67, 72, 89, 70, 72, 71, 78, 98, 58, 68, 89, 70, 62, 71, 56, 68, 68, 76, 63, 63, 71, 82, 63, 98, 76, 74, 71, 52, 80, 80, 66, 69, 67, 70, 81, 62, 63, 76, 57, 89, 60, 87, 80, 75, 71, 87, 59, 69, 65, 66, 67, 62, 87, 58, 58, 60, 54, 74, 83, 48, 77, 79, 60, 84, 86, 68, 64, 83, 65, 77, 79, 68, 75, 77, 72, 47, 77, 68, 67 (the data are posted on the course web page) The average of the test scores is about 70, and the standard deviation is about 10 (verify using R Commander). Figure 1.3 shows the density histogram. The histogram is bell-shaped. In 1870, the Belgian statistician Adolphe Quetelet had the idea to approximate bell-shaped histograms by the normal curve (Freedman et al., 2007, p. 78). The horizontal scale of the histogram differs from that of the standard normal curve: most test scores are between 40 and 100, while most of the standard area under the normal curve extends between 3 and +3 on the horizontal axis; and the center of the density histogram is about 70, while the center of the standard normal curve is 0. If we standardize the values, we get what we want. To obtain the standard scores, do: standard score = measurement average standard deviation For example, to standardize the first test score (74; in this case the variable has no units), do: standard score = = The list of standard scores is: 0.4; 1.2; 0.0;... ; 0.3. Verify that you can compute the first couple of standard scores.

17 1.7. APPROXIMATING A DENSITY HISTOGRAM BY THE NORMAL CURVE13 3 Density (% per point) Test score (points) Figure 1.3: Density histogram of 100 test scores Figure 1.4 shows the histogram of the standard scores. If you compare with the histogram of the original test scores (figure 1.3) you notice that the shape of the histogram hasn t changed. Consider the original test scores. Count the number of job applicants who had a test score between 75 and 85: 25 out of the 100 job applicants had a test score between 75 and 85. So 25% of the job applicants had a test score between 75 and 85. In the histogram (figure 1.3), the percentage corresponds to the area under the histogram between 75 and 85. The standard scores of 75 and 85 are: en = +0.5 = +1.5 In the histogram of the standard scores (figure 1.4) the percentage (25%) corresponds to the area under the histogram between +0.5 and The area under the normal curve between +0.5 and +1.5 approximates the area under the histogram between +0.5 and Now carefully look at figure 1.4. The normal approximation overestimates the bar over the interval between +0.5 and +1.0, and underestimates the bar over the interval between +1.0 and The area under the normal curve between +0.5 and +1.5 is approximately: DISTR normalcdf(0.5,1.5) %

18 14 CHAPTER 1. DESCRIPTIVE STATISTICS 30 Density (% per standard unit) Standard units Figure 1.4: Density histogram of 100 test scores, standardized The normal approximation (24.17%) is quite close to the actual percentage (25%). Use your TI-84 to find the areas under the normal curve between 1 and +1. Using the normal approximation, which percentage of measurements will be between ave SD and ave + SD? Repeat for 2 and +2 and 3 and +3. You see that the normal approximation implies the following rule, called the rule. For a bell-shaped histogram: approximately 68% of the measurements are within one SD of the average, that is, between ave SD and ave + SD; approximately 95% of the measurements are within two SDs of the average, that is, between ave 2 SD and ave + 2 SD; approximately 99.7% of the measurements are within three SDs of the average, that is, between ave 3 SD and ave + 3 SD; (The rule is not on the formula sheet; you have to know it by heart.) The normal approximation will turn out to be very useful in statistical inference (drawing conclusions about population parameters on basis of sample evidence).

19 1.8. QUESTIONS FOR REVIEW Questions for Review 1. What is the difference between a qualitative and a quantitative variable? Illustrate using examples where you consider different characteristics of the students in the class. 2. What is the difference between a parameter and a statistic? 3. What does descriptive statistics do? 4. What does statistical inference do? 5. How can you summarize the distribution of a numerical data set in a table? In a graph? 6. In a density histogram, what does the density represent? What are the units of density? Explain for a hypothetical distribution of heights (in centimeter) of people. 7. When would the median be a better measure of the central tendency of a distribution than the mean? Illustrate by giving an example. 8. What does the standard deviation measure? How is the standard deviation computed? 9. What are the properties of the normal curve? 10. What does the standard score measure? How is the standard score computed? 11. What does the % rule say? 1.9 Exercises 1. Download the data file AAPL-HistoricalQuotes.csv from the course web site: and save the data file to your STA201 folder (directory). The data set contains data about Apple stock. Run R Commander and load the data set: Data Import data from textfile, clipboard, or URL.... A window opens. For Location of Data File select Local file system. For Field Separator select Commas. For Decimal-Point Character select Period [.]. Press OK, navigate to the data file AAPL-HistoricalQuotes.csv, abd double-click the file. Your data should now be loaded by R Commander. In the R Commander menu, click the View Data Set button. A new window opens, showing the data set. The variable volume is the variable from table 1.1. Now enter the following line of script in the R script window: h <- hist(dataset$volume/ ,right=false) and press the Submit button. This command will compute the numbers needed to make a histogram and store then in an object called h. Next, type in the R script window:

20 16 CHAPTER 1. DESCRIPTIVE STATISTICS h$breaks and press the Submit button. The output window will display the breaks between the intervals, that is, the boundaries of the intervals used by R when it computes the frequency table. Next, type in the R script window: h$counts and press the Submit button. The output window will display the absolute frequencies (counts) of each interval. Next, type in the R script window: h$density and press the Submit button. The output window will display the densities of each interval. The densities are expressed as decimal fractions per horizontal unit; to get densities expressed as percentages per horizontal unit you have to multiply by 100%. Finally, type in the R script window: h$counts/sum(h$counts) and press the Submit button. The output window will display the relative frequencies for each interval; to get relative frequencies expressed as percentages you have to multiply by 100%. 2. Use the relative frequencies from table 1.2 to compute the densities for the other intervals. Add a column to show the densities. Then draw the density histogram on scale on squared paper. 3. Figure 1.1 shows that the daily traded volumes of Apple shares have a skewed distribution. The average daily volume is shares. Find the median. Show your work. How do mean and median compare? Is that what you expected from the shape of the histogram? Explain. 4. Find the standard deviation of {1, 1, 1, 1, 0} using two methods: the usual formula (root-mean-square size of the deviations) and the shortcut formula for 0-1 lists. Do you get the same result? 5. The daily traded volumes of Apple shares (table 1.1) have an average of and a standard deviation of Is an outlier? And ? Show your work and explain. 6. Use the TI-84 to find the areas under the standard normal curve: (a) to the right of 1.87 (b) to the left of 5.20 (c) between 1 and +1 (d) between 2 and +2 (e) between 3 and +3 Make for every case a sketch, with the relevant area shaded. Verify your answers using the R script. We ll get back to cases (c), (d), and (e) in a moment.

21 1.9. EXERCISES For the 100 given test scores, find which percentage of job applicants scored between 50 and 60. Then use the normal approximation. Is the normal approximation close? 8. For 164 adult Belgian men born in 1962 the average height is centimeter and the SD is 8.2 centimeter (Garcia and Quintana-Domeque, 2007). Suppose that the histogram of the 164 heights follows the normal curve (heights usually do). What is, approximately, the percentage of men in this group with a height of 170 centimeter or less? What is, approximately, the percentage of men in this group with a height of between 170 centimeter and 180 centimeter? 9. Of the volumes of Apple shares traded in the first 50 trading days of 2013 (p. 1.2) the average is and the SD is Find the actual percentage of values between: ave SD and ave + SD; ave 2 SD and ave + 2 SD; ave 3 SD and ave + 3 SD; Does the rule give a good approximation? Why (not)?

22 18 CHAPTER 1. DESCRIPTIVE STATISTICS

23 Chapter 2 Probability distributions 2.1 Chance experiments Examples of chance experiments are: rolling a die and counting the dots; tossing a coin and observing whether you get heads or tails; or randomly drawing a card from a well-shuffled deck of cards and observing which card you get. It is convenient to think of a chance experiment in terms of the following chance model: randomly drawing one or more tickets from a box. For instance, rolling a die is modeled as randomly drawing a ticket from the box: In R: box <- c(1,2,3,4,5,6) sample(box,1) In R: Tossing a coin is like randomly drawing a ticket from the box: box <- c("heads","tails") sample(box,1) heads tails 2.2 Frequency interpretation of probability Consider the following chance experiment. Roll a die and count the dots. If you get an ace (1), write down 1; if you don t get an ace (2, 3, 4, 5, or 6), write down 0. Repeat the experiment many times. After each roll, compute the relative frequency of aces up to that point. Make a graph with the number of tosses on the horizontal axis and the relative frequency on the vertical axis. Figure 2.1 shows the result of repetitions in such an experiment. The frequency of aces tends towards 1/6 ( %, the horizontal dashed line). The frequency interpretation of probability states that the probability of an event is the percentage to which the relative frequency tends if you repeat the chance experiment over and over, independently and under the same conditions (Freedman et al., 2007, p. 222). 19

24 20 CHAPTER 2. PROBABILITY DISTRIBUTIONS Frequency of aces (%) Number of repeats Figure 2.1: Frequency of aces in 10,000 rolls of a die 2.3 Drawing with and without replacement Consider the following box with tickets: The probability to draw an even number is 3/6: P ( 2nd draw is even ) = 3 6 Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you replace the ticket, and again randomly draw a ticket from the box. This is called drawing with replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2, is again 3/6. In mathematical notation: P ( 2nd draw is even 1st draw was 2 ) = 3 6 The vertical bar ( ) is shorthand for given that. What comes after the vertical bar ( ) is called the condition. A probability with a condition is called a conditional probability. Note that in this case imposing the condition didn t affect the probability of drawing an even number: whether the first draw was 2 or not doesn t matter

25 2.4. THE SUM OF DRAWS 21 for the second draw, because we replaced the ticket after the first draw. In both cases, the probability of getting an even number was the same (3/6): P ( 2nd draw is even 1st draw was 2 ) = P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be independent: the probability of the second event is not affected by how the first event turned out. That is because we were drawing with replacement. When drawing with replacement, the events are independent. Now consider a different chance experiment. Suppose you randomly draw a ticket from the box. The ticket turns out to be 2. Suppose you don t replace the ticket. The box now looks like this: If we now again randomly draw a ticket from the box, this is called drawing without replacement. The conditional probability to draw an even number on the second draw, given that the first draw was 2 now is : P ( 2nd draw is even 1st draw was 2 ) = 2 5 In this case, what happened in the first draw (as expressed by the condition 1st draw was 2 ) does make a difference: the probability of getting an even number differs: P ( 2nd draw is even 1st draw was 2 ) P ( 2nd draw is even ) The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be dependent: the probability of the second event is affected by how the first event turned out. That is because we were drawing without replacement. When drawing without replacement, the events are dependent. Think of a population as a box with tickets. A random sample is like drawing a number of tickets without replacement from this box. The number of draws is the sample size. Remember this. We ll use this box model when doing statistical inference. 2.4 The sum of draws For the theory of statistical inference, we ll frequently use the concept of the sum of draws. Here s a simple example: roll a die twice, and add the numbers. The chance model has the following box: Draw two tickets with replacement from the box, and add the outcomes. The result is the sum of draws. The sum of draws is a brief way to say the following (Freedman et al., 2007, p. 280):

26 22 CHAPTER 2. PROBABILITY DISTRIBUTIONS Draw tickets from a box. Add the numbers on the tickets. As the following activity makes clear, the sum of draws is itself a random variable: (a) Conduct the chance experiment above using an actual die or the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) (b) Repeat the experiment a couple of times and write up the outcomes (using an actual die, or in R by running the line sample(box,1)+sample(box,1). Would it be fair to say that the sum of draws is a chance variable? Explain. 2.5 Picking an appropriate chance model We model a population as a box with tickets. Taking a random sample is like randomly drawing a number of tickets from the box, without replacement; the number of draws is the sample size. In order to use such a chance model for inference, we will use some interesting properties of the sum of draws. The trick is to set up the chance model in such a way that the chance variable of interest is the sum of draws, or is computed from the sum of draws. An example clarifies my argument. Suppose you roll a die three times, and want to know what the sum of the outcomes is. What is the appropriate chance model? What is the chance variable? An appropriate chance model is a box with six tickets: and the chance variable is the sum of three random draws with replacement from the box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 3, 2, and 6. The sum of draws ( = 11) is obtained by adding up the outcomes. Now suppose that we are interested in another question: how many times (out of three rolls) will we get a six? First, we need the appropriate chance model. When we roll a die, when can get two kinds of outcomes: either we get a six (we ll label this outcome as a success), or we get another number (1, 2, 3, 4, 5: not a success). The term success is used here in a technical meaning: the outcome we are interested in. Note that we classify the outcomes of a single roll as a success or not a success. In such a case, the appropriate chance model is a box with six tickets: one ticket 1 for the outcome 6 labelled as a success, and five tickets 0 for the outcomes 1, 2, 3, 4, or 5 labelled as not a success:

27 2.6. PROBABILITY DISTRIBUTIONS 23 Now we are interested in the number of sixes in three rolls, so we need to count the sixes. Counting the sixes is the same thing as taking the sum of three draws from the 0-1 box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 0, 0, and 1 (we classified each outcome as a success or not a success). The sum of draws ( = 1) is the number of sixes (the number of successes). A box like this, with tickets that can only take values 0 and 1, is called a 0-1 box. Remember that when the problem is one of classifying and counting, the appropriate box is a 0-1 box. Here s a real-world example. Suppose you are the marketing manager of a telecommunications company that doesn t cover Brussels yet. You would like to find out which percentage of households in Brussels already has a tablet. The population of interest is all households in Brussels. Think of each household in Brussels as a ticket in a box, so there are as many tickets as households. A ticket takes value 1 if the household has a tablet, and 0 if the household doesn t. Taking a random sample of households is like randomly drawing tickets without replacement from this 0-1 box. The number of households in the sample who have a tablet is the sum of draws. The percentage of households in the sample who have a tablet is: sample percentage = 2.6 Probability distributions sum of draws size of the sample 100% Chance experiments can be described using probability distributions. In what follows, we ll focus on the probability distribution of the sum of draws. Suppose you roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box and add the outcomes. The chance variable (the sum of the two draws) can take the following values: {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} (the chance variable is discrete; we won t develop the theory for continuous chance variables). For each of these possible outcomes, we can compute the probability. There are 36 possible combinations: Each of these 36 combinations has the same probability, and as the probabilities have to add up to 1, each combination has a probability of 1/36. By applying the rules or probability, we can find the probability that the sum of draws takes the value 2, and then repeat the work to find the probability that the sum of draws takes the value 3, and so on. There are for instance two combinations that yield a sum of 3:

28 24 CHAPTER 2. PROBABILITY DISTRIBUTIONS when the first draw is 1 and the second draw is 2 (row 1, column 2 in the table above) when the first draw is 2 and the second draw is 1 (row 2, column 1) The probability that the sum of draws is 3 is therefore equal to: P (sum is 3) = P [(first 1, than 2) or (first 2, then 1)] Apply the addition rule (Freedman et al., 2007, pp ) to obtain: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) something The third term ( minus something ) is equal to zero because the events (first 1, than 2) and (first 2, then 1) are mutually exclusive (two events are mutually exclusive when as one event happens, the other cannot happen at the same time). So we get: P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) 0 = = 2 36 If you do this for all other possible values of the chance variable, you get the following table: outcome probability A table that shows all possible values for a (discrete) chance variable and the corresponding probabilities is called a probability distribution. We can graph the probability distribution as a bar chart. On the horizontal axis we put the chance variable, and we construct the bar chart in such a way that the area of a bar shows the probability (expressed as a percentage), just as in a density histogram the area of a bar showed the relative frequency (expressed as a percentage) of the data over the interval. That is why Freedman et al. (2007, pp ) call such a bar chart a probability histogram. For a discrete chance variable the convention is to center the bars on the values that the variable can take: the bar over 2 will start at 1.5 and end at 2.5; the bar over 3 will start at 2.5 and end at 3.5, and so on. The width of each bar is equal to 1. The height of each bar in a probability distribution is called probability density: the probability per unit on the horizontal axis. We find the probability densities by applying the formula for the area of a rectangle: area = width height We want the area to represent the probability (expressed as a percentage) and the height to represent the probability density (expressed as percent per unit on the horizontal axis), and hence the equation becomes: probability = width of interval on horizontal axis probability density Divide both sides of the equation by (width of interval on horizontal axis) to obtain: probability density = probability width of interval on horizontal axis

29 2.6. PROBABILITY DISTRIBUTIONS 25 Because the width of each interval on horizontal axis is one unit of the horizontal axis, this becomes: probability density = probability per unit on the horizontal axis which gives us the meaning of probability density. For example, the probability to get a 7 is 6/36 (= %). The probability density over the interval from 6.5 to 7.5 then is equal to: probability density = % = % per per unit on the horizontal axis Figure 2.2 shows the corresponding bar chart representing the probability distribution. The curve traced by the bar chart of the probability distribution is called the probability density function. The probability density function has the following properties: the curve is always on or above the horizontal axis, that is, the probability density (on the vertical axis) is always 0 or positive; de area under the curve is equal to 1 (or 100%); the area under the curve between two values on the horizontal axis gives the probability. The probability distribution has an expectation and a standard error. The following example illustrates the intuition of these concepts. Roll a die twice and add the numbers. You can do that with an actual die, or run the following R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1) Repeat this a couple of times, and write down the outcomes. You will get something like {6, 7, 10, 8, 10,... }. The outcomes are random. The lowest value you can get is 2 (when you roll two aces), and the highest value is 12 (when you roll two sixes). If you repeat the experiment many times you ll notice that those extreme values occur only occasionally; values like 6, 7, or 8 occur much more frequently. The expectation is the typical value that the random variable will take; the value around which the outcomes vary. Another way to think about the expectation is as the center of the probability distribution (figure 2.2). In this case the expectation is 7 (we ll see below how to compute the expectation). Now define the difference between the outcome of a chance experiment and the expectation as the chance error. For instance, our first outcome was 6, the expectation is 7, and hence the chance error was: chance error = outcome expectation = 6 7 = 1 (the negative value of 1 means that the outcome was 1 below the expectation). If we compute the chance errors for the other outcomes, we get:

6 3 The Standard Normal Distribution

6 3 The Standard Normal Distribution 290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

The Normal Distribution

The Normal Distribution Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple. Graphical Representations of Data, Mean, Median and Standard Deviation In this class we will consider graphical representations of the distribution of a set of data. The goal is to identify the range of

More information

Coins, Presidents, and Justices: Normal Distributions and z-scores

Coins, Presidents, and Justices: Normal Distributions and z-scores activity 17.1 Coins, Presidents, and Justices: Normal Distributions and z-scores In the first part of this activity, you will generate some data that should have an approximately normal (or bell-shaped)

More information

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175) Describing Data: Categorical and Quantitative Variables Population The Big Picture Sampling Statistical Inference Sample Exploratory Data Analysis Descriptive Statistics In order to make sense of data,

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Measurement with Ratios

Measurement with Ratios Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical

More information

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

AMS 5 CHANCE VARIABILITY

AMS 5 CHANCE VARIABILITY AMS 5 CHANCE VARIABILITY The Law of Averages When tossing a fair coin the chances of tails and heads are the same: 50% and 50%. So if the coin is tossed a large number of times, the number of heads and

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

MEASURES OF VARIATION

MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Unit 7: Normal Curves

Unit 7: Normal Curves Unit 7: Normal Curves Summary of Video Histograms of completely unrelated data often exhibit similar shapes. To focus on the overall shape of a distribution and to avoid being distracted by the irregularities

More information

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives. The Normal Distribution C H 6A P T E R The Normal Distribution Outline 6 1 6 2 Applications of the Normal Distribution 6 3 The Central Limit Theorem 6 4 The Normal Approximation to the Binomial Distribution

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Problem Solving and Data Analysis

Problem Solving and Data Analysis Chapter 20 Problem Solving and Data Analysis The Problem Solving and Data Analysis section of the SAT Math Test assesses your ability to use your math understanding and skills to solve problems set in

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Chapter 2: Descriptive Statistics

Chapter 2: Descriptive Statistics Chapter 2: Descriptive Statistics **This chapter corresponds to chapters 2 ( Means to an End ) and 3 ( Vive la Difference ) of your book. What it is: Descriptive statistics are values that describe the

More information

Probability. Distribution. Outline

Probability. Distribution. Outline 7 The Normal Probability Distribution Outline 7.1 Properties of the Normal Distribution 7.2 The Standard Normal Distribution 7.3 Applications of the Normal Distribution 7.4 Assessing Normality 7.5 The

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis

More information

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k. REPEATED TRIALS Suppose you toss a fair coin one time. Let E be the event that the coin lands heads. We know from basic counting that p(e) = 1 since n(e) = 1 and 2 n(s) = 2. Now suppose we play a game

More information

You flip a fair coin four times, what is the probability that you obtain three heads.

You flip a fair coin four times, what is the probability that you obtain three heads. Handout 4: Binomial Distribution Reading Assignment: Chapter 5 In the previous handout, we looked at continuous random variables and calculating probabilities and percentiles for those type of variables.

More information

Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits)

Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits) Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits) Instructor: Luc Hens Telephone: +32 2 629 11 92 e-mail: luc.hens@vub.ac.be Web site: http://homepages.vub.ac.be/~lmahens/ Course

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Scatter Plots with Error Bars

Scatter Plots with Error Bars Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

More information

Statistics Revision Sheet Question 6 of Paper 2

Statistics Revision Sheet Question 6 of Paper 2 Statistics Revision Sheet Question 6 of Paper The Statistics question is concerned mainly with the following terms. The Mean and the Median and are two ways of measuring the average. sumof values no. of

More information

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540 To complete this technology assignment, you should already have created a scatter plot for your data on your calculator and/or in Excel. You could do this with any two columns of data, but for demonstration

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

AP Statistics Solutions to Packet 2

AP Statistics Solutions to Packet 2 AP Statistics Solutions to Packet 2 The Normal Distributions Density Curves and the Normal Distribution Standard Normal Calculations HW #9 1, 2, 4, 6-8 2.1 DENSITY CURVES (a) Sketch a density curve that

More information

Standard Deviation Estimator

Standard Deviation Estimator CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of

More information

Normal distribution. ) 2 /2σ. 2π σ

Normal distribution. ) 2 /2σ. 2π σ Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

More information

Week 3&4: Z tables and the Sampling Distribution of X

Week 3&4: Z tables and the Sampling Distribution of X Week 3&4: Z tables and the Sampling Distribution of X 2 / 36 The Standard Normal Distribution, or Z Distribution, is the distribution of a random variable, Z N(0, 1 2 ). The distribution of any other normal

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

GeoGebra Statistics and Probability

GeoGebra Statistics and Probability GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,

More information

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Directions for Frequency Tables, Histograms, and Frequency Bar Charts Directions for Frequency Tables, Histograms, and Frequency Bar Charts Frequency Distribution Quantitative Ungrouped Data Dataset: Frequency_Distributions_Graphs-Quantitative.sav 1. Open the dataset containing

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

Updates to Graphing with Excel

Updates to Graphing with Excel Updates to Graphing with Excel NCC has recently upgraded to a new version of the Microsoft Office suite of programs. As such, many of the directions in the Biology Student Handbook for how to graph with

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Normal Probability Distribution

Normal Probability Distribution Normal Probability Distribution The Normal Distribution functions: #1: normalpdf pdf = Probability Density Function This function returns the probability of a single value of the random variable x. Use

More information

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds Isosceles Triangle Congruent Leg Side Expression Equation Polynomial Monomial Radical Square Root Check Times Itself Function Relation One Domain Range Area Volume Surface Space Length Width Quantitative

More information

6.3 Conditional Probability and Independence

6.3 Conditional Probability and Independence 222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

$2 4 40 + ( $1) = 40

$2 4 40 + ( $1) = 40 THE EXPECTED VALUE FOR THE SUM OF THE DRAWS In the game of Keno there are 80 balls, numbered 1 through 80. On each play, the casino chooses 20 balls at random without replacement. Suppose you bet on the

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

7.7 Solving Rational Equations

7.7 Solving Rational Equations Section 7.7 Solving Rational Equations 7 7.7 Solving Rational Equations When simplifying comple fractions in the previous section, we saw that multiplying both numerator and denominator by the appropriate

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

0 Introduction to Data Analysis Using an Excel Spreadsheet

0 Introduction to Data Analysis Using an Excel Spreadsheet Experiment 0 Introduction to Data Analysis Using an Excel Spreadsheet I. Purpose The purpose of this introductory lab is to teach you a few basic things about how to use an EXCEL 2010 spreadsheet to do

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

What Does the Normal Distribution Sound Like?

What Does the Normal Distribution Sound Like? What Does the Normal Distribution Sound Like? Ananda Jayawardhana Pittsburg State University ananda@pittstate.edu Published: June 2013 Overview of Lesson In this activity, students conduct an investigation

More information

What Do You Think? for Instructors

What Do You Think? for Instructors Accessing course reports and analysis views What Do You Think? for Instructors Introduction As an instructor, you can use the What Do You Think? Course Evaluation System to see student course evaluation

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

The Binomial Probability Distribution

The Binomial Probability Distribution The Binomial Probability Distribution MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2015 Objectives After this lesson we will be able to: determine whether a probability

More information

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Role in quality management system Quality Control (QC) is a component of process control, and is a major element of the quality management

More information

Statistics 2014 Scoring Guidelines

Statistics 2014 Scoring Guidelines AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Probability Distributions

Probability Distributions CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

1.6 The Order of Operations

1.6 The Order of Operations 1.6 The Order of Operations Contents: Operations Grouping Symbols The Order of Operations Exponents and Negative Numbers Negative Square Roots Square Root of a Negative Number Order of Operations and Negative

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate 1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such

More information

AP STATISTICS REVIEW (YMS Chapters 1-8)

AP STATISTICS REVIEW (YMS Chapters 1-8) AP STATISTICS REVIEW (YMS Chapters 1-8) Exploring Data (Chapter 1) Categorical Data nominal scale, names e.g. male/female or eye color or breeds of dogs Quantitative Data rational scale (can +,,, with

More information

MATH 140 Lab 4: Probability and the Standard Normal Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes

More information

Introduction; Descriptive & Univariate Statistics

Introduction; Descriptive & Univariate Statistics Introduction; Descriptive & Univariate Statistics I. KEY COCEPTS A. Population. Definitions:. The entire set of members in a group. EXAMPLES: All U.S. citizens; all otre Dame Students. 2. All values of

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Chapter 2: Frequency Distributions and Graphs

Chapter 2: Frequency Distributions and Graphs Chapter 2: Frequency Distributions and Graphs Learning Objectives Upon completion of Chapter 2, you will be able to: Organize the data into a table or chart (called a frequency distribution) Construct

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Chapter 1 Review 1. As part of survey of college students a researcher is interested in the variable class standing. She records a 1 if the student is a freshman, a 2 if the student

More information

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality. 8 Inequalities Concepts: Equivalent Inequalities Linear and Nonlinear Inequalities Absolute Value Inequalities (Sections 4.6 and 1.1) 8.1 Equivalent Inequalities Definition 8.1 Two inequalities are equivalent

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

Measures of Central Tendency and Variability: Summarizing your Data for Others

Measures of Central Tendency and Variability: Summarizing your Data for Others Measures of Central Tendency and Variability: Summarizing your Data for Others 1 I. Measures of Central Tendency: -Allow us to summarize an entire data set with a single value (the midpoint). 1. Mode :

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

Frequency Distributions

Frequency Distributions Descriptive Statistics Dr. Tom Pierce Department of Psychology Radford University Descriptive statistics comprise a collection of techniques for better understanding what the people in a group look like

More information

Using Microsoft Word. Working With Objects

Using Microsoft Word. Working With Objects Using Microsoft Word Many Word documents will require elements that were created in programs other than Word, such as the picture to the right. Nontext elements in a document are referred to as Objects

More information