STATISTICS AND STANDARD DEVIATION

Transcription

1 STATISTICS AND STANDARD DEVIATION

2 Statistics and Standard Deviation STSD-A Objectives... STSD 1 STSD-B Calculating Mean... STSD STSD-C Definition of Variance and Standard Deviation... STSD 4 STSD-D Calculating Standard Deviation... STSD 5 STSD-E Coefficient of Variation... STSD 7 STSD-F Normal Distribution and z-scores... STSD 8 STSD-G Chebshev s Theorem... STSD 15 STSD-H Correlation and Scatterplots... STSD 16 STSD-I Correlation Coefficient and Regression Equation... STSD 1 STSD-J Summar... STSD 5 STSD-K Review Eercise... STSD 7 STSD-L Appendi z-score Values Table... STSD 9 STSD-Y Inde... STSD 30 STSD-Z Solutions... STSD 3 STSD-A Objectives To calculate the mean and standard deviation of lists, tables and grouped data To determine the correlation co-efficient To calculate z-scores To use normal distributions to determine proportions and values To use Chebshev s theorem To determine correlation between sets of data To construct scatterplots and lines of best fit To calculate correlation coefficient and regression equation for data sets. STSD 1

3 STSD-B Calculating Mean The mean is a measure of central tendenc. It is the value usuall described as the average. The mean is determined b summing all of the numbers and dividing the result b the number of values. The mean of a population of N values (scores) is defined as the sum of all the scores, of the population,, divided b the number of scores, N. The population mean is represented b the Greek letter μ (mu) and calculated b using μ. N Often it is not possible to obtain data from an entire population. In such cases, a sample of the population is taken. The mean of a sample of n items drawn from the population is defined in the same wa and is denoted b, pronounced bar and calculated using Eample STSD-B1 Calculate the mean of the following student test results percentages.. n 9% 66% 99% 75% 69% 51% 89% 75% 54% 45% 69% n write out formula add together all scores divide b number of scores The mean of the student test results is 71.7 % (rounded to d.p.). When calculating the mean from a frequenc distribution table, it is necessar to multipl each score b its frequenc and sum these values. This result is then divided b the sum of the frequencies. The formula for the mean calculated from a frequenc table is f f Calculations using this formula are often simplified b setting up a table as shown below. Eample STSD-B Calculate the mean number of pins knocked down from the frequenc table. Pins () Frequenc (f) f Total f 60 f 40 mean Note: It is rare for an eact number to result from a mean calculation. f f The mean number of pins knocked down was 7 pins. STSD

4 If the frequenc distrubution table has grouped data, intervals, it is necessar to use the mid-value of the interval in mean calculations. The mid-value for an interval is calculated b adding the upper and lower boundaries of the interval and dividing the result b two. mid value: upper + lower Eample STSD-B3 Calculate the mean height of students from the frequenc table. Height (cm) mid-value () Frequenc(f) f Σ f 0 Σ f 315 mean f f cm Eercise STSD-B1 Calculate the mean of the following data sets. (a) Hocke goals scored. 5, 4, 3,,, 1, 0, 0, 1,, 3 The mean height is cm. (b) Points scored in basketball games (d) Babies weights Points Scored Frequenc (f) Bab Weight (kg) Freq (f) () Total 10 Total 15 (c) Number of tping errors (e) ATM withdrawals Tping (f) Withdrawals (f) errors ($) Total Total 30 STSD 3

5 STSD-C Definition of Variance and Standard Deviation To further describe data sets, measures of spread or dispersion are used. One of the most commonl used measures is standard deviation. This value gives information on how the values of the data set are varing, or deviating, from the mean of the data set. Deviations are calculated b subtracting the mean,, from each of the sample values,, i.e. deviation. As some values are less than the mean, negative deviations will result, and for values greater than the mean positive deviations will be obtained. B simpl adding the values of the deviations from the mean, the positive and negative values will cancel to result in a value of zero. B squaring each of the deviations, the problem of positive and negative values is avoided. To calculate the standard deviation, the deviations are squared. These values are summed, divided b the appropriate number of values and then finall the square root is taken of this result, to counteract the initial squaring of the deviation. The standard deviation of a population, σ, of N data items is defined b the following formula. ( μ ) Σ σ where μ is the population mean. N For a sample of n data items the standard deviation, s, is defined b, s Σ ( ) n 1 where is the sample mean. NOTE: When calculating the sample standard deviation we divide b (n 1) not N. The reason for this is comple but it does give a more accurate measurement for the variance of a sample. Standard deviation is measured in the same units as the mean. It is usual to assume that data is from a sample, unless it is stated that a population is being used. To assist in calculations data should be set up in a table and the following headings used: μ OR ( μ ) OR ( ) Eample STSD-C1 Determine the standard deviation of the following student test results percentages. 9% 66% 99% 75% 69% 51% 89% 75% 54% 45% 69% ( ) ( 0.7) Σ 784 Σ( ) Σ n 11 ( ) Σ s n The standard deviation of the test results is approimatel 17.6%. STSD 4

6 The variance is the average of the squared deviations when the data given represents the population. The lower case Greek letter sigma squared, σ, is used to represent the population variance. ( μ ) σ where μ is the population mean, and N is the population size. N The sample variance, which is denoted b s, is defined as s ( ) n 1 where is the sample mean, and n is the sample size. As variance is measured in squared units, it is more useful to use standard deviation, the square root of variance, as a measure of dispersion. STSD-D Calculating Standard Deviation The previousl mentioned formulae for standard deviation of a population, σ and a sample standard deviation, s, ( μ ) ( ) Σ Σ σ s N n 1 can be manipulated to obtain the following formula which are easier to use for calculations. These are commonl called computational formulae. ( Σ) ( Σ) Σ Σ N n σ s N n 1 To perform calculations again it is necessar to set up a table. The table heading in this case will be: Eample STSD-D1 Determine the standard deviation of the following student test results percentages. 9% 66% 99% 75% 69% 51% 89% 75% 54% 45% 69% Σ 784 Σ ( Σ) Σ n s n NOTE: This is approimatel the same value as calculated previousl. This value will actuall be more accurate as it onl uses rounding in the final calculation step. The standard deviation of the test scores is approimatel 17.6%. STSD 5

7 When data is presented in a frequenc table the following computational formulae for populations standard deviation, σ, and sample standard deviation, s, can be used. ( Σf) ( Σf) Σf Σf Σf Σf σ s Σf Σf 1 If the data is presented in a grouped or interval manner, the mid-values are used as with the calculation of the mean. The table heading for calculations will include. f f f Eamples STSD-D Calculate the standard deviations for each of the following data sets. (a) Number of pins knocked down in ten-pin bowling matches Pins () f f f Σ f 60 Σ f 40 Σ f 384 s Σf ( Σf) Σf Σf The standard deviation of the number of pins knocked down is approimatel.41 pins. (b) Heights of students Heights f f f Σ f 0 Σ f 315 Σ f s Σf ( Σf) Σf Σf The standard deviation of the heights is approimatel 38.33cm. STSD 6

8 Eercise STSD-D1 Calculate the standard deviations for each of the following data sets. (a) Hocke goals scored. 5, 4, 3,,, 1, 0, 0, 1,, 3 (b) Points scored in basketball games. (d) Babies weights Points Scored Frequenc (f) Bab Weight (kg) Freq (f) () Total 10 Total 15 (c) Number of tping errors. (e) ATM withdrawals STSD-E Tping (f) Withdrawals (f) errors ($) Total Total 30 Co-efficient of Variation Without an understanding of the relative size of the standard deviation compared to the original data, the standard deviation is somewhat meaningless for use with the comparison of data sets. To address this problem the coefficient of variation is used. The coefficient of variation, CV, gives the standard deviation as a percentage of the mean of the data set. s CV 100% CV σ 100% μ for a sample for a population Eample STSD-E1 Calculate the coefficient of variation for the following data set. The price, in cents, of a stock over five trading das was 5, 58, 55, 57, Σ 81 Σ 1583 s.77 CV 100% 100% 4.93% 56. n ( Σ) Σ n s n The coefficient of variation for the stock prices is 4.93%. The prices have not showed a large variation over the five das of trading STSD 7

9 The coefficient of variation is often used to compare the variabilit of two data sets. It allows comparison regardless of the units of measurement used for each set of data. The larger the coefficient of variation, the more the data varies. Eample STSD-E The results of two tests are shown below. Compare the variabilit of these data sets. Test 1 (out of 15 marks): 9 s Test (out of 50 marks): 7 s 8 s s CVtest1 100% CVtest 100% 8 100%.% 100% 9.6% 9 7 The results in the second test show a great variation than those in the first test. Eercise STSD-E1 1. Calculate the coefficient of variation for each of the following data sets. (a) Stock prices: 8, 10, 9, 10, 11 (b) Test results: 10, 5, 8, 9,, 1, 5, 7, 5, 8. Compare the variation of the following data sets. (a) Data set A: 35, 38, 34, 36, 38, 35, 36, 37, 36 Data set B: 36, 0, 45, 40, 5, 46, 6, 6, 3 (b) Bo s Heights: 141.6cm s 15.1cm Girl s Heights: 143.7cm s 8.4cm STSD-F Normal Distribution and z-scores Another use of the standard deviation is to convert data to a standard score or z-score. The z-score indicates the number of standard deviations a raw score deviates from the mean of the data set and in which direction, i.e. is the value greater or less than the mean? The following formula allows a raw score,, from a data set to be converted to its equivalent standard value, z, in a new data set with a mean of zero and a standard deviation of one. A z-score can be positive or negative: μ z sample z population s σ positive z-score raw score greater than the mean negative z-score raw score less than the mean. STSD 8

10 Eamples STSD-F1 1. Given the scores 4, 7, 8, 1, 5 determine the z-score for each raw score. 5 5 n ( ) 8 64 s n 1 1 n Σ 5 Σ 155 raw score z-score meaning z standard deviations below the mean z standard deviations above the mean z standard deviations above the mean z standard deviations below the mean z at the mean. Given a data set with a mean of 10 and a standard deviation of, determine the z-score for each of the following raw scores, is 1 standard deviations below the mean. z z z 3 10 is 0 standard deviations from the mean, it is equal to the mean. 16 is 3 standard deviations above the mean. The z-scores also allow comparisons of scores from different sources with different means and/or standard deviations. Eample STSD-F Jenn obtained results of 48 in her English eam and 75 in her Histor eam. Compare her results in the different subjects considering: English eam : class mean was 45 and the standard deviation.5 Histor eam : class mean was 78 and the standard deviation z English z Histor Jenn s English result is 1.33 standard deviations above the class mean, while her Histor was 1.5 standard deviations below the class mean. STSD 9

11 It is also possible to determine a raw score for a given z-score, i.e. it is possible to find a value that is a specified number of standard deviations from a mean. The z-score formula is transformed to + s z sample μ + σ z population Eamples STSD-F3 A data set has a mean of 10 and a standard deviation of. Find a value that is: (i) 3 standard deviations above the mean z 3 + s z is three standard deviations above the mean. (ii) standard deviations below the mean z + s z is two standard deviations below the mean. Eercise STSD-F1 1. Given the scores 56, 8, 74, 69, 94 determine the z-score for each raw score.. Given a data set with a mean of 54 and a standard deviation of 3., determine the z-score for each of the following raw scores,. (i) 81 (ii) Peter obtained results of 63 in Maths and 58 in Geograph. For Maths the class mean was 58 and the standard deviation 3.4, and for Geograph the class mean was 55 and the standard deviation.3. Compared to the rest of the class did Peter do better in Maths or Geograph? 4. A data set has a mean of 54 and a standard deviation of 3.. Find a value that is: (i) standard deviations below the mean. (ii) 1.5 standard deviations above the mean. The distribution of z-scores is described b the standard normal curve. Normal distributions are smmetric about the mean, with scores more concentrated in the centre of the distribution than in the tails. Normal distributions are often described as bell shaped. mean Data collected on heights, weights, reading abilities, memor and IQ scores often are approimated b normal distributions. The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one STSD 10

12 For normall distributed data 50% of the data is below the mean and 50% above the mean. 50% 50% In a normal distribution, the distance between the mean and a given z-score corresponds to a proportion of the total area under the curve, and hence can be related to a proportion of a population. The total area under a normal distribution curve is taken as equal to 1 or 100%. The values of the proportions, written as decimals, for various z-scores are provide in a statistical table, Normal Distribution Areas (see Appendi). The values in the Normal Distribution Areas table give a proportion value for the area between the mean and the raw score greater than the mean, converted to a positive z-score. + z z As the distribution is smmetric, it is possible to calculate proportions for raw scores less than the mean. If the calculated z-score is negative, the corresponding positive value from the table is used. So values for z are the same distance from the mean whether the are a number of standard deviations more or a number of standard deviations less than the mean, and will result in the same proportion. A B z +z proportion A proportion B In a normal distribution approimatel: 68% of values lie within ± 1 s.d. of the mean 68% % of values lie within ± s.d. of the mean 95% 99% of values lie within ± 3 s.d. of the mean 99% 3 3 STSD 11

13 When using the Normal Distribution Areas table, the z-score is structured from the row and column headings of the table and the required proportion is found in the middle of the table at the intersection of the corresponding rows and columns. Eamples STSD-F4 Use the Normal Distribution Areas table to determine the proportions that correspond to the following z-scores. (i) z row,.00 column proportion % (ii) z row,.00 column proportion % (iii) z row,.0 column proportion % (iv) z. use z row,.00 column proportion % (v) z.1 use z row,.01 column proportion % z It is also possible to find a z-score for a required proportion from the Normal Distribution Areas table. It is necessar to find the proportion in the middle of the table and read back to the row and column headings to determine the z-score. Eamples STSD-F5 Use the Normal Distribution Areas table to determine the z-scores that correspond to the following proportions. (i) 48.1% row,.00 column The z-score would be either z.1 or z.1. (ii) 48.68% row,.0 column The z-score would be either z. or z.. z STSD 1

14 Values from the Normal Distribution Areas table enable the determination of proportions of the data set that lie above a raw value, below a raw value or even between two raw values within the data set. Drawing a quick sketch of the distribution curve with the position of the mean and the raw scores(s) and shading the required area can assist with the understanding of the required calculations. Eamples STSD-F6 1. The weights of chips in packets have a mean of 50g and standard deviation of 3g. (a) Find the proportion of the packets of chips with a weight between 50g and 5g z (from the table) % of the chip packets have a weight between 50g and 5g. (b) Find the proportion of the packets of chips with a weight between 47g and 50g z (from the table) % of the chip packets have a weight between 47g and 50g. (c) Find the proportion of the packets of chips with a weight greater than 55g z Area > % of the chip packets have a weight greater than 55g. (d) Find the proportion of the packets of chips with a weight less than 45g z Area < % of the chip packets have a weight less than 45g. (e) Find the proportion of the packets of chips with a weight between 45g and 5g z z Area between 45 & % of the chip packets have a weight between 45g and 5g. STSD 13

15 Eamples STSD-F6 continued 1. (f) Find the proportion of the packets of chips with a weight between 5g and 55g z z Area between 5 & % of the chip packets have a weight between 5g and 55g.. (a) If the compan selling the chips in the previous question rejects the 5% of the chip packets that are too light, at what weight should the chip packets be rejected? 5% 50g s 3g proportion below mean s z z between 1.64 and Packets should be rejected if the weigh 45g or less. (b) Between which weights do 80% of the chip packets fall? 45% ( ) 80% 40% + 40% 50g s 3g ± s z 50 ± and ( ) proportion 0.4 above and below mean z ± % of the packets weigh between 46.16g and 53.84g. 3. If there are 80 packets in a vending machine, how man packets would be epected to weight more than 55g? z Area > Number > Approimatel 4 chip packets would be epected to have a weight of greater than 55g. Eercise STSD-F 1. If scores are normall distributed with a mean of 30 and a standard deviation of 5, what percentage of the scores is (i) greater than 30? (ii) between 8 and 30? (iii) greater than 37? (iv) between 8 and 34? (v) between 6 and 8? STSD 14

16 Eercise STSD-F continued. IQ scores have a mean of 100 and a standard deviation of 16. What proportion of the population would have an IQ of: (i) greater than 13? (ii) less than 91? (iii) between 80 and 10? (iv) between 80 and 91? (v) If Shane is smarter than 75% of the population, what is his IQ score? 3. The heights of bos in a school are approimatel normall distributed with a mean of 140cm and a standard deviation of 0cm. (a) Find the probabilit that a bo selected at random would have a height of less than 175cm. (b) If there are 400 bos in the school how man would be epected to be taller than 175cm? (c) If 15% of the bos have a height that is less than the girls mean height, what is the girls mean height? 4. Charlie is a sprinter who runs 00m in an average time of.4 seconds with a standard deviation of 0.6s. Charlie s times are approimatel normall distributed. (a) To win a given race a time of 1.9s is required. What is the probabilit that Charlie can win the race? (b) The race club that the sprinter is a member of has the record time for the 00m race at 0.7s. What is the likelihood that Charlie will be able to break the record? (c) A sponsor of athletics carnivals offers $100 ever time a sprinter breaks.5s. If Charlie competes in 80 races over a ear how much sponsorship mone can he epect? STSD-G Chebshev s Theorem A Russian mathematician, P.L. Chebshev, developed a theorem that approimated the proportion of data values ling within a given number of standard deviations of the mean regardless of whether the data is normall distributed or not. Chebshev s theorem states: For an data set, at least 1 k 1 of the values lie within k standard deviations either side of the mean. ( k > 1) So for an set of data at least: % % 3 9 of the values lie within standard deviations. of the values lie with in 3 standard deviations. This theorem allows the determination of the least percentage of values that must lie between certain bounds identified b standard deviations. STSD 15

17 Eamples STSD-G1 1. The heights of adult dogs in a town have a mean of 67.3cm and a standard deviation of 3.4cm. (a) What can be concluded from Chebshev s theorem about the percentage of dogs in the town that have heights between 58.8cm and 75.8cm? z z At least 84% of the adult dogs would have heights between cm and 75.8cm. 84%. (b) What would be the range of heights that would include at least 75% of the dogs? 1 75% 1 k k k k 1 k k k 4 Chebshev s theorem suggests that 75% of the heights are within ± standard deviations of the mean. μ ± zσ 67.3 ± and % of the dog s heights would range between 60.5cm and 74.1cm. Eercise STSD-G1 1. The weights of cattle have a mean of 434kg and standard deviation of 69kg. What percentage of cattle will weigh between 330.5kg and 537.5kg?. The age of pensioners residing in a retirement village has a mean of 74 ears and standard deviation of 4.5 ears. What is the age range of pensioners that contains at least 89% of the residents? 3. It was found that for a batch of softdrink bottles, the mean content was 994ml. If 75% of the bottles contained between 898ml and 1090ml, what was the standard deviation for the softdrink batch? 4. On a test the mean is 50 marks and standard deviation 11. At most, what percentage of the results will be less than 17 and greater than 83 marks? STSD-H Correlation and Scatterplots When two different data variables, quantities, are collected from the one source it is possible to determine if a relationship eists between the variables. A simple method of determining the relationship between two variables, if it eists, is b constructing a scatterplot. A scatterplot (scatter graph or scatter diagram) is a graph that is created b plotting one variable, quantit, on the horizontal ais and the other on the vertical ais. If one variable is likel to be dependent on the other, the dependent variable should be plotted on the vertical ais and the independent variable on the -ais. The scales on the vertical and horizontal aes do not need to be the same or even use the same units. Also the aes do not need to commence at zero. STSD 16

18 Eamples STSD-H1 1. The heights and weights of 10 students are recorded below. Construct a scatterplot for this data. Student Height (cm) Weight (kg) Adam Brent Charlie David Edd Fred Gar Harr Ian John Scatterplot of Weight against Height Weight (kg) Adam ( 167,5) John ( 181, 61) Height (cm). For one week the midda temperature and the number of hot drinks sold were recorded. Construct a scatterplot for this data. Sun Mon Tues Wed Thur Fri Sat Temp ( C) Number of Drinks Scatterplot of Hot Drinks Sales Against Temperature 70 Number of Drinks Sold Midda Temperature (oc) STSD 17

19 Eercise STSD-H1 Construct scatterplots for the following data sets. 1. Student A B C D E F G H I J Test 1 ( /50) Final Eam % Da Mon Tue Wed Thur Fri Sat Sun Temperature ( C) Softdrink Sales (cans) Name Ann Lee Ma Jan Tom Wes Height (cm) Shoe Size Speed (km/hr) Fuel Econom (km/l) The points in a scatterplot often tend towards approimating a line. It is possible to summarise the points of a scatterplot b drawing a line through the plot as a whole, not necessaril through the individual points. This line is called the line of best fit. The line of best fit line need not pass through an of the original data points, but is used to represent the entire scatterplot. A line of best fit can be thought of as an average for the scatterplot, in a wa similar to a mean is the average of a list of values. To sketch a line of best fit for a scatterplot: 1. calculate the mean of the independent variable values,, and the mean of the dependent variable values, ;. plot this mean point, (, ) to the scatterplot; 3. sketch a through the mean point, that has a slope the follows the general trend of the points of scatterplot. Wherever possible the number of points below the line should equal the number of points above. Outling points need not strongl influence the line of best fit, and are often not included. The line of best fit can be used to predict values for data associated with the scatterplot. STSD 18

20 Eamples STSD-H 1. The heights and weights of 10 students are recorded below. Construct a scatterplot for this data and draw a line of best fit. Student Height (cm) Weight (kg) Adam Brent Charlie David Edd Fred Gar Harr Ian John Means Σ 1697 Σ n 10 n Weight Against Height Weight (kg) ( 169.7,56.3) Height (cm). For one week the midda temperature and the number of hot drinks sold were recorded. Construct a scatterplot for this data. Sun Mon Tues Wed Thur Fri Sat means Temp ( C) Number of Drinks Drink Sales Against Temperature Drink Sales ( 9.43, 46.86) Temperature Eercise STSD-H Draw lines of best fit on scatterplots drawn in Eercise STSD-H1. STSD 19

21 Correlation is a measure of the relationship between two measures, variables, on sets of data. Correlation can be positive or negative. A positive correlation means that as one variable increases the other variable increases, eg. height of a child and age of the child. Negative correlation implies as one variable increases the other variable decrease, eg. value of a car and age of the car. If variables have no correlation there is no relationship between the variables, i.e. one measure does not affect the other. Scatterplots enable the visual determination of whether correlation eists between variables and the tpe of correlation. no correlation perfect positive correlation positive correlation perfect negative correlation negative correlation Eercise STSD-H3 1. For each of the following scatterplots determine if the correlation is perfect positive, positive, no correlation, negative or perfect negative. A B C D E F G H. Select the scatterplot from those above that would best describe the relationship between the following variables. (i) Height at 4 ears, height at 16 ears. (ii) (iii) (iv) (v) Age of a used car, price of a used car. Temperature at 6am, temperature at 3pm. Shoe size of mother, number of children in famil. Average eam result, class size. 3. For each of the scatterplots drawn in eercise STSD- H1 state the tpe of correlation between the variables. STSD 0

22 STSD-I Correlation Coefficient and Regression Equation It is possible to quantif the correlation between variables. This is done b calculating a correlation coefficient. A correlation coefficient measures the strength of the linear relationship between variables. Correlation coefficients can range from 1 to +1. A value of 1 represents a perfect negative correlation and a value of +1 represents a perfect positive correlation. If a data set has a correlation coefficient of zero there is no correlation between the variables. perfect positive correlation r 1 positive correlation r 0.7 no correlation r 0 perfect negative correlation r 1 negative correlation r 0.7 The most widel used tpe of correlation coefficient is Pearson s, r, simple linear correlation. The value of r is determined with the formula below r Σ( X X)( Y Y) ( X X) ( Y Y) Σ Σ This formula uses the sums of deviations from the means in both the X values and Y values. However for ease of calculation the following calculation formula is often used. r nσxy ΣXΣY ( nσx ( ΣX) ) nσy ( ΣY) ( ) where, n number of data points Σ X sum of the X values Σ Y sum of the Y values Σ XY sum of the product of each set of X and Y values Σ X sum of X Σ Y sum of Y ( Σ X ) the square of the sum of the X values ( Σ Y ) the square of the sum of they values STSD 1

23 To assist in calculations data can be set up in a table and the following headings used: X Y X Y XY Eample STSD-I1 Calculate the correlation coefficient for the height weight data below. Height, X Weight, Y X Y XY Σ X 1697 Σ Y 563 Σ X 8867 Σ Y 3661 Σ XY 9638 r nσxy ΣXΣY nσx ΣX nσy ΣY ( ( ) ) ( ) ( ) ( )( ) There is a positive correlation between the height and weight of students. NOTE: r, sometimes referred to as the co-efficient of determination, represents the proportion (percentage) of the relationship between the variables that can be eplained b a linear relationship. The greater the r value, the greater the linear relationship between the variables. Eample STSD-I For the previous height/weight eample r r approimatel 78% of the relationship can be eplained b a linear correlation. This is a moderatel strong correlation. Eercise STSD-I1 Determine the correlation co-efficient for the following data sets. 1. Temperature ( C) Hot drink sales (cups) Speed (km/hr) Fuel econom (km/l) Data collected of students results for sitting Test 1 and the final eam (eercise STSD H1 question 1) Determine the correlation co-efficient for this data. X : Test 1 results (out of 15 marks) Y : Final eam result (out of 50 marks) Σ X 99, Σ Y 649, Σ X 9849, Σ Y 46387, Σ XY 137, n 10 STSD

24 The relationship between two sets of data can be represented b a linear equation called a regression equation. The regression equation gives the variation of the dependent variable for a given change in the independent variable. It is etremel important to correctl determine which variable is dependent. The regression equation can be used to construct the regression line (line of best fit) on the associated scatterplot. Because the equation is for a line, the regression equation takes on the general linear equation format, m+ c. Usuall, however, for a regression equation this is written as α + β, where α is the -intercept and β the slope of the line. ΣXY nx Y α + β where β ΣX nx α Y βx The slope of the line depends on whether the correlation is positive or negative. A regression equation can be used to predict dependent variables from independent inputs within the range of the scatterplot values. It should not be used: to predict given to predict outside the bounds of given values. The stronger the correlation between the two variables the better the prediction made b the regression equation. Again setting up a table of values, with the following headings, can assist in calculations. X Y X XY Eamples STSD-I3 1. Calculate the correlation coefficient for the height/weight data below. The weight of a student should depend on the height rather than vice versa, so height is the independent,, variable and weight the dependent,, variable. Height, X Weight, Y X XY Σ X 1697 Σ Y 563 Σ X 8867 Σ XY 9638 ΣX 1697 X ΣXY nx Y n 10 β ΣY 563 ΣX nx Y n α Y βx The regression equation is OR w 1.08h 16.7 STSD 3

25 Eamples STSD-I3 continued. Use the regression equation to predict the following: (i) The weight of a student who is 160cm tall. h 160 w 1.08h kg The student should weigh approimatel 46.1kg. (ii) The weight of a student who is 175cm tall. h 175 w 1.08h kg The student should weigh approimatel 6.3kg. (iii) The weight of a student who is 185cm tall. Can not be predicted as input is outside range of recorded heights. (iv) The height of a student who weighs 65kg. Regression equation can not be used to predict height from weight. Eercise STSD-I 1. Determine the regression equation for each the following data sets. (Use sums calculated in the previous eercise and the equations to predict the requested values.) (a) Temperature ( C) Hot drink sales (cups) (i) (ii) Predict the drink sales when the temperature is 10 C. Predict the drink sales if the temperature is 5 C. (b) Speed (km/hr) Fuel econom (km/l) (i) (ii) Predict the fuel econom for a speed of 75km/hr. Predict what speed a car would be travelling if it was getting 17.5km/l.. Data collected of students results for sitting Test 1 and the final eam (eercise STSD H1 question 1) X : Test 1 results (out of 15 marks) Y : Final eam result (out of 50 marks) Σ X 99, Σ Y 649, Σ X 9849, Σ Y 46387, Σ XY 137, n 10 Find the regression equation for this data. STSD 4

26 STSD-J Summar The mean is a measure of central tendenc. Standard deviation measures spread or dispersion of a data set. The coefficient of variation, CV, gives the standard deviation as a percentage of the mean of the data set. The z-score indicates how far, the number of standard deviations, a raw score deviates from the mean of the data set. The following formulae can be used to calculate the the given statistical measures. Statistical Measure Population Formula Sample Formula Mean Mean from frequenc table μ N n f f μ f f Standard Deviation ( Σ) Σ N σ N s Σ ( Σ) n 1 n Standard Deviation from frequenc table σ Σf ( Σ ) f Σf Σf s Σf ( Σf) Σf Σf 1 Coefficient of Variation CV σ 100% CV s 100% μ z-score μ z σ z s Raw Score from z-score μ + σ z + s z The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one. The distance between the mean and a given z-score corresponds to a proportion of the total area under the curve, and hence can be related to a proportion of a population. The total area under a normal distribution curve is taken as equal to 1 or 100%. The values in the Normal Distribution Areas table give a proportion value for the area between the mean and the raw score greater than the mean, converted to a positive z-score. + z In a normal distribution approimatel: 68% of values lie within ± 1 s.d. of the mean; 95% of values lie within ± s.d. of the mean; and 99% of values lie within ± 3 s.d. of the mean. 1 Chebshev s theorem states: for an data set, at least ( 1 k ) deviations either side of the mean. ( k > 1). of the values lie within k standard STSD 5

27 A scatterplot is a graph that is created b plotting one variable, quantit, on the horizontal ais and the other on the vertical ais. To sketch a line of best fit for a scatterplot: 1. calculate the mean of the independent variable values,, and the mean of the dependent variable values, ;. plot this mean point, (, ) to the scatterplot; 3. sketch a line through the mean point, that has a slope that follows the general trend of the points of scatterplot. Correlation is a measure of the relationship between two measures, variables, on sets of data. Correlation can be positive or negative. no correlation perfect positive correlation positive correlation perfect negative correlation negative correlation A correlation coefficient measures the strength of the linear relationship between variables. The most widel used tpe of correlation coefficient is Pearson s, r, simple linear correlation. The value of r is determined with the calculation formula r nσxy ΣXΣY ( nσx ( ΣX) ) nσy ( ΣY) ( ) r, sometimes referred to as the co-efficient of determination, represents the proportion (percentage) of the relationship between the variables that can be eplained b a linear relationship. The relationship between two sets of data can be represented b a linear equation called a regression equation. ΣXY nx Y α + β where β ΣX nx α Y βx A regression equation can be used to predict dependent variables from independent inputs within the range of the scatterplot values. It should not be used: to predict, the independent variable, given, the dependent variable. to predict outside the bounds of given values. The stronger the correlation between the two variables the better the prediction made b the regression equation. STSD 6

28 STSD-K Review Eercise 1. For each of the following data sets calculate (i) the mean (ii) the standard deviation (iii) the coefficient of variation. (a) (b) (c) Store Sales for a week $55 $698 $547 $70 $645 $451 Student Mark in a 5 Mark Test Mark Frequenc Dail Rainfall in millimetres Rainfall (mm) Frequenc (das) A soft-drink filling machine uses cans with a maimum capacit of 340ml. The machine is set to output softdrink with a mean capacit of 330ml. It has been found that due to machine error the amount outputted varies with a standard deviation of 8ml and the amount outputted is normall distributed. (a) What proportion of cans will have between 330 ml and 340ml of softdrink? (b) What percentage of cans will have between 35ml and 340ml? (c) What percentage of cans will overflow? (d) If the smallest 5% of drinks must be rejected, what is the smallest amount which will be accepted? 3 (a) If a set of data has a mean of 76 and a standard deviation of 8.8, what is the interval that should contain at least 75% of the data? (b) A data set has a mean of 87 and a standard deviation of 98. At least what percentage of values should lie been 58 and 107? (c) A set of data has a mean of 468. If 89% of the data values lie between 336 and 600, what is the standard deviation for the data set? STSD 7

29 Eercise STSD-K continued 4. (a) Match each of the correlation coefficients with a scatterplot below. (i) r 0.6 (iv) r 0.9 (ii) r 0 (v) r 1 (iii) r 0.9 (b) Which scatterplot best approimates the correlation between each of the two variables below? (a scatterplot ma be used more than once) (i) das on a good diet, weight (ii) temperature outside, temperature in a non-air conditioned car (iii) hand span, height (iv) rainfall, level of water in river (v) length of finger nails, intelligence A B C D E F G 5. The average test results for a standard eamination and corresponding class size were recorded for five schools. The results are summarised in the table below. School Class Size Test Result A 8 8% B 33 50% C 5 80% D 14 98% E 0 90% Use the class size/test result data to: (a) construct a scatterplot (b) draw a line of best fit (c) determine the correlation coefficient (d) comment on the correlation (e) determine the regression equation (f) use the regression equation to predict (i) the epected result if a school had a class size of 30 students (ii) the epected result for a class of 10 students (iii) how man student there would be in a class if the test result was 75% STSD 8

30 STSD-L Appendi Normal Distribution Areas Table 0 z z STSD 9

31 STSD-Y Inde Topic Page Chebshev's Theorem... STSD 15 Coefficient of Determination... STSD Co-efficient of Variation... STSD 7 Correlation... STSD 0 Correlation Coefficient... STSD 1 Deviation... STSD 3 Line of Best Fit... STSD 18 Mean... STSD Mean population... STSD Mean sample... STSD Measure of Central Tendenc... STSD Negative Correlation... STSD 0 No Correlation... STSD 0 Normal Distribution... STSD 11 Normal Distribution table... STSD 1, 9 Population mean... STSD Population standard deviation... STSD 3 Positive Correlation... STSD 0 Raw Score... STSD 10 Regression Equation... STSD 3 Regression Line... STSD 3 Sample mean... STSD Sample standard deviation... STSD 3 Scatterplot... STSD 16 Standard Deviation... STSD 3 Standard Deviation calculation... STSD 4 Standard Deviation population... STSD 3 Standard Deviation sample... STSD 3 Standard Normal Curve... STSD 10 Standard Normal Distribution... STSD 10 Variance... STSD 5 bar... STSD z-scores... STSD 8 STSD 30

32 STSD 31

33 SOLUTIONS Statistics and Standard Deviation STSD-Z Solutions STSD -B Calculating Means... STSD 33 STSD -D Calculating Standard Deviations... STSD 34 STSD -E Coefficient of Variation... STSD 35 STSD -F Normal Distribution and z-scores STSD F1 z-scores... STSD 36 STSD F Normal Distributions... STSD 37 STSD -G Chebshev s Theorem... STSD 40 STSD -H Correlation and Scatterplots STSD H1/H Scatterplots/Lines of Best Fit... STSD 41 STSD H3 Correlation... STSD 43 STSD -I Correlation Coefficient and Regression Equation STSD I1 Correlation Coefficient... STSD 44 STSD I Regression Equation... STSD 45 STSD -K Review Eercise... STSD 46 STSD 3

34 SOLUTIONS Statistics and Standard Deviation STSD B1 Calculating Means (a) mean n The mean number of hocke goals scored is approimatel.09 goals. (b) Points Scored () Frequenc (f) f Total Σ f 10 Σ f 18 f mean f (c) The mean number of points scored is 1.8 points. Tping errors () Frequenc (f) f Total Σ f 0 Σ f 1 The mean number of tping errors is 1.05 errors. f mean f (d) Bab Weight (kg) Mid-value () Frequenc (f) f Total Σ f 15 Σ f 5.1 f mean f The mean bab weight is approimatel 3.47kg (e) Withdrawals ($) Mid-value () Frequenc (f) f Total Σ f 30 Σ f 3350 f mean f 3350 The mean withdrawal was approimatel $ STSD 33

35 SOLUTIONS Statistics and Standard Deviation STSD-D1 Calculating Standard Deviations (a) Σ 3 Σ 73 ( Σ) Σ n s n The standard deviation of the number of goals scored is approimatel 1.58 goals. (b) Points () f f f Σ f 10 Σ f 18 Σ f 1658 ( Σf) Σf Σf 10 s 1.48 Σf The standard deviation of number of basketball points is approimatel 1.48 points. (c) Errors () f f f Σ f 0 Σ f 1 Σ f 37 s Σf ( Σf) Σf Σf The standard deviation of the number of tping errors is approimatel 0.89 errors. (d) Weights f f f Σ f 15 Σ f 5.1 Σ f ( Σf) Σf Σf 15 s 0.33 Σf The standard deviation of the bab weights is approimatel 0.33kg. STSD 34

36 SOLUTIONS Statistics and Standard Deviation STSD-D1 continued (e) Withdrawals f f f Σ f 30 Σ f 3350 Σ f ( Σf) Σf Σf 30 s Σf The standard deviation of the ATM withdrawals is approimatel $ STSD-E1 Coefficient of Variation 1. (a) n ( Σ) Σ n s n s CV 100% % % (b) n ( Σ) Σ n s n s CV 100%.9 100% %. (a) Data Set A: n ( Σ) Σ n s n s CV 100% % % Data Set B: n ( Σ) Σ n s n There is greater relative variation in Data set B s CV 100% % % STSD 35

37 SOLUTIONS Statistics and Standard Deviation STSD-E1 continued. (b) Bos Heights: 141.6cm s 15.1cm Girls Heights: 143.7cm s 8.4cm There is greater relative variation in bos heights. s CV 100% % % s CV 100% % % STSD-F1 1. z-scores n 5 ( ) s n n Σ 375 Σ 8933 raw score z-score meaning z standard deviations below the mean z standard deviations above the mean z standard deviations below the mean z standard deviations below the mean z standard deviations above the mean. (i) s.d. above the mean z (ii) s.d. above the mean z z Maths 1.47 z Geograph Peter s Maths result is 1.47 standard deviations above the class mean, while his geograph was 1.30 standard deviations above the class mean. So Peter did slightl better with his Maths result compared to the rest of the class. 4. (i) z + s z is two standard deviations below the mean. (ii) z s z is 1.5 standard deviations above the mean. STSD 36

38 SOLUTIONS Statistics and Standard Deviation STSD-F 1. (i) Normal Distributions 30 50% 50% of the scores are greater than 30, the mean. (ii) z % of the scores are between 8 and 30. (iii) z % of the scores are greater than 37. Area > (iv) z8 0.4 z % of the score are between 8 and Area between 8 & (v) z6 0.8 z % of the score are between 6 and 8. Area between 6 & (i) Area > z % of the population have IQs greater than 13. STSD 37

39 SOLUTIONS Statistics and Standard Deviation STSD-F. (ii) continued % z Area < % of the population have IQs less than 91. (iii) z z Area between 80 & % of the population have IQs between 80 and 10. (iv) z z Area between 80 & % of the population have IQs between 80 and 91. (v) 75% 50% + 5% ( ) proportion 0.5 above mean + s z z between 0.67 and Shane would have an IQ of over 110. STSD 38

40 SOLUTIONS Statistics and Standard Deviation STSD-F 3. (a) continued % z Area < % of the bos have a height of less than 175cm. (b) Area > Bos > z Approimatel 16 bos have a height of greater than 175cm. (c) 15% 35% 140 proportion 0.35 ( below mean) s z z between 1.03 and The girls mean height is approimatel 119.3cm. 4. (a) (b) z Area < Charlie has approimatel 0% probabilit of winning the race (c) z Area < Charlie has approimatel 0.3% probabilit of breaking the record % Area < z Charlie epect $4540 in prize mone Prize $100 $4540 STSD 39

41 SOLUTIONS Statistics and Standard Deviation STSD-G1 1.. z Chebshev s Theorem z % 1 89% 1 k k k 1 k k ± 3 At least 56% of the cattle would have weights between kg and kg. z s z z 3 s z At least 89% of the pensioners would have ages between 60.5 ears and 87.5 ears % 1 k k k 1 k k ± z + s z z s z s OR s s s s 48 s 48 The standard deviation is 48ml z17 3 z At least 89% of the test result would be between 17 and 83, therefore 100% 89% 11% of the results will be less than and greater than 83 marks 89% STSD 40

42 SOLUTIONS Statistics and Standard Deviation STSD-H1/H 1. Scatterplots/Lines of Best Fit Student A B C D E F G H I J Test 1 ( /50) Final Eam % mean test1 9.9 mean test Scatterplot of Final Eam against Test 1 Final Eam Result (%) ( 9.9, 64.9) Test 1 Results (/50). Da Mon Tue Wed Thur Fri Sat Sun Temperature ( C) Softdrink Sales (cans) mean temp mean sales Scatterplot of Drink Sales against Temperature Softdrink Sales (cans) ( 30.43,7.14) Temperature (oc) STSD 41

43 SOLUTIONS Statistics and Standard Deviation STSD-H1/H continued 3. Name Ann Lee Ma Jan Tom Wes Height (cm) Shoe Size mean height mean shoesize Shoe Size against Height Shoe Size 9 8 ( ,9.5) Height (cm) 4. Speed (km/hr) Fuel Econom (km/l) mean speed mean econom Fuel Econom against Speed 0 18 Fuel Econom (km/l) ( 83.75,16.09) Speed (km/hr) STSD 4

44 SOLUTIONS Statistics and Standard Deviation STSD-H3 Correlation 1. A perfect positive B- negative C - positive D - positive E - negative F perfect negative G no correlation H perfect positive. (i) Height at 4ears, height at 16 ears. C (ii) Age of a used car, price of a used car. E (iii) Temperature at 6am, temperature at 3pm. C (iv) Shoe size of mother, number of children in famil. G (v) Average eam result, class size. E positive correlation the better a student did in Test 1 the better the did for the final eamination 3. positive correlation the taller a person the larger their shoe size. Shoe Size against Height Scatterplot of Final Eam against Test Final Eam (%) Shoe Size Test 1 Results (/50) Height (cm). positive correlation the hotter the temperature the more softdrinks sold. Scatterplot of Drink Sales against Temperature 4. negative correlation the faster a car the less economical it is. Fuel Econom against Speed Softdrink Sales (cans) Fuel Econom (km/l) Temperature (oc) Speed (km/hr) STSD 43

45 SOLUTIONS Statistics and Standard Deviation STSD-I1 Correlation Coefficient 1. Temperature, X Hot Drink Sales Y X Y XY Σ X 66 Σ Y 38 Σ X 69 Σ Y 1666 Σ XY 845 r ( )( ) There is a strong negative correlation, , between the sales of hot drinks and the temperature. r Speed, X Econom, Y X Y XY Σ X 670 Σ Y 18.7 Σ X 6350 Σ Y Σ XY r ( )( ) There is a strong negative correlation, 0.99, between the speed and the fuel econom. r r ( )( ) There is a positive correlation, 0.93, between the result in test 1 and the result in final eam. r STSD 44

46 SOLUTIONS Statistics and Standard Deviation STSD-I Regression Equation 1. (a) Temperature, X Hot Drink Sales, Y X XY Σ X 66 Σ Y 38 Σ X 69 Σ XY 845 X Y ΣX n ΣY n ΣXY nx Y α Y βx β ΣX nx ( 3.57) The regression equation is (i) It would be epected that 45 hot drinks would be sold at 10 C. (ii) Can not be predicted as input is outside range of recorded temperatures. (b) Speed, X Econom, Y X XY Σ X 670 Σ Y 18.7 Σ X 6350 Σ XY X Y ΣX n ΣY n ΣXY nx Y β ΣX nx The regression equation is α Y βx ( 0.093) (i) It would be epected that at 75km/hr the econom would be appro. 16.9km/l. (ii) Regression equation can not be used to predict speed from econom.. X ΣX n Y ΣY n ΣXY nx Y β ΣX nx ( ) α Y βx The regression equation is STSD 45

47 SOLUTIONS Statistics and Standard Deviation STSD-K Review Eercise 1. (a) Store Sales for a week (b) Sales () Σ 3613 Σ 8943 CV n $60.17 s 100% % % ( Σ) Σ n s n (i) the mean sales is approimatel $60.17 (ii) the standard deviation of the sales is approimatel $103.6 (iii) the coefficient of variation is approimatel 17.15% Student Mark in a 5 Mark Test Mark ( ) Frequenc, ( ) f f Σ f 30 Σ f 97 Σ f 361 f (c) f f s Σf ( Σf) Σf Σf CV s 100% % % (i) the mean mark is approimatel 3.3 (ii) the standard deviation of the marks is approimatel 1.8 (iii) the coefficient of variation is approimatel 39.6% Dail Rainfall in millimetres Rainfall f f Σ f 1 Σ f 47 Σ f 3779 f f ( Σf) Σf s f Σf s CV 100% 47 Σf % % 6.61 (i) the mean rainfall was approimatel 11.76mm (ii) the standard deviation of the rainfall was approimatel 6.61mm (iii) the coefficient of variation is approimatel 56.% STSD 46

48 SOLUTIONS Statistics and Standard Deviation STSD-K. (a) continued z A proportion of of the cans between 330 ml and 340ml of softdrink. (b) (c) z ( between 0.6 & 0.63) z % of the cans between 35 ml and 340ml of softdrink z Area > % of the cans will overflow. Area between 35 ml & 340ml (d) 5% % ( ) proportion 0.45 below mean z s z between 1.64 and The smallest amount of softdrink that would be accepted would be ml. 3. (a) (b) 1 75% 1 k k k 1 k 4 k ± 0.5 z z z + s z z s z At least 75% of values would lie between 18.4 and % At least 84% of the values should lie been 58 and 107. STSD 47

49 SOLUTIONS Statistics and Standard Deviation STSD-K continued 3. (c) 1 z + 3 z 3 89% 1 k s z s z s 3 OR s k s 3s s 44 s 44 k k 9 k ± The standard deviation is (a) (i) r 0.6 C (iv) r 0.9 D (ii) r 0 G (v) r 1 F (iii) r 0.9 E (b) (i) Das on a good diet, weight B (ii) temperature outside, temperature in a non-airconditioned car D (iii) hand span, height C (iv) rainfall, level of water in river C (v) length of finger nails, intelligence G 5. School Class Size, X Test Result, Y X Y XY A B C D E Σ X 10 Σ Y 400 Σ X 3094 Σ Y 3338 Σ XY 9118 (a)/(b) X Y 80 5 Scatterplot of Test Result against Class size Test result (%) ( 4,80) Class Size STSD 48

50 SOLUTIONS Statistics and Standard Deviation STSD-K 5. (c) (d) (e) r continued ( )( ) There is a negative correlation between the class size and test result. The smaller the class the better the test result. r ΣXY nx Y α Y βx β ΣX nx (.5) The regression equation is (f) (i) (ii) (iii) It would be epected that the test result would be 66.5% when the class size was 30 students. Can not be predicted as input is outside range of recorded class sizes. Regression equation can not be used to predict class size from test result. STSD 49