An approach to Descriptive Statistics through real situations

Size: px
Start display at page:

Download "An approach to Descriptive Statistics through real situations"

Transcription

1 MaMaEuSch Management Mathematics for European Schools mamaeusch An approach to Descriptive Statistics through real situations Paula Lagares Barreiro 1 Federico Perea Rojas-Marcos 1 Justo Puerto Albandoz 1 MaMaEuSch 2 Management Mathematics for European Schools CP DE - COMENIUS - C21 1 University of Seville 2 This project has been carried out with the partial support of the European Community in the framework of the Sokrates programme. The content does not necessarily reflect the position of the European Community, nor does it involve any responsibility on the part of the European Community.

2 Contents 1 One-dimensional Descriptive Statistics Objectives The example: an opinion poll Population and samples Types of statistical variables: quantitative (discrete and continuous) and qualitative Frequency tables: absolute, relative and percentage frequencies Graphical methods Bar graph Histogram Frequency polygon Pie Chart Pictogram Stem and leaf plot Some remarks Measures of central tendency: mean, median, mode, quantiles Measures of variability: Range, variance, standard deviation Joint use of the mean and the standard deviation: Tchebicheff s theorem, Pearson s coefficient of variation, z-scores Tchebicheff s theorem Pearson s coefficient of variation Z-scores Analysis of the opinion poll Conclusions Two-dimensional Descriptive Statistics Objectives The example: an opinion poll Introduction and simple tables Frequency tables, marginal distributions and conditional distributions Scatterplots Functional dependence and statistical dependence Covariance Linear correlation

3 3.9 Regression lines

4 Chapter 1 One-dimensional Descriptive Statistics We are going to study an opinion poll. You will fill a poll, so that we will see what you think about a lot of topics and we will study some characteristics as height, number of brothers/sisters, etc. We will check if your opinions coincide with those of the rest of your friends and also if there are many people in your classroom with similar characteristics to yours. For instance, how many of your partners are higher than you? And how many of them have the same number of brothers/sisters than you? Before continuing, we will pose the main objectives that we want to achieve in this chapter. 1.1 Objectives To distinguish the different types of statistics. To determine which type of statistic process we shall use, depending on the type of data that we are studying. To get to know the concepts of central tendency and variability of a set of data. To determine the parameters of an statistics distribution. To study the coefficient of variation. To motivate through information given in examples and exercises about social, ecological, economical topics, etc. 3

5 1.2 The example: an opinion poll From now on, we will work with an opinion poll. We want to know some things about the students of the same class than you. We will ask you about some personal data and then you will give to us some information and opinion about many topics, as sports, food, etc. Our poll will be anonymous, so that each one of you can feel free to answer without worrying about the later reading of those opinions. Thus, with these data, we will pose some interesting questions about ourselves as a group, that we can maybe use as an orientation to answer other questions about a wider group of people. For instance, Which is the most frequent height in your class? Can you consider your weekly pay normal compared with those of your partners? How many of you practice sports often? How many have breakfast before coming to the high school? What kind of fruit do you eat more: fruit, milk, coffee, milk, fish...? We will see that analyzing the answers we get in the poll, you will be able to answer all these questions we have posed. Surely, at the end of this chapter we will have all the answers. But first of all, we are going to present the concepts that you will need. 1.3 Population and samples Before answering all those questions, we have to clarify some things. Who do we want to get information about? We have said yet that we want to know things about the students of your level, so our population will not be only the students of this class, but all the students of your level. But it will take too long to ask all those students, thus we have decided to take a representative group of all the classrooms of your level, that is your class, in this case. So that you are the sample. Furthermore, each member of the population is called data point. Let us make some comments about what we have just said. First of all, maybe we want to study some characteristic in animals, plants or things, for instance, the life of batteries of a mobile phone and, in this case, the population is not human, but the different types of mobile phones. Moreover, we can find some situations in which the use of sampling is even more justified than in our case, due to different reasons: if we want to know the vote of all the spanish people, we can t ask all the inhabitants older than 18, because those are millions of people and that means lots of time and money. To study, for example, the average life of light bulbs we can t prove all of them because each proof means that a bulb is blown, this is an example of those situation in which sampling means destroying a data point. Therefore, sampling is justified in many situations by reasons of time, money or destruction of the data points. Exercise The University studies demand poll in Andalusia was made in 2001 to know what the high school students wanted to study and why. In order to get that, data from 8500 students from all Andalusia were collected. Could you say which are the sample and the population in this example? Which are the reasons to choose a sample in this example? 4

6 1.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative In order to answer to many of our questions in the right way, what we shall first do is to decide the kind of method we want to apply to our data. Notice that not all the data we can collect are the same kind, for instance, we can think about the answer to three questions of our poll: 1. The answer to the question sex (male or female). 2. The answer to the question number of brothers/sisters. 3. The answer to the question height. The first thing we notice is that the answer to the first question is not numerical whereas the answers to questions two and three are numerical. The characteristic corresponding to the answer of the first question is called qualitative whereas the ones related to the answers of questions two and three are called quantitative. It is easy to see that quantitative variables allow to do operations that we cannot do with qualitative characteristics. We call categories to the different possibilities of the qualitative variable and values to the ones of the quantitative variables. Let us see now which are the differences between variables 2 and 3, because this one is a little more complicated. The variable number of brothers/sisters take numerical values that we can call isolated, 0,1,2,3,..., but it cannot take any value between two of those ones, for instance, it cannot have the value 3.5. Nevertheless this does not happen with the variable height. In fact, height can have any value between certain limits, we can measure height as precisely as we want. We can say that height can take any value from an interval. So the variable in case 2 is called discrete and the variable in case 3 is called continuous. Exercise Decide whether these variables are qualitative or quantitative, and if they are quantitative, whether they are discrete or continuous 1. Number of babies born in a day. 2. Blood group of a person. 3. Time needed to solve a problem. 4. Number of questions in an exam. 5. Temperature of a person. 6. Political party voted in the last elections. 7. Number of goals scored by a player in a season. 5

7 1.5 Frequency tables: absolute, relative and percentage frequencies It is the time now to start processing the data we have collected with our poll. The data that we have about number of brothers/sisters are Meanwhile for the weights we have We can pose a lot of questions: how many of my partners have the same number of brothers/sisters as I have? How many of them have more than me? And less than me? how many of my partners weigh more than me? and less than me? To answer these questions, we would have to count how many time each answer appears. Let us start counting the ones related to the number of brothers/sisters. This is what we have So, we know now that there are 13 people that have 1 brother/sister. This number is called absolute frequency and we denote it by n i. And, how many people has at most 1 brother/sister? In our case, the people that has 0 or 1 brother/sister, this is, = 19. This number is called cumulative absolute frequency and we will denote it by N i. We can write now the cumulative and absolute frequency table: N. bro/sis absolute fr. cum. absolute fr = = = = 30 It is important to put the values of the characteristic in order from the biggest to the smallest, if we want to calculate the cumulative frequencies in the right way. We are going to define now other kinds of frequencies, because it is interesting to know the proportion of the total that represents a concrete value, because that s the way we can compare it with other populations. For instance, in our case, there are 6 students that have 0 brothers/sisters, but we have asked in a group of 50 people and we know that there are 9 people with 0 brothers/sisters, so in which of the two groups is there a bigger proportion of people with no brothers/sisters? It is easy to see that the proportions are 6 30 = 0.2 and 9 50 = 0.18 So the proportion is bigger in our group of 30 people. This proportion is called relative frequency and we denote it by f i. If we express it as a percentage (multiplying by 100) we get the percentage 6

8 frequency, that in our case are 20% and 18% respectively. We denote these frequencies by p i. We add now all these frequencies to our table and we get Bro/sis absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr = % = % = = % = = % = = % = 30 1 Let us analyze now the weight data. We count the different values: As you can see, most of the values have frequency 1 and our variable takes 25 different values. Those are too many different values to represent in a table (even more if we only have 30 data). How can we get a more representative table of the distribution of the data? It seems logical to group similar data in intervals. There is a complete theory about how to group data in a right way. These are the main points we want to remark: The number of classes shall not be neither too high (around 6 8 is the maximum number we usually work with) nor too low (it makes no sense to group in 2 or 3 classes because we are losing a lot of information. 7

9 Excepting maybe the extreme classes, all the intervals should have the same width, because if not, the information can be misinterpreted. Can you imagine which are the intervals we are looking for? You can think about the number of classes you want to have, for instance. Let us note that between the highest value (82) and the lowest value (46) there is a difference of 36 kg. For instance, if we want to group in 6 classes the width of the interval should be 36 6 = 6. So we obtain the following intervals: [46,52], (52,58], (58,64], (64,70],(70, 76], (76,82]. Now we have a possible classification though, of course, there are many more. In some analysis you may find that the first interval is of the kind smaller than 52 and the last interval greater than 76. This kind of interval is considered the same size as the others in order to make calculus. Once decided the data grouping, we can calculate the frequencies: Weight absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr. [46,52] % (52,58] % (58,64] % (64,70] % (70,76] % (76,82] % 30 1 Moreover, when we work with grouped data we shall need to choose a representative of each one of the intervals, and we will call it class mark, and it will be the half point of the interval (lower extreme of the interval plus higher extreme of the interval, divided by 2). Exercise Make the frequency table from the variable answers to the question 1.3 and from the answers to the question height, deciding previously if it is necessary to group the data in intervals or not. 1.6 Graphical methods Once we have the frequency tables, imagine that your teacher ask you to present to the rest of the students the conclusions you have obtained. You can present your frequency tables and talk about the main conclusions, but, is there any way of presenting data in such a way that the main conclusions can be seen in a more simple way? As you can suppose, the answer to this question is yes. Maybe you have seen in books or mainly in the media, that data are usually presented in a graphic way, so that are more attractive to the people and also easier to analyze data. In this section we want to show all the types of graphs and we are going to stress in how important it is to make a right choice of the type of graph depending on the data we are working with. Now we have the frequency tables for the variables weight and number of brothers/sisters, we are going to use them to introduce the different graphs Bar graph The first kind of graph we are going to study is the bar graph. This is a graph that is used for 8

10 qualitative variables and discrete variables grouped in intervals. We know already that our data about number of brothers/sisters is a discrete variable, so let us see how to build a bar graph using those data. In the OX axis we place the categories if we have a qualitative variable or the values in the case we have a discrete variable, in our example, those values are 0, 1, 2, 3 y 4. Over each one of these values, we place a rectangle or a bar of equal base, having a height proportional to the corresponding frequency. In our case, we shall have a graph like this: Figure 1.1: brothers/sisters (vertical bars) Sometimes this graph is also presented with horizontal bars, in such a way like this: Figure 1.2: brothers/sisters (horizontal bars) Histogram An histogram is a graph very similar to the bar graph, but this one is used for variables grouped in intervals. We are going to build an histogram for the variable weight. As the one before, it is built by representing in the OX axis the intervals and, over each of them we place a rectangle having a basis with the same width of the interval and such a height that the area of the rectangle is proportional to the frequency of the interval. In this kind of graph, the areas of the rectangles 9

11 are very important, because we are not representing a bar corresponding to a point but the width of the bar is representing our interval. So, if our intervals have the same width, the height should be the frequency, if not, we shall modify the height in order to keep proportions between frequency and area. Our histogram for the variable weight, that we have already grouped is: Figure 1.3: weight (histogram) We can represent it also with horizontal rectangles: Figure 1.4: weight (histogram) Surely, you have seen sometime a population pyramid in any media. You can notice that a population pyramid is in fact two horizontal histograms (one for women an other for men) in which we represent the number of inhabitants grouped by age Frequency polygon The next type of graph that we are going to define is the frequency polygon. This graph is used when we have quantitative variables, discrete or continuous. In order to draw it, we start from the histogram or the bar graph, depending on the case that we have a grouped or not grouped variable. 10

12 We have to join with a line the half-points of the higher basis in the bar graph or the histogram. In our two examples, we shall have for the number of brothers/sisters the next graph Figure 1.5: brothers/sisters (frequency polygon) The case of the weight is a little bit different. In this situation, the area under the line represents the data we have, as in the histogram, because we are talking about the whole width of the interval. The graph looks like this: Figure 1.6: weight (frequency polygon) All the graphs that we have seen before can be drawn also for relative frequencies and for cumulative frequencies Pie Chart The next type of graph that we are going to present is a well-known type, the pie chart. In a pie chart, we assign to each category or value a part of a circle in such a way that its area should be proportional to the frequency. This graph is usually used for qualitative variables and not grouped discrete variables. 11

13 Figure 1.7: brothers/sisters (pie chart) Pictogram These are a kind of graphs that are very frequent in the media, and they are called pictograms. They are graphs in which a picture related to the variable is used to represent the frequencies. But we have to stress again on something: the size (and not only the height) has to be proportional to the frequency that we want to represent. It is usual to write also the frequency aside to avoid mistakes Stem and leaf plot There is a representation that is between a graph and a data recount, this is the stem and leaf plot. We are going to see how to make it through the example of the weight. We recall that the data we had are: In a stem and leaf plot, the first thing we have to do is to write in a column the different figures corresponding to the tens that we can find in the data, in our example, as our values range between 46 and 82, we shall have to write 4, 5, 6, 7 and 8 in the following way Next, we take the first observation, 52, and we place the units figure aside its corresponding tens figure, this is 12

14 So we keep placing the units figures aside the tens ones for the rest of the data. What we get is something like this: You can notice that we have something similar (but not equal) to a bar graph or an histogram. Obviously we could have made it vertically and we would have something like this: That looks like an histogram or a bar graph though it is not. But the stem and leaf plot can be taken as an approximation to the distribution of the data. In fact, we have only divided in tens (from 40 to 49, from 50 to 59,... ) but we could divide in groups of 5 (from 40 to 44, from 45 to 49, from 50 to 54,... ) just placing twice each of the ten figure, aside the first one we place the unit figures between 0 and 4 and aside the second one, the unit figures between 5 and 9. In our example and for the horizontal case, we would have:

15 1.6.7 Some remarks Imagine that you see the two following graphs referred to the benefits of a company. Which one would you choose to be your company? Figure 1.8: benefits (company 1 and company 2) Most of you may choose company 2, because surely you agree that it is better than company 1, but in fact data from the two graphs are the same. We have only changed the OY axis scale. We will make some remarks before starting the next section. Graphs are a very useful tool and they make easier to obtain conclusions from our data, but it is necessary to draw them in the right way in order to avoid mistakes. It is very important to keep proportions among the pictures we represent so as to make sure that the axis scales keep also proportional, because small changes in scales make big differences in appearance and graph can be misunderstood. 1.7 Measures of central tendency: mean, median, mode, quantiles Let us suppose now that we are planning a trip with all the class and we want to earn some money, so we have decided to sell t-shirts, but we don t know which is the appropriate price. The only thing we know is that we pay for them 4 euros. We would like to have benefits but we cannot put a high price because we want everybody to buy our t-shirts. We think that the weekly pay is a good reference to know what the students can afford. So, we are going to use the weekly pay data that we have: We have 30 values, but we need only one value to represent them all. Which is the value we can choose? A first solution might be choosing an intermediate value among all the data we have. In order to get that, we sum all the numbers and divide it by the total number of data, so we have: 14

16 x = = = 13 Now we have the first possible price, 13 euros. This number, we have just calculated is called mean. But there are more possibilities, for instance, we can choose the most frequent value to represent our data. In our example, the most frequent value is 9, that can also be a good choice for a price. We call mode to the most frequent value. But none of those two numbers that we have got say anything about the number of people that can afford the t-shirt. So, we have another idea. Let us sort the data we have: So now we want to find the value that leaves half of the data on each side. The values placed in numbers 15 an 16 leave 14 values in each side, as both of them have value 10, we can consider that 10 is the value that leaves half of the data in each side. This number is called median. Just as we have proposed a value that leaves 50% of the data on each side, we can look for a value that can afford 75% of the class, this is, we want to find the value that leaves 25% on the left (this means that only 25% of the data is lower than that value), or any other percentage. This numbers are called quantiles. We can choose now any of those three values, depending on what we pretend on each case or depending on the value that best represents al the data set. Those three values are not always valid for every case, but can help us to see where the center of the distribution is. These are the main measures of central tendency. We are now going to define in a formal way the concepts that we have presented. We are speaking from now on about variables. Let us suppose that we have observed a variable in n data points and we got k different values, x 1, x 2,... x k, each of them with a frequency of n 1, n 2,... n k where n i is the absolute frequency of the value x i. We denote by N i = j i n j the cumulative absolute frequency of the value x i and by f i = ni n the relative frequency. If the values of the variable are grouped, we can suppose we have h intervals that we can denote by (L 0, L 1 ], (L 1, L 2 ],... (L h 1, L h ] whose class marks will be c 1, c 2,... c h. In this case, the absolute frequencies will be denoted by n 1, n 2,..., n h, the cumulative absolute frequencies by N 1, N 2,..., N h = n and the relative frequencies by f 1, f 2,..., f h. Then, the mean is defined as follows n i=1 x = x in i n. For not grouped variables. If we have a grouped variable we will use the class marks c i instead of the values x i. The mean has as main characteristics the following: It is the gravity center of the distribution and it is unique. When we have extreme or scarcely representative values (too big or too small), the mean may not be representative. 15

17 It makes no sense to calculate the mean for a qualitative variable or if we have grouped data and anyone of the intervals is not bounded. For grouped data, we use the class mark of each interval to calculate the mean. Moreover, the mean has the following properties: If a constant is summed to each value, the mean is summed in that constant also. If we multiply all the values by a constant, the mean is also multiplied by the same constant. The mode is usually defined as the most frequent value. For the case of a not grouped variable it is the value that appears more times. In the case of grouped variables in intervals of the same width, we shall look for the interval with the highest frequency (modal class or interval) and the approximation of the mode is done through the formula: Mo = L i 1 + n i n i 1 (n i n i 1 ) + (n i n i+1 ) c i. where: L i 1 is the lower limit of the modal interval. n i is the absolute frequency of the modal interval. n i 1 is the absolute frequency of the previous interval to the modal interval. n i+1 is the absolute frequency of the next interval to the modal interval. c i is the width of the interval. The mode verifies that: We can have more than a mode for the distribution. In that case, we will say that we have a bimodal, trimodal,... distribution depending on the number of values presenting the highest absolute frequency. The mode is usually a worse representing than the mean, excepting the case of qualitative data. If we have intervals with different width, we have to look for the interval with the highest frequency density (this is usually the result of dividing the absolute frequency by the width of the interval ni c i ) and then we use the preceding formula. The median is, in the case of a grouped variable and once we have sorted our data the central value if there is an odd number of observations and the media of the central values if we have a pair number of data. If we have a grouped variable, we have to look for the central interval (the one in which we can find the central value), that is to say the one in which N i is bigger than n 2 for the first time, and then we can apply the formula:. Me = L i 1 + n 2 N i 1 c i n i 16

18 where L i 1 is the lower limit of the interval. n i is the absolute frequency of the central interval. N i 1 is the cumulative absolute frequency of the previous interval to the central interval. n is the number of data. c i is the width of the interval. Moreover, the quantiles are position measures that generalize the concept of median. We are going to define now the concept of centiles or percentiles, the quartiles and the deciles. We suppose that we have sorted our data. The centiles or percentiles are the values of the variable that leave on the left side a concrete percentage of the data. We denote them by P h or C h where h is the percentage, h = 1, 2,..., 99. If we have a grouped variable, once we have the interval in which we can find the centil, we apply the next formula: P h = C h = L i 1 + h n 100 N i 1 c i n i. Where the different elements have the same meaning as in the median case. The quartiles are the values that, once we have sorted the data, divide the variable in 4 equal groups. Between each of them there is a 25% of the data points. We denote them by Q 1, Q 2 y Q 3 and they verify that Q 1 = C 25, Q 2 = C 50 = Me, Q 3 = C 75. The deciles are the values that, once we have sorted our data, divide the data in 10 equal groups, in such a way that between any 2 of them there is a 10% of the data points. We denote them by D 1, D 2, D 3,..., D 9. They verify that D 1 = C 10, D 2 = C 20, D 3 = C 30,... D 9 = C 90. Exercise For the data of number of brothers/sisters and weight, calculate mean, mode, median and cuantiles: Q 1, Q 3, C 30, C 74, D 4, D Measures of variability: Range, variance, standard deviation Imagine that we have 3 different data sets about the weights of certain people and we know that in the 3 cases, the mean of the variable weight is 55. Does this mean that the 3 sets are equal or similar? We get the data and we find that the observations are: Set 1: Set 2: Set 3: we can see that, though they have the same mean, the data sets are very different. Look at their stem and leaf plots: 17

19 Then, how can we find those differences among the data sets? It seems that the measures of central tendency do not give to us enough information for all the situations, so we have to look for any other measures that can tell us how far the data and the mean are. It means that we need to use the concept of variability of the data. The first thing we notice is that in the first case, all the data are equal, in the second one there is a little more difference between the biggest and the smallest ones and in the third case this is even more obvious. Exactly, we have that = = = 32 This numbers are called range of the data. Nevertheless, though it is a very easy measure to calculate, it is not very much used, because if we have a very small or a very big value in our data, the range changes a lot, so it is not an useful measure for every situation. How can we find a number that can give to us an approximation to the distance between the data and the mean? We can calculate the distances from every data point to the mean (in absolute value) and then calculate the mean of those distances. This is what we call mean deviation. Let us calculate the mean deviation for the second group of data, we have: = = 26 7 = Nevertheless, we usually use a different measure of variability, that is the mean of the square deviation of the data from the mean, and so we get that the biggest deviations have a smaller influence. But we are going to present the formal definition of all these concepts. The range is the difference between the biggest and the smallest value of the variable, if it is not grouped. If we have a grouped variable, we calculate the difference between the higher limit of the last interval and the lower limit of the first interval. The range only depends on the biggest and the smallest elements, and not on the rest of the data. For instance, we could have the following two data sets with the same range: It is easy to see that the difference between x k and x 1 is the same in both situations but both sets are very different. The interquartile range is the difference between the third and the first quartiles, and it gives to us a zone where we can find 50% of the distribution. The mean deviation is the mean of the deviations of the data from the mean. We call deviation from the mean the absolute value of the difference between the values of the variable and the mean ( x i x ), so the definition of the mean deviation is = 18

20 Figure 1.9: range k i=1 DM = x i x n i n This is a measure that is not used very often because of the difficulty to calculate it due to the absolute value function. Anyway, a small mean deviation means that data are highly concentrated around the mean. We can define also the median deviation, though it is even less usual. The definition is: k i=1 D = x i Me n i n. The variance is the mean of the square deviations of the data from the mean. We denote it by S 2 and its expression is k S 2 i=1 = (x i x) 2 n i n The variance verifies that: k i=1 = x2 i n i x 2 n As we are taking the square of the deviations, the bigger ones have more influence on the result. The unit of measure of S 2 are not the same as the ones of the sample, because we have the square of the deviations. Variance is always positive. It is 0 when all the values coincide with the mean. We define the quasivariance as k s 2 i=1 = (x i x) 2 n i n 1 its relation to the variance is S 2 = n 1 n s2. This is a very useful measure when we work with inferences. Sometimes it is also denoted by Sc 2. The standard deviation is the square root of the variance. We denote it by S and its expression is 19

21 S = + k Its main properties are i=1 (x i x) 2 n i n = + k i=1 x2 i n i n x 2 = + x 2 x 2 It is the most usual measure of variability. It has the same measure units than the sample Standard deviation is always positive or 0. Moreover, variance and standard deviation verify: If we sum a constant to all the data, the variance and the standard deviation stay the same. If we multiply all the values by a positive constant, the variance is multiplied by the square of the constant, and the standard deviation is multiplied by the constant. 1.9 Joint use of the mean and the standard deviation: Tchebicheff s theorem, Pearson s coefficient of variation, z-scores Tchebicheff s theorem We have already found measures that can give us the center of the data and their variability, but we still need more information. Let us recall the data about number of brothers/sisters: so we have that Num brothers absolute fr x = , S 2 = 1.022, S = 1.011, How many people is there around the mean? Are there many students that have 1 or 2 brothers/sisters? Let us take an interval centered in the mean, this is (x a, x + a). We know that variance and standard deviation measure variability, so we will try to use them now. Which one would you use? We should reject variance because we cannot sum it to the mean because they have different measure units. Let us take then the standard deviation, a = S. Then we get the interval ( , ) = (0.3223, ). Inside this interval we can find the students having 1 or 2 brothers/sisters. These are 20 of the 30 students, i. e., 66% of them. What could happen if we use 2S instead of S? We get the interval ( , ) = ( , ). 20

22 Inside this interval we have 29 of the 30 students, i. e., 96% of them. Obviously if we calculate the interval for 3S we find that all the data are inside it. But the next question is does this always happen? Are these concentrations of data always the same? Let us see another example using the weekly pay. We have that Then, x = 13, S 2 = 39.2, S = 6.26 ( , ) = (6.74, 19.26) contains 19 data (63%) ( , ) = (0.48, 25.52) contains 29 data (96%) ( , ) = ( 5.78, 31.78) contains 30 data (100%) As you can see, we get very similar results. This is because there is a theorem that assures that in this intervals we can find a certain percentage of the data, exactly, the theorem states that in an interval such as (x as, x as) we have at least 100(1 1 a 2 )% of the data. This statement is known as the Tchebicheff s theorem Pearson s coefficient of variation We are going to work now with height and weight data. We have that, for the weight:, while for the heights we have x = 60.8, S 2 = 99.56, S = 9.97 x = , S 2 = , S = In which case do we have more variability? we could think that for the weight data because variance and standard deviation are bigger, but look what happens if we calculate the same for the heights measured in centimeters x = , S 2 = , S = If we repeat the question now, what shall you answer? In fact, we cannot compare neither standard deviations nor variances because they depend on the units, just like the mean. We should find an adimensional measure. Until now, we only know that the mean and the standard deviation have the same measure units, so how can we get an adimensional measure from them? We can divide them and then we get the Pearson s coefficient of variation., CV = S x We can calculate it for our examples. For the weight we have that 21

23 , and for the height CV = = 0.163, CV = = = then we can find more variability in the weights than in the heights Z-scores We can still find more information in our data. Imagine that your height is 1.74 and you have a friend in another class whose height is the same. But, inside each class which of you is higher? How can we compare these two data if we only know that the mean in your friend s class is and standard deviation is 12.53? There is a way to change these two data to comparable values. These is what we denote by z-scores and it is calculated by making the difference between the value and its mean divided by the standard deviation. With this, we get that the two new values belong to a distribution with mean 0 and standard deviation 1, and so we can compare them. In our example we have the following z-scores, z 1 = = z 2 = = And we conclude that your friend is higher than you (each one inside its class) because the z-score is bigger. The formula for the z-score related to data x i is. z i = x i x S 22

24 Chapter 2 Analysis of the opinion poll We are going now to make a deeper analysis of some of the tasks in the opinion poll. We have chosen 3 tasks: 2.1 You smoke 2.3 You read other books different than school books 3.1 You practice some sport out of the high school The data we have from question 2.1 are from question 2.3 we have and from The first thing we are going to do is to calculate the frequencies in all cases in order to have the frequency tables for all of them. For question 2.1 we have that Answer (2.1) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 For question 2.3 we have the following frequency table 23

25 Answer (2.3) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 and finally, the frequency table for question 3.1 is Answer (3.1) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 Just looking at the data we have in the tables, we can notice that the three are very different. We will try now to see graphically how these variables are distributed and then we will talk about the first conclusions. As you can notice we have three discrete variables, so we are going to use the bar graph and the pie chart. These are the graphs for the question 2.1 Figure 2.1: answers to question 2.1 Let us represent now the graphs for question 2.3: and now here we have the ones for the question

26 Figure 2.2: answers to question 2.3 Figure 2.3: answers to question 3.1 We can talk now about the first conclusions. Is it quite obvious that for the question 2.1 the most frequent values are the extreme ones, 1 and 5, that is because there is a tendency to relate number 1 with the people that don t smoke and number five with the people that do smoke. Anyway, most of the data are placed in the bigger values (3,4 and 5). On the contrary, in question 2.3 we can see that the most frequent values are the smaller ones, so we can say that reading is not a very popular hobby. The third question is a little more spread on all the values. It is also interesting in this example to represent a bar graph whit the cumulative absolute frequencies. We show you the three graphs in which you can see that the frequencies are more gradually distributed in the third case: Anyway, we are now going to confirm what we see by calculating the main measures of central tendency: We are going to present them in a table, in order to make easier to compare them: 25

27 Figure 2.4: cumulative bar graphs Mean Median Mode Q Q Q This table gives us some interesting information. It is quite simple to see that though the mean for question 2.1 is 3.6, most of the data are bigger than the mean, because both the median and the mode are 5. For question 2.3 the situation is very different, we can see that most of the data are around the smallest values, and even the mode is the smallest one. In the question 3.1 we can notice that the 3 values coincide, then we can see that number 3 is the best one to represent our data. Let us calculate now the main measures of variability and then we will try to see which is the variable that is more spread. Range Variance Standard deviation Q Q Q

28 In our example, range is not very relevant, because all the answers range between 1 and 5. The only thing we can notice from the fact that in question 2.3 the range is 3 (smaller than the others) is that one of the extreme values (in this case value 5 has frequency 0) but for example, we can notice that for question 2.1, the frequency of value 2 is also 0. From the standard deviation we can conclude that the answers to question 2.1 are very spread. This is true because if you take a look to the data, you can find that most of them are extreme values, 1 or 5. The other two variables are a bit more concentrated around the mean, specially the answer to question 2.3. Let us check now if the mean is representative in our variables. We shall the calculate the coefficient of variation in each case. We have that Coefficient of variation Q Q Q So the mean is representative for the three cases we are studying. 2.1 Conclusions In this last section of the analysis, it is important to stress on the meaning of the data we are studying. Until now, we have been talking about the statistical characteristics of the data, but we cannot forget that all those data have their own meaning. We can notice that smoking is something very popular among young people. More than half of this class says that they smoke every day, but only 8 people express that they never smoke. If we sum the frequencies of the students that at least smoke sometimes, we find that we get 22 of you, almost 3 quarters of the total. On the contrary, there is very few interest in reading. 22 of you express that never or rarely read a book different than the ones you need for school. This is maybe one of the biggest contrasts we can get from the poll. No one of you say that they read everyday, though there are 5 people that say to read usually. Sports are the middle ground. This is maybe because many of you can practice any sport in the weekends or when there is good weather, while the ones that practice sports very often balance the ones that almost never practice any sport. 27

29 Chapter 3 Two-dimensional Descriptive Statistics In the previous chapter, we were working with the data we got from a poll and we obtained the first conclusions. But we want to know more than what we already do, because from those data we can have more information with certain methods that we are going to study from now on. Before going on, we will state our objectives in this chapter. 3.1 Objectives To represent and analyze data on two variables through an scatterplot. To identify as a two-dimensional distribution a data set on two variables given in a table or by an scatterplot. To analyze the relationship between two variables through their scatterplot, establishing by intuition if this relationship is positive or negative, if it is functional or not, and, in this case if it approaches to a line. To compare global tasks of several distributions through their scatterplots. To assign given scatterplots to different situations. To determine the relationship between the different means through the scatterplot. To find, in a graphical way, a line that fits the scatterplot. To estimate the correlation coefficient from a scatterplot. To analyze the grade of the relationship between two variables when the correlation coefficient is known. 28

30 To calculate the correlation coefficient in two-dimensional distributions and the regression lines. To make predictions from the regression line. 3.2 The example: an opinion poll In this chapter we will keep on getting deep in the analysis of the opinion poll we have been working with. From the information that we already have, we will try to answer questions like Is there any relationship between the pay you receive and the number of brothers/sisters you have? Does the sport you practice have any influence on how much you smoke or how much alcohol you drink? Can we measure precisely these relationships? Along this chapter we will try to answer these questions and many more. We are presenting from now on the concepts that will be necessary to get these answers. 3.3 Introduction and simple tables We can think about many variables that can have influence over many others. For instance, we can think that as older you are, the bigger pay you get. We are going to see if that is really true. So, as you already know from the previous chapter, the first thing we have to do is to organize our data. We recall that the data about ages and pays that we had are the following: Age Pay Age Pay

31 These are the pairs of data that we have. Let us start grouping the pairs that are equal. We get the following table Age Pay Number This table we have just built will be called simple table and it will be the starting point for our analysis. 3.4 Frequency tables, marginal distributions and conditional distributions Is it simple to you to obtain conclusions from the previous table? Can we find any other way to represent our data? The idea is to avoid those repeated values that we can see in the column of ages and also in the columns of pays. We can group our data in the following way Age Pay

32 This table allows us to have a more global vision of the distribution of the frequencies and the more different values we have,the more useful the table is. We call it table on two variables when we are representing two quantitative variables and contingency table when we have two qualitative variables. But from these tables, can we obtain the total number of people whose pay is 12 euros? and the total number of people whose age is 17? Obviously, the answer is yes. Notice that you can sum all the frequencies appearing on the row related to value 12 of the pay and so we can get the number of people whose pay is 12. In the same way, we can sum all the frequencies on the column related to value 17 of the age and we will have the total number of people that is 17. We add these numbers to our table and we have Age Pay Tot Tot In fact, what you have just got are the values of the two single variables independently one from the other. This values are called marginal distributions of the variables. To obtain the whole marginal distribution of the variable age we take the first and the last row, Age frequency We can do this also for the variable pay, taking the first and the last column. Exercise Can you build that similar table for the variable pay? In a general way, a table on two variable is defined as follows: Y X y 1 y 2... y p... y m Tot x 1 n 11 n n 1p... n 1m n 1 x 2 n 21 n n 2p... n 2m n x s n s1 n s2... n sp... n sm n s x k n k1 n k2... n kp... n km n k Tot n 1 n 2... n p... n m n 31

33 where the values or characteristics of X are x 1, x 2,..., x k and the ones of Y are y 1, y 2,..., y m ; n ij is the number of data points presenting characteristic x i for the variable X and y j for the variable Y. Moreover, n i denotes the number of data points presenting the characteristic x i and n j the number of data points presenting the characteristic y j. n is the total number of elements of the population or the sample. Once we know the marginal distributions, we can calculate the mean and the standard deviation of each of them as if the were one-dimensional variables. Their expressions are: k i=1 x = x in i n m j=1 y = y jn j n S x = S y = i=1 (x i x)n i n m j=1 (y j y)n j n k Exercise Which are the mean and the standard deviation of the pay and the age? One of your partners has a question. He is 17 and he wants to know if his pay is among the higher or the lower to ask for a raise in it if the pay is too low. In order to get that he wants to compare himself with all the other students of his age, so he takes out the data of those students having his age: Pay Age = As this boy has a pay of 10 euros, he decides that most of his partners have a higher pay than him, so he is going to ask for a raise. What we have just calculated is the conditional distribution of the variable pay for a fixed value of the age, in this case 17. We have again a one-dimensional variable to whom we can calculate the measures of central tendency and of variability that we already know. Exercise Calculate the frequency table for the variable age for pay=15 euros. Exercise Calculate the frequency table, with the marginal frequencies, for the weight and the answer to the question Scatterplots As it usually happens for one-dimensional variables, data are more easily analyzed if we represent them in a graph. Anyway, the situation now is different, because we need to represent two variables each one with its frequencies. To do that we use a graph called scatterplot. We are going to explain now how to draw it: we represent in the OX axis the variable pay and in the OY axis the variable age. We represent a point as big as its frequency or we represent as many points as the frequency shows. 32

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple. Graphical Representations of Data, Mean, Median and Standard Deviation In this class we will consider graphical representations of the distribution of a set of data. The goal is to identify the range of

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

Variables. Exploratory Data Analysis

Variables. Exploratory Data Analysis Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data. A common situation is for a data set to be represented as a matrix. There is

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

MEASURES OF VARIATION

MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

Measurement with Ratios

Measurement with Ratios Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical

More information

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html

More information

Means, standard deviations and. and standard errors

Means, standard deviations and. and standard errors CHAPTER 4 Means, standard deviations and standard errors 4.1 Introduction Change of units 4.2 Mean, median and mode Coefficient of variation 4.3 Measures of variation 4.4 Calculating the mean and standard

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. MATH 3/GRACEY PRACTICE EXAM/CHAPTERS 2-3 Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) The frequency distribution

More information

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

Statistics Revision Sheet Question 6 of Paper 2

Statistics Revision Sheet Question 6 of Paper 2 Statistics Revision Sheet Question 6 of Paper The Statistics question is concerned mainly with the following terms. The Mean and the Median and are two ways of measuring the average. sumof values no. of

More information

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared jn2@ecs.soton.ac.uk Relationships between variables So far we have looked at ways of characterizing the distribution

More information

The correlation coefficient

The correlation coefficient The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

Mathematical goals. Starting points. Materials required. Time needed

Mathematical goals. Starting points. Materials required. Time needed Level S6 of challenge: B/C S6 Interpreting frequency graphs, cumulative cumulative frequency frequency graphs, graphs, box and box whisker and plots whisker plots Mathematical goals Starting points Materials

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

Week 1. Exploratory Data Analysis

Week 1. Exploratory Data Analysis Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam

More information

Statistics 151 Practice Midterm 1 Mike Kowalski

Statistics 151 Practice Midterm 1 Mike Kowalski Statistics 151 Practice Midterm 1 Mike Kowalski Statistics 151 Practice Midterm 1 Multiple Choice (50 minutes) Instructions: 1. This is a closed book exam. 2. You may use the STAT 151 formula sheets and

More information

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

More information

Chapter 2: Frequency Distributions and Graphs

Chapter 2: Frequency Distributions and Graphs Chapter 2: Frequency Distributions and Graphs Learning Objectives Upon completion of Chapter 2, you will be able to: Organize the data into a table or chart (called a frequency distribution) Construct

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Chapter 1 Review 1. As part of survey of college students a researcher is interested in the variable class standing. She records a 1 if the student is a freshman, a 2 if the student

More information

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, 5-8 8-4, 8-7 1-6, 4-9

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, 5-8 8-4, 8-7 1-6, 4-9 Glencoe correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 STANDARDS 6-8 Number and Operations (NO) Standard I. Understand numbers, ways of representing numbers, relationships among numbers,

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Probability Distributions

Probability Distributions CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

More information

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013 Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Bar Graphs and Dot Plots

Bar Graphs and Dot Plots CONDENSED L E S S O N 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab 1 Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab I m sure you ve wondered about the absorbency of paper towel brands as you ve quickly tried to mop up spilled soda from

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information

Exploratory Data Analysis. Psychology 3256

Exploratory Data Analysis. Psychology 3256 Exploratory Data Analysis Psychology 3256 1 Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find

More information

Lecture 2. Summarizing the Sample

Lecture 2. Summarizing the Sample Lecture 2 Summarizing the Sample WARNING: Today s lecture may bore some of you It s (sort of) not my fault I m required to teach you about what we re going to cover today. I ll try to make it as exciting

More information

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1. Lecture 6: Chapter 6: Normal Probability Distributions A normal distribution is a continuous probability distribution for a random variable x. The graph of a normal distribution is called the normal curve.

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Correlation key concepts:

Correlation key concepts: CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

More information

Point and Interval Estimates

Point and Interval Estimates Point and Interval Estimates Suppose we want to estimate a parameter, such as p or µ, based on a finite sample of data. There are two main methods: 1. Point estimate: Summarize the sample by a single number

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

Big Ideas in Mathematics

Big Ideas in Mathematics Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards

More information

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers) Probability and Statistics Vocabulary List (Definitions for Middle School Teachers) B Bar graph a diagram representing the frequency distribution for nominal or discrete data. It consists of a sequence

More information

Mental Questions. Day 1. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle?

Mental Questions. Day 1. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle? Mental Questions 1. What number is five cubed? KS3 MATHEMATICS 10 4 10 Level 8 Questions Day 1 2. A circle has radius r. What is the formula for the area of the circle? 3. Jenny and Mark share some money

More information

Descriptive statistics; Correlation and regression

Descriptive statistics; Correlation and regression Descriptive statistics; and regression Patrick Breheny September 16 Patrick Breheny STA 580: Biostatistics I 1/59 Tables and figures Descriptive statistics Histograms Numerical summaries Percentiles Human

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions.

Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions. Student Outcomes Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions. Lesson Notes The activities in this lesson

More information

Numeracy and mathematics Experiences and outcomes

Numeracy and mathematics Experiences and outcomes Numeracy and mathematics Experiences and outcomes My learning in mathematics enables me to: develop a secure understanding of the concepts, principles and processes of mathematics and apply these in different

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

Descriptive statistics parameters: Measures of centrality

Descriptive statistics parameters: Measures of centrality Descriptive statistics parameters: Measures of centrality Contents Definitions... 3 Classification of descriptive statistics parameters... 4 More about central tendency estimators... 5 Relationship between

More information

Open-Ended Problem-Solving Projections

Open-Ended Problem-Solving Projections MATHEMATICS Open-Ended Problem-Solving Projections Organized by TEKS Categories TEKSING TOWARD STAAR 2014 GRADE 7 PROJECTION MASTERS for PROBLEM-SOLVING OVERVIEW The Projection Masters for Problem-Solving

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Frequency Distributions

Frequency Distributions Descriptive Statistics Dr. Tom Pierce Department of Psychology Radford University Descriptive statistics comprise a collection of techniques for better understanding what the people in a group look like

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Interpreting Data in Normal Distributions

Interpreting Data in Normal Distributions Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds Isosceles Triangle Congruent Leg Side Expression Equation Polynomial Monomial Radical Square Root Check Times Itself Function Relation One Domain Range Area Volume Surface Space Length Width Quantitative

More information

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: What do the data look like? Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

More information

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position Chapter 27: Taxation 27.1: Introduction We consider the effect of taxation on some good on the market for that good. We ask the questions: who pays the tax? what effect does it have on the equilibrium

More information

Independent samples t-test. Dr. Tom Pierce Radford University

Independent samples t-test. Dr. Tom Pierce Radford University Independent samples t-test Dr. Tom Pierce Radford University The logic behind drawing causal conclusions from experiments The sampling distribution of the difference between means The standard error of

More information

Level 1 - Maths Targets TARGETS. With support, I can show my work using objects or pictures 12. I can order numbers to 10 3

Level 1 - Maths Targets TARGETS. With support, I can show my work using objects or pictures 12. I can order numbers to 10 3 Ma Data Hling: Interpreting Processing representing Ma Shape, space measures: position shape Written Mental method s Operations relationship s between them Fractio ns Number s the Ma1 Using Str Levels

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Using Proportions to Solve Percent Problems I

Using Proportions to Solve Percent Problems I RP7-1 Using Proportions to Solve Percent Problems I Pages 46 48 Standards: 7.RP.A. Goals: Students will write equivalent statements for proportions by keeping track of the part and the whole, and by solving

More information

Biology 300 Homework assignment #1 Solutions. Assignment:

Biology 300 Homework assignment #1 Solutions. Assignment: Biology 300 Homework assignment #1 Solutions Assignment: Chapter 1, Problems 6, 15 Chapter 2, Problems 6, 8, 9, 12 Chapter 3, Problems 4, 6, 15 Chapter 4, Problem 16 Answers in bold. Chapter 1 6. Identify

More information

Midterm Review Problems

Midterm Review Problems Midterm Review Problems October 19, 2013 1. Consider the following research title: Cooperation among nursery school children under two types of instruction. In this study, what is the independent variable?

More information

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University

More information

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. Unit 1 Number Sense In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. BLM Three Types of Percent Problems (p L-34) is a summary BLM for the material

More information

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont To most people studying statistics a contingency table is a contingency table. We tend to forget, if we ever knew, that contingency

More information

2 Describing, Exploring, and

2 Describing, Exploring, and 2 Describing, Exploring, and Comparing Data This chapter introduces the graphical plotting and summary statistics capabilities of the TI- 83 Plus. First row keys like \ R (67$73/276 are used to obtain

More information

Mind on Statistics. Chapter 10

Mind on Statistics. Chapter 10 Mind on Statistics Chapter 10 Section 10.1 Questions 1 to 4: Some statistical procedures move from population to sample; some move from sample to population. For each of the following procedures, determine

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

Name: Date: Use the following to answer questions 2-3:

Name: Date: Use the following to answer questions 2-3: Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student

More information

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles. Math 1530-017 Exam 1 February 19, 2009 Name Student Number E There are five possible responses to each of the following multiple choice questions. There is only on BEST answer. Be sure to read all possible

More information

WEEK #22: PDFs and CDFs, Measures of Center and Spread

WEEK #22: PDFs and CDFs, Measures of Center and Spread WEEK #22: PDFs and CDFs, Measures of Center and Spread Goals: Explore the effect of independent events in probability calculations. Present a number of ways to represent probability distributions. Textbook

More information

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction CA200 Quantitative Analysis for Business Decisions File name: CA200_Section_04A_StatisticsIntroduction Table of Contents 4. Introduction to Statistics... 1 4.1 Overview... 3 4.2 Discrete or continuous

More information

The Normal Distribution

The Normal Distribution Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution

More information

430 Statistics and Financial Mathematics for Business

430 Statistics and Financial Mathematics for Business Prescription: 430 Statistics and Financial Mathematics for Business Elective prescription Level 4 Credit 20 Version 2 Aim Students will be able to summarise, analyse, interpret and present data, make predictions

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Mean, Median, and Mode

Mean, Median, and Mode DELTA MATH SCIENCE PARTNERSHIP INITIATIVE M 3 Summer Institutes (Math, Middle School, MS Common Core) Mean, Median, and Mode Hook Problem: To compare two shipments, five packages from each shipment were

More information

STAT355 - Probability & Statistics

STAT355 - Probability & Statistics STAT355 - Probability & Statistics Instructor: Kofi Placid Adragni Fall 2011 Chap 1 - Overview and Descriptive Statistics 1.1 Populations, Samples, and Processes 1.2 Pictorial and Tabular Methods in Descriptive

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information