An approach to Descriptive Statistics through real situations

Transcription

1 MaMaEuSch Management Mathematics for European Schools mamaeusch An approach to Descriptive Statistics through real situations Paula Lagares Barreiro 1 Federico Perea Rojas-Marcos 1 Justo Puerto Albandoz 1 MaMaEuSch 2 Management Mathematics for European Schools CP DE - COMENIUS - C21 1 University of Seville 2 This project has been carried out with the partial support of the European Community in the framework of the Sokrates programme. The content does not necessarily reflect the position of the European Community, nor does it involve any responsibility on the part of the European Community.

2 Contents 1 One-dimensional Descriptive Statistics Objectives The example: an opinion poll Population and samples Types of statistical variables: quantitative (discrete and continuous) and qualitative Frequency tables: absolute, relative and percentage frequencies Graphical methods Bar graph Histogram Frequency polygon Pie Chart Pictogram Stem and leaf plot Some remarks Measures of central tendency: mean, median, mode, quantiles Measures of variability: Range, variance, standard deviation Joint use of the mean and the standard deviation: Tchebicheff s theorem, Pearson s coefficient of variation, z-scores Tchebicheff s theorem Pearson s coefficient of variation Z-scores Analysis of the opinion poll Conclusions Two-dimensional Descriptive Statistics Objectives The example: an opinion poll Introduction and simple tables Frequency tables, marginal distributions and conditional distributions Scatterplots Functional dependence and statistical dependence Covariance Linear correlation

3 3.9 Regression lines

4 Chapter 1 One-dimensional Descriptive Statistics We are going to study an opinion poll. You will fill a poll, so that we will see what you think about a lot of topics and we will study some characteristics as height, number of brothers/sisters, etc. We will check if your opinions coincide with those of the rest of your friends and also if there are many people in your classroom with similar characteristics to yours. For instance, how many of your partners are higher than you? And how many of them have the same number of brothers/sisters than you? Before continuing, we will pose the main objectives that we want to achieve in this chapter. 1.1 Objectives To distinguish the different types of statistics. To determine which type of statistic process we shall use, depending on the type of data that we are studying. To get to know the concepts of central tendency and variability of a set of data. To determine the parameters of an statistics distribution. To study the coefficient of variation. To motivate through information given in examples and exercises about social, ecological, economical topics, etc. 3

5 1.2 The example: an opinion poll From now on, we will work with an opinion poll. We want to know some things about the students of the same class than you. We will ask you about some personal data and then you will give to us some information and opinion about many topics, as sports, food, etc. Our poll will be anonymous, so that each one of you can feel free to answer without worrying about the later reading of those opinions. Thus, with these data, we will pose some interesting questions about ourselves as a group, that we can maybe use as an orientation to answer other questions about a wider group of people. For instance, Which is the most frequent height in your class? Can you consider your weekly pay normal compared with those of your partners? How many of you practice sports often? How many have breakfast before coming to the high school? What kind of fruit do you eat more: fruit, milk, coffee, milk, fish...? We will see that analyzing the answers we get in the poll, you will be able to answer all these questions we have posed. Surely, at the end of this chapter we will have all the answers. But first of all, we are going to present the concepts that you will need. 1.3 Population and samples Before answering all those questions, we have to clarify some things. Who do we want to get information about? We have said yet that we want to know things about the students of your level, so our population will not be only the students of this class, but all the students of your level. But it will take too long to ask all those students, thus we have decided to take a representative group of all the classrooms of your level, that is your class, in this case. So that you are the sample. Furthermore, each member of the population is called data point. Let us make some comments about what we have just said. First of all, maybe we want to study some characteristic in animals, plants or things, for instance, the life of batteries of a mobile phone and, in this case, the population is not human, but the different types of mobile phones. Moreover, we can find some situations in which the use of sampling is even more justified than in our case, due to different reasons: if we want to know the vote of all the spanish people, we can t ask all the inhabitants older than 18, because those are millions of people and that means lots of time and money. To study, for example, the average life of light bulbs we can t prove all of them because each proof means that a bulb is blown, this is an example of those situation in which sampling means destroying a data point. Therefore, sampling is justified in many situations by reasons of time, money or destruction of the data points. Exercise The University studies demand poll in Andalusia was made in 2001 to know what the high school students wanted to study and why. In order to get that, data from 8500 students from all Andalusia were collected. Could you say which are the sample and the population in this example? Which are the reasons to choose a sample in this example? 4

6 1.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative In order to answer to many of our questions in the right way, what we shall first do is to decide the kind of method we want to apply to our data. Notice that not all the data we can collect are the same kind, for instance, we can think about the answer to three questions of our poll: 1. The answer to the question sex (male or female). 2. The answer to the question number of brothers/sisters. 3. The answer to the question height. The first thing we notice is that the answer to the first question is not numerical whereas the answers to questions two and three are numerical. The characteristic corresponding to the answer of the first question is called qualitative whereas the ones related to the answers of questions two and three are called quantitative. It is easy to see that quantitative variables allow to do operations that we cannot do with qualitative characteristics. We call categories to the different possibilities of the qualitative variable and values to the ones of the quantitative variables. Let us see now which are the differences between variables 2 and 3, because this one is a little more complicated. The variable number of brothers/sisters take numerical values that we can call isolated, 0,1,2,3,..., but it cannot take any value between two of those ones, for instance, it cannot have the value 3.5. Nevertheless this does not happen with the variable height. In fact, height can have any value between certain limits, we can measure height as precisely as we want. We can say that height can take any value from an interval. So the variable in case 2 is called discrete and the variable in case 3 is called continuous. Exercise Decide whether these variables are qualitative or quantitative, and if they are quantitative, whether they are discrete or continuous 1. Number of babies born in a day. 2. Blood group of a person. 3. Time needed to solve a problem. 4. Number of questions in an exam. 5. Temperature of a person. 6. Political party voted in the last elections. 7. Number of goals scored by a player in a season. 5

7 1.5 Frequency tables: absolute, relative and percentage frequencies It is the time now to start processing the data we have collected with our poll. The data that we have about number of brothers/sisters are Meanwhile for the weights we have We can pose a lot of questions: how many of my partners have the same number of brothers/sisters as I have? How many of them have more than me? And less than me? how many of my partners weigh more than me? and less than me? To answer these questions, we would have to count how many time each answer appears. Let us start counting the ones related to the number of brothers/sisters. This is what we have So, we know now that there are 13 people that have 1 brother/sister. This number is called absolute frequency and we denote it by n i. And, how many people has at most 1 brother/sister? In our case, the people that has 0 or 1 brother/sister, this is, = 19. This number is called cumulative absolute frequency and we will denote it by N i. We can write now the cumulative and absolute frequency table: N. bro/sis absolute fr. cum. absolute fr = = = = 30 It is important to put the values of the characteristic in order from the biggest to the smallest, if we want to calculate the cumulative frequencies in the right way. We are going to define now other kinds of frequencies, because it is interesting to know the proportion of the total that represents a concrete value, because that s the way we can compare it with other populations. For instance, in our case, there are 6 students that have 0 brothers/sisters, but we have asked in a group of 50 people and we know that there are 9 people with 0 brothers/sisters, so in which of the two groups is there a bigger proportion of people with no brothers/sisters? It is easy to see that the proportions are 6 30 = 0.2 and 9 50 = 0.18 So the proportion is bigger in our group of 30 people. This proportion is called relative frequency and we denote it by f i. If we express it as a percentage (multiplying by 100) we get the percentage 6

8 frequency, that in our case are 20% and 18% respectively. We denote these frequencies by p i. We add now all these frequencies to our table and we get Bro/sis absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr = % = % = = % = = % = = % = 30 1 Let us analyze now the weight data. We count the different values: As you can see, most of the values have frequency 1 and our variable takes 25 different values. Those are too many different values to represent in a table (even more if we only have 30 data). How can we get a more representative table of the distribution of the data? It seems logical to group similar data in intervals. There is a complete theory about how to group data in a right way. These are the main points we want to remark: The number of classes shall not be neither too high (around 6 8 is the maximum number we usually work with) nor too low (it makes no sense to group in 2 or 3 classes because we are losing a lot of information. 7

9 Excepting maybe the extreme classes, all the intervals should have the same width, because if not, the information can be misinterpreted. Can you imagine which are the intervals we are looking for? You can think about the number of classes you want to have, for instance. Let us note that between the highest value (82) and the lowest value (46) there is a difference of 36 kg. For instance, if we want to group in 6 classes the width of the interval should be 36 6 = 6. So we obtain the following intervals: [46,52], (52,58], (58,64], (64,70],(70, 76], (76,82]. Now we have a possible classification though, of course, there are many more. In some analysis you may find that the first interval is of the kind smaller than 52 and the last interval greater than 76. This kind of interval is considered the same size as the others in order to make calculus. Once decided the data grouping, we can calculate the frequencies: Weight absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr. [46,52] % (52,58] % (58,64] % (64,70] % (70,76] % (76,82] % 30 1 Moreover, when we work with grouped data we shall need to choose a representative of each one of the intervals, and we will call it class mark, and it will be the half point of the interval (lower extreme of the interval plus higher extreme of the interval, divided by 2). Exercise Make the frequency table from the variable answers to the question 1.3 and from the answers to the question height, deciding previously if it is necessary to group the data in intervals or not. 1.6 Graphical methods Once we have the frequency tables, imagine that your teacher ask you to present to the rest of the students the conclusions you have obtained. You can present your frequency tables and talk about the main conclusions, but, is there any way of presenting data in such a way that the main conclusions can be seen in a more simple way? As you can suppose, the answer to this question is yes. Maybe you have seen in books or mainly in the media, that data are usually presented in a graphic way, so that are more attractive to the people and also easier to analyze data. In this section we want to show all the types of graphs and we are going to stress in how important it is to make a right choice of the type of graph depending on the data we are working with. Now we have the frequency tables for the variables weight and number of brothers/sisters, we are going to use them to introduce the different graphs Bar graph The first kind of graph we are going to study is the bar graph. This is a graph that is used for 8

10 qualitative variables and discrete variables grouped in intervals. We know already that our data about number of brothers/sisters is a discrete variable, so let us see how to build a bar graph using those data. In the OX axis we place the categories if we have a qualitative variable or the values in the case we have a discrete variable, in our example, those values are 0, 1, 2, 3 y 4. Over each one of these values, we place a rectangle or a bar of equal base, having a height proportional to the corresponding frequency. In our case, we shall have a graph like this: Figure 1.1: brothers/sisters (vertical bars) Sometimes this graph is also presented with horizontal bars, in such a way like this: Figure 1.2: brothers/sisters (horizontal bars) Histogram An histogram is a graph very similar to the bar graph, but this one is used for variables grouped in intervals. We are going to build an histogram for the variable weight. As the one before, it is built by representing in the OX axis the intervals and, over each of them we place a rectangle having a basis with the same width of the interval and such a height that the area of the rectangle is proportional to the frequency of the interval. In this kind of graph, the areas of the rectangles 9

11 are very important, because we are not representing a bar corresponding to a point but the width of the bar is representing our interval. So, if our intervals have the same width, the height should be the frequency, if not, we shall modify the height in order to keep proportions between frequency and area. Our histogram for the variable weight, that we have already grouped is: Figure 1.3: weight (histogram) We can represent it also with horizontal rectangles: Figure 1.4: weight (histogram) Surely, you have seen sometime a population pyramid in any media. You can notice that a population pyramid is in fact two horizontal histograms (one for women an other for men) in which we represent the number of inhabitants grouped by age Frequency polygon The next type of graph that we are going to define is the frequency polygon. This graph is used when we have quantitative variables, discrete or continuous. In order to draw it, we start from the histogram or the bar graph, depending on the case that we have a grouped or not grouped variable. 10

12 We have to join with a line the half-points of the higher basis in the bar graph or the histogram. In our two examples, we shall have for the number of brothers/sisters the next graph Figure 1.5: brothers/sisters (frequency polygon) The case of the weight is a little bit different. In this situation, the area under the line represents the data we have, as in the histogram, because we are talking about the whole width of the interval. The graph looks like this: Figure 1.6: weight (frequency polygon) All the graphs that we have seen before can be drawn also for relative frequencies and for cumulative frequencies Pie Chart The next type of graph that we are going to present is a well-known type, the pie chart. In a pie chart, we assign to each category or value a part of a circle in such a way that its area should be proportional to the frequency. This graph is usually used for qualitative variables and not grouped discrete variables. 11

13 Figure 1.7: brothers/sisters (pie chart) Pictogram These are a kind of graphs that are very frequent in the media, and they are called pictograms. They are graphs in which a picture related to the variable is used to represent the frequencies. But we have to stress again on something: the size (and not only the height) has to be proportional to the frequency that we want to represent. It is usual to write also the frequency aside to avoid mistakes Stem and leaf plot There is a representation that is between a graph and a data recount, this is the stem and leaf plot. We are going to see how to make it through the example of the weight. We recall that the data we had are: In a stem and leaf plot, the first thing we have to do is to write in a column the different figures corresponding to the tens that we can find in the data, in our example, as our values range between 46 and 82, we shall have to write 4, 5, 6, 7 and 8 in the following way Next, we take the first observation, 52, and we place the units figure aside its corresponding tens figure, this is 12

14 So we keep placing the units figures aside the tens ones for the rest of the data. What we get is something like this: You can notice that we have something similar (but not equal) to a bar graph or an histogram. Obviously we could have made it vertically and we would have something like this: That looks like an histogram or a bar graph though it is not. But the stem and leaf plot can be taken as an approximation to the distribution of the data. In fact, we have only divided in tens (from 40 to 49, from 50 to 59,... ) but we could divide in groups of 5 (from 40 to 44, from 45 to 49, from 50 to 54,... ) just placing twice each of the ten figure, aside the first one we place the unit figures between 0 and 4 and aside the second one, the unit figures between 5 and 9. In our example and for the horizontal case, we would have:

15 1.6.7 Some remarks Imagine that you see the two following graphs referred to the benefits of a company. Which one would you choose to be your company? Figure 1.8: benefits (company 1 and company 2) Most of you may choose company 2, because surely you agree that it is better than company 1, but in fact data from the two graphs are the same. We have only changed the OY axis scale. We will make some remarks before starting the next section. Graphs are a very useful tool and they make easier to obtain conclusions from our data, but it is necessary to draw them in the right way in order to avoid mistakes. It is very important to keep proportions among the pictures we represent so as to make sure that the axis scales keep also proportional, because small changes in scales make big differences in appearance and graph can be misunderstood. 1.7 Measures of central tendency: mean, median, mode, quantiles Let us suppose now that we are planning a trip with all the class and we want to earn some money, so we have decided to sell t-shirts, but we don t know which is the appropriate price. The only thing we know is that we pay for them 4 euros. We would like to have benefits but we cannot put a high price because we want everybody to buy our t-shirts. We think that the weekly pay is a good reference to know what the students can afford. So, we are going to use the weekly pay data that we have: We have 30 values, but we need only one value to represent them all. Which is the value we can choose? A first solution might be choosing an intermediate value among all the data we have. In order to get that, we sum all the numbers and divide it by the total number of data, so we have: 14

16 x = = = 13 Now we have the first possible price, 13 euros. This number, we have just calculated is called mean. But there are more possibilities, for instance, we can choose the most frequent value to represent our data. In our example, the most frequent value is 9, that can also be a good choice for a price. We call mode to the most frequent value. But none of those two numbers that we have got say anything about the number of people that can afford the t-shirt. So, we have another idea. Let us sort the data we have: So now we want to find the value that leaves half of the data on each side. The values placed in numbers 15 an 16 leave 14 values in each side, as both of them have value 10, we can consider that 10 is the value that leaves half of the data in each side. This number is called median. Just as we have proposed a value that leaves 50% of the data on each side, we can look for a value that can afford 75% of the class, this is, we want to find the value that leaves 25% on the left (this means that only 25% of the data is lower than that value), or any other percentage. This numbers are called quantiles. We can choose now any of those three values, depending on what we pretend on each case or depending on the value that best represents al the data set. Those three values are not always valid for every case, but can help us to see where the center of the distribution is. These are the main measures of central tendency. We are now going to define in a formal way the concepts that we have presented. We are speaking from now on about variables. Let us suppose that we have observed a variable in n data points and we got k different values, x 1, x 2,... x k, each of them with a frequency of n 1, n 2,... n k where n i is the absolute frequency of the value x i. We denote by N i = j i n j the cumulative absolute frequency of the value x i and by f i = ni n the relative frequency. If the values of the variable are grouped, we can suppose we have h intervals that we can denote by (L 0, L 1 ], (L 1, L 2 ],... (L h 1, L h ] whose class marks will be c 1, c 2,... c h. In this case, the absolute frequencies will be denoted by n 1, n 2,..., n h, the cumulative absolute frequencies by N 1, N 2,..., N h = n and the relative frequencies by f 1, f 2,..., f h. Then, the mean is defined as follows n i=1 x = x in i n. For not grouped variables. If we have a grouped variable we will use the class marks c i instead of the values x i. The mean has as main characteristics the following: It is the gravity center of the distribution and it is unique. When we have extreme or scarcely representative values (too big or too small), the mean may not be representative. 15

17 It makes no sense to calculate the mean for a qualitative variable or if we have grouped data and anyone of the intervals is not bounded. For grouped data, we use the class mark of each interval to calculate the mean. Moreover, the mean has the following properties: If a constant is summed to each value, the mean is summed in that constant also. If we multiply all the values by a constant, the mean is also multiplied by the same constant. The mode is usually defined as the most frequent value. For the case of a not grouped variable it is the value that appears more times. In the case of grouped variables in intervals of the same width, we shall look for the interval with the highest frequency (modal class or interval) and the approximation of the mode is done through the formula: Mo = L i 1 + n i n i 1 (n i n i 1 ) + (n i n i+1 ) c i. where: L i 1 is the lower limit of the modal interval. n i is the absolute frequency of the modal interval. n i 1 is the absolute frequency of the previous interval to the modal interval. n i+1 is the absolute frequency of the next interval to the modal interval. c i is the width of the interval. The mode verifies that: We can have more than a mode for the distribution. In that case, we will say that we have a bimodal, trimodal,... distribution depending on the number of values presenting the highest absolute frequency. The mode is usually a worse representing than the mean, excepting the case of qualitative data. If we have intervals with different width, we have to look for the interval with the highest frequency density (this is usually the result of dividing the absolute frequency by the width of the interval ni c i ) and then we use the preceding formula. The median is, in the case of a grouped variable and once we have sorted our data the central value if there is an odd number of observations and the media of the central values if we have a pair number of data. If we have a grouped variable, we have to look for the central interval (the one in which we can find the central value), that is to say the one in which N i is bigger than n 2 for the first time, and then we can apply the formula:. Me = L i 1 + n 2 N i 1 c i n i 16

18 where L i 1 is the lower limit of the interval. n i is the absolute frequency of the central interval. N i 1 is the cumulative absolute frequency of the previous interval to the central interval. n is the number of data. c i is the width of the interval. Moreover, the quantiles are position measures that generalize the concept of median. We are going to define now the concept of centiles or percentiles, the quartiles and the deciles. We suppose that we have sorted our data. The centiles or percentiles are the values of the variable that leave on the left side a concrete percentage of the data. We denote them by P h or C h where h is the percentage, h = 1, 2,..., 99. If we have a grouped variable, once we have the interval in which we can find the centil, we apply the next formula: P h = C h = L i 1 + h n 100 N i 1 c i n i. Where the different elements have the same meaning as in the median case. The quartiles are the values that, once we have sorted the data, divide the variable in 4 equal groups. Between each of them there is a 25% of the data points. We denote them by Q 1, Q 2 y Q 3 and they verify that Q 1 = C 25, Q 2 = C 50 = Me, Q 3 = C 75. The deciles are the values that, once we have sorted our data, divide the data in 10 equal groups, in such a way that between any 2 of them there is a 10% of the data points. We denote them by D 1, D 2, D 3,..., D 9. They verify that D 1 = C 10, D 2 = C 20, D 3 = C 30,... D 9 = C 90. Exercise For the data of number of brothers/sisters and weight, calculate mean, mode, median and cuantiles: Q 1, Q 3, C 30, C 74, D 4, D Measures of variability: Range, variance, standard deviation Imagine that we have 3 different data sets about the weights of certain people and we know that in the 3 cases, the mean of the variable weight is 55. Does this mean that the 3 sets are equal or similar? We get the data and we find that the observations are: Set 1: Set 2: Set 3: we can see that, though they have the same mean, the data sets are very different. Look at their stem and leaf plots: 17

19 Then, how can we find those differences among the data sets? It seems that the measures of central tendency do not give to us enough information for all the situations, so we have to look for any other measures that can tell us how far the data and the mean are. It means that we need to use the concept of variability of the data. The first thing we notice is that in the first case, all the data are equal, in the second one there is a little more difference between the biggest and the smallest ones and in the third case this is even more obvious. Exactly, we have that = = = 32 This numbers are called range of the data. Nevertheless, though it is a very easy measure to calculate, it is not very much used, because if we have a very small or a very big value in our data, the range changes a lot, so it is not an useful measure for every situation. How can we find a number that can give to us an approximation to the distance between the data and the mean? We can calculate the distances from every data point to the mean (in absolute value) and then calculate the mean of those distances. This is what we call mean deviation. Let us calculate the mean deviation for the second group of data, we have: = = 26 7 = Nevertheless, we usually use a different measure of variability, that is the mean of the square deviation of the data from the mean, and so we get that the biggest deviations have a smaller influence. But we are going to present the formal definition of all these concepts. The range is the difference between the biggest and the smallest value of the variable, if it is not grouped. If we have a grouped variable, we calculate the difference between the higher limit of the last interval and the lower limit of the first interval. The range only depends on the biggest and the smallest elements, and not on the rest of the data. For instance, we could have the following two data sets with the same range: It is easy to see that the difference between x k and x 1 is the same in both situations but both sets are very different. The interquartile range is the difference between the third and the first quartiles, and it gives to us a zone where we can find 50% of the distribution. The mean deviation is the mean of the deviations of the data from the mean. We call deviation from the mean the absolute value of the difference between the values of the variable and the mean ( x i x ), so the definition of the mean deviation is = 18

20 Figure 1.9: range k i=1 DM = x i x n i n This is a measure that is not used very often because of the difficulty to calculate it due to the absolute value function. Anyway, a small mean deviation means that data are highly concentrated around the mean. We can define also the median deviation, though it is even less usual. The definition is: k i=1 D = x i Me n i n. The variance is the mean of the square deviations of the data from the mean. We denote it by S 2 and its expression is k S 2 i=1 = (x i x) 2 n i n The variance verifies that: k i=1 = x2 i n i x 2 n As we are taking the square of the deviations, the bigger ones have more influence on the result. The unit of measure of S 2 are not the same as the ones of the sample, because we have the square of the deviations. Variance is always positive. It is 0 when all the values coincide with the mean. We define the quasivariance as k s 2 i=1 = (x i x) 2 n i n 1 its relation to the variance is S 2 = n 1 n s2. This is a very useful measure when we work with inferences. Sometimes it is also denoted by Sc 2. The standard deviation is the square root of the variance. We denote it by S and its expression is 19

21 S = + k Its main properties are i=1 (x i x) 2 n i n = + k i=1 x2 i n i n x 2 = + x 2 x 2 It is the most usual measure of variability. It has the same measure units than the sample Standard deviation is always positive or 0. Moreover, variance and standard deviation verify: If we sum a constant to all the data, the variance and the standard deviation stay the same. If we multiply all the values by a positive constant, the variance is multiplied by the square of the constant, and the standard deviation is multiplied by the constant. 1.9 Joint use of the mean and the standard deviation: Tchebicheff s theorem, Pearson s coefficient of variation, z-scores Tchebicheff s theorem We have already found measures that can give us the center of the data and their variability, but we still need more information. Let us recall the data about number of brothers/sisters: so we have that Num brothers absolute fr x = , S 2 = 1.022, S = 1.011, How many people is there around the mean? Are there many students that have 1 or 2 brothers/sisters? Let us take an interval centered in the mean, this is (x a, x + a). We know that variance and standard deviation measure variability, so we will try to use them now. Which one would you use? We should reject variance because we cannot sum it to the mean because they have different measure units. Let us take then the standard deviation, a = S. Then we get the interval ( , ) = (0.3223, ). Inside this interval we can find the students having 1 or 2 brothers/sisters. These are 20 of the 30 students, i. e., 66% of them. What could happen if we use 2S instead of S? We get the interval ( , ) = ( , ). 20

22 Inside this interval we have 29 of the 30 students, i. e., 96% of them. Obviously if we calculate the interval for 3S we find that all the data are inside it. But the next question is does this always happen? Are these concentrations of data always the same? Let us see another example using the weekly pay. We have that Then, x = 13, S 2 = 39.2, S = 6.26 ( , ) = (6.74, 19.26) contains 19 data (63%) ( , ) = (0.48, 25.52) contains 29 data (96%) ( , ) = ( 5.78, 31.78) contains 30 data (100%) As you can see, we get very similar results. This is because there is a theorem that assures that in this intervals we can find a certain percentage of the data, exactly, the theorem states that in an interval such as (x as, x as) we have at least 100(1 1 a 2 )% of the data. This statement is known as the Tchebicheff s theorem Pearson s coefficient of variation We are going to work now with height and weight data. We have that, for the weight:, while for the heights we have x = 60.8, S 2 = 99.56, S = 9.97 x = , S 2 = , S = In which case do we have more variability? we could think that for the weight data because variance and standard deviation are bigger, but look what happens if we calculate the same for the heights measured in centimeters x = , S 2 = , S = If we repeat the question now, what shall you answer? In fact, we cannot compare neither standard deviations nor variances because they depend on the units, just like the mean. We should find an adimensional measure. Until now, we only know that the mean and the standard deviation have the same measure units, so how can we get an adimensional measure from them? We can divide them and then we get the Pearson s coefficient of variation., CV = S x We can calculate it for our examples. For the weight we have that 21

23 , and for the height CV = = 0.163, CV = = = then we can find more variability in the weights than in the heights Z-scores We can still find more information in our data. Imagine that your height is 1.74 and you have a friend in another class whose height is the same. But, inside each class which of you is higher? How can we compare these two data if we only know that the mean in your friend s class is and standard deviation is 12.53? There is a way to change these two data to comparable values. These is what we denote by z-scores and it is calculated by making the difference between the value and its mean divided by the standard deviation. With this, we get that the two new values belong to a distribution with mean 0 and standard deviation 1, and so we can compare them. In our example we have the following z-scores, z 1 = = z 2 = = And we conclude that your friend is higher than you (each one inside its class) because the z-score is bigger. The formula for the z-score related to data x i is. z i = x i x S 22

24 Chapter 2 Analysis of the opinion poll We are going now to make a deeper analysis of some of the tasks in the opinion poll. We have chosen 3 tasks: 2.1 You smoke 2.3 You read other books different than school books 3.1 You practice some sport out of the high school The data we have from question 2.1 are from question 2.3 we have and from The first thing we are going to do is to calculate the frequencies in all cases in order to have the frequency tables for all of them. For question 2.1 we have that Answer (2.1) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 For question 2.3 we have the following frequency table 23

25 Answer (2.3) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 and finally, the frequency table for question 3.1 is Answer (3.1) abs fr rel fr perc fr cum abs fr cum rel fr % % % % % 30 1 Just looking at the data we have in the tables, we can notice that the three are very different. We will try now to see graphically how these variables are distributed and then we will talk about the first conclusions. As you can notice we have three discrete variables, so we are going to use the bar graph and the pie chart. These are the graphs for the question 2.1 Figure 2.1: answers to question 2.1 Let us represent now the graphs for question 2.3: and now here we have the ones for the question

26 Figure 2.2: answers to question 2.3 Figure 2.3: answers to question 3.1 We can talk now about the first conclusions. Is it quite obvious that for the question 2.1 the most frequent values are the extreme ones, 1 and 5, that is because there is a tendency to relate number 1 with the people that don t smoke and number five with the people that do smoke. Anyway, most of the data are placed in the bigger values (3,4 and 5). On the contrary, in question 2.3 we can see that the most frequent values are the smaller ones, so we can say that reading is not a very popular hobby. The third question is a little more spread on all the values. It is also interesting in this example to represent a bar graph whit the cumulative absolute frequencies. We show you the three graphs in which you can see that the frequencies are more gradually distributed in the third case: Anyway, we are now going to confirm what we see by calculating the main measures of central tendency: We are going to present them in a table, in order to make easier to compare them: 25

27 Figure 2.4: cumulative bar graphs Mean Median Mode Q Q Q This table gives us some interesting information. It is quite simple to see that though the mean for question 2.1 is 3.6, most of the data are bigger than the mean, because both the median and the mode are 5. For question 2.3 the situation is very different, we can see that most of the data are around the smallest values, and even the mode is the smallest one. In the question 3.1 we can notice that the 3 values coincide, then we can see that number 3 is the best one to represent our data. Let us calculate now the main measures of variability and then we will try to see which is the variable that is more spread. Range Variance Standard deviation Q Q Q

28 In our example, range is not very relevant, because all the answers range between 1 and 5. The only thing we can notice from the fact that in question 2.3 the range is 3 (smaller than the others) is that one of the extreme values (in this case value 5 has frequency 0) but for example, we can notice that for question 2.1, the frequency of value 2 is also 0. From the standard deviation we can conclude that the answers to question 2.1 are very spread. This is true because if you take a look to the data, you can find that most of them are extreme values, 1 or 5. The other two variables are a bit more concentrated around the mean, specially the answer to question 2.3. Let us check now if the mean is representative in our variables. We shall the calculate the coefficient of variation in each case. We have that Coefficient of variation Q Q Q So the mean is representative for the three cases we are studying. 2.1 Conclusions In this last section of the analysis, it is important to stress on the meaning of the data we are studying. Until now, we have been talking about the statistical characteristics of the data, but we cannot forget that all those data have their own meaning. We can notice that smoking is something very popular among young people. More than half of this class says that they smoke every day, but only 8 people express that they never smoke. If we sum the frequencies of the students that at least smoke sometimes, we find that we get 22 of you, almost 3 quarters of the total. On the contrary, there is very few interest in reading. 22 of you express that never or rarely read a book different than the ones you need for school. This is maybe one of the biggest contrasts we can get from the poll. No one of you say that they read everyday, though there are 5 people that say to read usually. Sports are the middle ground. This is maybe because many of you can practice any sport in the weekends or when there is good weather, while the ones that practice sports very often balance the ones that almost never practice any sport. 27

29 Chapter 3 Two-dimensional Descriptive Statistics In the previous chapter, we were working with the data we got from a poll and we obtained the first conclusions. But we want to know more than what we already do, because from those data we can have more information with certain methods that we are going to study from now on. Before going on, we will state our objectives in this chapter. 3.1 Objectives To represent and analyze data on two variables through an scatterplot. To identify as a two-dimensional distribution a data set on two variables given in a table or by an scatterplot. To analyze the relationship between two variables through their scatterplot, establishing by intuition if this relationship is positive or negative, if it is functional or not, and, in this case if it approaches to a line. To compare global tasks of several distributions through their scatterplots. To assign given scatterplots to different situations. To determine the relationship between the different means through the scatterplot. To find, in a graphical way, a line that fits the scatterplot. To estimate the correlation coefficient from a scatterplot. To analyze the grade of the relationship between two variables when the correlation coefficient is known. 28

30 To calculate the correlation coefficient in two-dimensional distributions and the regression lines. To make predictions from the regression line. 3.2 The example: an opinion poll In this chapter we will keep on getting deep in the analysis of the opinion poll we have been working with. From the information that we already have, we will try to answer questions like Is there any relationship between the pay you receive and the number of brothers/sisters you have? Does the sport you practice have any influence on how much you smoke or how much alcohol you drink? Can we measure precisely these relationships? Along this chapter we will try to answer these questions and many more. We are presenting from now on the concepts that will be necessary to get these answers. 3.3 Introduction and simple tables We can think about many variables that can have influence over many others. For instance, we can think that as older you are, the bigger pay you get. We are going to see if that is really true. So, as you already know from the previous chapter, the first thing we have to do is to organize our data. We recall that the data about ages and pays that we had are the following: Age Pay Age Pay

31 These are the pairs of data that we have. Let us start grouping the pairs that are equal. We get the following table Age Pay Number This table we have just built will be called simple table and it will be the starting point for our analysis. 3.4 Frequency tables, marginal distributions and conditional distributions Is it simple to you to obtain conclusions from the previous table? Can we find any other way to represent our data? The idea is to avoid those repeated values that we can see in the column of ages and also in the columns of pays. We can group our data in the following way Age Pay

32 This table allows us to have a more global vision of the distribution of the frequencies and the more different values we have,the more useful the table is. We call it table on two variables when we are representing two quantitative variables and contingency table when we have two qualitative variables. But from these tables, can we obtain the total number of people whose pay is 12 euros? and the total number of people whose age is 17? Obviously, the answer is yes. Notice that you can sum all the frequencies appearing on the row related to value 12 of the pay and so we can get the number of people whose pay is 12. In the same way, we can sum all the frequencies on the column related to value 17 of the age and we will have the total number of people that is 17. We add these numbers to our table and we have Age Pay Tot Tot In fact, what you have just got are the values of the two single variables independently one from the other. This values are called marginal distributions of the variables. To obtain the whole marginal distribution of the variable age we take the first and the last row, Age frequency We can do this also for the variable pay, taking the first and the last column. Exercise Can you build that similar table for the variable pay? In a general way, a table on two variable is defined as follows: Y X y 1 y 2... y p... y m Tot x 1 n 11 n n 1p... n 1m n 1 x 2 n 21 n n 2p... n 2m n x s n s1 n s2... n sp... n sm n s x k n k1 n k2... n kp... n km n k Tot n 1 n 2... n p... n m n 31

33 where the values or characteristics of X are x 1, x 2,..., x k and the ones of Y are y 1, y 2,..., y m ; n ij is the number of data points presenting characteristic x i for the variable X and y j for the variable Y. Moreover, n i denotes the number of data points presenting the characteristic x i and n j the number of data points presenting the characteristic y j. n is the total number of elements of the population or the sample. Once we know the marginal distributions, we can calculate the mean and the standard deviation of each of them as if the were one-dimensional variables. Their expressions are: k i=1 x = x in i n m j=1 y = y jn j n S x = S y = i=1 (x i x)n i n m j=1 (y j y)n j n k Exercise Which are the mean and the standard deviation of the pay and the age? One of your partners has a question. He is 17 and he wants to know if his pay is among the higher or the lower to ask for a raise in it if the pay is too low. In order to get that he wants to compare himself with all the other students of his age, so he takes out the data of those students having his age: Pay Age = As this boy has a pay of 10 euros, he decides that most of his partners have a higher pay than him, so he is going to ask for a raise. What we have just calculated is the conditional distribution of the variable pay for a fixed value of the age, in this case 17. We have again a one-dimensional variable to whom we can calculate the measures of central tendency and of variability that we already know. Exercise Calculate the frequency table for the variable age for pay=15 euros. Exercise Calculate the frequency table, with the marginal frequencies, for the weight and the answer to the question Scatterplots As it usually happens for one-dimensional variables, data are more easily analyzed if we represent them in a graph. Anyway, the situation now is different, because we need to represent two variables each one with its frequencies. To do that we use a graph called scatterplot. We are going to explain now how to draw it: we represent in the OX axis the variable pay and in the OY axis the variable age. We represent a point as big as its frequency or we represent as many points as the frequency shows. 32