CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES

Transcription

1 CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES 11.1 Basics of Statistics Statistics is a branch of mathematics that deals with the collection, organization, and analysis of numerical data and with such problems as experiment design and decision making. The origin of the term statistics comes from the Italian word statista (meaning statesman ), but the real term derived from statista was statistik which was firstly used by Gottfried Achenwall ( ). He was a professor at Marlborough and Gottingen. But the introduction of the word statistics was made by E.A.W. Zimmerman in England. However, before eighteenth century people were able to record and use some data. The popularity of statistics had started with Sir John Clair in his work Statistical Account of Scotland which includes the period of There are various techniques in statistics which can be applied in every branch of public and private enterprises. But statisticians generally divide it into two main parts: Descriptive Statistics and Inferential Statistics. Shortly, in descriptive statistics there is no generalization from sample to population 1. We can describe any data with tables, charts or graphs so that they do not refer any generalization for other data or population. On the other hand, in inferential statistics there is a generalization from sample to population. The generalization or conclusions on any data goes far beyond that data. So the generalization may not be true and valid, and statistician should specify how likely it is to be true, because it based on estimation somehow. Inferential statistics could be also re-called as Statistical Inference. Statistical inference can be applied also in decision theory, which is a branch of statistics. Because there is a very close relationship between the two; decisions are made under the conditions of uncertainty. So statistical inference is very effective in decision making Arranging Data: Data Array, Frequency Distributions, and Cross-Tabulations Data are collections of any number of related observations. We can collect an information about the number of students in Eastern Mediterranean University (EMU) in Turkish Republic of Northern Cyprus (TRNC). We can divide them into the different categories such as nationality, gender, age groups, and etc.. A collection of data is called data set, and a single observation in the data is called a data point. People can gather data from past records, or by observation. Again people can use data on the past to make decisions about the future. So data plays very important role in decision making. Most of the times it is not possible to gather data for the population. So what statisticians do is to gather the data from a sample. They use this information to make inferences about the population that the sample represents. A population is a whole, whereas a sample is only a fraction of the population. Assume that there are currently 1 The concepts on sample and population will be discussed later. 1

2 10,300 students in EMU, and we want to evaluate the expectations and findings of EMU students toward the University. It will be very hard to consider all the students in the university, so we select a fraction of the total number. If we decide to take 15% of the total number, the selected number of students would be 1,545 in this case; and this number is called Sample Size. On the other hand, total number of students (10,300) is called Population Size. The collection of sample or population can be implemented randomly or not randomly. When data is selected randomly all the observations have an equal chance of being included in the data regardless of their characteristics. But when data is not selected randomly, there is a biased selection regarding any characteristic of the observations. In order to use data for any purpose efficiently, we need to arrange the data. This arrangement might be in various numbers of forms. Data before it is arranged and analyzed is called raw data. It is still unprocessed by statistical methods. Data Array The first form of arranging the data is to use Data Array. It is one of the simplest ways to present a data, and it arranges the data in ascending or descending order. Table 11.1 Grades of Students Raw Data Array Data Ascending Descending When we use data array, we can immediately see the lowest and highest values in the data, we can divide it into the sections, and we see if a value appear more than once in the data. But when we have large quantities of data, data array is not so helpful for us. We need to arrange the data by using another method. Frequency Distributions The second form of arranging the data is to use frequency distributions. It is the one of best known types of data in statistics. It divides the data into the classes with lower and upper limits, and it shows the number of observations that fall into each of the classes. We can also express the frequency of each value in terms of fractions or percentages of the total observations, which is called relative 2

3 frequency distribution. In table 11.2 you can see the frequency distribution and relative frequency distribution table. Table 11.2 Frequency Distribution of Student Grades Class Frequency Relative Frequency Total As you will notice from the table, the summation of relative frequency in each of the classes is equal to 1.00, or 100%. It can never exceed Because they are the results of the division of the frequency of each class by the total. The classes in a frequency distribution are all-inclusive. All the data fit into one category or another. And the classes are mutually exclusive 2. The frequency distributions can be qualitative-quantitative, open ended-closed ended and discrete or continious. We can classify the data according to quantitative characteristics as age groups, salary, income level, and etc.. Or we can classify the data according to qualitative characteristics as sex, occupation, nationality, and etc.. On the other hand, we can arrange the data with open ended or closed ended classes. The classification scheme in open-ended classes is limitless. The last class in the frequency distribution is open-ended. Lastly, the classes in the frequency distribution can be discrete or continuous. Discrete data include those entities which are separate and do not progress from one class to another without a break (eg. 1, 2, 5, 10, 100, etc..). On the other hand, continuous data include those continuous numbers which do progress from one class to another without a break (eg. 1.1, 1.2, 22.5, , etc..). You can see various types of frequency distributions below: Table 11.3 Types of Frequency Distribution Tables Quantitative and Discrete Data with Open- Ended Class Income level ($) Frequency Relative Frequency TOTAL (a) Qualitative Data Gender Frequency Relative Frequency Male Female TOTAL (b) 2 No data point can fall into more than one category. 3

4 Continuous Data with Closed- Ended Classes Student GPAs Frequency Relative Frequency TOTAL (c) Cross-Tabulations And the third form of arranging data is to use Cross-Tabulations which is a two-way table representing two data with two separate characteristics with row and column dimensions. Consider table 11.4 (a) for the distribution of income level with respect to gender. On row dimension gender, and on column dimension income level is included. Table 11.4 (b) shows the same two-way distribution table of Income level with respect to gender both in absolute numbers and relative frequencies or in percentages. Table 11.4 (a) Income Level ($) Row Cross-Tabulation Gender Total of Income level with respect to Male Gender Female Column Total Table 11.4 (b) Income Level ($) Row Cross-Tabulation Gender Total of Income level with respect to Gender Male Female Column Total Interpretation of these two-way tables is essential in statistics, especially in scientific researches and even in decision making. On the base of Table 11.4 (b), for example, sample size is 50; 35% of males in this sample have a income level between 0 and $500 and this corresponds to 7 persons in total number of males which is 20. 4

5 And 46.7 percent of those persons who have an income level between 0 and $500 are consisted of male which corresponds to 7 persons in total number of those having income level between 0 and $500 which is 15 persons. Lastly, 14% is the fraction of males having an income level between 0 and $500 out of the total sample size of 50. Total number of males (20) constitutes 40% out of the sample size (n=50) and total number of those having an income level between 0 and $500 constitutes 30% out of the sample size. For large number of data it is very hard and time consuming to organize and arrange data with frequency distributions or cross tabulations, nowadays, by using computer packages, especially SPSS (Statistical Package for Social Sciences), it has been very easy to create these types of tables. Later on, we will study on these subjects in the following chapters Using Graphs to Describe Distributions We can represent the distribution of a data (especially frequency distribution) in various forms of graphs. We have usually two dimensions in graphs for distributions: X and Y. On X-axis values or characteristics of variables are included, and on Y-axis the frequency of these variables are included in absolute or relative terms. However, graphs with relative frequencies are more useful because they attract more attention from the reader, they are easier to understand, to make decision, and etc.. Nowadays, there are advanced computer packages that are effective for drawing the graphs. We will discuss these subjects in later chapters. Figure 11.1 includes a few examples to the types of graphs available in Microsoft Excel 97 for Windows. 5

6 Figure 11.1 Types of Graphs a) Column bar Graph b) Line Graph (c) Pie Charts (d) XY Scatter Graphs c) Pie Charts d) XY (scatter) Graphs 6

7 11.4 Measures of Central Tendency and Dispersion After data have been collected and tabulated, analysis begins with the calculation of single numbers, which will summarize or represent all the data called summary statistics. We use summary statistics to describe the characteristics of a data set. Nowadays, almost every statistical package program provides summary statistics for the data in computer output. Two of the summary statistics are important for decision-makers: Central Tendency and Dispersion. Before we get into the details of these two concepts, let s shortly define them: Central Tendency Because data often exhibit a cluster or central point, this number is called a measure of central tendency. It refers to the central or middle point of a distribution. We can also name Measures of Central Tendency as Measures of Location. We can show the concept of central tendency in a graph: Figure 11.2 Central Tendency for Three Types of Distribution Curve A Curve C Curve B It seems clearly from the figure that the central locations of A and C curves are equal to each other, and central location of curve B lies to the right of those curve A and curve C. Dispersion Dispersion refers to the spread of the data in a distribution. Notice in Figure 11.2 that Curve B has a widest spread, or dispersion than A and C. And Curve C has a wider spread than Curve A. Besides Central Tendency and Dispersion, an investigator may benefit from two other measures in a data set-skewness and kurtosis. Skewness A Curve of any distribution may be either symmetrical or skewed. In symmetrical curves, the area is divided into two equal parts by the vertical line drawn from the peak of the curve in the horizontal axis. For example, we know that total of a relative frequency distribution is equal to And in a symmetrical curve we will have 50% of the data on the left-hand side of the symmetric curve, and another 50% of the data on the right-hand side. 7

8 Figure 11.3 Symmetrical Curve 50% 50% On the other hand, curves A and B in Figure 11.4 are skewed curves. Their frequency distributions are concentrated at either the low end or the high end of the measuring scale on the horizontal axis. Curve A is called to be Positively Skewed, and curve B is called to be Negatively Skewed curves. Figure 11.4 Positively and Negatively Skewed Curves Curve A: Skewed to the right Curve B: Skewed to the left Kurtosis Kurtosis is the peakedness of a distribution. Notice in figure 11.5 that two curve possesses the same central location, dispersion, and both are symmetrical. But Curve A is said to be more peaked than curve B. Figure 11.5 Measure of Degree of Kurtosis Curve A Curve B 8

9 Measures of Central Tendency In statistics, arithmetic mean, weighted mean, geometric mean, median and mode are referred as the measures of central tendency. We will firstly consider. The Arithmetic mean The arithmetic mean is simple average of a data set. We can calculate the average age in a class, average monthly expenditure of students in EMU, average tourist number coming to TRNC each year, and etc.. The arithmetic mean for population is represented by the symbol of µ and for sample is x. The formulas for µ and x are provided below: X Population: µ = where N represents population size N x Sample: x = where n represents sample size n Table 11.4 provides the ages of students in a class. In this case, we will assume that data represents a sample derived from the whole university. Table 11.4 Ages of Students in a Class ID Name Age 1. Ali Veli Ayla George Mohammed Asher Samah Ayse Mahmut John 28 Now, let s calculate the arithmetic mean for this ungrouped data: x x = n = = So the arithmetic mean of the ages in the class will be; x = 25 But what about if the data is grouped! In a grouped data, we do not know the separate values of each observation. So we are only able to estimate the mean. But in ungrouped data, since we know all the observations in the data, whatever mean we find from the data will be the actual number. 9

10 To calculate the arithmetic mean for a grouped data we use the following formula: ( f x) x = n where x = sample mean = summation f = number of observations in each class x = midpoint of each class n = sample size Let s look at the following frequency distribution of student GPAs which is a grouped data at the same time. Table 11.5 Frequency Distribution of Student GPAs Student GPAs Frequency Relative Frequency TOTAL The first step in calculating the arithmetic mean is to find the midpoint (x) corresponding to each class. To find the midpoints, we add the lower limit of the first class with the lower limit of the following class and divide it by two. For example, to find the midpoint for the first class, the formula would be ( )/2 = 1.5. This process will continue until we reach last class interval. Then, we multiply each midpoint for the corresponding absolute frequencies and add them up. And lastly, we divide this summation by the total number of observations in the data. This exercise is included in table 11.6 Table 11.6 Student GPAs Frequency Midpoint (x) f x Arithmetic Mean for Student GPAs TOTAL 500 1,300 ( f x) 1,300 = = 500 x = 2.6 n So our approximated or estimated mean for the student GPAs from the grouped data is 2.6. A useful practice about the midpoints is that to come up with whole cents and for easy calculation, rounding the numbers is advisable. 10

11 Today, we get these frequency distributions as ready done by using statistical packages and computers calculate the arithmetic mean from the original data. So the arithmetic mean for grouped data would be unnecessary in this situation. The arithmetic mean is the best known and most frequently used measure of central tendency. One of the most important uses of the arithmetic mean is that we can easily make a comparison with different data. The arithmetic mean has got two important disadvantages. Firstly, it is affected by extreme values. Secondly, it is not possible to calculate the mean for a grouped data. The Median In its simplest meaning, median divides the distribution into two equal parts. It is a single number, which represent the most central item, or middlemost item in the distribution or in the data. Half of the data lie below this number, and another half of the data lie above this number. In order to calculate the median for ungrouped data, firstly, we array the data in ascending or descending order. If we have an odd number of data, then the median would be the most central item in the data. Let s consider the following simple data in table 11.7: Table 11.7 Year Graduated Students in Each Year No of Students Firstly, let s array the data in ascending order: 10, 13, 14, 15, 17 In this case, the most central item for this odd-numbered data would be 14, which is the median of this data set at the same time. Another way of finding the median is to use the following formula: n + 1 th Median is the item in the data array and n represents number of items 2 in the data. If we apply this formula for the above data; Median is the = 3 th 2 item in the data which corresponds to 14. However, this formula is frequently used for even-numbered data, which takes the average of the two middle items in the data. 11

12 In order to calculate the median of even-numbered data, we need to take the average of the two middlemost items since we do not know the most central item in the data set. So we should use the above formula to calculate the median. Now let s extend table 11.7 to 1996 and try to calculate the median for the data. In this case, number of observations will be 6 ( ). Table 11.7 Year Graduated Students in Each Year No of Students Again we have to sort the data in ascending order; 10, 13, 14, 15, 17, From the formula, median is = 3. 5 th item in the data which is included between 14 and 15. And the average of 14 and 15 is = That 2 is the median of this data set. So the median number of graduated students for the period of is For a grouped data, we have to find an estimated value for median that can fall into a class interval. Because we do not know all the observations in the data, we are only given the frequency distribution with class intervals. The formula to calculate the median from the grouped data is given below: where ( n + 1) / 2 ( F 1) w m~ + = L + f m ~ = the median assumed for the sample distribution L = the lower limit of the class interval containing median F = the cumulative sum of the frequencies up to, but not including, median class f = the frequency of the class interval containing median w = the width of the class interval containing median n = total number of observations in the data by N. In case where we work with the population, m ~ would be replaced by Md and n 12

13 Let s consider table 11.5 in the previous examples, and try to find the median for this data: Table 11.8 Student GPAs Frequency Finding Median for Median class Student GPAs TOTAL 500 The first step is to find the class interval that includes median. The median would be = th item in the data. Secondly, we have to find in which class interval 2 the th item is included. To do that, we add all the frequencies together from the very beginning until we reach the summation of And then we stop. In this data, the median would fall into the class of ( ), because =350 and we have already reached So the median class is ( ). Now if we put the values into the formula; ( ) / 2 ( 100 1) ~ + m = = So the median value for the GPAs of the students is And it is an estimated sample median, since the data is a grouped data. Unlike the mean, the median is not affected by the extreme values in the data. It can be calculated even for open-ended grouped data- unless the median falls into this open-ended class. The Mode The mode is the value or observation that occurs most frequently in the data. If two or more distinct observations occur with equal frequencies, but none with greater frequency, the set of observations may be said not to have a mode or to be bimodal, with modes at the two most frequent observations, or trimodal, with modes at the three most frequent observations. But when there is a single value which is repeating mostly, the distribution is unimodal. In order to find the mode of any ungrouped data, we need to array the data again in ascending or descending order. Let s consider the following ungrouped data, which represents the final exam marks of 35 students in a class. Table 11.9 Marks Arrayed in Ascending Order Student Marks In Final Exam

14 It clearly appears that the most frequently repeated observation or student mark is 23, it is repeated 3 times, so the mode for this ungrouped data is 23. So this distribution is unimodal. And as we can observe from the data, 99 is repeated 2 times. Now, let s consider the following table again for student marks: Table Marks Arrayed in Ascending Order Student Marks In Final Exam This time we changed the observations. And in this case we have two observations which are repeating mostly, 23 and 83. They are repeated 3 times. And the mode for this data is 23 and 83, which is called bimodal. And lastly if we have 3 most repeated observation in a data, the distribution is trimodal. Let s make one more change in the previous table: Table 11.9 Marks Arrayed in Ascending Order Student Marks In Final Exam This time we have got three observations, which are, repeated most; 23, 83 and 90. They are again repeated 3 times each. However, generally accepted rule is that when we have two or more observations in a distribution, repeating mostly, this distribution is shortly bimodal. When we have a grouped data, we assume that the mode is located in the class interval having the highest frequency. This class interval is called modal class. In order to find the mode from the grouped data, we use the following formula: f f M 0 = L + 0 m b m a m b ( f f ) + ( f f ) w M. where M 0 = the mode of the frequency distribution or grouped data L m0 = lower limit of the modal class f m = frequency of the modal class f b = frequency of the class interval below the modal class f a = frequency of the class interval above the modal class w = the width of the modal class 14

15 Let s apply this formula to find the mode for the following frequency distribution of student GPAs: Table 11.8 Student GPAs Frequency Finding Median for Modal Class Student GPAs TOTAL 500 As we can see from the table the modal class for this frequency distribution will be since it has the highest frequency. Now we can put the values into the formula: M 0 = = 2.60 ( ) ( ) + So the mode for this frequency distribution will be And since this data is a grouped data, and we do not know every observation in the data, 2.60 is the estimated number for the mode. Like the median, and unlike the mean, the mode is not affected by the extreme values in the data. And we can use it even with the open-ended class intervals. Comparison of the Mean, the Median, and the Mode Among these three measures of central tendency, the mean is the most popular and useable one. The mean and the median is more preferable according to the mode. Most of the times, the data may not contain a mode, because no values may occur more than once in the data. But the frequency of the use of these three measures depends on the conditions and the area of the research that they will be applied in. On the other hand, we can compare these measures of central tendency with respect to statistical methods. When any distribution is symmetrical, the mean, the median and the mode are equal to each other. Figure 11.6 shows this relationship: Figure 11.6 Mean, Median, and Mode in symmetrical distribution Mean Median Mode 15

16 In this case, there will not be any preference since they are equal to each other. But what about when we have a skewed distribution! Figure 11.7 shows the position of these three measures of central tendency when the distribution is skewed to the right and to the left: Figure 11.7 Mean, Median and Mode in skewed distributions Curve A: Skewed to the right Curve B: Skewed to the left Mode Mean Mean Mode Median Median When the distribution is skewed, the median would be preferable measure of central tendency, because it is included between the mean and the mode in positively and negatively skewed distributions. Measures of Dispersion When we compare two or more distributions by using the measures of central tendency, we may be satisfied. We need o know more information about these distributions; for example, knowing the mean of the data sets may not be enough to compare them. The variability or dispersion is a useful measure to get more information about these distributions. If we try to compare two data sets by finding only the mean of these data, this will not be enough. We may need to know about which distribution is more consistent compared to other, so the measure of dispersion will help us in this case since it measures the spread of the observations in the data around their mean. If the dispersion of the data decreases, the consistency and the reliability of the data will increase. And the central location (mean, median, or mode) of the data will be more representative of the data as a whole. The concept of dispersion plays an important role in our business life also. For example, a financial manager may concern with the earnings of the firms. Widely dispersed earnings will indicate a higher risk for a financial manager. Because the earnings are widely variable, let s say around the mean, and this indicates inconsistency in the earnings. Figure 11.8 shows the spread of three curves having the same mean. Although they have the same central location, curve A has the least spread compared to B and C. And curve C has the widest spread in the graph. So distribution of Curve A is said to be more consistent and reliable compared to B and C. 16

17 Figure 11.8 Curve A Curve B Measure of Dispersion for three curves having the same mean Curve C Mean A, B, C Range, Interfractile Range and Interquartile Range These are the first and distance measures of dispersion. Range is the difference between the highest and the lowest values in a data set. We can show it by Range = Highest value Lowest value Interfractile range is the difference between two fractiles. Generally, fractiles are comprised of 4 characteristics as provided below: Third Fractiles Quartiles Deciles Percentiles = divide the data into 3 equal parts = divide the data into 4 equal parts = divide the data into 10 equal parts = divide the data into 100 equal parts Let s consider the following data on student grades: As a first example, let s divide the data into thirds and find the interfractile range between 1/3 and 2/3 fractiles. Firstly, how do we organize the data in three equal parts? We will first order the data starting from the lowest to the highest. And then specify the extensions. We specify the extentions as follows: Since the sample size, n, is 48, we divide 48 by 3, and we get /3 = 16 Which means that we will have 3 rows and 16 columns which is in 3 16 format. Let s create it now: 17

18 So, 1/3 fractile will be = 43, 2/3 fractile = 52, 3/3 fractile = 72 Interfractile range between 1/3 and 2/3 fractiles will be then; = 9. As a second example, what is the interfractile range between 30 th and 70 th percentiles? 30 th fractile is 30% of 48 = 14.4 = 14 th element in the data 70 th fractile is 70% of 48 = 33.6 = 34 th element in the data 14 th element in the data = th element in the data = 55 So; = 13 is the interfractile range. As a third example, let s find the interquartile range which is the difference between the first and third quartiles, and quartiles divide the data into 4 equal parts. Let s divide the data into 4 equal parts. 48/4 = 12, so format will be So; 1 th quartile = 1/4 = 41 3 rd quartile = 3/4 = 57 2 nd quartile = 2/4 = 49 4 th quartile = 4/4 = 72 And; Interquartile Range = Q 3 - Q 1 = = 16 But if we want to find the range between 1/4 and 2/4 fractiles, then it will be 49-41= 9. And again 30 th and 70 th percentiles are the same values as in the previous example as 42 and 55. And range is = 13. No matter we arrange the data into 4 or 3 or any other equal parts, the percentiles are the same values. 18

19 Variance and Standard Deviation Variance and, especially, standard deviation are the most commonly used statistical measures for dispersion. They specify the average distance of an observation in a data from the mean of the data. We can specify the average distance (or deviation) of the observations from the mean by the following formula: Average deviation = ( ) X i µ N Where X i = Observations in the population µ = Population mean N = Population size But when we use this formula, we will see that the sum of the deviations is equal to zero. And as a result, the average deviation will be also equal to zero. To prevent this problem, we square each deviation to find the standard deviation. The standard deviation is the square root of the variance. It is more applicable than the variance in statistical analyses. The reason behind this is that the variance does not express the average dispersion in the original units; it expresses in squared units. So, in order to bypass this problem, we take the square root of the variance to transform into standard deviation. So, the standard deviation measures the average dispersion in the data in original units of measurement. We can express the variance and the standard deviation for population by the following formulas: σ ( ) 2 X 2 i µ = Variance N ( ) 2 X 2 i µ σ = σ = N Standard Deviation However, most of the times, it is not possible to know all the observations in the population. So we induce our population formula into sampling units. To calculate the standard deviation of a given sample, we use; Where ( x x) i s = n 1 x I = each sample unit in the distribution x = sample mean n-1 = sample size minus 1 2 What is the reason for using n-1? We can prove that if we select many different samples from a population, find the standard deviation for each sample, take 19

20 the average of them, then this average will not tend to be equal to population standard deviation. So in order prevent this difference, we use n-1 as a denominator. Now let s calculate the standard deviation for student CGPAs for a randomly selected sample of 15 students. TABLE 11.9 CGPA x- x (x- x ) 2 Calculating Variance And Standard Deviation For Ungrouped Sample Data Of Student CGPAs x = 3.21 Sum = 0.00 Sum = s = = = sample standard deviation 15 1 The standard deviation (s) of this sample of student CGPAs is approximately 0.42 showing that each observation in the sample, on average, deviates from the mean ( x = 3.21) by 0.42 both downwards and upwards. The variance of this sample (s 2 ) is As you will observe from table 11.9, the sum of deviations of each observation in the data is equal to zero. So we square each deviation and add them up. Calculating Variance and Standard Deviation by Using Grouped Data Up to this point, we have discussed about the variance and the standard deviation for ungrouped data, which were unprocessed and raw data. But how about if the data is grouped? Then we need to use different formula to find variance and standard deviation. Since the standard deviation (σ) is the square root of the variance (σ 2 ), we will just work on standard deviation. The formula of standard deviation for a grouped data is; ( x ) 2 µ 2 f i σ = σ = for population N s 2 = s = f ( x x) i n 1 2 for sample 20

21 In this case, x I in each formula represents the midpoints of each class interval, and f, the frequency of each class. Table 11.6 Student GPAs Frequency Midpoint (x) x I - µ (x I - µ) 2 f. (x I - µ) 2 Arithmetic Mean for Student GPAs Total = 15 Total = 2.40 x = 3.30 f. x x = n = ( 0 1.5) + ( 3 2.5) ( ) 15 = 49.5 = s = = 0.17 = Now it s time to mention an important thing! Since we do not know every single observation in a grouped data, we find midpoints for each class to make approximation of real observations. We multiply each squared deviation of midpoints from the mean by their corresponding frequency, add them up, and divide by N (if population) or n-1 (if sample). So the standard deviation, or the variance computed from a grouped data is an approximated or estimated value. However, in an ungrouped data, we know every single observation, and whatever we calculate from an ungrouped data is a real value. A Relative Measure of Dispersion: The Coefficient of Variation The standard deviation and the variance are absolute measures of dispersion. On the other hand, CV is a relative measure of dispersion that expresses the standard deviation as a percentage of the mean. By using CV, we can easily compare the dispersions of two sets of data in percentages. The formula for calculating CV is; σ µ CV = ( 100) s x CV = ( 100) for population for sample Let s consider the following example to understand the use of CV better: Suppose that the common stocks of Sabanci Inc. sold at an average of $50,000 per stock and a standard deviation with $5,000 for the period of On the other hand, Koc Inc. had sold its common stocks at an average of $ 60,000 per stock and a standard deviation with $ 5,800 between The CV for both firms will be; 21

22 s 5000 = x CV Sabanci = ( 100) % s 5800 CV Koc = = = = 9.66% x = On the base of above results, since Sabanci Inc. has less absolute variation (or standard deviation, s= $5,000) in its common stocks than Koc Inc., it has more relative variation than Koc Inc.. This is because of the significant difference in their means. = 11.5 Statistical Inference: Estimation and Hypothesis Tests Statistical inference, estimation and hypothesis testing are three important concepts in statistics, and closely related to each other. The definition of statistical inference had been given in the opening chapter, 1. It deals with uncertainty by using probability concepts in decision making. It is based on estimation and is the subject of both estimation and hypothesis testing. In this part, firstly, we will start with estimation. Estimation When we deal with uncertainty, we have to estimate about something. In statistics, in order to estimate the population parameters, we use sample statistics. Generally, there are two types of estimates in statistics: A point estimate and an interval estimate. A point estimate is a single value or a sample statistic, which is used to estimate an unknown population parameter; it does not provide us enough information since it is a single number. On the other hand, an interval estimate is a range of values that is used to estimate the population parameter that it can fall into this range. The sample statistics used to estimate the population parameters are called estimators. For example, x is the sample mean and the estimator of population mean, µ.; s is the sample standard deviation and the estimator of the population standard deviation, σ. The observed values for the estimators are called estimates. For example, x =23; in this case x is an estimator and 23 is the estimate for the true population mean. An Alternative Way For Hypothesis Tests: Using Prob Values (p-values) Recall that α is a predetermined value for the significance level, which is the probability of rejecting a true null hypothesis, called type I error. Selecting the level of α depends on the researcher s desire. Generally accepted rule for specifying the level of α is that of the trade off between α and β (probability of type II error). 22

23 If the cost of making type I error is relatively high or expensive for the researcher, then he or she will not desire to make type I error and he or she is going to select a low level of α. On the other hand, if the cost of making type II error is relatively high or expensive for the researcher, he or she will not desire to make type II error and is going to select a high level of α. On the other hand, the standardized value of the probability for rejecting a true null hypothesis is called a prob value (p-value). It is directly found from z-table by using z formula. Let s consider the following example: H o : µ = 15 H a : µ 15 And; σ = 2.1 n = 20 x = 13.6 In this simple example, we are given the two-tailed hypothesis whether the mean value of any population is equal to 15 or not. Besides we are provided population standard deviation, sample size selected for the test, and sample mean. In this case the probability of µ > 15 or µ < 15 would be called its prob value, that is accepting the alternative hypothesis. So the prob value will be the summation of the the probabilities in both rejection tails. Let s find the prob value now: Firstly, we have to find the standard error for the mean: σ x σ = n = = 0.47 The next step is to find the z score for x : z = x µ = = 2.98 σ 0.47 x Figure 11.9 Prob Values in normal curve z In this example, the p-value for the test would be 2(0.0028)= So the standardized probability of accepting the alternative hypothesis is 0.56%. 23

24 Now let s continue to test our hypothesis. Let s select a significance level of α=0.05. Figure shows how α and p-values are used together to test the hypothesis. Figure Use of Prob Values in Testing Hypothesis z Reject H o Accept H o Reject H o Z critical values As you will see from figure 11.10, p-values falls outside region of the z critical values so we would reject H o and accept H a that the true mean value for the population will be beyond (not equal to) 15. It is possible to derive one more conclusion from above discussions. The p- value for the above example is and is lower than α=0.05. So when; p-value > α then we accept H o, p-value < α then we reject H o and accept H a This conclusion is commonly true not only for two-tailed tests but also onetailed tests Chi-Square And Analysis Of Variance (ANOVA) Tests Chi-square and ANOVA tests are two statistical techniques used in hypothesis testing. Usually, we use Chi-square as a test of independency between two or more variables and goodness of fit of a particular probability distribution, and ANOVA as a test of difference between two or more population proportions. Let s consider these tests in more details. 24

25 Chi-Square Test for Independency Two way tables (cross-tabulations) plays an important role in considering and evaluating chi-square test. When we get the computer output, especially in SPSS, if we will conduct a hypothesis test by using chi-square, we give the command to SPSS and chi-square statistic, df, and the significance level is provided with table. To carry out a Chi-square test we need to find the computed value for Chi-square statistic (χ 2 ) first. The formula for χ 2 ; χ 2 = ( f ) f 0 fe e where χ 2 f 0 f e : chi-square statistic : observed frequency in the distribution : expected frequency in the distribution But how do we find f e? The following formula is used to calculate f e : f e = ( rt ct ) n where rt ct n : row total of the corresponding frequency cell : column total of the corresponding frequency cell : total number of observations (sample size) Secondly we need to determine a significance level 3 for the hypothesis test. This might be 0.05, 0.10, whose level is up to the researcher s desire. And lastly we need to find the table value for the Chi-square statistic. To do that we find the degrees of freedom (df) by using the following formula: df = (r-1).(c-1) Where df r c : degrees of freedom : number of rows in the table : number of columns in the table So we can find the table value of the chi-square statistic from chi-square distribution table by looking at df and significance level. If the null hypothesis is true, the sampling distribution of a χ 2 can be approximated by a continuous curve, which is 3 Remember that the significance level shows the level of error accepted in hypothesis test. 25

26 known as chi-square distribution. There is different chi-square distribution for each level of df. The degree of freedom increases as column and/or row dimensions, and/or number of variables in the test increases. As df increases, the chi-square distribution will be more symmetrical and with small df, it will be skewed to the right as you can observe from Figure Figure df 5 df Representing Chi-square distribution with different levels of degrees of freedom 10 df χ 2 Carrying A Hypothesis Test by Using Chi-Square Figure is a representative graph for chi-square distribution used in hypothesis testing. The shaded area on the right tail shows the significance level, which was the level of error accepted for the true null hypothesis and it shows the probability of rejecting the true null hypothesis at the same time. The left-hand side contains the confidence level for the null hypothesis, and shows the probability of accepting the true null hypothesis. Figure Table value for χ 2 Representative graph of χ 2 distribution for hypothesis test C.L. = 0.90 α= 0.10 Acceptance Region Rejection Region χ 2 The intersection point of acceptance and rejection areas corresponds to the table value for chi-square statistic. If the computed value of chi-square statistic falls to 26

27 acceptance region (or if computed value is less than the table value), the null hypothesis will be accepted, otherwise it will be rejected and the alternative hypothesis will be accepted. In order to understand better let s try to solve a problem on chi-square test of independency. Because of the aim of this book, we will mostly work on computer based outputs in these types of problems. However, the reader can refer to any statistics book to see the theoretical computation of the formulas. Table 11.4 shows the evaluation of teaching ability of lecturers by faculty. The frequency in bold characters in each cell in Table 11.4 represents the expected frequency of each observed frequency. Recall from the f e formula that rt in table 11.4 is equal to 10 for 1 st, 2 nd and 3 rd rows, and 20 for the 4 th row; and ct is equal to 4 for the 1 st column, 8 for the 2 nd column, 23 for the 3 rd column 5 for the 4 th column and 0 for the 5 th column. Each row or column total gives the proportion of each row or column variables in the total number of observations. For example, the rt for the 1 st row is 10, it shows the total number of B&E students out of n (=50). Its proportion out of n is 0.20 (20%), which is 10 / 50. The ct for the 2 nd column is 18, it shows the total number of student who have found teaching ability of lecturers as High. Its proportion out of n is 0.36 (36%), which is 18 / 50. The combined proportion of rt and ct out of n will give the expected rt ct. frequency, f e, for each cell, which is ( ) n Now let s continue with our exercise. 27

28 Table 11.4 Ability by Evaluation Faculty of teaching ability of lecturers by faculty Very High B&E High Medium Poor Very Poor A&S ENG OTHER Row Total Column Total Computed value for Pearson s Significance Level Chi-square Statistic (χ 2 ) Df The null and the alternative hypotheses for chi-square test of this exercise will be as below: H o = Teaching ability of lecturers are independent of faculty H a = Teaching ability of lecturers depends on faculty In chi-square test, the null hypothesis specifies the independence, and the alternative hypothesis specifies dependence. The computed value for chi-square statistic is and it is called Pearson s Chi-square Statistic. The degrees of freedom is (4-1)(5-1) = 12. And the significance level is Let s test the hypothesis at 0.01 level of significance (α = 0.01). Once we get this information on this exercise, the next step is to find the table value for χ 2. 28

29 Appendix table.. provides us the chi-square distribution table for different levels of α and df. The table value for our exercise will be; 2 χ 0.01,12 = Now let s represent these data in a graph: Figure Hypothesis test for the evaluation of teaching ability of lecturers by faculty α= 0.01 Acceptance Region Rejection Region χ 2 As you see in figure 11.13, the computed value of χ 2 is less than the table value and it falls within the acceptance region. So we accept our null hypothesis that the teaching ability of lecturers are independent of faculty according to this data of n=50 observations. On the other hand, the p-value for the Pearson's Chi-Square Statistic is and since p-value = < α = 0.01, then again we would reject our null hypothesis. Analysis of Variance (ANOVA) Test for Difference Analysis of variance (ANOVA) is used to test for any differences among more than two sample means. That is, ANOVA is used to compare two different estimates of the variance, σ 2, of the same population: the first estimate is among the samples, and the second estimate is within the samples. If the null hypothesis is true, then both estimates should be equal. ANOVA is represented by F-ratio which is used to compare two estimates of variances. The formula to find the computed F ratio would be; estimate of the variance among the sample means F = estimate of the variance within the sample means 29

30 It is possible to formularize this in this way; F 2 ˆ σ = 2 ˆ σ = = n j ( x x ) j k 1 n j 1 s nt k 2 2 j where; n j = size of the j th sample x = sample mean of the j th sample x k s j 2 j n T = mean (average) of the sample means (grand mean) = number of samples = variance of the j th sample = total of the sample sizes ( n j ) Let's consider an example for the test for difference. Suppose we want to test if there is a significant difference in salaries of males and females in a questionnaire study for a corporation. The sample size selected is 475. The salaries and gender are categorized in the questionnaire form as; SAL: GENDER; $50, Male 2. $50,000 - $100, Female 3. $100, We can formularize our hypothesis as H 0 = µ male = µ female (Salaries of employees do not differ in gender) H a = µ male µ female (Salaries of employees differ in gender) α = 0.01 Below is the SPSS output for ANOVA test for employee data. ANOVA SAL Between Groups Within Groups Total Sum of Squares df Mean Square F Sig. 30

31 In order to test our null hypothesis we have to compare F-computed value with F-table value. We will find F-table value with regarding degrees of freedom (df). In ANOVA test, there are two degrees of freedom. Df in the numerator of F-ratio = (k -1) = 2-1 = 1 Df in the denominator of F-ratio = ( n j 1 ) = nt k = = 473 Where; k = number of samples n j = mean of j th sample = total sample size n T Then, F-table value would be; F (α=0.01) = 6.63 (approximately) Now let's test our hypothesis at α=0.01: Figure Hypothesis test for the difference of salaries among males and females α= 0.01 Acceptance Region 6.63 Rejection Area F-computed : > F-table : So, since F-computed falls within the rejection area, we would reject our null hypothesis and accept the alternative hypothesis that in the corporation the salaries of employees differ among males and females. Alternatively, p-value= < α=0.01 and again we would reject the hypothesis. The shape of F-distribution in different levels of degrees of freedom can be shown in Figure The first number in each parenthesis shows df in the numerator in F-Ratio formula, and the second number shows df in the denominator in F-Ratio formula. 31