CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES

Size: px
Start display at page:

Download "CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES"

Transcription

1 CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES 11.1 Basics of Statistics Statistics is a branch of mathematics that deals with the collection, organization, and analysis of numerical data and with such problems as experiment design and decision making. The origin of the term statistics comes from the Italian word statista (meaning statesman ), but the real term derived from statista was statistik which was firstly used by Gottfried Achenwall ( ). He was a professor at Marlborough and Gottingen. But the introduction of the word statistics was made by E.A.W. Zimmerman in England. However, before eighteenth century people were able to record and use some data. The popularity of statistics had started with Sir John Clair in his work Statistical Account of Scotland which includes the period of There are various techniques in statistics which can be applied in every branch of public and private enterprises. But statisticians generally divide it into two main parts: Descriptive Statistics and Inferential Statistics. Shortly, in descriptive statistics there is no generalization from sample to population 1. We can describe any data with tables, charts or graphs so that they do not refer any generalization for other data or population. On the other hand, in inferential statistics there is a generalization from sample to population. The generalization or conclusions on any data goes far beyond that data. So the generalization may not be true and valid, and statistician should specify how likely it is to be true, because it based on estimation somehow. Inferential statistics could be also re-called as Statistical Inference. Statistical inference can be applied also in decision theory, which is a branch of statistics. Because there is a very close relationship between the two; decisions are made under the conditions of uncertainty. So statistical inference is very effective in decision making Arranging Data: Data Array, Frequency Distributions, and Cross-Tabulations Data are collections of any number of related observations. We can collect an information about the number of students in Eastern Mediterranean University (EMU) in Turkish Republic of Northern Cyprus (TRNC). We can divide them into the different categories such as nationality, gender, age groups, and etc.. A collection of data is called data set, and a single observation in the data is called a data point. People can gather data from past records, or by observation. Again people can use data on the past to make decisions about the future. So data plays very important role in decision making. Most of the times it is not possible to gather data for the population. So what statisticians do is to gather the data from a sample. They use this information to make inferences about the population that the sample represents. A population is a whole, whereas a sample is only a fraction of the population. Assume that there are currently 1 The concepts on sample and population will be discussed later. 1

2 10,300 students in EMU, and we want to evaluate the expectations and findings of EMU students toward the University. It will be very hard to consider all the students in the university, so we select a fraction of the total number. If we decide to take 15% of the total number, the selected number of students would be 1,545 in this case; and this number is called Sample Size. On the other hand, total number of students (10,300) is called Population Size. The collection of sample or population can be implemented randomly or not randomly. When data is selected randomly all the observations have an equal chance of being included in the data regardless of their characteristics. But when data is not selected randomly, there is a biased selection regarding any characteristic of the observations. In order to use data for any purpose efficiently, we need to arrange the data. This arrangement might be in various numbers of forms. Data before it is arranged and analyzed is called raw data. It is still unprocessed by statistical methods. Data Array The first form of arranging the data is to use Data Array. It is one of the simplest ways to present a data, and it arranges the data in ascending or descending order. Table 11.1 Grades of Students Raw Data Array Data Ascending Descending When we use data array, we can immediately see the lowest and highest values in the data, we can divide it into the sections, and we see if a value appear more than once in the data. But when we have large quantities of data, data array is not so helpful for us. We need to arrange the data by using another method. Frequency Distributions The second form of arranging the data is to use frequency distributions. It is the one of best known types of data in statistics. It divides the data into the classes with lower and upper limits, and it shows the number of observations that fall into each of the classes. We can also express the frequency of each value in terms of fractions or percentages of the total observations, which is called relative 2

3 frequency distribution. In table 11.2 you can see the frequency distribution and relative frequency distribution table. Table 11.2 Frequency Distribution of Student Grades Class Frequency Relative Frequency Total As you will notice from the table, the summation of relative frequency in each of the classes is equal to 1.00, or 100%. It can never exceed Because they are the results of the division of the frequency of each class by the total. The classes in a frequency distribution are all-inclusive. All the data fit into one category or another. And the classes are mutually exclusive 2. The frequency distributions can be qualitative-quantitative, open ended-closed ended and discrete or continious. We can classify the data according to quantitative characteristics as age groups, salary, income level, and etc.. Or we can classify the data according to qualitative characteristics as sex, occupation, nationality, and etc.. On the other hand, we can arrange the data with open ended or closed ended classes. The classification scheme in open-ended classes is limitless. The last class in the frequency distribution is open-ended. Lastly, the classes in the frequency distribution can be discrete or continuous. Discrete data include those entities which are separate and do not progress from one class to another without a break (eg. 1, 2, 5, 10, 100, etc..). On the other hand, continuous data include those continuous numbers which do progress from one class to another without a break (eg. 1.1, 1.2, 22.5, , etc..). You can see various types of frequency distributions below: Table 11.3 Types of Frequency Distribution Tables Quantitative and Discrete Data with Open- Ended Class Income level ($) Frequency Relative Frequency TOTAL (a) Qualitative Data Gender Frequency Relative Frequency Male Female TOTAL (b) 2 No data point can fall into more than one category. 3

4 Continuous Data with Closed- Ended Classes Student GPAs Frequency Relative Frequency TOTAL (c) Cross-Tabulations And the third form of arranging data is to use Cross-Tabulations which is a two-way table representing two data with two separate characteristics with row and column dimensions. Consider table 11.4 (a) for the distribution of income level with respect to gender. On row dimension gender, and on column dimension income level is included. Table 11.4 (b) shows the same two-way distribution table of Income level with respect to gender both in absolute numbers and relative frequencies or in percentages. Table 11.4 (a) Income Level ($) Row Cross-Tabulation Gender Total of Income level with respect to Male Gender Female Column Total Table 11.4 (b) Income Level ($) Row Cross-Tabulation Gender Total of Income level with respect to Gender Male Female Column Total Interpretation of these two-way tables is essential in statistics, especially in scientific researches and even in decision making. On the base of Table 11.4 (b), for example, sample size is 50; 35% of males in this sample have a income level between 0 and $500 and this corresponds to 7 persons in total number of males which is 20. 4

5 And 46.7 percent of those persons who have an income level between 0 and $500 are consisted of male which corresponds to 7 persons in total number of those having income level between 0 and $500 which is 15 persons. Lastly, 14% is the fraction of males having an income level between 0 and $500 out of the total sample size of 50. Total number of males (20) constitutes 40% out of the sample size (n=50) and total number of those having an income level between 0 and $500 constitutes 30% out of the sample size. For large number of data it is very hard and time consuming to organize and arrange data with frequency distributions or cross tabulations, nowadays, by using computer packages, especially SPSS (Statistical Package for Social Sciences), it has been very easy to create these types of tables. Later on, we will study on these subjects in the following chapters Using Graphs to Describe Distributions We can represent the distribution of a data (especially frequency distribution) in various forms of graphs. We have usually two dimensions in graphs for distributions: X and Y. On X-axis values or characteristics of variables are included, and on Y-axis the frequency of these variables are included in absolute or relative terms. However, graphs with relative frequencies are more useful because they attract more attention from the reader, they are easier to understand, to make decision, and etc.. Nowadays, there are advanced computer packages that are effective for drawing the graphs. We will discuss these subjects in later chapters. Figure 11.1 includes a few examples to the types of graphs available in Microsoft Excel 97 for Windows. 5

6 Figure 11.1 Types of Graphs a) Column bar Graph b) Line Graph (c) Pie Charts (d) XY Scatter Graphs c) Pie Charts d) XY (scatter) Graphs 6

7 11.4 Measures of Central Tendency and Dispersion After data have been collected and tabulated, analysis begins with the calculation of single numbers, which will summarize or represent all the data called summary statistics. We use summary statistics to describe the characteristics of a data set. Nowadays, almost every statistical package program provides summary statistics for the data in computer output. Two of the summary statistics are important for decision-makers: Central Tendency and Dispersion. Before we get into the details of these two concepts, let s shortly define them: Central Tendency Because data often exhibit a cluster or central point, this number is called a measure of central tendency. It refers to the central or middle point of a distribution. We can also name Measures of Central Tendency as Measures of Location. We can show the concept of central tendency in a graph: Figure 11.2 Central Tendency for Three Types of Distribution Curve A Curve C Curve B It seems clearly from the figure that the central locations of A and C curves are equal to each other, and central location of curve B lies to the right of those curve A and curve C. Dispersion Dispersion refers to the spread of the data in a distribution. Notice in Figure 11.2 that Curve B has a widest spread, or dispersion than A and C. And Curve C has a wider spread than Curve A. Besides Central Tendency and Dispersion, an investigator may benefit from two other measures in a data set-skewness and kurtosis. Skewness A Curve of any distribution may be either symmetrical or skewed. In symmetrical curves, the area is divided into two equal parts by the vertical line drawn from the peak of the curve in the horizontal axis. For example, we know that total of a relative frequency distribution is equal to And in a symmetrical curve we will have 50% of the data on the left-hand side of the symmetric curve, and another 50% of the data on the right-hand side. 7

8 Figure 11.3 Symmetrical Curve 50% 50% On the other hand, curves A and B in Figure 11.4 are skewed curves. Their frequency distributions are concentrated at either the low end or the high end of the measuring scale on the horizontal axis. Curve A is called to be Positively Skewed, and curve B is called to be Negatively Skewed curves. Figure 11.4 Positively and Negatively Skewed Curves Curve A: Skewed to the right Curve B: Skewed to the left Kurtosis Kurtosis is the peakedness of a distribution. Notice in figure 11.5 that two curve possesses the same central location, dispersion, and both are symmetrical. But Curve A is said to be more peaked than curve B. Figure 11.5 Measure of Degree of Kurtosis Curve A Curve B 8

9 Measures of Central Tendency In statistics, arithmetic mean, weighted mean, geometric mean, median and mode are referred as the measures of central tendency. We will firstly consider. The Arithmetic mean The arithmetic mean is simple average of a data set. We can calculate the average age in a class, average monthly expenditure of students in EMU, average tourist number coming to TRNC each year, and etc.. The arithmetic mean for population is represented by the symbol of µ and for sample is x. The formulas for µ and x are provided below: X Population: µ = where N represents population size N x Sample: x = where n represents sample size n Table 11.4 provides the ages of students in a class. In this case, we will assume that data represents a sample derived from the whole university. Table 11.4 Ages of Students in a Class ID Name Age 1. Ali Veli Ayla George Mohammed Asher Samah Ayse Mahmut John 28 Now, let s calculate the arithmetic mean for this ungrouped data: x x = n = = So the arithmetic mean of the ages in the class will be; x = 25 But what about if the data is grouped! In a grouped data, we do not know the separate values of each observation. So we are only able to estimate the mean. But in ungrouped data, since we know all the observations in the data, whatever mean we find from the data will be the actual number. 9

10 To calculate the arithmetic mean for a grouped data we use the following formula: ( f x) x = n where x = sample mean = summation f = number of observations in each class x = midpoint of each class n = sample size Let s look at the following frequency distribution of student GPAs which is a grouped data at the same time. Table 11.5 Frequency Distribution of Student GPAs Student GPAs Frequency Relative Frequency TOTAL The first step in calculating the arithmetic mean is to find the midpoint (x) corresponding to each class. To find the midpoints, we add the lower limit of the first class with the lower limit of the following class and divide it by two. For example, to find the midpoint for the first class, the formula would be ( )/2 = 1.5. This process will continue until we reach last class interval. Then, we multiply each midpoint for the corresponding absolute frequencies and add them up. And lastly, we divide this summation by the total number of observations in the data. This exercise is included in table 11.6 Table 11.6 Student GPAs Frequency Midpoint (x) f x Arithmetic Mean for Student GPAs TOTAL 500 1,300 ( f x) 1,300 = = 500 x = 2.6 n So our approximated or estimated mean for the student GPAs from the grouped data is 2.6. A useful practice about the midpoints is that to come up with whole cents and for easy calculation, rounding the numbers is advisable. 10

11 Today, we get these frequency distributions as ready done by using statistical packages and computers calculate the arithmetic mean from the original data. So the arithmetic mean for grouped data would be unnecessary in this situation. The arithmetic mean is the best known and most frequently used measure of central tendency. One of the most important uses of the arithmetic mean is that we can easily make a comparison with different data. The arithmetic mean has got two important disadvantages. Firstly, it is affected by extreme values. Secondly, it is not possible to calculate the mean for a grouped data. The Median In its simplest meaning, median divides the distribution into two equal parts. It is a single number, which represent the most central item, or middlemost item in the distribution or in the data. Half of the data lie below this number, and another half of the data lie above this number. In order to calculate the median for ungrouped data, firstly, we array the data in ascending or descending order. If we have an odd number of data, then the median would be the most central item in the data. Let s consider the following simple data in table 11.7: Table 11.7 Year Graduated Students in Each Year No of Students Firstly, let s array the data in ascending order: 10, 13, 14, 15, 17 In this case, the most central item for this odd-numbered data would be 14, which is the median of this data set at the same time. Another way of finding the median is to use the following formula: n + 1 th Median is the item in the data array and n represents number of items 2 in the data. If we apply this formula for the above data; Median is the = 3 th 2 item in the data which corresponds to 14. However, this formula is frequently used for even-numbered data, which takes the average of the two middle items in the data. 11

12 In order to calculate the median of even-numbered data, we need to take the average of the two middlemost items since we do not know the most central item in the data set. So we should use the above formula to calculate the median. Now let s extend table 11.7 to 1996 and try to calculate the median for the data. In this case, number of observations will be 6 ( ). Table 11.7 Year Graduated Students in Each Year No of Students Again we have to sort the data in ascending order; 10, 13, 14, 15, 17, From the formula, median is = 3. 5 th item in the data which is included between 14 and 15. And the average of 14 and 15 is = That 2 is the median of this data set. So the median number of graduated students for the period of is For a grouped data, we have to find an estimated value for median that can fall into a class interval. Because we do not know all the observations in the data, we are only given the frequency distribution with class intervals. The formula to calculate the median from the grouped data is given below: where ( n + 1) / 2 ( F 1) w m~ + = L + f m ~ = the median assumed for the sample distribution L = the lower limit of the class interval containing median F = the cumulative sum of the frequencies up to, but not including, median class f = the frequency of the class interval containing median w = the width of the class interval containing median n = total number of observations in the data by N. In case where we work with the population, m ~ would be replaced by Md and n 12

13 Let s consider table 11.5 in the previous examples, and try to find the median for this data: Table 11.8 Student GPAs Frequency Finding Median for Median class Student GPAs TOTAL 500 The first step is to find the class interval that includes median. The median would be = th item in the data. Secondly, we have to find in which class interval 2 the th item is included. To do that, we add all the frequencies together from the very beginning until we reach the summation of And then we stop. In this data, the median would fall into the class of ( ), because =350 and we have already reached So the median class is ( ). Now if we put the values into the formula; ( ) / 2 ( 100 1) ~ + m = = So the median value for the GPAs of the students is And it is an estimated sample median, since the data is a grouped data. Unlike the mean, the median is not affected by the extreme values in the data. It can be calculated even for open-ended grouped data- unless the median falls into this open-ended class. The Mode The mode is the value or observation that occurs most frequently in the data. If two or more distinct observations occur with equal frequencies, but none with greater frequency, the set of observations may be said not to have a mode or to be bimodal, with modes at the two most frequent observations, or trimodal, with modes at the three most frequent observations. But when there is a single value which is repeating mostly, the distribution is unimodal. In order to find the mode of any ungrouped data, we need to array the data again in ascending or descending order. Let s consider the following ungrouped data, which represents the final exam marks of 35 students in a class. Table 11.9 Marks Arrayed in Ascending Order Student Marks In Final Exam

14 It clearly appears that the most frequently repeated observation or student mark is 23, it is repeated 3 times, so the mode for this ungrouped data is 23. So this distribution is unimodal. And as we can observe from the data, 99 is repeated 2 times. Now, let s consider the following table again for student marks: Table Marks Arrayed in Ascending Order Student Marks In Final Exam This time we changed the observations. And in this case we have two observations which are repeating mostly, 23 and 83. They are repeated 3 times. And the mode for this data is 23 and 83, which is called bimodal. And lastly if we have 3 most repeated observation in a data, the distribution is trimodal. Let s make one more change in the previous table: Table 11.9 Marks Arrayed in Ascending Order Student Marks In Final Exam This time we have got three observations, which are, repeated most; 23, 83 and 90. They are again repeated 3 times each. However, generally accepted rule is that when we have two or more observations in a distribution, repeating mostly, this distribution is shortly bimodal. When we have a grouped data, we assume that the mode is located in the class interval having the highest frequency. This class interval is called modal class. In order to find the mode from the grouped data, we use the following formula: f f M 0 = L + 0 m b m a m b ( f f ) + ( f f ) w M. where M 0 = the mode of the frequency distribution or grouped data L m0 = lower limit of the modal class f m = frequency of the modal class f b = frequency of the class interval below the modal class f a = frequency of the class interval above the modal class w = the width of the modal class 14

15 Let s apply this formula to find the mode for the following frequency distribution of student GPAs: Table 11.8 Student GPAs Frequency Finding Median for Modal Class Student GPAs TOTAL 500 As we can see from the table the modal class for this frequency distribution will be since it has the highest frequency. Now we can put the values into the formula: M 0 = = 2.60 ( ) ( ) + So the mode for this frequency distribution will be And since this data is a grouped data, and we do not know every observation in the data, 2.60 is the estimated number for the mode. Like the median, and unlike the mean, the mode is not affected by the extreme values in the data. And we can use it even with the open-ended class intervals. Comparison of the Mean, the Median, and the Mode Among these three measures of central tendency, the mean is the most popular and useable one. The mean and the median is more preferable according to the mode. Most of the times, the data may not contain a mode, because no values may occur more than once in the data. But the frequency of the use of these three measures depends on the conditions and the area of the research that they will be applied in. On the other hand, we can compare these measures of central tendency with respect to statistical methods. When any distribution is symmetrical, the mean, the median and the mode are equal to each other. Figure 11.6 shows this relationship: Figure 11.6 Mean, Median, and Mode in symmetrical distribution Mean Median Mode 15

16 In this case, there will not be any preference since they are equal to each other. But what about when we have a skewed distribution! Figure 11.7 shows the position of these three measures of central tendency when the distribution is skewed to the right and to the left: Figure 11.7 Mean, Median and Mode in skewed distributions Curve A: Skewed to the right Curve B: Skewed to the left Mode Mean Mean Mode Median Median When the distribution is skewed, the median would be preferable measure of central tendency, because it is included between the mean and the mode in positively and negatively skewed distributions. Measures of Dispersion When we compare two or more distributions by using the measures of central tendency, we may be satisfied. We need o know more information about these distributions; for example, knowing the mean of the data sets may not be enough to compare them. The variability or dispersion is a useful measure to get more information about these distributions. If we try to compare two data sets by finding only the mean of these data, this will not be enough. We may need to know about which distribution is more consistent compared to other, so the measure of dispersion will help us in this case since it measures the spread of the observations in the data around their mean. If the dispersion of the data decreases, the consistency and the reliability of the data will increase. And the central location (mean, median, or mode) of the data will be more representative of the data as a whole. The concept of dispersion plays an important role in our business life also. For example, a financial manager may concern with the earnings of the firms. Widely dispersed earnings will indicate a higher risk for a financial manager. Because the earnings are widely variable, let s say around the mean, and this indicates inconsistency in the earnings. Figure 11.8 shows the spread of three curves having the same mean. Although they have the same central location, curve A has the least spread compared to B and C. And curve C has the widest spread in the graph. So distribution of Curve A is said to be more consistent and reliable compared to B and C. 16

17 Figure 11.8 Curve A Curve B Measure of Dispersion for three curves having the same mean Curve C Mean A, B, C Range, Interfractile Range and Interquartile Range These are the first and distance measures of dispersion. Range is the difference between the highest and the lowest values in a data set. We can show it by Range = Highest value Lowest value Interfractile range is the difference between two fractiles. Generally, fractiles are comprised of 4 characteristics as provided below: Third Fractiles Quartiles Deciles Percentiles = divide the data into 3 equal parts = divide the data into 4 equal parts = divide the data into 10 equal parts = divide the data into 100 equal parts Let s consider the following data on student grades: As a first example, let s divide the data into thirds and find the interfractile range between 1/3 and 2/3 fractiles. Firstly, how do we organize the data in three equal parts? We will first order the data starting from the lowest to the highest. And then specify the extensions. We specify the extentions as follows: Since the sample size, n, is 48, we divide 48 by 3, and we get /3 = 16 Which means that we will have 3 rows and 16 columns which is in 3 16 format. Let s create it now: 17

18 So, 1/3 fractile will be = 43, 2/3 fractile = 52, 3/3 fractile = 72 Interfractile range between 1/3 and 2/3 fractiles will be then; = 9. As a second example, what is the interfractile range between 30 th and 70 th percentiles? 30 th fractile is 30% of 48 = 14.4 = 14 th element in the data 70 th fractile is 70% of 48 = 33.6 = 34 th element in the data 14 th element in the data = th element in the data = 55 So; = 13 is the interfractile range. As a third example, let s find the interquartile range which is the difference between the first and third quartiles, and quartiles divide the data into 4 equal parts. Let s divide the data into 4 equal parts. 48/4 = 12, so format will be So; 1 th quartile = 1/4 = 41 3 rd quartile = 3/4 = 57 2 nd quartile = 2/4 = 49 4 th quartile = 4/4 = 72 And; Interquartile Range = Q 3 - Q 1 = = 16 But if we want to find the range between 1/4 and 2/4 fractiles, then it will be 49-41= 9. And again 30 th and 70 th percentiles are the same values as in the previous example as 42 and 55. And range is = 13. No matter we arrange the data into 4 or 3 or any other equal parts, the percentiles are the same values. 18

19 Variance and Standard Deviation Variance and, especially, standard deviation are the most commonly used statistical measures for dispersion. They specify the average distance of an observation in a data from the mean of the data. We can specify the average distance (or deviation) of the observations from the mean by the following formula: Average deviation = ( ) X i µ N Where X i = Observations in the population µ = Population mean N = Population size But when we use this formula, we will see that the sum of the deviations is equal to zero. And as a result, the average deviation will be also equal to zero. To prevent this problem, we square each deviation to find the standard deviation. The standard deviation is the square root of the variance. It is more applicable than the variance in statistical analyses. The reason behind this is that the variance does not express the average dispersion in the original units; it expresses in squared units. So, in order to bypass this problem, we take the square root of the variance to transform into standard deviation. So, the standard deviation measures the average dispersion in the data in original units of measurement. We can express the variance and the standard deviation for population by the following formulas: σ ( ) 2 X 2 i µ = Variance N ( ) 2 X 2 i µ σ = σ = N Standard Deviation However, most of the times, it is not possible to know all the observations in the population. So we induce our population formula into sampling units. To calculate the standard deviation of a given sample, we use; Where ( x x) i s = n 1 x I = each sample unit in the distribution x = sample mean n-1 = sample size minus 1 2 What is the reason for using n-1? We can prove that if we select many different samples from a population, find the standard deviation for each sample, take 19

20 the average of them, then this average will not tend to be equal to population standard deviation. So in order prevent this difference, we use n-1 as a denominator. Now let s calculate the standard deviation for student CGPAs for a randomly selected sample of 15 students. TABLE 11.9 CGPA x- x (x- x ) 2 Calculating Variance And Standard Deviation For Ungrouped Sample Data Of Student CGPAs x = 3.21 Sum = 0.00 Sum = s = = = sample standard deviation 15 1 The standard deviation (s) of this sample of student CGPAs is approximately 0.42 showing that each observation in the sample, on average, deviates from the mean ( x = 3.21) by 0.42 both downwards and upwards. The variance of this sample (s 2 ) is As you will observe from table 11.9, the sum of deviations of each observation in the data is equal to zero. So we square each deviation and add them up. Calculating Variance and Standard Deviation by Using Grouped Data Up to this point, we have discussed about the variance and the standard deviation for ungrouped data, which were unprocessed and raw data. But how about if the data is grouped? Then we need to use different formula to find variance and standard deviation. Since the standard deviation (σ) is the square root of the variance (σ 2 ), we will just work on standard deviation. The formula of standard deviation for a grouped data is; ( x ) 2 µ 2 f i σ = σ = for population N s 2 = s = f ( x x) i n 1 2 for sample 20

21 In this case, x I in each formula represents the midpoints of each class interval, and f, the frequency of each class. Table 11.6 Student GPAs Frequency Midpoint (x) x I - µ (x I - µ) 2 f. (x I - µ) 2 Arithmetic Mean for Student GPAs Total = 15 Total = 2.40 x = 3.30 f. x x = n = ( 0 1.5) + ( 3 2.5) ( ) 15 = 49.5 = s = = 0.17 = Now it s time to mention an important thing! Since we do not know every single observation in a grouped data, we find midpoints for each class to make approximation of real observations. We multiply each squared deviation of midpoints from the mean by their corresponding frequency, add them up, and divide by N (if population) or n-1 (if sample). So the standard deviation, or the variance computed from a grouped data is an approximated or estimated value. However, in an ungrouped data, we know every single observation, and whatever we calculate from an ungrouped data is a real value. A Relative Measure of Dispersion: The Coefficient of Variation The standard deviation and the variance are absolute measures of dispersion. On the other hand, CV is a relative measure of dispersion that expresses the standard deviation as a percentage of the mean. By using CV, we can easily compare the dispersions of two sets of data in percentages. The formula for calculating CV is; σ µ CV = ( 100) s x CV = ( 100) for population for sample Let s consider the following example to understand the use of CV better: Suppose that the common stocks of Sabanci Inc. sold at an average of $50,000 per stock and a standard deviation with $5,000 for the period of On the other hand, Koc Inc. had sold its common stocks at an average of $ 60,000 per stock and a standard deviation with $ 5,800 between The CV for both firms will be; 21

22 s 5000 = x CV Sabanci = ( 100) % s 5800 CV Koc = = = = 9.66% x = On the base of above results, since Sabanci Inc. has less absolute variation (or standard deviation, s= $5,000) in its common stocks than Koc Inc., it has more relative variation than Koc Inc.. This is because of the significant difference in their means. = 11.5 Statistical Inference: Estimation and Hypothesis Tests Statistical inference, estimation and hypothesis testing are three important concepts in statistics, and closely related to each other. The definition of statistical inference had been given in the opening chapter, 1. It deals with uncertainty by using probability concepts in decision making. It is based on estimation and is the subject of both estimation and hypothesis testing. In this part, firstly, we will start with estimation. Estimation When we deal with uncertainty, we have to estimate about something. In statistics, in order to estimate the population parameters, we use sample statistics. Generally, there are two types of estimates in statistics: A point estimate and an interval estimate. A point estimate is a single value or a sample statistic, which is used to estimate an unknown population parameter; it does not provide us enough information since it is a single number. On the other hand, an interval estimate is a range of values that is used to estimate the population parameter that it can fall into this range. The sample statistics used to estimate the population parameters are called estimators. For example, x is the sample mean and the estimator of population mean, µ.; s is the sample standard deviation and the estimator of the population standard deviation, σ. The observed values for the estimators are called estimates. For example, x =23; in this case x is an estimator and 23 is the estimate for the true population mean. An Alternative Way For Hypothesis Tests: Using Prob Values (p-values) Recall that α is a predetermined value for the significance level, which is the probability of rejecting a true null hypothesis, called type I error. Selecting the level of α depends on the researcher s desire. Generally accepted rule for specifying the level of α is that of the trade off between α and β (probability of type II error). 22

23 If the cost of making type I error is relatively high or expensive for the researcher, then he or she will not desire to make type I error and he or she is going to select a low level of α. On the other hand, if the cost of making type II error is relatively high or expensive for the researcher, he or she will not desire to make type II error and is going to select a high level of α. On the other hand, the standardized value of the probability for rejecting a true null hypothesis is called a prob value (p-value). It is directly found from z-table by using z formula. Let s consider the following example: H o : µ = 15 H a : µ 15 And; σ = 2.1 n = 20 x = 13.6 In this simple example, we are given the two-tailed hypothesis whether the mean value of any population is equal to 15 or not. Besides we are provided population standard deviation, sample size selected for the test, and sample mean. In this case the probability of µ > 15 or µ < 15 would be called its prob value, that is accepting the alternative hypothesis. So the prob value will be the summation of the the probabilities in both rejection tails. Let s find the prob value now: Firstly, we have to find the standard error for the mean: σ x σ = n = = 0.47 The next step is to find the z score for x : z = x µ = = 2.98 σ 0.47 x Figure 11.9 Prob Values in normal curve z In this example, the p-value for the test would be 2(0.0028)= So the standardized probability of accepting the alternative hypothesis is 0.56%. 23

24 Now let s continue to test our hypothesis. Let s select a significance level of α=0.05. Figure shows how α and p-values are used together to test the hypothesis. Figure Use of Prob Values in Testing Hypothesis z Reject H o Accept H o Reject H o Z critical values As you will see from figure 11.10, p-values falls outside region of the z critical values so we would reject H o and accept H a that the true mean value for the population will be beyond (not equal to) 15. It is possible to derive one more conclusion from above discussions. The p- value for the above example is and is lower than α=0.05. So when; p-value > α then we accept H o, p-value < α then we reject H o and accept H a This conclusion is commonly true not only for two-tailed tests but also onetailed tests Chi-Square And Analysis Of Variance (ANOVA) Tests Chi-square and ANOVA tests are two statistical techniques used in hypothesis testing. Usually, we use Chi-square as a test of independency between two or more variables and goodness of fit of a particular probability distribution, and ANOVA as a test of difference between two or more population proportions. Let s consider these tests in more details. 24

25 Chi-Square Test for Independency Two way tables (cross-tabulations) plays an important role in considering and evaluating chi-square test. When we get the computer output, especially in SPSS, if we will conduct a hypothesis test by using chi-square, we give the command to SPSS and chi-square statistic, df, and the significance level is provided with table. To carry out a Chi-square test we need to find the computed value for Chi-square statistic (χ 2 ) first. The formula for χ 2 ; χ 2 = ( f ) f 0 fe e where χ 2 f 0 f e : chi-square statistic : observed frequency in the distribution : expected frequency in the distribution But how do we find f e? The following formula is used to calculate f e : f e = ( rt ct ) n where rt ct n : row total of the corresponding frequency cell : column total of the corresponding frequency cell : total number of observations (sample size) Secondly we need to determine a significance level 3 for the hypothesis test. This might be 0.05, 0.10, whose level is up to the researcher s desire. And lastly we need to find the table value for the Chi-square statistic. To do that we find the degrees of freedom (df) by using the following formula: df = (r-1).(c-1) Where df r c : degrees of freedom : number of rows in the table : number of columns in the table So we can find the table value of the chi-square statistic from chi-square distribution table by looking at df and significance level. If the null hypothesis is true, the sampling distribution of a χ 2 can be approximated by a continuous curve, which is 3 Remember that the significance level shows the level of error accepted in hypothesis test. 25

26 known as chi-square distribution. There is different chi-square distribution for each level of df. The degree of freedom increases as column and/or row dimensions, and/or number of variables in the test increases. As df increases, the chi-square distribution will be more symmetrical and with small df, it will be skewed to the right as you can observe from Figure Figure df 5 df Representing Chi-square distribution with different levels of degrees of freedom 10 df χ 2 Carrying A Hypothesis Test by Using Chi-Square Figure is a representative graph for chi-square distribution used in hypothesis testing. The shaded area on the right tail shows the significance level, which was the level of error accepted for the true null hypothesis and it shows the probability of rejecting the true null hypothesis at the same time. The left-hand side contains the confidence level for the null hypothesis, and shows the probability of accepting the true null hypothesis. Figure Table value for χ 2 Representative graph of χ 2 distribution for hypothesis test C.L. = 0.90 α= 0.10 Acceptance Region Rejection Region χ 2 The intersection point of acceptance and rejection areas corresponds to the table value for chi-square statistic. If the computed value of chi-square statistic falls to 26

27 acceptance region (or if computed value is less than the table value), the null hypothesis will be accepted, otherwise it will be rejected and the alternative hypothesis will be accepted. In order to understand better let s try to solve a problem on chi-square test of independency. Because of the aim of this book, we will mostly work on computer based outputs in these types of problems. However, the reader can refer to any statistics book to see the theoretical computation of the formulas. Table 11.4 shows the evaluation of teaching ability of lecturers by faculty. The frequency in bold characters in each cell in Table 11.4 represents the expected frequency of each observed frequency. Recall from the f e formula that rt in table 11.4 is equal to 10 for 1 st, 2 nd and 3 rd rows, and 20 for the 4 th row; and ct is equal to 4 for the 1 st column, 8 for the 2 nd column, 23 for the 3 rd column 5 for the 4 th column and 0 for the 5 th column. Each row or column total gives the proportion of each row or column variables in the total number of observations. For example, the rt for the 1 st row is 10, it shows the total number of B&E students out of n (=50). Its proportion out of n is 0.20 (20%), which is 10 / 50. The ct for the 2 nd column is 18, it shows the total number of student who have found teaching ability of lecturers as High. Its proportion out of n is 0.36 (36%), which is 18 / 50. The combined proportion of rt and ct out of n will give the expected rt ct. frequency, f e, for each cell, which is ( ) n Now let s continue with our exercise. 27

28 Table 11.4 Ability by Evaluation Faculty of teaching ability of lecturers by faculty Very High B&E High Medium Poor Very Poor A&S ENG OTHER Row Total Column Total Computed value for Pearson s Significance Level Chi-square Statistic (χ 2 ) Df The null and the alternative hypotheses for chi-square test of this exercise will be as below: H o = Teaching ability of lecturers are independent of faculty H a = Teaching ability of lecturers depends on faculty In chi-square test, the null hypothesis specifies the independence, and the alternative hypothesis specifies dependence. The computed value for chi-square statistic is and it is called Pearson s Chi-square Statistic. The degrees of freedom is (4-1)(5-1) = 12. And the significance level is Let s test the hypothesis at 0.01 level of significance (α = 0.01). Once we get this information on this exercise, the next step is to find the table value for χ 2. 28

29 Appendix table.. provides us the chi-square distribution table for different levels of α and df. The table value for our exercise will be; 2 χ 0.01,12 = Now let s represent these data in a graph: Figure Hypothesis test for the evaluation of teaching ability of lecturers by faculty α= 0.01 Acceptance Region Rejection Region χ 2 As you see in figure 11.13, the computed value of χ 2 is less than the table value and it falls within the acceptance region. So we accept our null hypothesis that the teaching ability of lecturers are independent of faculty according to this data of n=50 observations. On the other hand, the p-value for the Pearson's Chi-Square Statistic is and since p-value = < α = 0.01, then again we would reject our null hypothesis. Analysis of Variance (ANOVA) Test for Difference Analysis of variance (ANOVA) is used to test for any differences among more than two sample means. That is, ANOVA is used to compare two different estimates of the variance, σ 2, of the same population: the first estimate is among the samples, and the second estimate is within the samples. If the null hypothesis is true, then both estimates should be equal. ANOVA is represented by F-ratio which is used to compare two estimates of variances. The formula to find the computed F ratio would be; estimate of the variance among the sample means F = estimate of the variance within the sample means 29

30 It is possible to formularize this in this way; F 2 ˆ σ = 2 ˆ σ = = n j ( x x ) j k 1 n j 1 s nt k 2 2 j where; n j = size of the j th sample x = sample mean of the j th sample x k s j 2 j n T = mean (average) of the sample means (grand mean) = number of samples = variance of the j th sample = total of the sample sizes ( n j ) Let's consider an example for the test for difference. Suppose we want to test if there is a significant difference in salaries of males and females in a questionnaire study for a corporation. The sample size selected is 475. The salaries and gender are categorized in the questionnaire form as; SAL: GENDER; $50, Male 2. $50,000 - $100, Female 3. $100, We can formularize our hypothesis as H 0 = µ male = µ female (Salaries of employees do not differ in gender) H a = µ male µ female (Salaries of employees differ in gender) α = 0.01 Below is the SPSS output for ANOVA test for employee data. ANOVA SAL Between Groups Within Groups Total Sum of Squares df Mean Square F Sig. 30

31 In order to test our null hypothesis we have to compare F-computed value with F-table value. We will find F-table value with regarding degrees of freedom (df). In ANOVA test, there are two degrees of freedom. Df in the numerator of F-ratio = (k -1) = 2-1 = 1 Df in the denominator of F-ratio = ( n j 1 ) = nt k = = 473 Where; k = number of samples n j = mean of j th sample = total sample size n T Then, F-table value would be; F (α=0.01) = 6.63 (approximately) Now let's test our hypothesis at α=0.01: Figure Hypothesis test for the difference of salaries among males and females α= 0.01 Acceptance Region 6.63 Rejection Area F-computed : > F-table : So, since F-computed falls within the rejection area, we would reject our null hypothesis and accept the alternative hypothesis that in the corporation the salaries of employees differ among males and females. Alternatively, p-value= < α=0.01 and again we would reject the hypothesis. The shape of F-distribution in different levels of degrees of freedom can be shown in Figure The first number in each parenthesis shows df in the numerator in F-Ratio formula, and the second number shows df in the denominator in F-Ratio formula. 31

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

Means, standard deviations and. and standard errors

Means, standard deviations and. and standard errors CHAPTER 4 Means, standard deviations and standard errors 4.1 Introduction Change of units 4.2 Mean, median and mode Coefficient of variation 4.3 Measures of variation 4.4 Calculating the mean and standard

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Foundation of Quantitative Data Analysis

Foundation of Quantitative Data Analysis Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1

More information

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

MEASURES OF VARIATION

MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

More information

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction CA200 Quantitative Analysis for Business Decisions File name: CA200_Section_04A_StatisticsIntroduction Table of Contents 4. Introduction to Statistics... 1 4.1 Overview... 3 4.2 Discrete or continuous

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

Measures of Central Tendency and Variability: Summarizing your Data for Others

Measures of Central Tendency and Variability: Summarizing your Data for Others Measures of Central Tendency and Variability: Summarizing your Data for Others 1 I. Measures of Central Tendency: -Allow us to summarize an entire data set with a single value (the midpoint). 1. Mode :

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Quantitative Methods for Finance

Quantitative Methods for Finance Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain

More information

Descriptive statistics parameters: Measures of centrality

Descriptive statistics parameters: Measures of centrality Descriptive statistics parameters: Measures of centrality Contents Definitions... 3 Classification of descriptive statistics parameters... 4 More about central tendency estimators... 5 Relationship between

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1. Lecture 6: Chapter 6: Normal Probability Distributions A normal distribution is a continuous probability distribution for a random variable x. The graph of a normal distribution is called the normal curve.

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1) Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

More information

Descriptive Analysis

Descriptive Analysis Research Methods William G. Zikmund Basic Data Analysis: Descriptive Statistics Descriptive Analysis The transformation of raw data into a form that will make them easy to understand and interpret; rearranging,

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

Using Excel for inferential statistics

Using Excel for inferential statistics FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Statistics Statistics are quantitative methods of describing, analysing, and drawing inferences (conclusions)

More information

An Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English

An Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English An Introduction to Statistics using Microsoft Excel BY Dan Remenyi George Onofrei Joe English Published by Academic Publishing Limited Copyright 2009 Academic Publishing Limited All rights reserved. No

More information

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: What do the data look like? Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo)

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo) COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY Prepared by: Jess Roel Q. Pesole CENTRAL TENDENCY -what is average or typical in a distribution Commonly Measures: 1. Mode. Median 3. Mean quantified

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR

More information

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

More information

Lesson 4 Measures of Central Tendency

Lesson 4 Measures of Central Tendency Outline Measures of a distribution s shape -modality and skewness -the normal distribution Measures of central tendency -mean, median, and mode Skewness and Central Tendency Lesson 4 Measures of Central

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Chi Square Tests. Chapter 10. 10.1 Introduction

Chi Square Tests. Chapter 10. 10.1 Introduction Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Describing and presenting data

Describing and presenting data Describing and presenting data All epidemiological studies involve the collection of data on the exposures and outcomes of interest. In a well planned study, the raw observations that constitute the data

More information

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target

More information

Midterm Review Problems

Midterm Review Problems Midterm Review Problems October 19, 2013 1. Consider the following research title: Cooperation among nursery school children under two types of instruction. In this study, what is the independent variable?

More information

CHAPTER THREE. Key Concepts

CHAPTER THREE. Key Concepts CHAPTER THREE Key Concepts interval, ordinal, and nominal scale quantitative, qualitative continuous data, categorical or discrete data table, frequency distribution histogram, bar graph, frequency polygon,

More information

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013 Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

More information

TECHNIQUES OF DATA PRESENTATION, INTERPRETATION AND ANALYSIS

TECHNIQUES OF DATA PRESENTATION, INTERPRETATION AND ANALYSIS TECHNIQUES OF DATA PRESENTATION, INTERPRETATION AND ANALYSIS BY DR. (MRS) A.T. ALABI DEPARTMENT OF EDUCATIONAL MANAGEMENT, UNIVERSITY OF ILORIN, ILORIN. Introduction In the management of educational institutions

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

More information

Week 1. Exploratory Data Analysis

Week 1. Exploratory Data Analysis Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam

More information

Chapter 2: Frequency Distributions and Graphs

Chapter 2: Frequency Distributions and Graphs Chapter 2: Frequency Distributions and Graphs Learning Objectives Upon completion of Chapter 2, you will be able to: Organize the data into a table or chart (called a frequency distribution) Construct

More information

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple. Graphical Representations of Data, Mean, Median and Standard Deviation In this class we will consider graphical representations of the distribution of a set of data. The goal is to identify the range of

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

A Picture Really Is Worth a Thousand Words

A Picture Really Is Worth a Thousand Words 4 A Picture Really Is Worth a Thousand Words Difficulty Scale (pretty easy, but not a cinch) What you ll learn about in this chapter Why a picture is really worth a thousand words How to create a histogram

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

03 The full syllabus. 03 The full syllabus continued. For more information visit www.cimaglobal.com PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

03 The full syllabus. 03 The full syllabus continued. For more information visit www.cimaglobal.com PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS 0 The full syllabus 0 The full syllabus continued PAPER C0 FUNDAMENTALS OF BUSINESS MATHEMATICS Syllabus overview This paper primarily deals with the tools and techniques to understand the mathematics

More information

Introduction; Descriptive & Univariate Statistics

Introduction; Descriptive & Univariate Statistics Introduction; Descriptive & Univariate Statistics I. KEY COCEPTS A. Population. Definitions:. The entire set of members in a group. EXAMPLES: All U.S. citizens; all otre Dame Students. 2. All values of

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Lecture Notes Module 1

Lecture Notes Module 1 Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

More information

Chapter 2 Statistical Foundations: Descriptive Statistics

Chapter 2 Statistical Foundations: Descriptive Statistics Chapter 2 Statistical Foundations: Descriptive Statistics 20 Chapter 2 Statistical Foundations: Descriptive Statistics Presented in this chapter is a discussion of the types of data and the use of frequency

More information

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

DESCRIPTIVE STATISTICS & DATA PRESENTATION* Level 1 Level 2 Level 3 Level 4 0 0 0 0 evel 1 evel 2 evel 3 Level 4 DESCRIPTIVE STATISTICS & DATA PRESENTATION* Created for Psychology 41, Research Methods by Barbara Sommer, PhD Psychology Department

More information

Lecture 2. Summarizing the Sample

Lecture 2. Summarizing the Sample Lecture 2 Summarizing the Sample WARNING: Today s lecture may bore some of you It s (sort of) not my fault I m required to teach you about what we re going to cover today. I ll try to make it as exciting

More information

Exploratory Data Analysis. Psychology 3256

Exploratory Data Analysis. Psychology 3256 Exploratory Data Analysis Psychology 3256 1 Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Statistics Revision Sheet Question 6 of Paper 2

Statistics Revision Sheet Question 6 of Paper 2 Statistics Revision Sheet Question 6 of Paper The Statistics question is concerned mainly with the following terms. The Mean and the Median and are two ways of measuring the average. sumof values no. of

More information

Linear Models in STATA and ANOVA

Linear Models in STATA and ANOVA Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples

More information

Interpreting Data in Normal Distributions

Interpreting Data in Normal Distributions Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

6 3 The Standard Normal Distribution

6 3 The Standard Normal Distribution 290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since

More information

Correlation key concepts:

Correlation key concepts: CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

Is it statistically significant? The chi-square test

Is it statistically significant? The chi-square test UAS Conference Series 2013/14 Is it statistically significant? The chi-square test Dr Gosia Turner Student Data Management and Analysis 14 September 2010 Page 1 Why chi-square? Tests whether two categorical

More information

2 Describing, Exploring, and

2 Describing, Exploring, and 2 Describing, Exploring, and Comparing Data This chapter introduces the graphical plotting and summary statistics capabilities of the TI- 83 Plus. First row keys like \ R (67$73/276 are used to obtain

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement Measurement & Data Analysis Overview of Measurement. Variability & Measurement Error.. Descriptive vs. Inferential Statistics. Descriptive Statistics. Distributions. Standardized Scores. Graphing Data.

More information

2. Filling Data Gaps, Data validation & Descriptive Statistics

2. Filling Data Gaps, Data validation & Descriptive Statistics 2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information