Descriptive and Inferential Statistics

Transcription

1 General Sir John Kotelawala Defence University Workshop on Descriptive and Inferential Statistics Faculty of Research and Development 14 th May 2013

2 1. Introduction to Statistics 1.1 What is Statistics? In the common usage, `statistics' refers to numerical information. (Here, `Statistics' is the plural of `Statistic', which means one piece of numerical information). For example, Percentage of male nurses in Sri Lanka is 5% Birth rate: births/1,000 population Death rate: 5.92 deaths/1,000 population Infant mortality rate: 9.7 deaths/1,000 live births Life expectancy at birth: male: years female: years GDP (value of all final goods and services produced in a year): $106.5 billion Unemployment rate (the percent of the labor force that is without jobs) : 5.8% Inflation rate (the annual percent change in consumer prices compared with the previous year's consumer prices): 5.9% (2010 est.) In the more specific sense, `statistics' refers to a field of Study. It has been defined in several ways. For example, Statistics is the study of the collection, organization, analysis, and interpretation of data - Statistics is the mathematical science involved in the application of quantitative principles to the collection, analysis, and presentation of numerical data. Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting numerical data to assist in making more effective decisions Data and Information These words are often used interchangeably. However, there are some differences. Data are the numbers, characters, symbols, images etc., collected in the raw form for analysis whereas information is processed data. Data is unprocessed facts and figures without any added interpretation or analysis. 3

3 Information is data that has been interpreted so that it has meaning for the user. Knowledge is a combination of information, experience and insight that may benefit the individual or the organization. 1.3 Distinguishing between Variables and Data A variable is some characteristic which has different `values' or categories for different units (items/subjects/individuals) Examples of variables on which data are collected at a prenatal clinic. Gender, Ethnicity, Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar level, Urine ph value, Income group, Number of children. We collect data on variables. Data are raw numbers or facts that must be processed (analyzed) to get useful information. We get information by processing data. Variable: Age (in years) of patients Data: 31, 42, 34, 33, 41, 45, 35, 39, 28, 41 Information: the mean age is 36.9 years. the percentage of patients above 40 years of age: 40% 1.4 Population and sample Statistics is used for making conclusions regarding a group of units (individuals/items/subjects). Such a group of interest is called a population. In research, the `population' represents a group of units that one wishes to generalize the conclusions to. The populations of interest are usually large. Even though the decisions have to be made pertaining to the population of interest, often it is impossible or very difficult to collect data from the whole population, due to practical constraints on the available money, time and labour etc., or due to the nature of the population. Therefore, often data are collected from only a subset of the population. Such a subset is called a sample. 4

4 1.5 Descriptive Statistics and Inferential Statistics Descriptive Statistics is the branch of Statistics that includes methods of organizing, summarizing and presenting data in an informative way. Commonly used methods are: frequency tables, graphs, and summary measures. Inferential Statistics is the branch of Statistics that includes methods used to make decisions, estimates, predictions, or generalizations about a population, based on a sample. This includes point estimation, interval estimation, tests of hypotheses, regression analysis, time series analysis, multivariate analysis, etc 1.6 Classification of Variables 5

5 Why do we need to know about types of variables? You need to know, in order to evaluate the appropriateness of the statistical techniques used, and consequently whether the conclusions derived from them are valid. In other words, you can't tell whether the results in a particular medical research study are credible unless you know what types of variables or measures have been used in obtaining the data Qualitative Variables The characteristic is a quality. The data are categories. They cannot be given numerical values. However, they may be given numerical labels. Examples: Gender of patient, Ethnicity, income group Quantitative Variables The characteristic is a quantity. The data are numbers. They are obtained by counting or measuring with some scale. Examples: Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar level, Urine ph value, Number of children Discrete Variables Quantitative. Usually, the data are counts. There are impossible values between any two possible values. Examples: Pulse rate, Number of children Continuous Variables Quantitative. Usually, the data are obtained by measuring with a scale. There are no impossible values between any two possible values. Any value between any two possible values is also a possible value. Examples: Age, Fasting blood sugar level, Body temperature, Urine ph value 6

6 1.6.5 Scales of measurement Nominal Variables Qualitative No order or ranking in categories. Examples: Gender, Ethnicity Ordinal Variables Qualitative Categories can be ordered or ranked. Examples: income group Interval Variables Quantitative. Data can be ordered or ranked. There is no absolute zero. Zero is only an arbitrary point with which other values can compare. Difference between two numbers is a meaningful numerical value. They are called interval variables because the intervals between the numbers represent something real. This is not the case with ordinal variables. Ratio of two numbers is not a meaningful numerical value. Examples: Temperature Ratio Variables Possesses all the characteristics of an interval variable. There exists an absolute (true) zero. Ratio between different measurements is meaningful. Examples: Age, Pulse rate, Fasting blood sugar level, Number of children 7

7 2. Data Analysis with SPSS Running SPSS for Windows Method 01 Click on the Start button at the lower left of your screen, and among the program listed, find SPSS for windows and select SPSS 16.0 for Windows. Method 02 If there is an SPSS shortcut on the desktop, simply put the cursor on it and double click the left mouse button. Shown below is an image of the screen you will see when SPSS is ready. Menu Bar Tool Bar Start up dialog box Figure 01 8

8 You could select any one of the options on the start-up dialog box and click OK, or you could simply hit Cancel. If you hit Cancel, you can either enter new data in the blank Data Editor or you could open an existing file using the File menu bar as explained later. 2.2 Different Types of Windows in SPSS The Data Editor As shown in figure 01 first you will see start up dialog box listing several options; behind it is the Data Editor. The Data Editor is a worksheet used for entering and editing data. It has two panes, Data editor Variable View Data View Output viewer Syntax editor Script window Naming and defining variables When preparing a new dataset in SPSS, it is required to set the following attributes from the variable view. Move your cursor to the bottom of the Data Editor, where you will see a tab labeled Variable View. Click on that tab. A different grids appears, with these column headings: For each variable we create, we need to specify all or most of the attributes described by these column headings. 9

9 Name Should be a single word. Spaces and special characters (!,?, *, ) are not allowed. Each variable name must be unique; duplication is not allowed. The underscore character is frequently used where a space is desired in names. Type Click within the Type column, and a small gray button marked with three dots will appear; click on it and you ll see this dialog box. Numeric is the default type. (Basically, numeric and string types are preferred for many of the variables.) (For a full description of each of the variable types, click on the Help button.) Width& Decimals Applicable for numeric type of variables. Label This is an optional attribute which can be used for entering a detailed name. Values This option allows user to configure the coding structure for categorical variables. (In the Values column, click on the word None and then click the gray 10

10 box with three dots. This open the value labels dialog box. ) (eg: Type 1 in the value box and type male in the label box. Click Add. Then type 0 in the value, and female in label. Click Add and then click OK. ) Missing The user can assign codes to represent the missing observations. Measure The scale of measurement applicable to variable. Both interval and ratio scales are referred as scale type Entering Data The Data View pane of the Data Editor window is used to enter the data. Displayed initially is an empty spreadsheet with the variable names you have defined appearing as the column headings Saving a Data File On the File menu, choose Save As In the Save in box, select the destination directory that chosen (in our example, we re saving it to the Desktop.). Then give a suitable file name and click Save. 11

11 2.2.2 Output Viewer Display outputs and errors. Extension of the saved file will be spv. 2.3 Reading data to the SPSS Data can be entered directly or it can be imported from a number of different sources. The process for reading data stored in SPSS format data files; spreadsheet application, such as Microsoft Excel is to be covered in the class room session. SPSS format data files are organized by cases (rows) and variables (columns). 12

12 3. Descriptive Analysis of Data Descriptive statistics consists of organizing and summarizing the information collected. Descriptive statistics describes the information collected through numerical measurements, charts, graphs and tables. The main purpose of descriptive statistics is to provide an overview of the information collected. 3.1 Organizing Qualitative Data Recall that qualitative data provide no numerical measures that categorize or classify an individual. When qualitative data are collected, we often interested in determining the number of individuals that occur within each category Tabular Data Summaries A frequency table (frequency distribution) is a listing of the values a variable takes in a data set, along with how often (frequency) each value occurs. Definition 3.1: The frequency is the number of observations in the data set that fall into a particular class. Definition 3.2: The relative frequency is the class frequency divided by the total number of observations in the data set; that is, Relative frequency = Definition 3.3: The percentage is the relative frequency multiplied by 100; that is, Percentage = Relative frequency * 100 Relative frequency is usually more useful than a comparison of absolute frequencies. One- way frequency tables (Simple frequency table) Analyze Descriptive Statistics Frequencies (Select the variable and click OK) 13

13 Table 01: Composition of the sample by activity Note: The Valid Percent column takes into account missing values. For instance, if there was one missing values in this data set, then the valid number of cases would be 91. If that were the case, the valid Percentage of slight category would be 11%. Note that Percent and Valid Percent will both always total to 100%. The Cumulative Percent is a cumulative percentage of the cases for category and all categories listed above it in the table. The cumulative percentages are not meaningful, of course, unless the scale has ordinal properties. 3.2 Cross classification tables Cross classification tables (contingency tables/ two-way tables) display the relationship between two or more categorical (nominal or ordinal) variables. Analyze Descriptive Statistics Crosstabs 14

14 Note: Crosstabs command will not present percentages from its default options. You can add Row, Column and Total percentages as appropriate using Cells option in crosstab command window. Table 02: Composition of the sample by smoke and gender 15

15 3.3 Graphical Presentation for Categorical Data The most effective way to present information is by means of visual display. Graphs are frequently used in statistical analyses both as a means of uncovering patterns in a set of data and as a means of conveying the important information from a survey in a concise and accurate fashion Bar Charts Simple Bar Chart Graphs Legacy Dialogs Bar Choose the options Simple and Summaries for groups of cases Choose the relevant variable as category axis 16

16 Cluster Bar Chart Graphs Legacy Dialogs Bar Choose the options Cluster and Summaries for groups of cases Component Bar Chart (Sub-divided bar diagram) These diagrams show the total of values and its break up into parts. The bar is subdividing into various parts in proportion to the values given in the data and may be drawn on absolute figures or percentages. Each component occupies a part of the bar proportional to its share in the total. To distinguish different components from one another, different colors or shades may be given. When sub-divided bar diagram is drawn on percentage basis it is called percentage bar diagram. The various components should be kept in the same order in each bar. 17

17 Pie Chart SPSS Command Graphs Legacy Dialogs Pie Define 3.2 Organizing Quantitative Data Grouped frequency tables In order to construct a grouped frequency distribution, the numerical variable should be classified first. We can use Recode option in SPSS to perform this classification. One the variable is classified into a different variable, a frequency table can be prepared to present the grouped frequency distribution. SPSS command for Recode (into different variables) Transform Recode in to different variables or Transform Visual binning Graphical Presentation of Numerical Data When presenting and analyzing the behavior of numerical variable, different graphical options such as Histogram, Dot plot, Box plot can be used. SPSS commands Histogram: Graphs Legacy Dialogs Histogram Dot plot: Graphs Legacy Dialogs Scatter/ Dot Simple Dot Define Box plot: Graphs Legacy Dialogs Box plot Simple Define 18

18 3.3 Summary measures SPSS Command Analyze Descriptive Statistics Frequencies Statistics Analyze Descriptive Statistics Descriptives Analyze Descriptive Statistics Explore Central Tendency Mean: Median: It is the value that lies in the middle of the data when arranged in ascending order. That is, half the data are below the median and half the data are above the median. Mode: The mode of a variable is the most frequent observation of the variable that occurs in the data set Measures of Dispersion Range: Difference between the largest data value and the smallest data value. Sample variance: Sample Standard deviation: Inter-Quartile range: measure the spread of a data around the median. The range of middle 50% of the data is called the inter-quartile range. Quartiles Measures of skewness Kurtosis The quartiles of a set of values are the three points that divide the data set into four groups, each representing a fourth of the population being sampled. Skewness is the characteristic that describes the lack of symmetry. Degree of peakeedness of a distribution, usually taken relative to a normal distribution. 19

19 3.4 Scatter Plot When you analyze bi-variate data it is best to start with a suitable graph. In a quantitative bivariate data set, we have a (x; y) pair for each sampling unit, where x denotes the independent variable and y denotes the dependent variable. Each (x; y) pair can be considered as a point on the cartesian plan. Scatter plot is a plot of all the (x; y) pairs in the data set. The purpose of scatter plot is to illustrate diagrammatically any relationship between two quantitative variables. If the variables are related, what kind of relationship it is, linear or nonlinear? If the relationship is linear, the scattergram will show whether it is negative or positive. SPSS Command Graphs Legacy Dialogs Scatter/ Dot Simple Scatter Define 20

20 3.5 Correlation The correlation coefficient, r lies between -1 and +1. When r = 1, it signifies a perfect positive linear relationship When r = -1, it signifies a perfect negative linear relationship The further away r is from 0, the stronger is the correlation. Figure 6.5 shows some examples. SPSS Command Analyse Correlation Bivariate 21

21 4. Fundamentals of Statistical Inference The need for making educated guesses and drawing conclusions regarding some group of units of interest arises in almost every field. Such a group of interest is called a population. In research, the population represents a group of units that you wish to generalize your conclusions to. Even though the decisions have to be made pertaining to the population of interest, often it is impossible or very difficult to collect data from the whole population, due to practical constraints on the available money, time and labour etc., or due to the nature of the population. Therefore, often data are collected from only a subset of the population. Such a subset is called a sample. The process of making educated guess and conclusions regarding a population, using a sample from that population is called a Statistical Inference. Usually this involves collecting suitable data, analyzing data using suitable statistical techniques, measuring the uncertainty of the results and making conclusions. Statistical inference problems usually involve one or more unknown constant related to the population of interest. Such unknown constants are called parameters. For example, the total of the value of variable X for the units of a finite population (which is called the population total), the means of the values of X for the units of a finite population (which is called the population mean), proportion of units with some specified characteristics (which is called the population proportion) and the means of some random variable (which is called the expected value) are some examples for parameters. In addition, we come across parameters in various models like regression models, probability distributions. Often statistical inference problems involve estimation of parameters and test of hypotheses concerning parameters. Estimation can be of the form of point estimation and/or interval estimation. 22

22 4.1 Point Estimation It involves using the sample data to calculate a single number to estimate the parameter of interest. For instance, we might use the sample mean to estimate the population mean μ. The problem is that two different samples are very likely to result in different sample means, and thus there is some degree of uncertainty involved. A point estimate does not provide any information about the inherent variability of the estimator; we do not know how close is to μ in any given situation. While is more likely to be near the true population mean if the sample on which it is based is large. 4.2 Interval Estimation The method is often preferred. The technique provides a range of reasonable values that are intended to contain the parameter of interest, the range of values is called a confidence interval. In interval estimation we derive an interval so that we can say that the parameter lies within the interval with a given level of confidence. 4.3 Terminology and Notation Estimate An approximate value for a parameter, determined using a sample of data is called a point estimate or in short, an estimate Estimator We obtain an estimate by substituting the sample of data in to a formula. Such a formula is called an estimator. An estimator is a function of the data Notation We usually use Greek letters to denote parameters. For example the population mean, population standard deviation, population proportion are usually denoted by µ, σ and θ respectively. 23

23 Example: Suppose that we are interested in estimating the mean µ and the variance σ 2. Let X1, X2, X5 be 5 random observations from this population. Let {3, 5, 2, 1, 2} be one observed sample from this population and {4, 1, 3, 2, 1} be another observed sample from this population. Table 01 illustrates the terms parameters, estimators and estimates. Parameter Estimator Estimate 01 (Using {3, 5, 2, 1, 2}) µ σ 2 Estimate 02 (Using {4, 1, 3, 2, 1}) 4.4 Point Estimation of Population Mean Suppose X is a variable derived on the units of a large population and we are interested in the population mean μ. Suppose we have selected a random sample of n units and we have observed X on those units. Let x 1, x 2, x 3, be the observed values of X. Then = (x 1 + x 2 + x 3 + x n )/n can be used as an approximate value for the population mean. Therefore, we say that the is an estimate for μ. It is a point estimate. In order to estimate the population mean using the sample mean, one of the following options can be used. These were introduced in the previous section. Analyze Descriptive Statistics Frequencies Statistics Analyze Descriptive Statistics Descriptives Analyze Descriptive Statistics Explore Bound on the error of and confidence intervals Usually an estimate is not exactly equal to the parameter. The difference between the actual value of the parameter and the estimate is called the error of the estimate. Since we do not know the actual value of the parameter, we cannot know the exact error in our estimate. However we can place a bound on the error with a known level of confidence. For example, 24

24 using the statistical theory, we may be able to make a statement like we are 95% confident that error of the estimate is less than 75. This is equivalent to saying that we are 95% confident that. This is equivalent to saying that we are 95% confident that. This means, we are 95% confident that is in the interval ). Such a interval is called a 95% confidence interval. 25

25 Computing an Appropriate Confidence interval for a Population Mean Yes Is n 30? No Yes Is the value of σ known? No Yes Is the population Normal? No Use Use the sample standard deviation s to estimate σ and use Or, more correctly Use Is the value of σ known? Use a nonparametric technique Since n is large, there is little difference between these intervals Use Yes Use No Or Increase the sample size at least 30 to develop a confidence interval. 26

26 Small sample from a normal population Example 1 A researcher wish to estimate the average number of heart beats per minute for a certain population. In one such study the following data were obtained from 16 individuals. 77, 92, 93, 77, 98, 81, 76, 71, 100, 87, 88, 86, 97, 95, 81, 96 It is known from past research that the distribution of the number of heart beats per minute among humans is normally distributed. Find 90% confidence interval for the mean. SPSS Command for the interval Estimation of population mean Analyze Descriptive Statistics Explore Note: Use Statistics in Explore command and set the confidence level if it is required to be change. The default confidence level is 95%. 27

27 Interpretation: We are 90% confidence that the mean heart beat level for the population is between ( , ). Interpretation What do we mean by saying that we are 90% confident that the mean heart beat level for the population is between ( , ) Example 02 As reported by the US National Center for Health Statistics, the mean serum high density lipoprotein (HDL) cholesterol of female years old is μ = 53. Dr. Paul wants to estimate the mean serum HDL cholesterol of his years old female patients. He randomly selects 15 of his year old patients and obtains the data as shown. 65, 47, 51, 54, 70, 55, 44, 48, 36, 53, 45, 34, 59, 45, 54 28

28 a) Use the data to compute a point estimate for the population mean serum HDL cholesterol in patients. b) Construct a 95% confidence interval for the mean serum HDL cholesterol for the patients. Interpret the result. Note: In this problem it is not given that the population is normally distributed. Since the sample size is small, we must verify that serum HDL cholesterol is normally distributed. If a population cannot be assumed normal, we must use large sample or nonparametric techniques. However if we can assume that the parent population is normal, then small samples can be handled using the t distribution Assessing normality The assumption of normality is a prerequisite for many inferential statistical techniques. There are a number of different ways to explore this assumption graphically: Histogram Stem-and-leaf plot Boxplot Normal probability plot Furthermore, a number of statistics are available to test normality: Kolmogorov Smirnov statistic, with a Lilliefors significance level and the Shapiro Wilk statistic Skewness Kurtosis Normal probability plots 1. Select the Analyze menu. 2. Click on Descriptive Statistics and then Explore to open the Explore dialogue box. 3. Select the variable you require (i.e HDL) and click on the button to move this 29

29 variable into the Dependent List: box 4. Click on the Plots command pushbutton to obtain the Explore: Plots sub dialogue box. 5. Click on the Normality plots with tests check box, and ensure that the Factor levels together radio button is selected in the Boxplots display. 6. Click on Continue. 7. In the Display box, ensure that Both is activated. 8. Click on the Options command pushbutton to open the Explore: Options subdialogue box. 9. In the Missing Values box, click on the Exclude cases pairwise radio button. If this option is not selected then, by default, any variable with missing data will be excluded from the analysis. That is, plots and statistics will be generated only for cases with complete data. 10. Click on Continue and then OK. Normal Probability Plot In a normal probability plot, each observed value is paired with its expected value from the normal distribution. If the sample is from a normal distribution, then the cases fall more or less in a straight line. 30

30 Kolmogorov-Smirnov and Shapiro-Wilk statistics The Kolmogorov-Smirnov with a Lilliefors significance level for testing normality is produced with the normal probability and detrended probability plots. If the significance level is greater than 0.05 then normality is assumed. Since the conditions are satisfied we can precede with the t test confidence intervals. Large sample from a normal distribution (σ unkown) Example 03 A reacher is interested in obtaining an estimate of the average level of some enzyme in a certain human population. He has taken a sample of 35 individuals and determined the level of the enzyme in each individual. It is known from past research that the distribution of the level of this enzyme among humans is normally distributed. The following are the values 20, 11, 32, 25, 6, 23, 19, 24, 15, 31, 19, 23, 21, 27, 17, 20, 23, 23, 22, 13, 15, 28, 27, 18, 11, 32, 23, 28, 14, 23, 21, 25, 19, 29, 17 Construct a 95% confidence interval for the mean population mean and interpret the result. Large sample from a non-normal distribution, or we do not know data are normally distributed (σ unkown) Example 04 (Pulse data set) 1. Construct a 95% confidence interval for the mean pulse rate of all males 2. Construct a 95% confidence interval for the mean pulse rate of all females 31

31 3. Compare the preceding results. Can we conclude that the population means for males and females are different? Why or Why not? Note: We said that if we do not know σ (which is almost always the case) and the sample size n is large (say at least 30), then we can estimate σ by s in the z-based confidence interval. ( ) It can be argued, however, that because the t-based confidence interval ( ± ) is a statistically correct interval that not requires that we know σ, then it is best, if we do not know σ, to use this interval for any size sample even for a large sample. Most common t- tables give t points for degrees of freedom from 1 to 30, so we would need a more complete t table or computer software package to use the t-based confidence interval for a sample whose size n exceeds 31. For large samples (n > 30), the tradition by-hand approach is to invoke the Central Limit Theorem, to estimate σ using the sample standard deviation (s) and to construct an interval using the normal distribution, but this is just a practical approach from pre-computing days. With software like SPSS, the default presumption is that we don t know σ, and so the Explore command automatically uses the sample standard deviation and builds an interval using the value of the t distribution rather than the normal. However, because these intervals do not differ by much when n is at least 30, it is reasonable, if n is at least 30, to use the large sample, z-based interval as an approximation to the t-based interval. In practice, the values of the normal and t distribution becomes very close when n exceeds

32 5. Hypothesis testing 5.1 Introduction Sometimes, the objective of an investigation is not to estimate a parameter, but instead to decide which of two contradictory statements about the parameter is correct. This is called hypothesis testing. Hypothesis testing typically begin with some theory, claim or assertion about a particular parameter or several parameters. In any hypothesis testing problem, there are two contradictory hypotheses under consideration, one is called the null hypothesis. The other is called the alternative hypothesis. The validity of a hypothesis will be tested by analyzing the sample. The procedure which enables us to decide whether a certain hypothesis is true or not, is called Test of Hypothesis. 5.2 Terminology and Notation Hypothesis: A hypothesis is a statement or claim regarding a characteristic of one or more populations. Test of Hypothesis: The testing of hypothesis is a procedure based on sample evidence and probability, used to test claims regarding a characteristic of one or more populations. Hypothesis testing is based upon two types of hypotheses. The null hypothesis, denoted by H 0 is a statement to be tested. The null hypothesis is assumed true until evidence indicates otherwise. The alternative hypothesis denoted by H 1 is a claim to tested. We are trying to find evidence for the alternative hypothesis. Two - Tailed Left - Tailed Right -Tailed Table

33 Computation of Test Statistics A function of sample observations (i.e. statistic) whose computed value determined the final decision regarding acceptance or rejection of H 0, is called a Test Statistic. The appropriate test statistics has to be chosen very carefully and knowledge of its sampling distribution under H 0 (i.e. when the null hypothesis is true) is essential in framing the decision rule. If the value of the test statistic falls in the critical region, the null hypothesis is rejected. Types of Errors in Hypothesis Testing - Type I and Type II Errors As stated earlier, we use sample data to determine whether to reject or not reject the null hypothesis. Because the decision to reject or not reject the null hypothesis is based upon incomplete (i. e., sample) information, there is always the possibility of making an incorrect decision. In fact, there are four possible outcomes from hypothesis testing. Four Outcomes from Hypothesis Testing Reality H 0 is True H 1 is True Conclusion Table 5.2 Do not Reject H 0 Reject H 0 The Level of Significance The level of significance is the maximum probability of making a type I error and it is denoted by α, α = P (Type I error) = P( rejecting H 0 when H 0 is true) The probability of making a Type I error is chosen by the researcher before the sample data are collected. Traditionally, 0.01, 0.05 or 0.1 are taken as α Critical Region or Rejection Region The rejection region or critical region is the region of the standard normal curve corresponding to a predetermined level of significance α. The region under the normal curve which is not covered by the rejection region is known as Acceptance Region. Thus the 34

34 statistic which leads to rejection of null hypothesis H0 gives us the region known as Rejection region or Critical region. The value of the test statistic compute to test the null hypothesis H0 is known as the Critical Value. The Critical value separates the rejection region from the acceptance region. Two - Tailed Left - Tailed Right - Tailed Table 5.3 Methods for making conclusion Method 01: Compare the critical value with the test statistic: Two Tailed Left Tailed Right tailed Table

35 Method 02: Compare the p - value with the significance level: Two Tailed Left Tailed Right tailed Table 5.5 Power The probability of rejecting a wrong null hypothesis is called the power of the test. The probability of committing type ii error is denoted by ß. Power = 1-ß 5.3 Formulating a hypothesis It is ideal if a test can be derived such that both errors are minimized simultaneously. However, it may not be possible with the available data. Instead, we consider tests for which the probability of one error is controlled. Conventionally, the type I error is controlled. Usually, out of the two errors, one error is more serious than the other. In such situations it is reasonable to minimize the probability of the more serious error. In order to achieve this, the hypothesis is constructed so that the more serious error will be the type I error. An alternative way is to take the initially favored claim as the null hypothesis. The initially favored claim will not be rejected in favor of the alternative unless sample evidence contradicts it and provides strong support for the assertion. If one of the hypothesis is an equality and the other is an inequality, then the equality hypothesis is taken to be the null hypothesis. 36

36 5.4 Steps in test of hypothesis 1. Set up the Null Hypothesis H 0 and the Alternative Hypothesis H State the appropriate test statistic and also its sampling distribution when the null hypothesis is true. 3. Select the level of significance α of the test, if it is not specified in the given problem. 4. Find the critical region of the test at the chosen level of significance. 5. Compute the value of the test statistic on the basis of sample data null hypothesis. 6. If the computed value of test statistic lies in the critical region reject H 0 otherwise do not reject H Write the conclusion in plain non-technical language. 37

37 5.5 One Sample Hypothesis Tests about Population Mean Selecting an Appropriate Test Statistic to Test a Hypothesis about a Population Mean Yes Is n 30? No Yes Is the value of σ known? No Yes Is the population Normal? No Use Z = Use the sample standard deviation s to estimate σ and use Z = Or, more correctly Use Is the value of σ known? Use a nonparametric technique t = Since n is large, there is little difference between these tests Use Z = Yes Use No Or Increase the sample size at least 30 to conduct parametric hypothesis test t = 38

38 5.5.1 A small sample two sided hypothesis Example 5.1 File: ph.sav An engineer wants to measure the bias in a ph meter. She uses the meter to measure the ph in 14 neutral substances (ph = 7) and obtains the data obtained below Is there sufficient evidence to support the claim that the ph meter is not correctly calibrated at the α = 0.05 level of significance? Approach: In this case, we have only sixteen observations, meaning that the Central Limit Theorem does not apply. With a small sample, we should only use the t test if we can reasonably assume that the parent population is normally distributed. In this problem also since the sample size is small before proceeding to test, we must verify that ph is normally distributed. Hypothesis to be tested H 0 : Data are normally distributed. H 1 : Data are not normally distributed. Analyze Descriptive Statistics Explore 39

39 According to the Kolmogorov- Smirnov p-value 0.2 > Hence we do not reject H 0 under 0.05 level of significance.we can conclude data are normally distributed. Since the conditions are satisfied we can proceed with the t test. Hypothesis to be tested:. To conduct a one-sample t-test 1. Select the Analyze menu. 2. Click on Compare Means and then One-Sample T Test to open the One-Sample T Test dialogue box. 3. Select the variable you require (i.e. ph) and click on the button to move the variable into the Test Variable(s): box. 4. In the Test Value: box type the mean score (i.e. 7). 40

40 5. Click on OK. Calculated value of the test Statistic P-value Note: In SPSS a Column labeled Sig. (usually two tailed Sig.) displays the p-value of a particular Hypothesis test. Decision:.. Conclusion:

41 Note: Performing One-tail Tests using One-Sample T Test Procedure The One Sample T-test procedure in SPSS is designed to test two-tail hypothesis. However, a researcher may need to test a one-tail (left tail or right tail) hypothesis. In this situation the p- value for the corresponding test has to be computed using the following criteria. 1. For left-tail tests(i.e. H 1 : μ < ) If the sample mean is less than (i.e. t < 0) then, p-value = Sig/2 Otherwise, p-value = 1-Sig/2 2. For right-tail tests(i.e. H 1 : μ > ) If the sample mean is greater than (i.e. t > 0) then, p-value = Sig/2 Otherwise, p-value = 1-Sig/2 Example 5.2 In a study conducted by the U.S. Department of Agriculture, it was found that the mean daily caffeine intake of year old female in 2010 was milligrams. A nutritionist claims that the mean daily caffeine intake has increased since then. She obtains a simple random sample of 35 females between 20 and 29 years of age and determines their daily caffeine intakes. The results are presented in caffine.sav. Test the nutritionist s claim at the α = 0.05 level of significance. Approach: The dataset represents a large sample (n=35), so we can rely on the Central Limit Theorem to assert that the sampling distribution is approximately normal. Hypothesis:. P-value: Decision:.. Conclusion: 42

42 Non Parametric Binomial Test for the One-Sample Test procedure The Binomial Test procedure compares an observed proportion of cases to the propotion expected under a binomial distribution with a specified probability parameter. The observed proportion is defined either by the number of cases having the first value of a dichotomous (a variable that has two possible values) variable or by the number of cases at or below a given cut point on a scale (quantitative) variable. Hypothesis (to be tested on a quantitative variable) H0: median = m 0 vs, H1: median m 0 SPSS command Analyze Nonparametric Binomial Test Note: Set the cut point to the hypothesized median value. 43

43 6. Inferences on Two Samples In the preceding chapter, we used a statistical test of hypothesis to compare the unknown mean, proportion of a single population to some fixed known value. In practical applications however, it is far more common to compare the means of two different populations, where both parameters are unknown. In order to perform inference on the difference of two population means, we must first determine whether the data come from an independent or dependent sample. Samples are independent when he individuals selected for one sample do not dictate which individuals are to be in second sample. Samples are dependent when the individuals selected to be in one sample are used to determine the individuals to be in the second sample. 6.1 Testing hypotheses concerning two populations means μ 1 and μ 2 : Dependent Samples Let (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),. ( x n, y n ) be a random sample of paired observations. Suppose that x s are identically distributed with population mean and population variance μ 1 and respectively. Also suppose that y s are identically distributed with population mean and population variance μ 2 and respectively. Let μ d be a known constant. Consider the following hypotheses: Two-Tailed Left-Tailed Right-Tailed H 0 : H 0 : H 0 : H 1 : H 1 : H 1 : Rather than consider the two sets of observations to be distinct samples, we focus on the difference in measurements within each pair. Suppose that our two groups observations are as follows: 44

44 Sample 01 Sample 02 Differences within each pair x 11 x 21 x 31 x n1 x 12 x 22 x 32 x n2 d 1 = x 11 x 12 d 2 = x 21 x 22 d 3 = x 31 x 32. d n = x n1 x n2 = - ) 2 If differences are normally distributed or the sample size n is large, The test statistic is, U = Compare the critical value with the test statistic, using the guideline below Two - tailed Left - Tailed Right - Tailed If U < or U >,n-1 If U <,n-1 If U > reject the null hypothesis reject the null hypothesis reject the null hypothesis Confidence Interval for Matched Pairs Data We can also create a confidence interval for the mean difference, using the sample mean difference, the sample standard difference s d, the sample size and. Remember, the format for a confidence interval about population mean is of the following form: Point estimate ± Margin of error Based on the preceding formula we compute the confidence interval about as follows: 45

45 (1-α) 100% confidence interval for is given by SPSS Command Command for Paired - Samples T test Analyze Compare Means Paired Samples T Test Example 6.1 A dietitian hopes to reduce a person s cholesterol level by using a special diet supplemented with a combination of vitamin pills. Six (6) subjects were pre-tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown below. Cholesterol levels are measured in milligrams per deciliter. 2.1 Test the claim that the Cholesterol level before the special diet is greater than the Cholesterol level after the special diet at α = 0.01 level of significance. 2.2 Construct 99% confidence interval for the difference in mean cholesterol levels. Assume that the cholesterol levels are normally distributed both before and after. Subject Before After Example 6.2 A physician is evaluating a new diet for patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights are measured before and after the study, and the physician wants to know if either set of measurements has changed. Test whether there are statistically significant differences between the pre and post-diet of these patients. Use 5% level of significant. Step 01: Calculating differences 46

46 Transform Compute Variable Step 02: Because the sample size is small, we must verify that difference data normally distributed. Analyze Descriptive Statistics Explore Note: Use Plots in Explore command and set Normality plots with test Step 03: Command for Paired - Samples T test Analyze Compare Means Paired Samples T Test 6.4 Performing One tail Tests using Paired Samples T Test procedure The Paired Samples T Test procedure in SPSS is designed to test two-tail hypothesis. However, a researcher may need to test a one tail (left-tail or right-tail) hypothesis. In this situation the p-value for the corresponding test has to be computed using the following criteria. 1. For left-tail tests (i.e. < 0) If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2. Otherwise, p-value = 1 Sig/2 47

47 2. For right-tail tests (i.e. > 0) If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2. Otherwise, p-value = 1 Sig/2 Example: If a researcher tries to find whether post-diet weights have been significantly increased, determine the p-value and state your findings at 5% level of significance. 6.5 Nonparametric Wilcoxon Test for Two Related Samples Hypothesis H0: = 0 vs, H1: 0 SPSS command Analyze Nonparametric 2 Related Samples Note: Ensure that Wilcoxon is checked in the Test Type dialog box. 6.6 Testing hypotheses concerning two population means μ 1 and μ 2 : Independent samples Let x 1, x 2, x 3,.x m be a random sample of observations from a certain population with population mean and population variance μ 1 and respectively. Also let y 1, y 2, y n be a random sample of observations from a certain population with population mean and population variance μ 2 and respectively. Further suppose that two samples are independent. Let μ d be a known constant. Consider the following hypotheses: Two-Tailed Left-Tailed Right-Tailed H 0 : H 0 : H 0 : H 1 : H 1 : H 1 : 48

48 Case 01: Data from normal distributions, both variances are known The test statistic is, U = Compare the critical value with the test statistic, using the guideline below Two - Tailed Left - Tailed Right - Tailed If U < or U > reject the null hypothesis If U < reject the null hypothesis If U > reject the null hypothesis Case 02: Data from two normal distributions with unequal variances ( variances are unknown, m and n are small ), both The test statistic is, U = Compare the critical value with the test statistic, using the guideline below Two - tailed Left - Tailed Right - Tailed If Ucal < or t > If U cal <,ν If U cal > reject the null hypothesis reject the null hypothesis reject the null hypothesis Where ν = 49

49 (1-α)100% Confidence Interval about the Difference of Two Means ( ) ± Case 03: Data normal, both variances are unknown, but known that they are equal. = = = 2 = 2 Also let = The test statistic is, U = Compare the critical value with the test statistic, using the guideline below Two - tailed Left - Tailed Right - Tailed If Ucal < or Ucal> If U cal <,m+n-2 If U cal > reject the null hypothesis reject the null hypothesis reject the null hypothesis (1-α)100% Confidence Interval about the Difference of Two Means ( ) ± SPSS Command for the Independent-Samples T test Analyze Compare Means Independent Samples T Test Note: On Define Groups option, apply relevant codes of the groups to be compared. 50

50 6.6.1 Performing One tail Tests using Independent Samples T Test procedure The Independent Samples T Test procedure in SPSS is designed to test two-tail hypothesis. However, a researcher may need to test a one tail (left-tail or right-tail) hypothesis. In this situation the p-value for the corresponding test has to be computed using the following criteria. 1. For left-tail tests (i.e. < ) If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2. Otherwise, p-value = 1 Sig/2 2. For right-tail tests (i.e. > ) If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2. Otherwise, p-value = 1 Sig/2 6.7 The Nonparametric Mann Whitney U Test for Two Independent Samples What should you do if the t test assumptions are markedly violated (e.g., what if the response variable is not normal?) One answer is to run the appropriate nonparametric test, which in this case called the Mann Whitney (M-W) U test. Hypothesis H0: = vs, H1: SPSS command Analyze Nonparametric 2 Independent Samples Note: Ensure that Mann Whitney U test is checked. On Define Groups option, apply relevant codes of the groups to be compared. 51

51 Example 6.3: The purpose of a study by Eidelman et al. was to investigate the nature of lung destruction in cigarette smokers before the development of marked emphysema. Three lung destructive index measurements were made on the lungs of lifelong nonsmokers and smokers who died suddenly outside the hospital of nonrespiratory causes. A large score indicates greater lung damage. For one of the indexes the scores yielded by the lungs of a sample of nine nonsmokers and a sample of 16 smokers are shown in Table 02. We wish to know if we may conclude, on the basis of these data, that smoker, in general, have greater lung damage as measured by this destructive index than do smokers. Nonsmokers Smokers Example 6.4: Researchers wished to know if they could conclude that two populations of infants differ with respect to mean age at which they walked alone. The following data (age in months) were collected: Sample from population A: 9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75 Sample from population B: 12.5, 9.5, 13.5, 13.75, 12.0, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0,

52 7. Comparison Multiple Groups In the preceding chapter, we covered techniques for determining whether a difference exits between the means of two independent populations. It is not unusual, however, to encounter situations in which we wish to test for differences among three or more independent means rather than just two. The extension of the two sample t test to three or more samples is known as the Analysis of Variance or ANOVA for short. Definition: Analysis of Variance (ANOVA) is an inferential method that is used to test the equality of three or more population means. 7.1 One- Way Analysis of Variance It is the simplest type of analysis of variance. The one-way analysis of variance is a form of design and subsequent analysis utilized when the data can be classified into k categories or levels of a single factor, and the equality of the k class means in the population is to be investigated. For example, five fertilizers are applied to four plots each of wheat and yield of wheat on each of the plot is given. We may be interested in finding out whether the effect of these fertilizers on the yield is significantly different or in other words, whether the samples have come from the same normal population. The answer to this problem is problem is provided by the technique of analysis of variance. The basic purpose of the variance is to test the homogeneity of several means. In order to perform ANOVA test, certain requirements must be satisfied. 7.2 Requirements of ANOVA Test 1. Independent random samples have been taken from each population. 2. The populations are normally distributed. 3. The population variances are all equal. 7.3 The Hypothesis test of Analysis of Variance H 0 : H 1 : At least one of the population means differs from the others 53

53 7.4 Decomposition of Total Sum of Squares The name analysis of variance is derived from a partitioning of total variability into its component parts. Let y ij is the j th observation of i th factor level. The data collected under the factor levels can be represented as follows. Group (Factor Level/ Treatment) k Number of observations mean variance n 1 n 2 n 3. n k Grand mean ( ) = = The total variation present in the data is measured by the sum of squares of all these deviations. Thus Total Sum of Squares (SSTo) = The total variation in the observation can be split into the following two components. 1. The variation between the classes or the variation due to different bases of classification, commonly known as treatments. 2. The variation within the classes, i.e, the inherent variation of the random variable within the observation of a class. This variation is due to chance causes which are beyond the control of human hand. 54

54 The sum of squares due to differences in the treatment means is called the treatment sum of squares or between sums of squares and is given by the expression. Sum of squares of the differences between treatments = or Treatment Sum of Squares (SSTr) The sum of squares due to inherent variabilities in the experiment material is called the Sum of Squares of the differences within the treatment. Sum of squares of differences within the treatment(sse) = It can be shown that = + Total sum of squares = Sum of squares between treatments + Sum of squares within treatments (SSTo) (SSTr) (SSE) 7.5 The Mean Squares In finding the average squared deviations due to treatment and to error, we divide each sum of squares by its degrees of freedom. We call the two resulting averages mean square treatment (MSTr) and mean square error (MSE), respectively. The number of degrees of freedom associated with SSTr = k-1 MSTr = The number of degrees of freedom associated with SSE = n- k MSE = The Expected Values of the Statistics MSTr and MSE under the null hypothesis E(MSE) =.(1) 55

55 E (MSTr) = +. (2) - mean of population i μ combined mean of all k population When the null hypothesis of ANOVA is true and all K population means are equal MSTr and MSE are two independent, unbiased estimators of the common population variance. In on the other hand, the null hypothesis is not true and differences do exist among k population means, then MSTr will tend to be larger than MSE. This happens because, when not all population means are equal, the second term in eq 2 is a positive number. 7.6 The test statistic in analysis of variance Under the assumption of ANOVA the ratios MSTr/ MSE possesses an F distribution with k-1 degrees of freedom for the numerator and n-k degrees of freedom for the denominator when the null hypothesis is true. Decision rule If > reject H 0 Alternatively p-value = Pr (F > ) under the distribution Thus reject H 0 if p value < α (level of significance) ANOVA Table Source of Sum of Degrees of Mean F test p-value variation Squares freedom Squares statistics Treatment SSTr k - 1 MSTr F = Pr( F > ) Error SSE n - k MSE Total SSTo n

56 Example 7.1 A family doctor claims that the mean HDL cholesterol levels of males in the age groups years old, years old and years old are equal. He obtains a simple random sample of 12 individuals from each group and determines their HDL cholesterol level. The results are presented in table 7.1 Table years old years old years old Approach: We must verify the requirements 1. As was stated in the problem, the data were collected using random sampling method. 2. None of the subjects selected are related any way. So the samples are independent. 3. Normality test suggest sample data come from populations that are normally distributed(by using the normality test). Because all requirements are satisfied, we can perform a one way ANOVA. Hypothesis:.. 57

57 Decision:. Conclusion:.. Example 7.2 An experimenter wished to study the effect of 5 fertilizers on the yield of crop. He divided the field into 45 plots and assigned each fertilizer at random to 9 plots. Data in table 4 represent the number of pods on soyabean plants for various plot types. Fertilizer Pods A B C D E Test at the 5% level to see whether the fertilizers differed significantly. Part 01: Hypothesis:.. Decision: Conclusion:.. Part 02: Where are the differences? After performing a one-factor independent measures ANOVA and finding out that the results are significant, we know that the means are not all the same. This relatively simple conclusion, however, actually raises more questions? Is different than? Are all five 58

58 means different? Post hoc provide answer to these questions whenever we have a significant ANOVA result? There are many different kinds of post-hoc tests, that examine which means are different from each other: One commonly used procedure is Tukey s Honestly Significant Difference Test. SPSS Command Analyze Compare Means One Way ANOVA The variables are still selected, as earlier. Click on Post Hoc and select only Tukey, as shown here:... 59

59 The Nonparametric Kruskal Wallis Test SPSS Command Analyze Nonparametric k independent samples Ensure that Kruskal Wallis H is checked On Define Groups option, apply relevant codes of the groups to be compared. Example 7.3 To compare the effectiveness of three types of weight reduction diets, a homogeneous groups of 22 women were divided into three sub-grouups and each sub-group followed one of these diet plans for a period of two months. The weight reductions, in kgs were noted as given below Diet Weight reduction plan I II III Test whether the effectiveness of the three weight reducing diet plans are same at 5% level of significance. Hypothesis: Decision: Conclusion: 60

60 Exercise: Inference about Two Means: Dependent Samples 1. A dietitian hopes to reduce a person s cholesterol level by using a special diet supplemented with a combination of vitamin pills. 16 subjects were pre tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown in table 01. Cholesterol levels are measured in milligrams per deciliter. I. Test the claim that the cholesterol level before the special diet is greater than the cholesterol level after the special diet at α = 0.05 level of significance. II. Construct 95% confidence interval for the difference in mean cholesterol level. Subject Before After Table 01 Step 01: Set up the null hypothesis and alternative hypothesis 61

61 Step 02: Compute the difference between the before and after cholesterol level for each individual Step 03: Before proceed with the test of hypothesis we must verify that the difference data are normally distributed because the sample size is small. We will construct a normal Q- Q plot and normality test to verify the assumption Hypothes:... Decision: Conclustion:.. Step 04: Now we can proceed with the hypothesis test. Decision: Conclusion: Part 02: 95% confidence interval Interpret the result 62

62 Nonparametric Wilcoxon Test for Two Related Samples 2. Suppose that you are interested in examining the effects of the transition from fetal to postnatal circulation among premature infant. For each of 14 healthy newborns, respiratory rate is measured at two different times-once when the infant is less than 15 days old, and again when he or she is more than 25 days old. Subject Respiratory Rate (breaths/minute) Time 1 Time I. At the α = 0.1 level of significance, test the null hypothesis that the median difference in respiratory rates for the two times is equal to 0. Hypothesis Decision Conclusion 63

63 II. Do you feel that it would have been appropriate to use the paired t test to evaluate these data? Why or why not? Inference about Two Means: Independent Samples Case 01: Data from two normal distributions with unequal variances and both variances are unknown. 3. A physical therapist wanted to know whether the mean step pulse of men was less than the mean step pulse of women. She randomly selected 51 men and 70 women to participate in the study. Each subject was required to step up and down onto a six-inch platform for three minutes. The pulse of each subject (in beats per minute) was then recorded. Data: pulse.sav State the null and alternative hypothesis: Identify the p-value and state the researcher s conclusion if the level of significance was α = What is the 95% confidence interval for the mean difference in pulse rates of men versus women? Interpret this interval. Case 02: Data normal, both variances are unknown, but known that they are equal. 4. Researcher wanted to determine whether carpeted rooms contained more bacteria than uncarpeted rooms. To determine the amount of bacteria in a room, researcher pumped the air from the room over a Petri dish at the rate of one cubic foot per minute for eight carpeted rooms and eight uncarpeted rooms. Colonies of bacteria were allowed to form in the 16 Petri dish. The results are presented in the table below. 64

64 Test the claim that carpeted rooms have more bacteria than uncarpeted rooms at the α = 0.05 level of significance. Carpeted Rooms (Bacteria/ cubic foot) Uncarpeted Rooms (Bacteria/ cubic foot) Hypothesis Decision Conclusion The Nonparametric Mann Whitney U Test for Two Independent Samples 5. When a person is exposed to an infection, the person typically develops antibodies. The extent to which the antibodies respond can be measured by looking at a person s titer, which is a measure of the number of antibodies present. The higher the titer, the more antibodies are present. The data in table 02 represent the titers of 11 healthy people exposed to the tularemia virus in Vermont. ill Healthy Test the claim that the level of titer in the ill group is greater than the level of titer in the healthy group, at the α = 0.1 level of significance. Approach:... Hypothesis: 65

65 Decision:.. Conclusion: Mann Whitney Using Qualitative Data 6. The Mann Whitney Test can be performed on qualitative data if data can ba ranked. For example, a letter grade received in a class is qualitative data that can be ranked an A ranks higher than a B. Suppose a department chair wants to discover whether there was a difference in the grades of students learning a computer program based on the style of the teaching methods. The chair randomly selects 15 students from Professor A s class and Professor B s class and obtains data on the below. Test whether the grades administered in each class are equivalent. Professor A Professor B C D F C C B B A A C B B D C A B B C D A A C B D C B C F C B Hypothesis: Decision:.. Conclusion: 66