Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

Size: px
Start display at page:

Download "Introduction to Statistics with GraphPad Prism (5.01) Version 1.1"

Transcription

1 Babraham Bioinformatics Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

2 Introduction to Statistics with GraphPad Prism 2 Licence This manual is , Anne Segonds-Pichon. This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that you are free: to copy, distribute, display, and perform the work to make derivative works Under the following conditions: Attribution. You must give the original author credit. Non-Commercial. You may not use this work for commercial purposes. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. Please note that: For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Full details of this licence can be found at

3 Introduction to Statistics with GraphPad Prism 3 Table of contents Introduction to Statistics with GraphPad Prism (5.01)... 1 Introduction... 4 Chapter 1: Basic structure of an GraphPad Prism project... 5 Chapter 2: Qualitative data... 7 Example... 7 The χ 2 test... 9 The null hypothesis and the error types Chapter 3: Quantitative data Descriptive stats The mean The median: The variance The Standard Deviation (SD) Standard Deviation vs. Standard Error Confidence interval Assumptions of parametric data How can you check that your data are parametric/normal? Example Quantitative data representation The t-test Independent t-test Paired t-test Example Comparison of more than 2 means: Analysis of variance A bit of theory Example Correlation Example Correlation coefficient... 33

4 Introduction to Statistics with GraphPad Prism 4 Introduction Prism is the officially supported graphical package at Babraham. It is a straight forward package with a friendly environment. There is a lot of easy-to-access documentation and the tutorials are very good. Graphical representation of data is pivotal when one wants to present scientific results, in particular in publications. GraphPad allows you to build top quality graphs, much better than Excel for example and in a much more intuitive way. In this manual however, we are going to focus on the statistical menu of GraphPad. The data analysis approach is a bit friendlier than with SPSS (the statistical package officially supported by the institute). SPSS does not hold your hand all the way through your analysis whereas GraphPad does. On the down side, you cannot do as many as different analyses with GraphPad that you can with SPSS. If you need to run say a 3-way ANOVA then you will need to use SPSS. So the 2 packages work quite differently but whether you choose one and the other, in both cases you need some basic statistical knowledge if only to design your experiments correctly, so there is no way out of it! And don t forget: you use stats to present your data in a comprehensible way and to make your point; this is just a tool, so don t hate it, use it! To consult a statistician after an experiment is finished is often merely to ask him to conduct a postmortem examination. He can perhaps say what the experiment died of." R.A.Fisher, 1938.

5 Introduction to Statistics with GraphPad Prism 5 Chapter 1: Basic structure of an GraphPad Prism project Click on the GraphPad Prism icon and the window below will appear. Before you do anything with GraphPad you need to have in mind the type of graph/analysis you want to do as this will determine the type of table you are going to choose. Then there are 2 scenarios, either you enter your data directly into GraphPad, in which case, depending on the type of table you are choosing, you may need to know exactly how many data points you are going to deal with. Or your data are already into Excel in which case it seems that you cannot import from its latest version and even with the previous one it is not easy and it does not work for Mac. So whenever possible, as the Prism Help suggests, transfer data from Excel using copy and paste. As mentioned previously, unlike other softwares, you need to choose a type of table before doing anything else which will be dependent upon the type of graph/analysis you want to do. Unlike in Excel for instance, the worksheets don t have all the same structure. You can choose from 5 different types: - XY table in which each point is defined by both an X and a Y value, though for one X you can have several Y like replicates which will be used to calculate error bars. Replicates are in side-by-side sub columns. This type of table allows you to run linear regression, correlation and to calculate area under the curve. - Column table in which each column defines a treatment group. From this type of table, one can run a t-test and a one-way ANOVA or one of the non parametric equivalent tests. - Grouped table in which you can have 2 grouping variables, hence running 2-way ANOVAs. - Contingency table in which one can enter categorical data suitable for Fisher s exact test or Chi-square. - Survival table for survival analysis! In this manual we will cover only XY, column and contingency tables. The type of analysis you can run. You choose a table. You choose a graph.

6 Introduction to Statistics with GraphPad Prism 6 Whatever the type of tables you have chosen each Project contains the 5 folders: - Data Tables in which are the worksheets containing the data, - Info section in which you can enter information about the technical aspect of the experiment like the protocol or who was the experimenter, - Results in which are the outputs of the statistical analysis - Graphs in which are the graphs! They are automatically generated from your data but you can make them pretty afterwards - Layouts in which you can present your graphs and analysis.

7 Introduction to Statistics with GraphPad Prism 7 Chapter 2: Qualitative data Let s talk about the important stuff: your data. The first thing you need to do good stats is to know your data inside out. They are generally organised into variables, which can be divided into 2 categories: qualitative and quantitative. Qualitative data are non numerical data and the values taken are usually names (also nominal data) (e.g. variable sex: male or female). The values can be numbers but not numerical (e.g. an experiment number is a numerical label but not a unit of measurement). A qualitative variable with intrinsic order in their categories is ordinal. Finally, there is the particular case of qualitative variable with only 2 categories, it is then said to be binary or dichotomous (e.g. alive/dead or male/female). We are going to use an example to go through the analysis and the plotting of categorical data. Example (File: cats and dogs.xlsx) A researcher is interested in whether animals could be trained to line dance. He takes some cats and dogs (animal) and tries to train them to dance by giving them either food or affection as a reward (training) for dance-like behaviour. At the end of the week a note is made of which animal could line dance and which could not (dance). All the variables are dummy variables (categorical). The pivotal (!) question is: Is there an effect of training on dogs and cats ability to learn to line dance? It is quite intuitive that after having run such an experiment, you are going to end up with a contingency tables that is going to show the number of animals who danced or not according to the type of training they received. Those contingency tables are presented below. Count Type of training Food Affection Total Did they yes dance? no Total Cat Count Type of training Food Affection Total Did they Yes dance? no Total Dog The first thing to do is enter the data into GraphPad. While for some software it is OK or even easier to prepare your data in Excel and then import them, it is not such a good idea with GraphPad because, as we said before, the structure of the worksheets varies with the type of graph you want to do. So, first, you need to open a New Project which means that you have to choose among the different types of tables mentioned earlier. In our case we want to build a contingency table, so we choose Contingency and we click on OK. The next step is to enter the data after having named the columns and the rows.

8 Introduction to Statistics with GraphPad Prism 8 When you want to insert another sheet you have 2 choices. If the second sheet has the same structure and variable s names that the first one, you can right-click on the first sheet name (here Dog ) and choose Duplicate Current Sheet and all you have to do is change the values. If the second sheet has different structure, you click on New>New data table in the Sheet Menu. The first thing you want to do is look at a graphical representation of the data. GraphPad will have done it for you and if you go into Graphs you will see the results. You can change pretty much everything on a graph in GraphPad and it is very easy to make it look like that for instance: Counts Cat Dance Yes Dance No 0 Food Affection Counts Dog Dance Yes Dance No 0 Food Affection I will not go into much detail in this manual about all the graphical possibilities of GraphPad because it is not its purpose but it is very intuitive and basically, once you have entered the data in the correct way, you are OK. After that all you have to do is click on the bit you want to change and, usually, a window will pop up. To analyse such data you need to use a Fisher s exact test or a χ 2 test. Both tests will give you the same-ish p-value for big samples but for small samples the difference can be a bit more important and the p-value given by Fisher s exact test is more accurate. Having said that, the calculation of the Fisher s exact test is quite complex whereas the one for χ 2 is quite easy so

9 Introduction to Statistics with GraphPad Prism 9 only the calculation of the latter is going to be presented here. Also, the Fisher s test is often only available for 2x2 tables, as in GraphPad for example, so in a way the χ 2 is more general. For both tests, the idea is the same: how different are the observed data from what you would have expected to see by chance i.e. if there were no association between the 2 variables. Or, looking at the table you may also ask: knowing that 32 of the 68 cats did dance and that 36 of the 68 received affection, what is the probability that those 32 dancers would be so unevenly distributed between the 2 types of reward? A bit of theory: the Chi 2 test It could be either: - a one-way χ 2 test, which is basically a test that compares the observed frequency of a variable in a single group with what would be the expected by chance. - a two-way χ 2 test, the most widely used, in which the observed frequencies for two or more groups are compared with expected frequencies by chance. In other words, in this case, the χ 2 tells you whether or not there is an association between 2 categorical variables. An important thing to know about the χ 2, and for the Fisher s exact test for that matter, is that it does not tell you anything about causality; it is simply measuring the strength of the association between 2 variables and it is your knowledge of the biological system you are studying which will help you to interpret the result. Hence, you generally have an idea of which variable is acting the other. The Chi2 value is calculated using the formula below: The observed frequencies are the one you measured, the values that are in your table. Now, the expected ones are calculated this way: Expected frequency = (row total)*(column total)/grand total So, for the cat, for example: the expected frequency of cat that would line dance after having received food as reward is: - probability of line dancing: 32/68 - probability of receiving food: 32/68 So the expected frequency: (32*32)/68 = 15.1

10 Introduction to Statistics with GraphPad Prism 10 Did they dance? * Type of Training * Anima l Crosstabulation Animal Cat Dog Did they dance? Total Did they dance? Total Yes No Yes No Count Expected Count Count Expected Count Count Expected Count Count Expected Count Count Expected Count Count Expected Count Type of Training Food as Affection as Reward Reward Total Intuitively, one can see that we are kind of averaging things here, we try to find out the values we should have got by chance. If you work out the values for all the cells, you get: So for the cat, the χ 2 value is: ( ) 2 / (6-16.9) 2 / (6-16.9) 2 / ( ) 2 /19.1 = 28.4 Let s do it with GraphPad. To calculate either of the tests, you click on = Analyze in the tool bar menu, then the window below will appear. GraphPad will offer you by default the type of analysis which go with the type of data you have entered. So to the question Which analysis for Contingency tables?, the answer is Chi-square and Fisher s exact test. If you are happy with it, and after having checked that the data sets to be analysed are the one you want, you can click on OK. The complete analysis will then appear in the Results section. Below are presented the results for the χ 2 and the Fisher s exact test.

11 Introduction to Statistics with GraphPad Prism 11 Let s start with the χ 2 : there is only one assumption that you have to be careful about when you run it: with 2x2 contingency tables you should not have cells with an expected count below 5 as if it is the case it is likely that the test is not accurate (for larger tables, all expected counts should be greater than 1 and no more than 20% of expected counts should be less than 5). If you have a high proportion of cells with a small value in it, then you should use a Fisher s exact test. However as I said before many software including GraphPad only offer the calculation of the Fisher s exact test for 2x2 tables. So when you have more than 2 categories and a small sample you are in trouble. You have 2 solutions to solve the problem: either you collect more data or you group the categories to boost the proportions. If you remember the χ 2 s formula, the calculation gives you estimation of the difference between your data and what you would have obtained if there was no association between your variables. Clearly, the bigger the value of the χ 2, the bigger the difference between observed and expected frequencies and the more likely to be significant the difference is. As you can see here the p-values vary slightly between the 2 tests ( vs ) though the conclusion remains the same: the type of reward has no effect whatsoever on the ability of dogs to line dance (p=0.9). Though the samples are not very big here, the assumptions for the χ 2 are met so you can choose either test. As for the cats, you are more than 99% confident (p< ) when you say that cats are more likely to line dance when they receive food as a reward than when they receive affection.

12 Introduction to Statistics with GraphPad Prism 12 A bit of theory: the null hypothesis and the error types. The null hypothesis (H 0 ) corresponds to the absence of effect (e.g.: the animals rewarded by food are as likely to line dance as the ones rewarded by affection) and the aim of a statistical test is to accept or to reject H 0. Traditionally, a test or a difference are said to be significant if the probability of type I error is: α =< 0.05 (max α=1). It means that the level of uncertainty of a test usually accepted is 5%. It also means that there is a probability of 5% that you may be wrong when you say that your 2 means are different, for instance, or you can say that when you see an effect you want to be at least 95% sure that something is significantly happening. Statistical decision True state of H 0 H 0 True H 0 False Reject H 0 Type I error (False Positive) Correct (True Positive) Do not reject H 0 Correct (True Negative) Type II error (False Negative) Tip: if your p-value is between 5% and 10% (0.05 and 0.10), I would not reject it too fast if I were you. It is often worth putting this result into perspective and asks yourself a few questions like: - what the literature says about what am I looking at? - what if I had a bigger sample? - have I run other tests on similar data and were they significant or not? The interpretation of a border line result can be difficult as it could be important in the whole picture. The specificity and the sensitivity of a test are closely related to Type I and Type II errors. Specificity = Number of True Negatives / (Number of False Positives + Number of True Negatives) A test with a high specificity has a low type I error rate. Sensitivity = Number of True Positives / (Number of False Negatives + Number of True Positives) A test with a high sensitivity has a low type II error rate.

13 Introduction to Statistics with GraphPad Prism 13 Chapter 3: Quantitative data When it comes to quantitative data, more tests are available but assumptions must be met before applying them. There are 2 types of stats tests: parametric and non-parametric ones. Parametric tests have 4 assumptions that must be met for the test to be accurate. Non-parametric tests are designed to be used with nominal or ordinal data (e.g. χ 2 test) and they make few or no assumptions about populations parameters like normality (e.g. Mann-Whitney test). 3-1 A bit of theory: descriptive stats The mean (or average) µ = average of all values in a column It can be considered as a model because it summaries the data. - Example: number of friends of each members of a group of 5 lecturers: 1, 2, 3, 3 and 4 Mean: ( )/5 = 2.6 friends per lecturer: clearly an hypothetical value! But if the values were: 0, 0, 1, 5 and 7, the mean would also be 2.6 but clearly it would not give an accurate picture of the data. So, how can you know that it is an accurate model? You look at the difference between the real data and your model. To do so, you calculate the difference between the real data and the model created and you make the sum so that you get the total error (or sum of differences). (x i - µ) = (-1.6) + (-0.6) + (0.4) + (0.4) + (1.4) = 0 And you get no errors! Of course: positive and negative differences cancel each other out. So to avoid the problem of the direction of the error, you can square the differences and instead of sum of errors, you get the Sum of Squared errors (SS). - In our example: SS = (-1.6) 2 + (-0.6) 2 + (0.4) 2 + (0.4) 2 + (1.4) 2 = 5.20 The median: The median is the value exactly in the middle of an ordered set of numbers. Example 1: , Median = 68 Example 2: , Median = 60 The variance This SS gives a good measure of the accuracy of the model but it is dependent upon the amount of data: the more data, the higher the SS. The solution is to divide the SS by the number of observations (N). As we are interested in measuring the error in the sample to estimate the one in the population, we divide the SS by N-1 instead of N and we get the variance (S 2 ) = SS/N-1 - In our example: Variance (S 2 ) = 5.20 / 4 = 1.3

14 Introduction to Statistics with GraphPad Prism 14 Why N-1 instead N? If we take a sample of 4 scores in a population they are free to vary but if we use this sample to calculate the variance, we have to use the mean of the sample as an estimate of the mean of the population. To do that we have to hold one parameter constant. - Example: mean of the sample is 10 We assume that the mean of the population from which the sample has been collected is also 10. If we want to calculate the variance, we must keep this value constant which means that the 4 scores cannot vary freely: - If the values are 9, 8, 11 and 12 (mean = 10) and if we change 3 of these values to 7, 15 and 8 then the final value must be 10 to keep the mean constant. - If we hold 1 parameter constant, we have to use N-1 instead of N. - It is the idea behind the degree of freedom: one less than the sample size. The Standard Deviation (SD) The problem with the variance is that it is measured in squared units which is not very nice to manipulate. So for more convenience, the square root of the variance is taken to obtain a measure in the same unit as the original measure: the standard deviation. - S.D. = (SS/N-1) = (S 2 ), in our example: S.D. = (1.3) = So you would present your mean as follows: µ = 2.6 +/ friends The standard deviation is a measure of how well the mean represents the data or how much your data are squattered around the mean.: - small S.D.: data close to the mean: mean is a good fit of the data (graph on the left) - large S.D.: data distant from the mean: mean is not an accurate representation (graph on the right) Standard Deviation vs. Standard Error Many scientists are confused about the difference between the standard deviation (S.D.) and the standard error of the mean (S.E.M. = S.D. / N). - The S.D. (graph on the left) quantifies the scatter of the data and increasing the size of the sample does not increase the scatter (above a certain threshold). - The S.E.M. (graph on the right) quantifies how accurately you know the true population mean, it s a measure of how much you expect sample means to vary. So the S.E.M. gets smaller as your samples get larger: the mean of a large sample is likely to be closer to the true mean than is the mean of a small sample.

15 Introduction to Statistics with GraphPad Prism 15 A big S.E.M. means that there is a lot of variability between the means of different samples and that your sample might not be representative of the population. A small S.E.M. means that most samples means are similar to the population mean and so your sample is likely to be an accurate representation of the population. Which one to choose? - If the scatter is caused by biological variability, it is important to show the variation. So it is more appropriate to report the S.D. rather than the S.E.M. Even better, you can show in a graph all data points, or perhaps report the largest and smallest value. - If you are using an in vitro system with no biological variability, the scatter can only result from experimental imprecision (no biological meaning). It is more sensible then to report the S.E.M. since the S.D. is less useful here. The S.E.M. gives your readers a sense of how well you have determined the mean. Choosing between SD and SEM also depends on what you want to show. If you just want to present your data on a descriptive purpose then you go for the SD or the SEM. If you want the reader to be able to infer an idea of significance then you should go for the SEM or the Confidence Interval (see below). We will go a bit more in details later. Confidence interval - The confidence interval quantifies the uncertainty in measurement. The mean you calculate from your sample of data points depends on which values you happened to sample. Therefore, the mean you calculate is unlikely to equal the true population mean exactly. The size of the likely discrepancy depends on the variability of the values (expressed as the S.D. or the S.E.M.) and the sample size. If you combine those together, you can calculate a 95% confidence interval (95% CI), which is a range of values. If the population is normal (or nearly so), you can be 95% sure that this interval contains the true population mean. 95% of observations in a normal distribution lie within +/- 1,96*SE

16 Introduction to Statistics with GraphPad Prism 16 One other way to look at error bars: Error bars Type Description Standard deviation (SD) Descriptive Typical or average difference between the data points and their mean. Standard error (SEM) Inferential A measure of how variable the mean will be, if you repeat the whole study many times. Confidence interval (CI), Inferential A range of values you can be usually 95% CI 95% confident contains the true mean. From Geoff Cumming et al. If you want to compare experimental results, it could be more appropriate to show inferential error bars such as SE or CI rather than SD. However if n is very small (for example n=3), rather than showing error bars and statistics, it is better to simply plot the individual data points. You can estimate statistical significance using the overlap rule for SE bars. In the same way, you can estimate statistical significance using the overlap rule for 95% CI bars.

17 Introduction to Statistics with GraphPad Prism A bit of theory: Assumptions of parametric data When you are dealing with quantitative data, the first thing you should look at is how they are distributed, how they look like. The distribution of your data will tell you if there is something wrong in the way you collected them or enter them and it will also tell you what kind of test you can apply to make them say something. T-test, analysis of variance and correlation tests belong to the family of parametric tests and to be able to use them your data must comply with 4 assumptions. 1) The data have to be normally distributed (normal shape, bell shape, Gaussian shape). Example of normally distributed data: There are 2 main types of departure from normality: - Skewness: lack of symmetry of a distribution - Kurtosis: measure of the degree of peakedness in the distribution The two distributions below have the same variance approximately the same skew, but differ markedly in kurtosis.

18 Introduction to Statistics with GraphPad Prism 18 2) Homogeneity in variance: The variance should not change systematically throughout the data. 3) Interval data: The distance between points of the scale should be equal at all parts along the scale 4) Independence: Data from different subjects are independent so that values corresponding to one subject do not influence the values corresponding to another subject. There are specific designs for repeated measures experiments. How can you check that your data are parametric/normal? GraphPad can test the normality of the distribution of your sample(s). To do so, you go: =Analyze>Column Analyses>Column statistics. You are given the choice between 3 tests for normality: D'Agostino and Pearson, Kolmogorov- Smirnov and Shapiro-Wilk. These tests require n>=7 and the D'Agostino and Pearson test is the one to go for. As GraphPad puts it: It first computes the skewness and kurtosis to quantify how far from Gaussian the distribution is in terms of asymmetry and shape. It then calculates how far each of these values differs from the value expected with a Gaussian distribution, and computes a single p-value from the sum of these discrepancies. The Kolmogorov-Smirnov test is not recommended, and the Shapiro-Wilk test is only accurate when no two values have the same value. Let s try it through an example. Example (File: coyote.xlsx) In this case, the normality test tells us that our data are normally distributed. Actually, the test does not tell you that your data are normally distributed, it tells you that they are not significantly different from normality ( p= and p=0.7757).

19 Introduction to Statistics with GraphPad Prism 19 However, the best way to get a real good idea of what is going on is to plot your data. When it comes to normality, there are 2 ways to plot your data: the histogram and the box plot. We are going to do both with Graphpad. Let s start with the histogram. To draw such a graph with GraphPad, you first need to calculate the frequency distribution. To do so, you go: =Analyze>Column Analyses>Frequency distribution. GraphPad will automatically draw a histogram from the frequency. The slightly delicate thing here is to determine the size of the bin: too small, the distribution may look anything but normal, too big, you will not see a thing. The best way is to try 2 or 3 bin size and see how it goes. Something else to be careful about: by default GraphPad will plot the counts (in Tabulate> Number of Data Points). It is OK when you plot just one group or one data set but when you want to plot several (or just 2 like here) and the groups are not of the same size then you should plot percentages (in Tabulate> Relative frequencies as percent) if you want to be able to compare them graphically. Histogram of Coyote:Freq. dist. (histogram) Female Male Percentage Bin Center Female Male Percentage Bin Center Percentage Female Male Bin Center As you can see, depending of the choice of the bin size, the histograms look quite different. And even though they don t look too normal the data still passed the test. It is why I don t like histogram that much, especially with not very big data sets.

20 Introduction to Statistics with GraphPad Prism 20 My preference goes to the box plot as it tells you in one go anything you need to know and you don t need to play with the bin size! To draw a box plot you choose it from the gallery of graphs in Column and you choose Tukey for Whiskers. Tukey was the guy who invented the box plot and this particular representation allows you to indentify outliers (which we will talk about later). It is very important that you know how a box plot is built. It is rather simple and it will allow you to get a pretty good idea about the distribution of your data in a glance. Below you can see the relationship between box plot and histogram. If your distribution is normal-ish then the box plot should be symmetrical. Regarding the outliers, there is no really right or wrong attitude. If there is a technical issue or an experimental problem, you should remove it of course but if there is nothing obvious, it is up to you. I would always recommend keeping outliers if you can; you can run the analysis with and without it for instance and see what effect it has on the p-value. If the outcome is still consistent with your hypothesis, then you should keep it. If not, then it is between you and your conscience!

21 Introduction to Statistics with GraphPad Prism 21 Finally, you can check the second assumption (homogeneity of variances). In GraphPad the second assumption is tested by default. When you ask for a t-test, GraphPad will calculate an F test to tell you if variances were different or not. Don't be too quick to switch to using the nonparametric Kruskal-Wallis ANOVA (or the Mann-Whitney test when comparing two groups). While nonparametric tests do not assume Gaussian distributions, the Kruskal-Wallis and Mann-Whitney tests do assume that the shape of the data distribution is the same in each group. So if your groups have very different standard deviations and so are not appropriate for one-way ANOVA, they also should not be analyzed by the Kruskal-Wallis or Mann-Whitney tests either. However ANOVA and t-tests are rather robust, especially when the samples are not too small so you can get away with small departure from normality and small differences in variances. Often the best approach is to transform the data and transforming to logarithms or reciprocals does the trick, restoring equal variance. Going back to the box plots, the symmetry tells you about the distribution of the data and if both (like in our case) are of the same size-ish, then you know that the variances are about the same. Quantitative data representation Let s go back to our coyotes. What you want from your graph is to see if there is difference between males and females and possibly, have an idea of the significance of the difference. The best way to do it is to plot the error bars as in confidence intervals (CI). Length (cm) Male Female There is about 40% of overlap between the error bars. Significance can still be reached up to 50% of overlap depending on sample size and variability. This is a very informative graph as you can spot the 2 means together with the confidence interval. We saw before that the 95% CI of the mean gives you the boundaries between which you are 95% sure to find the true population mean. It is always better when you want to compare visually 2 or more groups to use the CI than the SD or to some extent the SEM. It gives you a good idea of the dispersion of your sample and, as we saw before, it easily allows you to have an idea, before doing any stats, of the likelihood of a significant difference between your groups. Since your true group means have 95% chances of lying within their respective CI, such a big overlap between the CI tells you that the difference is probably not significant. In our particular example, from the graph we can say that the average body length of female coyotes, for instance, is a little bit more that 92 cm and that 95 out of 100 samples from the same population would have means between about 90 and 94 cm. We can also say that despite the fact that the females appear smaller than the males, this difference is probably not significant as the errors bars overlap a lot.

22 Introduction to Statistics with GraphPad Prism 22 To check that, we are going to run a t-test. 3-3 A bit of theory: the t-test The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups. The figure above shows the distributions for the treated (blue) and control (green) groups in a study. Actually, the figure shows the idealized distribution. The figure indicates where the control and treatment group means are located. The question the t-test addresses is whether the means are statistically different. What does it mean to say that the averages for two groups are statistically different? Consider the three situations shown in the figure below. The first thing to notice about the three situations is that the difference between the means is the same in all three. But, you should also notice that the three situations don't look the same -- they tell very different stories. The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. The third shows the case with low variability. Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case. Why? Because there is relatively little overlap between the two bell-shaped curves. In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much. This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this. The formula for the t-test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. Figure

23 Introduction to Statistics with GraphPad Prism 23 3 shows the formula for the t-test and how the numerator and denominator are related to the distributions. The t-value will be positive if the first mean is larger than the second and negative if it is smaller. To run a t-test GraphPad, you go: =Analysis> Column analyses>t-tests and then you have to choose between 2 types of t-tests: Unpaired and Paired t-testd. The choice between the 2 is very intuitive. If you measure a variable in 2 different populations, you choose the independent t-test as the 2 populations are independent from each other. If you measure a variable 2 times in the same population, you go for the paired t-test. So say, you want to compare the weights of 2 breeds of sheep. To do so, you take a sample of each breed (the 2 samples have to be comparable) and you weigh each animal. You then run an Independent-samples t-test on your data to find out if the difference is significant. You may also want to compare 2 types of sheep food (A and B): to do so you define 2 samples of sheep comparable in every other ways and you weigh them at day 1 and say at day 30. This time you apply a Paired-Samples t-test as you are interested in each individual difference in weight between day 1 and day 30. Independent t-test Let s go back to our coyotes. You go =Analysis>Column analyses> t-tests. The default setting here is good as you want to run a Unpaired t-test.

24 Introduction to Statistics with GraphPad Prism 24 Though the males are bigger than the females, the difference between the 2 genders does not reach significance (p=0.1045). The variances of the 2 groups are not significantly different (p=0.8870) hence the second assumption for parametric test is met. Paired t-test Now let s try a Paired t-test. As we mentioned before, the idea behind the paired t-test is to look at a difference between 2 paired individuals or 2 measures for a same individual. For the test to be significant, the difference must be different from 0. Example (File: height husband wife.xlsx) 200 Height (cm) Husband Wife From the graph above, we can conclude that if husbands are taller than wives, this difference does not seem significant. Before running the paired t-test to get a p-value we are going to check that the assumptions for parametric stats are met. The box plots below seem to indicate that there is no significant departure from normality and this is confirmed by the D Agostino & Pearson test. 200 Husband and Wife Height Husband Wife

25 Introduction to Statistics with GraphPad Prism 25 Normality Husband are significantly taller than the wives (p<0.0001). On average, husbands are cm taller than their wives. The confidence interval does not include 0 hence the significance. The paired t-test turns out to be highly significant (see Table above). So, how come the graph and the test tell us different things? The problem is that we don t really want to compare the mean size of the wives to the mean size of the husband, we want to look at the difference pair-wise, in other words we want to know if, on average, a given wife is taller or smaller than her husband. So we are interested in the mean difference between husband and wife. Unfortunately, one of the down sides of GraphPad is that you cannot manipulate the data, for instance there is no equivalent of Excel s Function thanks to which one can apply formulas to join several values. In our case, we want to calculate and plot the difference in size between a husband and his wife. So no choice, we have to do it in Excel and then we can copy and paste it back into GraphPad after having created a new data table. The graph representing the difference is displayed below and one can see that the confidence interval does not include 0 meaning that the difference is likely to be significantly different from 0 which we already know by the paired t-test.

26 Introduction to Statistics with GraphPad Prism Difference Confidence Interval Now try to run a One Sample t-test which you will find under Column Analysis > Column Statistics. Same values as for the paired t-test. You will have noticed that GraphPad does not run a test for the equality of variances in the paired t- test; this is because it is actually looking at only one sample: the difference between the husbands and the wives. 3-4 Comparison of more than 2 means: Analysis of variance A bit of theory When we want to compare more than 2 means (e.g. more than 2 groups), we cannot run several t-test because it increases the familywise error rate which is the error rate across tests conducted on the same experimental data. Example: if you want to compare 3 groups (1, 2 and 3) and you carry out 3 t-tests (groups 1-2, 1-3 and 2-3), each with an arbitrary 5% level of significance, the probability of not making the type I error is 95% (= ). The 3 tests being independent, you can multiply the probabilities, so the overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = Which means that the probability of making at least one type I error (to say that there is a difference whereas there is not) is =

27 Introduction to Statistics with GraphPad Prism or 14.3%. So the probability has increased from 5% to 14.3%. If you compare 5 groups instead of 3, the family wise error rate is 40% (= 1 - (0.95) n ) To overcome the problem of multiple comparisons, you need to run an Analysis of variance (ANOVA), which is an extension of the 2 group comparison of a t-test but with a slightly different logic. If you want to compare 5 means, for example, you can compare each mean with another, which gives you 10 possible 2-group comparisons, which is quite complicated! So, the logic of the t-test cannot be directly transferred to the analysis of variance. Instead the ANOVA compares variances: if the variance amongst the 5 means is greater than the random error variance (due to individual variability for instance), then the means must be more spread out than we would have explained by chance. The statistic for ANOVA is the F ratio: F = also: F = variance among sample means variance within samples (=random. Individual variability) variation explained by the model (systematic) variation explained by unsystematic factors If the variance amongst sample mean is greater than the error variance, then F>1. In an ANOVA, you test whether F is significantly higher than 1 or not. Imagine you have a dataset of 78 data points, you make the hypothesis that these points in fact belong to 5 different groups (this is your hypothetical model). So you arrange your data into 5 groups and you run an ANOVA. You get the table below. Source of variation Sum of Squares df Mean Square F p-value Between Groups < Within Groups Total Typical example of analyse of variance table Let s go through the figures in the table. First the bottom row of the table: Total sum of squares = (x i Grand mean) 2 In our case, Total SS = If you were to plot your data to represent the total SS, you would produce the graph below. So the total SS is the squared sum of all the differences between each data point and the grand mean. This is a quantification of the overall variability in your data. The next step is to partition this variability: how much variability between groups (explained by the model) and how much variability within groups (random/individual variability)?

28 Introduction to Statistics with GraphPad Prism 28 According to your hypothesis your data can be split into 5 groups because, for instance, the data come from 5 cell types, like in the graph below. So you work out the mean for each cell type and you work out the squared differences between each of the means and the grand mean ( n i (Mean i - Grand mean) 2 ). In our example (second row of the table): Between groups SS = and, since we have 5 groups, there are 5 1 = 4 df, the mean SS = 2.665/4 = If you remember the formula of the variance (= SS / N-1, with df=n-1), you can see that this value quantifies the variability between the groups means: it is the between group variance. Between group variability Within group variability There is one row left in the table, the within groups variability. It is the variability within each of the five groups, so it corresponds to the difference between each data point and its respective group mean: Within groups sum of squares = (x i - Mean i ) 2 which in our case is equal to This value can also be obtained by doing = 5.775, which is logical since it is the amount of variability left from the total variability after the variability explained by your model has been removed. In our example, the 5 groups sizes are 12, 12, 17, 17 and 17 so df = 5 x (n 1) = 73 So the mean within groups: SS = 5.775/73 = This quantifies the remaining variability, the one not explained by the model, the individual variability between each value and the mean of the group to which it belongs according to your hypothesis. At this point, you can see that the amount of variability explained by your model (0.6663) is far higher than the remaining one (0.0791).

29 Introduction to Statistics with GraphPad Prism 29 So, you can work out the F-ratio: F = / = The level of significance of the test is calculated by taking into account the F ratio and the number of df (degree of freedom) for the numerator and the denominator. In our example, p<0.0001, so the test is highly significant and you are more than 99% confident when you say that there is a difference between the groups means. Let s do it in more details. We want to find out if there is a significant difference in terms of protein expression between 5 cell types. Example (File: protein expression.xlsx): 10 Protein expression A B C D E Cell groups First we need to see whether the data meet the assumptions for a parametric approach. Well it does not look good: 2 out of 5 groups (C and D) show a significant departure from normality (See Table below). As for the homogeneity of variance, even before testing it, a look at the box plots (see Graph above) tells us that there is no way the second assumption is met. The data from groups C and D are quite skewed and a look at the raw data shows more than a 10-fold jump between values of the same group (e.g. in group A, value line 4 is 0.17 and value line 10 is 2.09). A good idea would be to log-transform the data so that the spread is more balanced and to check again on the assumptions. To do so, you go to = Analyse> Transform > Transform and you choose Y=Log(Y), you then re-run the analysis.

30 Introduction to Statistics with GraphPad Prism 30 OK, the situation is getting better: the first assumption is met and from what we see when we plot the transformed data (Box-plots and scatter plots below) the homogeneity of variance has improved a great deal. 1.5 Protein expression (Log) A B C D E Cell groups 1.5 Protein expression (Log) A B C D E

31 Introduction to Statistics with GraphPad Prism 31 Now that we have sorted out the data, we can run the ANOVA: to do so you go =Analyze >One-way ANOVA. The next thing you need to do is to choose is a post-hoc test. These post hoc tests should only be used when the ANOVA finds a significant effect. GraphPad is not very powerful when it comes to post-hoc tests as it offers only 2 tests: the Bonferroni test which is quite conservative so you should only choose it when you are comparing no more than 5 groups and the Tukey which is more liberal. Overall difference between the groups Homogeneity in variance

32 Introduction to Statistics with GraphPad Prism 32 There is an overall significant difference between the means (p< ), but even if you have an indication from the graph, you cannot tell which mean is significantly different from which. This is because the ANOVA is an omnibus test: it tells you that there is (or not) a overall difference between your means but not exactly which means are significantly different from which other ones. This is why you apply post-hoc tests. Post hoc tests could be compared as t-tests but with a more stringent approach, a lower significance threshold to correct for familywise error rate. From the table above you can find out which pairwise comparison reaches significance and which does not. One of the problems with GraphPad is that for post-hoc tests, it does not report the exact p-values which is more and more often asked in journals. And even for you, it is important to know the exact p-values: for example A vs. D is significant but it must be just about looking at the 95% CI as 0 is really on the side. Same thing for A vs. B: this time the test does not reach significance but again it must be quite close judging again by the CI. You can report the significance as in the graph below. * ** ** 0.4 Log(Protein Expression) A B C D E Cell groups 3-5 Correlation If you want to find out about the relationship between 2 variables, you can run a correlation. Example (File: roe deer.xlsx). When you want to plot data from 2 quantitative variables between which you suspect (hope?) that there is a relationship, the best choice to have a first look at you data is the scatter plot. So in GraphPad, you go choose an XY table. In our case we want to know if there is a relationship between the body mass and the parasite burden.

33 Introduction to Statistics with GraphPad Prism 33 Roe Deer Body Mass Male Female Parasites Burden You have to choose between the x- and the y-axis for your 2 variables. It is usually considered that x predicts y (y=f(x)) so when looking at the relationship between 2 variables, you must have an idea of which one is likely to predict the other one. In our particular case, we want to know how an increase in parasite burden affects the body mass of the host. By looking at the graph, one can think that something is happening here. Now, if you want to know if the relationship between your 2 variables is significant, you need to run a correlation test. A bit of theory: Correlation coefficient A correlation is a measure of a linear relationship (can be expressed as straight-line graphs) between variables. The simplest way to find out whether 2 variables are associated is to look at whether they covary. To do so, you combine the variance of one variable with the variance of the other. A positive covariance indicates that as one variable deviates from the mean, the other one deviates in the same direction, in other word if one variable goes up the other one goes up as well. The problem with the covariance is that its value depends upon the scale of measurement used, so you won t be able to compare covariance between datasets unless both data are measures in the same units. To standardise the covariance, it is divided by the SD of the 2 variables. It gives you the most widely-used correlation coefficient: the Pearson product-moment correlation coefficient r. Of course, you don t need to remember that formula but it is important that you understand what the correlation coefficient does: it measures the magnitude and the direction of the relationship between two variables. It is designed to range in value between 0.0 and 1.0.

Introduction to Statistics with SPSS (15.0) Version 2.3 (public)

Introduction to Statistics with SPSS (15.0) Version 2.3 (public) Babraham Bioinformatics Introduction to Statistics with SPSS (15.0) Version 2.3 (public) Introduction to Statistics with SPSS 2 Table of contents Introduction... 3 Chapter 1: Opening SPSS for the first

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

An introduction to using Microsoft Excel for quantitative data analysis

An introduction to using Microsoft Excel for quantitative data analysis Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Harvey Motulsky hmotulsky@graphpad.com This is the first case in what I expect will be a series of case studies. While I mention

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

More information

Chapter 7. One-way ANOVA

Chapter 7. One-way ANOVA Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

More information

The InStat guide to choosing and interpreting statistical tests

The InStat guide to choosing and interpreting statistical tests Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 1990-2003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Using Excel for inferential statistics

Using Excel for inferential statistics FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

More information

Version 5.0. Statistics Guide. Harvey Motulsky President, GraphPad Software Inc. 2007 GraphPad Software, inc. All rights reserved.

Version 5.0. Statistics Guide. Harvey Motulsky President, GraphPad Software Inc. 2007 GraphPad Software, inc. All rights reserved. Version 5.0 Statistics Guide Harvey Motulsky President, GraphPad Software Inc. All rights reserved. This Statistics Guide is a companion to GraphPad Prism 5. Available for both Mac and Windows, Prism makes

More information

Analyzing Research Data Using Excel

Analyzing Research Data Using Excel Analyzing Research Data Using Excel Fraser Health Authority, 2012 The Fraser Health Authority ( FH ) authorizes the use, reproduction and/or modification of this publication for purposes other than commercial

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Statistics Statistics are quantitative methods of describing, analysing, and drawing inferences (conclusions)

More information

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone:

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

Version 4.0. Statistics Guide. Statistical analyses for laboratory and clinical researchers. Harvey Motulsky

Version 4.0. Statistics Guide. Statistical analyses for laboratory and clinical researchers. Harvey Motulsky Version 4.0 Statistics Guide Statistical analyses for laboratory and clinical researchers Harvey Motulsky 1999-2005 GraphPad Software, Inc. All rights reserved. Third printing February 2005 GraphPad Prism

More information

The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests

The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests Tutorial The F distribution and the basic principle behind ANOVAs Bodo Winter 1 Updates: September 21, 2011; January 23, 2014; April 24, 2014; March 2, 2015 This tutorial focuses on understanding rather

More information

HYPOTHESIS TESTING WITH SPSS:

HYPOTHESIS TESTING WITH SPSS: HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER

More information

Directions for using SPSS

Directions for using SPSS Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

More information

One-Way Analysis of Variance (ANOVA) Example Problem

One-Way Analysis of Variance (ANOVA) Example Problem One-Way Analysis of Variance (ANOVA) Example Problem Introduction Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the equality of two or more population (or treatment) means

More information

Come scegliere un test statistico

Come scegliere un test statistico Come scegliere un test statistico Estratto dal Capitolo 37 of Intuitive Biostatistics (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright 1995 by Oxfd University Press Inc. (disponibile in Iinternet) Table

More information

MEASURES OF LOCATION AND SPREAD

MEASURES OF LOCATION AND SPREAD Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the

More information

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Statistics in Medicine Research Lecture Series CSMC Fall 2014 Catherine Bresee, MS Senior Biostatistician Biostatistics & Bioinformatics Research Institute Statistics in Medicine Research Lecture Series CSMC Fall 2014 Overview Review concept of statistical power

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

SPSS Tests for Versions 9 to 13

SPSS Tests for Versions 9 to 13 SPSS Tests for Versions 9 to 13 Chapter 2 Descriptive Statistic (including median) Choose Analyze Descriptive statistics Frequencies... Click on variable(s) then press to move to into Variable(s): list

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test The t-test Outline Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test - Dependent (related) groups t-test - Independent (unrelated) groups t-test Comparing means Correlation

More information

Introduction to StatsDirect, 11/05/2012 1

Introduction to StatsDirect, 11/05/2012 1 INTRODUCTION TO STATSDIRECT PART 1... 2 INTRODUCTION... 2 Why Use StatsDirect... 2 ACCESSING STATSDIRECT FOR WINDOWS XP... 4 DATA ENTRY... 5 Missing Data... 6 Opening an Excel Workbook... 6 Moving around

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Introduction Course in SPSS - Evening 1

Introduction Course in SPSS - Evening 1 ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/

More information

An analysis method for a quantitative outcome and two categorical explanatory variables.

An analysis method for a quantitative outcome and two categorical explanatory variables. Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

More information

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups ANOVA ANOVA Analysis of Variance Chapter 6 A procedure for comparing more than two groups independent variable: smoking status non-smoking one pack a day > two packs a day dependent variable: number of

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

Experimental Designs (revisited)

Experimental Designs (revisited) Introduction to ANOVA Copyright 2000, 2011, J. Toby Mordkoff Probably, the best way to start thinking about ANOVA is in terms of factors with levels. (I say this because this is how they are described

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

0 Introduction to Data Analysis Using an Excel Spreadsheet

0 Introduction to Data Analysis Using an Excel Spreadsheet Experiment 0 Introduction to Data Analysis Using an Excel Spreadsheet I. Purpose The purpose of this introductory lab is to teach you a few basic things about how to use an EXCEL 2010 spreadsheet to do

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. 277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Introduction to Statistics and Quantitative Research Methods

Introduction to Statistics and Quantitative Research Methods Introduction to Statistics and Quantitative Research Methods Purpose of Presentation To aid in the understanding of basic statistics, including terminology, common terms, and common statistical methods.

More information

Using Excel for descriptive statistics

Using Excel for descriptive statistics FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These

More information

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution

More information

Mathematics within the Psychology Curriculum

Mathematics within the Psychology Curriculum Mathematics within the Psychology Curriculum Statistical Theory and Data Handling Statistical theory and data handling as studied on the GCSE Mathematics syllabus You may have learnt about statistics and

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

UNIVERSITY OF NAIROBI

UNIVERSITY OF NAIROBI UNIVERSITY OF NAIROBI MASTERS IN PROJECT PLANNING AND MANAGEMENT NAME: SARU CAROLYNN ELIZABETH REGISTRATION NO: L50/61646/2013 COURSE CODE: LDP 603 COURSE TITLE: RESEARCH METHODS LECTURER: GAKUU CHRISTOPHER

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE Perhaps Microsoft has taken pains to hide some of the most powerful tools in Excel. These add-ins tools work on top of Excel, extending its power and abilities

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Exploratory Data Analysis. Psychology 3256

Exploratory Data Analysis. Psychology 3256 Exploratory Data Analysis Psychology 3256 1 Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find

More information

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics Analysis of Data Claudia J. Stanny PSY 67 Research Design Organizing Data Files in SPSS All data for one subject entered on the same line Identification data Between-subjects manipulations: variable to

More information

Data Analysis in SPSS. February 21, 2004. If you wish to cite the contents of this document, the APA reference for them would be

Data Analysis in SPSS. February 21, 2004. If you wish to cite the contents of this document, the APA reference for them would be Data Analysis in SPSS Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Heather Claypool Department of Psychology Miami University

More information

An introduction to IBM SPSS Statistics

An introduction to IBM SPSS Statistics An introduction to IBM SPSS Statistics Contents 1 Introduction... 1 2 Entering your data... 2 3 Preparing your data for analysis... 10 4 Exploring your data: univariate analysis... 14 5 Generating descriptive

More information

The Statistics Tutor s Quick Guide to

The Statistics Tutor s Quick Guide to statstutor community project encouraging academics to share statistics support resources All stcp resources are released under a Creative Commons licence The Statistics Tutor s Quick Guide to Stcp-marshallowen-7

More information

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: What do the data look like? Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information