Analyzing Research Data Using Excel Fraser Health Authority, 2012 The Fraser Health Authority ( FH ) authorizes the use, reproduction and/or modification of this publication for purposes other than commercial redistribution. In consideration for this authorization, the user agrees that any unmodified reproduction of this publication shall retain all copyright and proprietary notices. If the user modifies the content of this publication, all FH copyright notices shall be removed, however FH shall be acknowledged as the author of the source publication. Reproduction or storage of this publication in any form by any means for the purpose of commercial redistribution is strictly prohibited. This publication is intended to provide general information only, and should not be relied on as providing specific healthcare, legal or other professional advice. The Fraser Health Authority, and every person involved in the creation of this publication, disclaims any warranty, express or implied, as to its accuracy, completeness or currency, and disclaims all liability in respect of any actions, including the results of any actions, taken or not taken in reliance on the information contained herein.
http://research.fraserhealth.ca/ 2
Objectives To review key concepts and elements of quantitative research To explore the application of Excel in conducting a research project: Creating data files Creating data dictionary Linking research question to appropriate analysis To apply statistics to analytical data 3
Workshop Outline 09:00-09:15 Review of Quantitative Research 09:15-09:30 Measurement 09:30-10:30 Excel 101 Excel as database 10:30-10:45 Break 10:45-11:15 Using Excel to Clean/Explore Data 11:15-12:00 Using Excel to Analyze Data 4
Excel Pop Quiz How many columns? 2003: 256 2007 & 2010: 16,384 How many rows? 2003: 65,546 2007 & 2010: 1,048,576 True or False You can conduct statistical analyses in Excel? 5
Quick Review of Quantitative Research 6
Framework for Quantitative Research Conduct literature review Develop rationale Why do want to do this research? What do others say? What are knowledge gaps? Formulate research question Generate objective(s) and/or hypothesis PICO Method P = population / patient I = intervention C = comparison O = outcome Hypothesis Objective (Usually) statement of anticipated Action statement results Apply methods and conduct the study Measurement Study Design Analysis 7
Research Question 8
Measurement: Thinking in Numbers From this To this ID Gender Age Disease Outcome ID Gender Age Disease Outcome 1 Male 59 Y 1 1 59 1 2 Female 52 Y 2 2 52 1 3 Male 53 N 3 1 53 0 4 Female 60 N 4 2 60 0 9
Types of Variables Independent variable (IV) Influences your outcome measure Active (intervention) or Attribute (characteristic) Dependent variable (DV) Influenced by the IV(s) Usually represents outcome studied Confounders Alternative explanation for an association between an exposure (IV) and an outcome (DV) Not a focus of the study Independently associated with the outcome Associated with the exposure under study 10
Level of Measurement Nominal Example: gender Data categorized into mutually exclusive and unordered groups Can assign number codes but calculations would be meaningless (male=1; female=2) Ordinal Example: SES level: low, middle and high income Data classified/categorized with implied order Distance between data not always equal Can't measure the magnitude or quantify the difference between data: how much lower is middle from higher income? 11
Level of Measurement Interval: attributes measured on interval scales Equal distance between each interval Distance between scale numbers has meaning Arbitrary zero point (e.g., temperature) Ratio: similar to interval scale Has true zero point Clear definition of 0: There is none of the variable Example: weight, salary ($0=$0). Can make assumptions about the ratio of two measurements 6 grams is twice as much as 3 grams 12
Level of Measurement & Acceptable Statistical Operations Nominal Ordinal Interval Ratio Frequency distribution Yes Yes Yes Yes Mode Yes Yes Yes Yes Median & percentile No Yes Yes Yes Mean & standard deviation No No Yes Yes 13
Excel 101 14
Objectives How to organize data in an MS Excel spreadsheet How to define variables How to code in preparation for analysis 15
Terminologies Data: Information that you collect Dataset Collection of data usually presented in tabular form Columns represents variables Rows represent members of the dataset Spreadsheet Computer application that facilitates use of datasets (enter data, analyses, sharing) MS Excel is a spreadsheet program 16
Using Excel for a Research Study To capture data Facilitate data collection, minimize entry errors To clean/explore/describe data Starting point for analyses is cleaning raw data Preliminary descriptive statistics To analyze data Using program add-ins 17
Stages in preparing data for analysis Collect data Create data file: Enter & clean data Explore data Analyze data Interpret results 18
Stages in preparing data for analysis Good practice Design your spreadsheet keeping your statistical analyses in mind Use logic check to clean data Create data dictionary Consult with analyst 19
Creating data file using spreadsheet Each variable (e.g., ID #) represented by a column Each participant is represented by a row All the information for a single case is entered across one row only the data in each column summarizes information on a particular variable 20
Defining variables Creating Your Spreadsheet Use descriptive, unique, names for variables Use underscore (_) instead of space Be consistent in naming especially with array variables Variables that capture a pattern Example: measuring blood pressure at regular intervals (e.g., Q30 min for 12 hours) BP_30min BP_1hr BP_final BP_time1 BP_time2 BP_time3 21
Creating Your Spreadsheet Each column captures only single piece of information ID 1 2 3 Intervention_Used Yes surgery Yes medication No ID Intervention_Used Intervention_Type 1 2 3 Yes Yes No Surgery Medication Not applicable 22
Creating Your Spreadsheet Use tools to facilitate entry: Add notes for data entry Insert Comments 23
Creating Your Spreadsheet Use tools to facilitate entry: Colour-code columns Fill color icon 24
Creating Your Spreadsheet For numeric variables, use Format to force entry into specific form Date Number of decimals Highlight entire column to ensure consistent format applied 25
Data Dictionaries AKA code books A centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format Develop data dictionary: Before collecting data Components: Variable name Variable label Type: nominal or interval Values/coding for each variable Always assign values for missing or non applicable cases 26
Sample Data Dictionary Data Dictionary for Research Consults Database Item Variable Label Type Coding/ entry instruction 1 ID_no ID number Numeric Enter unique number for each record/patient (1.100) 2 Gender Gender of participant String Enter corresponding number: 0= Female 1= Male 999= missing 3 Date_T1 Date at baseline Date Specify format: mm/dd/yy 4 Education Highest level of education Numeric Enter corresponding number: 1= Grade 10 2= Grade 12 3= College Diploma 4= University 27
Data Structures Re-emphasize importance of keeping analyses in mind when designing dataset Statistical packages require that data are entered in specific way in order to run analytical steps 28
Using Excel to Clean / Explore Data 29
Data Cleaning & Data Exploration Data at this stage is raw dataset Need to Clean: any entry errors, duplicate entries convert text variables into numeric variables Explore: any outliers Excel tools to facilitate cleaning and exploring Filter Sort 30
Data Cleaning & Data Exploration Sources of data errors: Missing: never leave blank Assign a meaningful number for missing values Consider coding for cases with non-applicable responses Typing errors on data entry (e.g., age= 121) Measurement errors (e.g., height) Identify errors: Descriptive statistics for each variable Minimum and maximum values Means, medians and SD 31
Our Dataset for the Remainder of the Workshop 32
Gout and AMI Study 33
Gout and AMI 2 Study Research Question What clinical factors are associated with the development of AMI among elderly women with gout? Objectives To compare women with gout and no gout on clinical factors including age, BMI, uric acid level To evaluate the correlation between clinical factors among elderly women with gout and without gout 34
Gout and AMI 2 Dataset 35
Gout and AMI 2 Study Data dictionary (handout) 13 Variables What are the continuous variables? What are the categorical variables? N = 200 subjects 36
Excel Tool: Filter Place cursor over the block by A1 Data Filter AutoFilter (2003) Data Sort & Filter Filter (2007) 37
Exercise 1 You are the analyst for the Gout and AMI 2 study. A data dictionary was not implemented (GASP!) and there are entry errors. Clean the dataset by locating and finding entry errors using tools in Excel. Sample_worksheet_Exercise1.xls 38
Summary Important data considerations while conducting the study Designing the dataset Collecting the data Entering the data Cleaning/exploring raw data Applying statistics to analytical data 39
Using Excel to Analyze Data 40
Using Excel to Analyze Data Analysis ToolPak: add-in to be installed in Excel Supplemental program that adds custom commands Descriptive statistics Analytical statistics T-tests Correlation 41
1. Click Tools Add-Ins Getting Started 2. Select Analysis Tool Pack 42
Descriptive Statistics Statistics used to describe characteristics of study population/sample Not used to infer the properties of the population from which the sample was drawn For continuous variables Measures of central tendency: mean, median, mode Measures of variability (standard deviation) Shape of Distribution (skewness, kurtosis) 43
Level of Measurement & Acceptable Statistical Operations Nominal Ordinal Interval Ratio Frequency distribution Yes Yes Yes Yes Mode Yes Yes Yes Yes Median & percentile No Yes Yes Yes Mean & standard deviation No No Yes Yes 44
Descriptive Statistics 1.Click Tools Data Analysis 2. Select Descriptive Statistics 45
Example: Age Input range Select columns $A:$E Group by columns Click on Labels in 1 st row Output options New worksheet Summary statistics 46
Output for Descriptive Statistics: Age 47
Exercise 2 Data file = sample_worksheet_masterfile.xls Using Descriptive Statistics in Analysis ToolPak, obtain descriptive statistics for: Uric Acid BMI 48
Descriptive Statistics by Group Use Pivot Tables Useful to obtain descriptive statistics by group For example, if you wanted to know the average and standard deviation of BMI for men and women 49
Pivot Tables Click Data Pivot Table and Pivot Chart 50
Step 1 Example Age by Gout Diagnosis Where is data? What kind of report? 51
Example Age by Gout Diagnosis Step 2 Where is data you want to use? Drag pointer over entire dataset 52
Example Age by Gout Diagnosis Step 3 Where do you want to put the Pivot Table? Layout This is where you tell Excel which groups you want to output your results by and for what variables 53
Example Age by Gout Diagnosis Step 3 Layout Row represents your grouping variable (Gout_Dx) Column variable you want to output according to groups May need to drag several times for parameters needed 54
Example Age by Gout Diagnosis No gout Gout 55
Exercise 3 Using Pivot Tables and Charts, obtain the mean and standard deviation for uric acid according to gout diagnosis 56
Analytic Statistics Statistical procedures used to draw conclusions about a population from sample data Compare groups T-tests Evaluate correlation Correlation coefficients Evaluate association Regression models 57
Analytic Statistics: Considerations Research question: Describe, compare or predict? Levels of data measurement: nominal, ordinal or interval? Are you comparing same or different subjects? Number of experimental groups? Is your data normally distributed? What are the assumptions of the statistical test you would like to use? 58
Check Data Assumptions What are the assumptions of the statistical test you would like to use? Some common assumptions are: The DV and IV will need to be measured on a certain level (e.g. continuous) The population is normally distributed (not skewed) 59
Selecting Appropriate Statistical Test Statistical decision tree (handout) 1. Research goal 2. Identify ID and DV 3. Describe level of the data 4. Identify the # & group pairing groups 5. Check data assumptions Goal Describe one group Compare one group to a hypothetical value Compare two unpaired groups Compare two paired groups Compare three or more unmatched groups Compare three or more matched groups Quantify association between two variables Predict value from another measured variable Predict value from several measured or binomial variables Type of Dependent Variable Data Continuous Normal Mean, SD Ordinal Non-normal Median, interquartile range Categorical Proportion One-sample t test Wilcoxon test Chi-square Unpaired t test Mann-Whitney test Fisher's test (chi-square for large samples) Paired t test Wilcoxon test McNemar's test One-way ANOVA Kruskal-Wallis test Chi-square test Repeated-measures ANOVA Friedman test Cochrane Q Pearson correlation Spearman correlation Contingency coefficients Simple linear regression or Nonlinear regression Multiple linear regression or Multiple nonlinear regression Nonparametric regression Simple logistic regression Multiple logistic regression 60
Exercise 4 (handout) A pilot experiment designed to test the effectiveness of a new therapy to pain management for patients with chronic pain, conducted over a one year time period. What is the goal? What is the IV? What is the DV? How many groups? Paired/matched or independent? What is the level of measurement? 61
Comparing groups: Mean differences Independent Samples T-Test Comparison of the means of 2 non-paired groups Differences in pain levels between 2 groups (standard care and new intervention) Paired Samples T-Test Comparison of means of 2 paired measures Differences in pain levels within groups Pre and post measurement, repeated measurement under different conditions 62
Comparing Groups Sort dataset by variable you are comparing Click Tools Data analysis Three options for t-tests 63
Is there a difference in age between gout and non-gout patients? Highlight age data for first group (no gout) Highlight age data for second group (gout) Output to new worksheet 64
Output No gout Gout 65
Paired Samples T-Test Procedure Tools Data Analysis T-test: Paired Two Sample for Means Input 1 Range: DV at time 1 Input 2 Range: DV at time 2 Output Options 66
Exercise 5 Using Descriptive Statistics in Analysis ToolPak, compare patients with no gout versus gout according to: Uric Acid levels BMI 67
Exercise 6 Using Descriptive Statistics in Analysis ToolPak to answer the following question: Is there a difference in mean pain before and after surgery? 68
Associate - Correlation Allows an examination of the association between variables Range: 0 to +1 Information about the strength of association Information about the direction of the association (positive or negative) A correlation coefficient of 0 =no relationship A correlation coefficient of +1= positive relationship A correlation coefficient of -1= negative relationship 69
Continuous variables Columns: side-by-side Click Tools Data Analysis Click Correlation Evaluating Correlation 70
What is Correlation Between Age and Uric Acid Level? Highlight 2 columns Group by columns Labels in first row Output to new worksheet 71
Output 72
Exercise 6 Using Descriptive Statistics in Analysis ToolPak, evaluate the correlation between: Uric Acid and BMI 73
Thank You! 74