Data Management and Analysis for Successful Clinical Research Lily Wang, PhD Department of Biostatistics Vanderbilt University
Goals of This Presentation Provide an overview on data management and analysis aspects of clinical research Minimize errors in datasets Ensure statistical software packages will recognize data correctly Facilitate efficient data analysis for projects 2
An Overview of the Process 1. Write the protocol - consult mentors, colleagues and visit us to finalize specific aims, testable hypothesis and study design 2. Create a Data Dictionary 3. Create a Patient Directory 4. Prepare datasets for statistical analysis 3
An Overview 5. The statisticians will assist with statistical tests 6. Review results, start thinking about writing the paper 7. Additional tables and figures 8. Write the paper/abstract 4
Timeline For abstract, please send us datasets at least 4 weeks in advance Please contact us even if you don t have the dataset ready, so we can schedule other projects and leave room for yours 5
1. Writing the Proposal Background Why this research is important Be concise Specific Aims, Testable Hypothesis Be focused, clearly conceptualized, and feasible The most important section of the proposal Consult mentors, colleagues and visit us 6
1. Writing the Proposal Methods/Experimental Design Participants Inclusion/Exclusion Criteria Recruiting Process How the measurements will be made 7
1. Writing the Proposal Challenges/Potential Problems Loss to follow up Bias - Confounding variables and other sources Human Subjects Protection Plan Informed consent Adverse events Privacy, confidentiality issues 8
Bias Definition - any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure s effect on the risk of disease 9
Confounding - definition In a study of whether factor A is a cause of disease B, we say a third factor, factor X is a confounder if Factor X is a known risk factor for disease B Factor X is associated with factor A, but is not a result of factor A 10
Confounding an example coffee drinking and pancreatic cancer 11
Confounding an example coffee drinking and pancreatic cancer If an association is observed between coffee drinking and pancreas cancer, then The coffee => cancer or Smoking is a risk factor for cancer and smoking is associated with coffee drinking 12
1. Writing the Proposal Confounding ways to deal with it in design phase match cases to controls on confounding variables in analysis phase stratification adjustment 13
1. Writing the Proposal Statistical Analysis (provided by the statisticians) Sample size/power calculations Analysis Plan 14
1. Writing the Proposal A good example Dr Malow stemplate 15
2. Create a Data Dictionary Name Description Units Type Values (Permissible ranges) group treatment group discrete 1= placebo, 2=trt age age in years year continuous 10 79 bp_sys systolic blood pressure mmhg continuous 100 160 bp_dias diastolic blood pressure mmhg continuous 80 150 date0 date for baseline assessment date mm/dd/yyyy 16
3. Create a Patient Directory ID FirstName LastName Address Phone... 1 John Smith 2 Mary Ann 3 Joe Kim Include any other information you like to record for reference Keep this file to yourself, and don t send it to us 17
4. Prepare datasets for Statistical Analysis A good example ID group age sex ht wt bp_sys bp_dias stage race date0 complic 1 1 25 1 61 350 120 80 3 3.0 1/15/1999 0 2 1 65 2 68 161 140 90 2 1.0 2/5/1999 1 3 1 25 1 47 150 160 110 4 2.0 1/15/1998 1 4 1 31 1 66 161 140 105 2 2.0 4/1/1999 0 5 1 42 2 72 177 130 70 2 1.0 2/15/1999 0 6 1 45 2 67 160 120 80 1 2.0 3/6/1999 0 7 1 44 1 72 145 120 80 1 1.0 2/28/1999 0 8 1 55 1 72 161 120 95 4 2.0 6/15/2000 1 9 1 0.5 2 66 174 160 110 3 4.0 12/14/2000 1 10 1 21 2 60 155 190 120 2 2.0 11/14/2000 0 18
4. Prepare datasets for Statistical Analysis First - strip off any confidential information (name, address, phone #) Rows - each subject (sample, observations) Columns - each measurement (variable) 19
4. Preparing datasets Variable Names (column labels) No special characters ( < etc) except _ Start with letters, not numbers Less than 8 characters Should be unique No spaces 20
4. Preparing datasets Data Values Be consistent: M m, date format, upper/lower case No spaces No embedded formula use paste special, then paste values Missing data: leave it as blank Unless there are different reasons for missing, code them as different values 21
4. Preparing datasets Only 1 variable in each column, use separate columns for non-mutually exclusive values Derived variables statisticians can do those Keep all information as continuous variables, information can t be recovered 22
4.Preparing datasets It s OK to have separate data sheets for demographic info and clinical measurements As long as there is a unique identifier (ID) that links all data sheets 23
4. Preparing Datasets If you are in a hurry Record data in a file and call it Raw_xxx.xls Later transform it into the desired format It s OK to format only those needed for analysis and send only these variables to the statisticians Good idea: visit us after you ve entered the first 5 patients and completed the data dictionary 24
What s wrong with this data sheet? Comparison of Drug A and Drug B Drug A Age of Patient Patient Height Weight 24hrhct blood pressure tumor Race Date complications Gender (inches) (pound) stage enrolled 1 25 Male 61" >350 38% 120/80 2-3 Hipanic 1/15/99 no 2 65+ female 5'8" 161 32 140/90 II White 2/05/1999 yes 3? Male 120cm 12 >160/110 IV Black Jan 98 yes, pneumonia 4 31 m 5'6" obse 40 140 sys 105 dias? ican-americ? 5 42 f >6 ft normal 39 missing =>2 W Feb 99 6 45 f 5.7 160 29 80/120 NA B last fall n 7 unknown? 6 145 35 normal 1 W 2/30/99 n 8 55 m 72 161.45 12/39 120/95 4 ican-americ 6-15-00 y 9 6 months f 66 174 38 160/110 3 Asian 14/12/00 y 10 21 f 5' Drug B 1 55 m 61 145 normal 120/80 120/90 IV ative Americ 6/20/ 3 2 45 f 4"11 166? 135/95 2b none 7/14/99 n 3 32 male 5'13" 171 38 140/80 not staged NA 8/30/99 n 4 44 na 65? 40 120/80 2? 09/01/00 n 5 66 fem 71 0 41 140/90 4 w Sep 14th y, sepsis 6 71 unknown 172 199 38 >160/110 3 b unknown y, died 7 45 m? 204 32 140 sys 105 dias 1 b 12/25/00 n 8 34 m NA 145 36 130 3 w July 97 n 9 13 m 66 161 39 166/115 2a w 06/06/99 n 10 66 m 68 176 41 1120/80 3 w 01/21/58 n Average 45 65 155 38 25
Acknowledgement Guideline for data collection and data entry http://biostat.mc.vanderbilt.edu/wiki/main/theresascott 10 Data Entry Commandments, Spreadsheet from Heaven/Hell http://biostat.mc.vanderbilt.edu/wiki/main/danielbyrne 26