Cohort Analysis for Genetic Epidemiology (C. A.G. E.) User Reference Manual



Similar documents
Time Clock Import Setup & Use

Using the American Community Survey Data

Figure 1.1 Percentage of persons without health insurance coverage: all ages, United States,

How to set the main menu of STATA to default factory settings standards

Adverse Impact Ratio for Females (0/ 1) = 0 (5/ 17) = Adverse impact as defined by the 4/5ths rule was not found in the above data.

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Using Stata for Categorical Data Analysis

Running Descriptive Statistics: Sample and Population Values

RATIOS, PROPORTIONS, PERCENTAGES, AND RATES

Summary Measures (Ratio, Proportion, Rate) Marie Diener-West, PhD Johns Hopkins University

Death Data: CDC Wonder, Texas Health Data, and VitalWeb

Using SPSS, Chapter 2: Descriptive Statistics

Two Related Samples t Test

EXST SAS Lab Lab #4: Data input and dataset modifications

Introduction to STATA 11 for Windows

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Constructing a Table of Survey Data with Percent and Confidence Intervals in every Direction

T-SQL STANDARD ELEMENTS

Federal Employee Viewpoint Survey Online Reporting and Analysis Tool

New Hampshire Childhood Cancer

Independent t- Test (Comparing Two Means)

Is it statistically significant? The chi-square test


CHILDHOOD CANCER SURVIVOR STUDY Analysis Concept Proposal

Creating Basic Excel Formulas

3.4 Statistical inference for 2 populations based on two samples

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

Appendix III: SPSS Preliminary

Microsoft Excel 2010 Part 3: Advanced Excel

Incorrect Analyses of Radiation and Mesothelioma in the U.S. Transuranium and Uranium Registries Joey Zhou, Ph.D.

Chapter 2 Probability Topics SPSS T tests

Guidelines for Data Collection & Data Entry

Relational Database: Additional Operations on Relations; SQL

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

Getting started with the Stata

Ad Hoc Advanced Table of Contents

Advanced Statistical Analysis of Mortality. Rhodes, Thomas E. and Freitas, Stephen A. MIB, Inc. 160 University Avenue. Westwood, MA 02090

Optimization of sampling strata with the SamplingStrata package

Instructions for applying data validation(s) to data fields in Microsoft Excel

Tutorial Segmentation and Classification

IBM SPSS Statistics for Beginners for Windows

Company Setup 401k Tab

Cancer Cluster Investigation French Limited Superfund Site, Harris County, Texas

A Guide to Stat/Transfer File Transfer Utility, Version 10

ee-quipment.com ee203 RTCM USB Quick-Start Guide

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Conditionals (with solutions)

SPSS: Getting Started. For Windows

1-3 id id no. of respondents respon 1 responsible for maintenance? 1 = no, 2 = yes, 9 = blank

Importing Data from a Dat or Text File into SPSS

SAS Analyst for Windows Tutorial

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

NCSS Statistical Software

Client Marketing: Sets


Two Correlated Proportions (McNemar Test)

Lifetime Likelihood of Going to State or Federal Prison

Analysis of Population Cancer Risk Factors in National Information System SVOD

Simply Accounting Intelligence Tips and Tricks Booklet Vol. 1

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

The Little Man Computer

Lesson Outline Outline

SPSS Workbook 1 Data Entry : Questionnaire Data

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Supplementary online appendix

Microsoft Access Glossary of Terms

Using Formulas, Functions, and Data Analysis Tools Excel 2010 Tutorial

The Center for Teaching, Learning, & Technology

Clever SFTP Instructions

Part A. EpiData Entry

Calculating Survival Probabilities Accepted for Publication in Journal of Legal Economics, 2009, Vol. 16(1), pp

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Linear Models in STATA and ANOVA

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Chapter 4 Displaying and Describing Categorical Data

PROC SUMMARY Options Beyond the Basics Susmita Pattnaik, PPD Inc, Morrisville, NC

Help File. Version February, MetaDigger for PC

Basic Statistical and Modeling Procedures Using SAS

Using Microsoft Access

In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%

Cal Answers Analysis Training Part I. Creating Analyses in OBIEE

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Fewer people with coronary heart disease are being diagnosed as compared to the expected figures.

Travel Distance to Healthcare Centers is Associated with Advanced Colon Cancer at Presentation

Infinite Campus Ad Hoc Reporting Basics

VisionMate Flat Bed Scanner 2D Tube Barcode Reader

How to use the UNIX commands for incident handling. June 12, 2013 Koichiro (Sparky) Komiyama Sam Sasaki JPCERT Coordination Center, Japan

HOW TO COLLECT AND USE DATA IN EXCEL. Brendon Riggs Texas Juvenile Probation Commission Data Coordinators Conference 2008

MULTIPLE REGRESSION EXAMPLE

Scatter Plots with Error Bars

Example of a Java program

Moving from CS 61A Scheme to CS 61B Java

How to Download Census Data from American Factfinder and Display it in ArcMap

Data Management and Analysis for Successful Clinical Research. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Transcription:

Cohort Analysis for Genetic Epidemiology (C. A.G. E.) User Reference Manual CAGE is a UNIX based program, which calculates the standardized cancer incidence ratios (Observed / Expected) with 95% confidence intervals assuming that the observed number of malignancies follow a Poisson distribution. Some of its application examples include familial aggregation using pedigree data, second onset cancer within cohort of cancer patients and so on. The CAGE program requires three data input files: 1) Registry data file: Connecticut or SEER data in R1 format Note: This file is provided to the user 2) Cohort data file: Your input data file which contains the members of the population (.dat file) 3) Data Input file: A Cohort Analysis Language (CAL) program file telling CAGE how to manipulate the data and what analysis to perform (.inp file) Note: You need to create this file to run your analysis To run CAGE, use the following command line in UNIX: % cage registry data file name data input file name > output file name % cage conn9.r1 test.inp > test.out Registry Data File The newly updated Connecticut Tumor Registry data file (conn9.r1_ctr) and SEER data files (conn9.r1_seerwhite and conn9.r1_seerblack) are both ICD-9 R1 format Connecticut Tumor Registry file, where the conn9.r1_seerwhite data is for SEER white population and the conn9.r1_seerblack data is for SEER black population. Cohort Data File The cohort data file is an ASCII file of rows and columns of data. Each row is made up of columns which represent parameters that describe or apply to one member of the population being studied. Data columns may be of type string (alphanumeric), floating point number or integer. Columns can be delimited by spaces or tabs. Missing value is coded as -1. Certain information is required to be included for each member. These must-have columns are: 1) Sex 2) Age 1

3) Birth year 4) Affection status Do not label the column in the first row of the file. Each column must be numeric type. CAGE will give an error message Bus Error (core dumped) if the cohort data file exceeds 10000 rows (observation). So if you have a large dataset, please cut your data to two or more files and make sure each file contains less than 10000 rows (observations). The commands you can use in Unix to cut your data are: % head [-counts] [file name] > [ new file name] % tail [-counts] [file name] > [new file name] For If you have a large cohort data BT.dat which contains 15678 observations, you should cut this data into 2 files: % head 10000 BT.dat > first10kbt.dat where first10kbt.dat contains the first 10000 obs of BT.dat % tail 5678 BT.dat > last5678bt.dat where last5678bt.dat contains the last 5678 obs of BT.dat You need to run CAGE for each of the two files separately, and then run a S-plus program to combine the two results. Please consult Carol J. Etzel or Mei Liu about how to run the S-plus program. 2

Data Input File The input file which uses the CAL describes and manipulates the cohort data and performs analyses of that data. The input file includes five parts and here is an data file is test.dat Part I: Specify the cohort data file labels (REL, BYR, SEX, PBrace, PBsex, PBYOB, cancer, AGE) ; Part II: Label variables sex is col SEX where (male=1, female=2) ; age is col AGE; birth year is col BYR ; affected_status is col cancer where (affected == 1, unaffected == 2) ; Part III: Identify the required fields: Sex, Age, Birth year, Affection status columns group FDR by REL where value == 1 REL where value == 2 REL where value == 3 REL where value == 12; group Fathers by REL where value ==2; group Mothers by REL where value ==3; group Brothers by REL where value ==1 && SEX where value ==1; group Sisters by REL where value ==1 && SEX where value ==2; Part IV: Define analysis groups using group-by-statement analyze FDR for site (MLG) ; analyze Fathers for site (MLG) ; analyze Mothers for site (MLG) ; Part V: Analyze defined groups using analyze-statement analyze Brothers for site (MLG) ; analyze Sisters for site (MLG) ; 3

Part I: Specify the cohort data file Tells CAGE what cohort data file to be analyzed data file is filename data file is test.dat Part II: Label variables Tells CAGE what variables in the cohort data file labels(variable1, variable2, Sex, varilble3, Age, variable4, Birth, variable5, variable6, Affected, variable7, variable8, ) * Where variable1, variable2, variable3 are the other variables except age, sex, birth year, affected status in your data file. labels (Familyid, Sex, Age, Affected, Relationship, Birth) * The different variables must be separated by comma * The variable order should be exact the same as they appear in the dataset * Be aware the lower case sex, age, affected status and birth year are keywords for CAL, so they are not allowed using in the label statement. Using upper case Sex, Age, etc, because CAL is case-sensitive. Part III: Identify Sex, Age, Birth year, Affection status columns Tells CAGE where to find the age, sex, birth year, and affected status information age is col column name sex is col column name where(male = value, female = value) birth year is col column name affected_status is col column name where(affected relational-operator value, unaffected relational-operator value) * where relational-operator includes any of == < > <= >= * Be aware = but not == is set for the sex variable age is col Age sex is col Sex where(male =1, female = 2) birth year is col Birth affected_status is col Affected where (affected == 1, unaffected == 0) 4

Part IV: Define analysis groups using group-by-statement group-by-statement tells CAGE to define a subgroup with the stated logical commonalties. If the subgroup is defined by only one column variable: group identifier by column name where value relational-op constant * where relational-operator includes any of == < > <= >= group females by Sex where value == 2; group whites by Race where value == 1; If the subgroup is defined by more than one column variable: group identifier by column where value relational-op constant relational-op column where value relational-op constant relational-op column where value relational-op constant *where relational-op includes: (logical or) and &&(logical and) group PBMale by REL where value == 1 REL where value == 2 REL where value == 3 REL where value == 12 && PBsex where value == 1; Part IV: Analyze defined groups using analyze-statement analyze-statement performs observed/expected analysis using the Connecticut Tumor Registry data or SEER data. This analysis is performed on the defined group. analyze group name for site (site list) print summary reset variables * site list is a comma delimited list of the one or more sites taken from the PREFERENT CAUSE LIST * print summary tells CAGE to show the detailed analysis results * reset variables tells CAGE to start a new grouping context analysis analyze females for site(brf,lip) print summary reset variables 5

analyze PBAge20 for site (BLA) print summary reset variables Analysis Output File: CAGE gives you the analysis results for each group you define in the input file. The output file contains the input file information, the missing value information, and the analysis results. Here is the output from the analysis for the given input file CONNECTICUT TUMOR FILE IS conn9.r1_ctr CAL FILE IS test.inp POPULATION DATA FILE IS test.dat Input file information Missing essential parameter on member 680 of group FDR: age -1, sex 2, birthyear 1917 Missing essential parameter on member 1170 of group FDR: age -1, sex 2, birthyear 1903 Missing essential parameter on member 1296 of group FDR: age -1, sex 1, birthyear 1933 Missing essential parameter on member 2254 of group FDR: age -1, sex 2, birthyear 1985 Missing essential parameter on member 2358 of group FDR: age -1, sex 2, birthyear 1957 Missing essential parameter on member 2359 of group FDR: age -1, sex 1, birthyear 1958 Missing essential parameter on member 2944 of group FDR: age -1, sex 1, birthyear 1951 Missing essential parameter on member 3122 of group FDR: age 39, sex -1, birthyear 1952 Missing essential parameter on member 3288 of group FDR: age 70, sex -1, birthyear 1921 Missing essential parameter on member 3328 of group FDR: age -1, sex 2, birthyear 1939 Missing essential parameter on member 3329 of group FDR: age -1, sex 1, birthyear 1910 Observations with missing value which are not analyzed by CAGE Group Site Total Pop Tot Affected Sum of Exp O/E L/E U/E Person Years FDR MLG, 3518 256 264.01 0.97 0.85 1.10 146653.00 TOTALS: 3518 256 264.01 0.97 0.85 1.10 146653.00 FDR Missing essential parameter on member 475 of group Fathers: age -1, sex 1, birthyear 1951 Missing essential parameter on member 529 of group Fathers: age 70, sex -1, birthyear 1921 Missing essential parameter on member 534 of group Fathers: age -1, sex 1, birthyear 1910 Analysis results Fathers MLG, 566 90 100.39 0.90 0.72 1.10 181428.00 TOTALS: 566 90 100.39 0.90 0.72 1.10 181428.00 Fathers Missing essential parameter on member 104 of group Mothers: age -1, sex 2, birthyear 1917 Missing essential parameter on member 184 of group Mothers: age -1, sex 2, birthyear 1903 Mothers MLG, 574 94 99.80 0.94 0.76 1.15 217060.00 TOTALS: 574 94 99.80 0.94 0.76 1.15 217060.00 Mothers Missing essential parameter on member 270 of group Brothers: age -1, sex 1, birthyear 1933 Brothers MLG, 717 20 27.21 0.74 0.45 1.14 245874.00 TOTALS: 717 20 27.21 0.74 0.45 1.14 245874.00 6

Brothers Missing essential parameter on member 638 of group Sisters: age -1, sex 2, birthyear 1939 Sisters MLG, 668 45 30.60 1.47 1.07 1.97 272495.00 TOTALS: 668 45 30.60 1.47 1.07 1.97 272495.00 Sisters The analysis results contain group name, site name, total population in your cohort dataset, total affected cases in your cohort dataset, expected number of cases which is obtained by multiplying the age- and gender-specific cancer incidence rates in Connecticut or SEER database by corresponding person-years of your cohort data, the ratio of observed to expected numbers of cases (SIR) with likelihood-based 95% confidence intervals (CI) from Poisson models and person-years. Calculation of Person-years in CAGE CAGE calculates the person- years for each group by adding the person-years for that specific group with the person-years for the preceding analysis groups. Therefore, in order to get the person-years for only the specific group, you need to subtract the personyears by the previous person-years. Below is an Person-years are from the output file example and the actual person-years were recalculated for each analysis group. Group Person-years from CAGE Actual Person-years FDR 146653 146653 Father 181428 34775 Mother 217060 35632 Brother 245874 28814 Sister 272495 26621 Interpretation of the Results: Using the above output file as an Significantly increased SIRs were observed for MLG for the Sister group (SIR= 1.47, 95% CI =(1.07-1.97)). Decreased SIRs were observed for the FDR, Father, Mother, Brother groups, however, the results are insignificant because all the 95% CIs include 1. How to analyze second cancer within cohort of cancer patients? To calculate the risk of second cancer within the cohort of cancer patients, you need to know the age onset for both of the first and second cancers. You will also need an accompanying SPLUS/R function to calculate the final SIRs. This function is available 7

upon request. The following are the steps you need to complete to obtain an SIR for a second cancer: First, run CAGE to get the expected number of cancers (E1) and total number of person years (PY1) from the birth to the age onset of the first cancer within the cohort. Second, run CAGE to get the expected number of cancers (E2) total number of person years (PY2) from the birth to the age onset of the second cancer within the cohort Third, Calculate the corrected expected number of cancers (EC) from the age onset of the first cancer to the second cancer: EC=E2-E1 Calculate the correct person years (PYC) from the age onset of the first cancer to the second cancer: PYC=PY2-PY1 Fourth, run the Splus program using EC, PYC and the observed number of second cancers within the cohort to get the SIR and 95% confidence interval for the second cancer. EC=E2-E1 The adjusted person-years for the second onset can be obtained by subtracting the person-years Birth of the second cancer by the person-years First of the first cancer. Second Cancer Cancer E1 E2 8