Cohort Analysis for Genetic Epidemiology (C. A.G. E.) User Reference Manual

Cohort Analysis for Genetic Epidemiology (C. A.G. E.) User Reference Manual CAGE is a UNIX based program, which calculates the standardized cancer incidence ratios (Observed / Expected) with 95% confidence intervals assuming that the observed number of malignancies follow a Poisson distribution. Some of its application examples include familial aggregation using pedigree data, second onset cancer within cohort of cancer patients and so on. The CAGE program requires three data input files: 1) Registry data file: Connecticut or SEER data in R1 format Note: This file is provided to the user 2) Cohort data file: Your input data file which contains the members of the population (.dat file) 3) Data Input file: A Cohort Analysis Language (CAL) program file telling CAGE how to manipulate the data and what analysis to perform (.inp file) Note: You need to create this file to run your analysis To run CAGE, use the following command line in UNIX: % cage registry data file name data input file name > output file name % cage conn9.r1 test.inp > test.out Registry Data File The newly updated Connecticut Tumor Registry data file (conn9.r1_ctr) and SEER data files (conn9.r1_seerwhite and conn9.r1_seerblack) are both ICD-9 R1 format Connecticut Tumor Registry file, where the conn9.r1_seerwhite data is for SEER white population and the conn9.r1_seerblack data is for SEER black population. Cohort Data File The cohort data file is an ASCII file of rows and columns of data. Each row is made up of columns which represent parameters that describe or apply to one member of the population being studied. Data columns may be of type string (alphanumeric), floating point number or integer. Columns can be delimited by spaces or tabs. Missing value is coded as -1. Certain information is required to be included for each member. These must-have columns are: 1) Sex 2) Age 1

3) Birth year 4) Affection status Do not label the column in the first row of the file. Each column must be numeric type. CAGE will give an error message Bus Error (core dumped) if the cohort data file exceeds 10000 rows (observation). So if you have a large dataset, please cut your data to two or more files and make sure each file contains less than 10000 rows (observations). The commands you can use in Unix to cut your data are: % head [-counts] [file name] > [ new file name] % tail [-counts] [file name] > [new file name] For If you have a large cohort data BT.dat which contains 15678 observations, you should cut this data into 2 files: % head 10000 BT.dat > first10kbt.dat where first10kbt.dat contains the first 10000 obs of BT.dat % tail 5678 BT.dat > last5678bt.dat where last5678bt.dat contains the last 5678 obs of BT.dat You need to run CAGE for each of the two files separately, and then run a S-plus program to combine the two results. Please consult Carol J. Etzel or Mei Liu about how to run the S-plus program. 2

Data Input File The input file which uses the CAL describes and manipulates the cohort data and performs analyses of that data. The input file includes five parts and here is an data file is test.dat Part I: Specify the cohort data file labels (REL, BYR, SEX, PBrace, PBsex, PBYOB, cancer, AGE) ; Part II: Label variables sex is col SEX where (male=1, female=2) ; age is col AGE; birth year is col BYR ; affected_status is col cancer where (affected == 1, unaffected == 2) ; Part III: Identify the required fields: Sex, Age, Birth year, Affection status columns group FDR by REL where value == 1 REL where value == 2 REL where value == 3 REL where value == 12; group Fathers by REL where value ==2; group Mothers by REL where value ==3; group Brothers by REL where value ==1 && SEX where value ==1; group Sisters by REL where value ==1 && SEX where value ==2; Part IV: Define analysis groups using group-by-statement analyze FDR for site (MLG) ; analyze Fathers for site (MLG) ; analyze Mothers for site (MLG) ; Part V: Analyze defined groups using analyze-statement analyze Brothers for site (MLG) ; analyze Sisters for site (MLG) ; 3

Part I: Specify the cohort data file Tells CAGE what cohort data file to be analyzed data file is filename data file is test.dat Part II: Label variables Tells CAGE what variables in the cohort data file labels(variable1, variable2, Sex, varilble3, Age, variable4, Birth, variable5, variable6, Affected, variable7, variable8, ) * Where variable1, variable2, variable3 are the other variables except age, sex, birth year, affected status in your data file. labels (Familyid, Sex, Age, Affected, Relationship, Birth) * The different variables must be separated by comma * The variable order should be exact the same as they appear in the dataset * Be aware the lower case sex, age, affected status and birth year are keywords for CAL, so they are not allowed using in the label statement. Using upper case Sex, Age, etc, because CAL is case-sensitive. Part III: Identify Sex, Age, Birth year, Affection status columns Tells CAGE where to find the age, sex, birth year, and affected status information age is col column name sex is col column name where(male = value, female = value) birth year is col column name affected_status is col column name where(affected relational-operator value, unaffected relational-operator value) * where relational-operator includes any of == < > <= >= * Be aware = but not == is set for the sex variable age is col Age sex is col Sex where(male =1, female = 2) birth year is col Birth affected_status is col Affected where (affected == 1, unaffected == 0) 4

Part IV: Define analysis groups using group-by-statement group-by-statement tells CAGE to define a subgroup with the stated logical commonalties. If the subgroup is defined by only one column variable: group identifier by column name where value relational-op constant * where relational-operator includes any of == < > <= >= group females by Sex where value == 2; group whites by Race where value == 1; If the subgroup is defined by more than one column variable: group identifier by column where value relational-op constant relational-op column where value relational-op constant relational-op column where value relational-op constant *where relational-op includes: (logical or) and &&(logical and) group PBMale by REL where value == 1 REL where value == 2 REL where value == 3 REL where value == 12 && PBsex where value == 1; Part IV: Analyze defined groups using analyze-statement analyze-statement performs observed/expected analysis using the Connecticut Tumor Registry data or SEER data. This analysis is performed on the defined group. analyze group name for site (site list) print summary reset variables * site list is a comma delimited list of the one or more sites taken from the PREFERENT CAUSE LIST * print summary tells CAGE to show the detailed analysis results * reset variables tells CAGE to start a new grouping context analysis analyze females for site(brf,lip) print summary reset variables 5

analyze PBAge20 for site (BLA) print summary reset variables Analysis Output File: CAGE gives you the analysis results for each group you define in the input file. The output file contains the input file information, the missing value information, and the analysis results. Here is the output from the analysis for the given input file CONNECTICUT TUMOR FILE IS conn9.r1_ctr CAL FILE IS test.inp POPULATION DATA FILE IS test.dat Input file information Missing essential parameter on member 680 of group FDR: age -1, sex 2, birthyear 1917 Missing essential parameter on member 1170 of group FDR: age -1, sex 2, birthyear 1903 Missing essential parameter on member 1296 of group FDR: age -1, sex 1, birthyear 1933 Missing essential parameter on member 2254 of group FDR: age -1, sex 2, birthyear 1985 Missing essential parameter on member 2358 of group FDR: age -1, sex 2, birthyear 1957 Missing essential parameter on member 2359 of group FDR: age -1, sex 1, birthyear 1958 Missing essential parameter on member 2944 of group FDR: age -1, sex 1, birthyear 1951 Missing essential parameter on member 3122 of group FDR: age 39, sex -1, birthyear 1952 Missing essential parameter on member 3288 of group FDR: age 70, sex -1, birthyear 1921 Missing essential parameter on member 3328 of group FDR: age -1, sex 2, birthyear 1939 Missing essential parameter on member 3329 of group FDR: age -1, sex 1, birthyear 1910 Observations with missing value which are not analyzed by CAGE Group Site Total Pop Tot Affected Sum of Exp O/E L/E U/E Person Years FDR MLG, 3518 256 264.01 0.97 0.85 1.10 146653.00 TOTALS: 3518 256 264.01 0.97 0.85 1.10 146653.00 FDR Missing essential parameter on member 475 of group Fathers: age -1, sex 1, birthyear 1951 Missing essential parameter on member 529 of group Fathers: age 70, sex -1, birthyear 1921 Missing essential parameter on member 534 of group Fathers: age -1, sex 1, birthyear 1910 Analysis results Fathers MLG, 566 90 100.39 0.90 0.72 1.10 181428.00 TOTALS: 566 90 100.39 0.90 0.72 1.10 181428.00 Fathers Missing essential parameter on member 104 of group Mothers: age -1, sex 2, birthyear 1917 Missing essential parameter on member 184 of group Mothers: age -1, sex 2, birthyear 1903 Mothers MLG, 574 94 99.80 0.94 0.76 1.15 217060.00 TOTALS: 574 94 99.80 0.94 0.76 1.15 217060.00 Mothers Missing essential parameter on member 270 of group Brothers: age -1, sex 1, birthyear 1933 Brothers MLG, 717 20 27.21 0.74 0.45 1.14 245874.00 TOTALS: 717 20 27.21 0.74 0.45 1.14 245874.00 6

Brothers Missing essential parameter on member 638 of group Sisters: age -1, sex 2, birthyear 1939 Sisters MLG, 668 45 30.60 1.47 1.07 1.97 272495.00 TOTALS: 668 45 30.60 1.47 1.07 1.97 272495.00 Sisters The analysis results contain group name, site name, total population in your cohort dataset, total affected cases in your cohort dataset, expected number of cases which is obtained by multiplying the age- and gender-specific cancer incidence rates in Connecticut or SEER database by corresponding person-years of your cohort data, the ratio of observed to expected numbers of cases (SIR) with likelihood-based 95% confidence intervals (CI) from Poisson models and person-years. Calculation of Person-years in CAGE CAGE calculates the person- years for each group by adding the person-years for that specific group with the person-years for the preceding analysis groups. Therefore, in order to get the person-years for only the specific group, you need to subtract the personyears by the previous person-years. Below is an Person-years are from the output file example and the actual person-years were recalculated for each analysis group. Group Person-years from CAGE Actual Person-years FDR 146653 146653 Father 181428 34775 Mother 217060 35632 Brother 245874 28814 Sister 272495 26621 Interpretation of the Results: Using the above output file as an Significantly increased SIRs were observed for MLG for the Sister group (SIR= 1.47, 95% CI =(1.07-1.97)). Decreased SIRs were observed for the FDR, Father, Mother, Brother groups, however, the results are insignificant because all the 95% CIs include 1. How to analyze second cancer within cohort of cancer patients? To calculate the risk of second cancer within the cohort of cancer patients, you need to know the age onset for both of the first and second cancers. You will also need an accompanying SPLUS/R function to calculate the final SIRs. This function is available 7

upon request. The following are the steps you need to complete to obtain an SIR for a second cancer: First, run CAGE to get the expected number of cancers (E1) and total number of person years (PY1) from the birth to the age onset of the first cancer within the cohort. Second, run CAGE to get the expected number of cancers (E2) total number of person years (PY2) from the birth to the age onset of the second cancer within the cohort Third, Calculate the corrected expected number of cancers (EC) from the age onset of the first cancer to the second cancer: EC=E2-E1 Calculate the correct person years (PYC) from the age onset of the first cancer to the second cancer: PYC=PY2-PY1 Fourth, run the Splus program using EC, PYC and the observed number of second cancers within the cohort to get the SIR and 95% confidence interval for the second cancer. EC=E2-E1 The adjusted person-years for the second onset can be obtained by subtracting the person-years Birth of the second cancer by the person-years First of the first cancer. Second Cancer Cancer E1 E2 8