Statistical analysis of genome-wide association (GWAS) data

Similar documents

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

SNPbrowser Software v3.5

NGS and complex genetics

Factors for success in big data science

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Multiple Sclerosis: a NZ Perspective. Dr Deborah Mason Neurologist Christchurch Hospital

Heritability: Twin Studies. Twin studies are often used to assess genetic effects on variation in a trait

Genetics of Rheumatoid Arthritis Markey Lecture Series

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

Risk Factors for Alcoholism among Taiwanese Aborigines

Logistic Regression (1/24/13)

Differential privacy in health care analytics and medical research An interactive tutorial

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

Validation and Replication

A Multi-locus Genetic Risk Score for Abdominal Aortic Aneurysm

School of Nursing. Presented by Yvette Conley, PhD

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Journal of Statistical Software

Epigenetic variation and complex disease risk

Big Data for Population Health and Personalised Medicine through EMR Linkages

B I O I N F O R M A T I C S

SNP Essentials The same SNP story

Biomedical Big Data and Precision Medicine

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

SAP HANA Enabling Genome Analysis

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Haplotype analysis of case-control data

Technical Issues in Aggregating and Analyzing Data from Heterogeneous EHR Systems

The Developing Person Through the Life Span 8e by Kathleen Stassen Berger

Hardy-Weinberg Equilibrium Problems

A Genetic Analysis of Rheumatoid Arthritis

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Introduction to Statistics and Quantitative Research Methods

A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes.

Autoimmunity and immunemediated. FOCiS. Lecture outline

Population Genetics and Multifactorial Inheritance 2002

A Primer of Genome Science THIRD

Guide to Biostatistics

Figure 14.2 Overview of Innate and Adaptive Immunity

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Mutations: 2 general ways to alter DNA. Mutations. What is a mutation? Mutations are rare. Changes in a single DNA base. Change a single DNA base

Incorporating Research Into Sight (IRIS) Essentia Rural Health Institute Marshfield Clinic Penn State University

HLA data analysis in anthropology: basic theory and practice

Using Family History to Improve Your Health Web Quest Abstract

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

CASSI: Genome-Wide Interaction Analysis Software

Genomes and SNPs in Malaria and Sickle Cell Anemia

GENETIC STUDIES OF AUTOIMMUNE DISEASES. Benedicte Alexandre Lie Institute of Immunology Rikshospitalet University Hospital

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

Insulin Receptor Substrate 1 (IRS1) Gene Variation Modifies Insulin Resistance Response to Weight-loss Diets in A Two-year Randomized Trial

DNA-Analytik III. Genetische Variabilität

GENETIC DATA ANALYSIS

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

11. Analysis of Case-control Studies Logistic Regression

Single Nucleotide Polymorphisms (SNPs)

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Basics of Marker Assisted Selection

Alpha-1 Antitrypsin Deficiency A future NBS candidate?

Investigating the genetic basis for intelligence

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Information leaflet. Centrum voor Medische Genetica. Version 1/ Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

Core Facility Genomics

How To Find Rare Variants In The Human Genome

Heredity - Patterns of Inheritance

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Large Gene Interaction Analytics at University at Buffalo, SUNY

GEMMA User Manual. Xiang Zhou. May 18, 2016

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

European Educational Programme in Epidemiology

GENETICS AND INSURANCE: QUANTIFYING THE IMPACT OF GENETIC INFORMATION

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

Natalia Taborda Vanegas. Doc. Sci. Student Immunovirology Group Universidad de Antioquia

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

CNV Univariate Analysis Tutorial

Transcription:

Statistical analysis of genome-wide association (GWAS) data Jim Stankovich Menzies Research Institute University of Tasmania J.Stankovich@utas.edu.au

Outline Introduction Confounding variables and linkage disequilibrium Statistical methods to test for association in case-control GWA studies Allele counting chi-square test Logistic regression Multiple testing and power Example: GWAS for multiple sclerosis (MS) Data cleaning / quality control Results

GWA studies have been very successful since 2007 Prior to the advent of GWA studies, there was very little success in identifying genetic risk factors for complex multifactorial diseases GWA studies have identified over 200 separate associations with various complex diseases in the past two years Human Genetic Variation hailed as Breakthrough of the Year by Science magazine in 2007 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Human genome project The SNP consortium SNP genotyping arrays The International HapMap Project GWA studies

This talk: case-control GWA studies Obtain DNA from people with disease of interest (cases) and unaffected controls Run each DNA sample on a SNP chip to measure genotypes at 300,000-1,000,000 SNPs in cases and controls Identify SNPs where one allele is significantly more common in cases than controls The SNP is associated with disease SNP: rs12425791 stroke

This talk: case-control GWA studies Obtain DNA from people with disease of interest (cases) and unaffected controls Run each DNA sample on a SNP chip to measure genotypes at 300,000-1,000,000 SNPs in cases and controls Identify SNPs where one allele is significantly more common in cases than controls The SNP is associated with disease Alternative strategy (Peter Visscherʼs talk): test for association between SNPs and a quantitative trait that underlies the disease (endophenotype) SNP: rs12425791 blood pressure stroke

Association does not imply causation Suppose that genotypes at a particular SNP are significantly associated with disease This may be because the SNP is associated with some other factor (a confounder), which is associated with disease but is not in the same causal pathway SNP near lactase gene multiple sclerosis (MS)

Association does not imply causation Suppose that genotypes at a particular SNP are significantly associated with disease This may be because the SNP is associated with some other factor (a confounder), which is associated with disease but is not in the same causal pathway Northern European ancestry SNP near lactase gene multiple sclerosis (MS)

Association does not imply causation Suppose that genotypes at a particular SNP are significantly associated with disease This may be because the SNP is associated with some other factor (a confounder), which is associated with disease but is not in the same causal pathway Possible confounders of genetic associations: Ethnic ancestry Genotyping batch, genotyping centre DNA quality Environmental exposures in the same causal pathway Nicotine receptors --> smoking --> lung cancer Hung et al, Nature 452: 633 (2008) + other articles in same issue Alcohol dehydrogenase genes --> alcohol consumption --> throat cancer Hashibe et al, Nature Genetics 40: 707 (2008)

Helpful confounding: linkage disequilibrium Linkage disequilibrium (LD) is the non-independence of alleles at nearby markers in a population because of a lack of recombinations between the markers 50,000 years ago: Today: ~50kb Haplotype block

Direct and indirect association testing Hirschhorn and Daly: Nature Reviews Genetics 6: 95 (2005) Functional SNP is genotyped and an association is found Functional SNP (blue) is not genotyped, but a number of other SNPs (red), in LD with the functional SNP, are genotyped, and an association is found for these SNPs

LD is helpful, because not all SNPs have to be genotyped Peʼer et al: Nature Genetics 38: 663 (2006)

Allele counting to test for association between SNP genotype and case / control status GG GT TT Total Cases r 0 r 1 r 2 R Controls s 0 s 1 s 2 S Total n 0 n 1 n 2 N Observed allele counts G T Total Cases 2r 0 +r 1 r 1 +2r 2 2R Controls 2s 0 +s 1 s 1 +2s 2 2S Total 2n 0 +n 1 n 1 +2n 2 2N

Allele counting to test for association between SNP genotype and case / control status GG GT TT Total Cases r 0 r 1 r 2 R Controls s 0 s 1 s 2 S Total n 0 n 1 n 2 N Observed allele counts G T Total Cases 2r 0 +r 1 r 1 +2r 2 2R Controls 2s 0 +s 1 s 1 +2s 2 2S Total 2n 0 +n 1 n 1 +2n 2 2N Expected allele counts G T 2R(2n 0 +n 1 )/(2N) 2R(n 1 +2n 2 )/(2N) 2S(2n 0 +n 1 )/(2N) 2S(n 1 +2n 2 )/(2N)

Allele counting to test for association between SNP genotype and case / control status GG GT TT Total Cases r 0 r 1 r 2 R Controls s 0 s 1 s 2 S Total n 0 n 1 n 2 N Observed allele counts G T Total Cases 2r 0 +r 1 r 1 +2r 2 2R Controls 2s 0 +s 1 s 1 +2s 2 2S Total 2n 0 +n 1 n 1 +2n 2 2N Expected allele counts G T 2R(2n 0 +n 1 )/(2N) 2R(n 1 +2n 2 )/(2N) 2S(2n 0 +n 1 )/(2N) 2S(n 1 +2n 2 )/(2N) Chi-square test for independence of rows and columns (null hypothesis): (Obs Exp) 2 Exp ~ χ 2 with 1 df PLINK --assoc option Other options (e.g. dominant/recessive models) --model

The odds ratio: a measure of effect size Odds of an event occurring = Pr(event occurs) / Pr(event doesnʼt occur) = Pr(event occurs) / [1 - Pr(event occurs)] Allele counts G T Cases a b Controls c d Consider all the G alleles in the sample, and pick one at random. The odds that the G allele occurs in a case: a/c Consider all the T alleles in the sample, and pick one at random. The odds that a T allele occurs in a case: b/d odds ratio = odds that G allele occurs in a case = a/c = a d odds that T allele occurs in a case b/d b c

Interpretation of the odds ratio G T Cases a b Controls c d odds ratio (OR) = odds that G allele occurs in a case = a d odds that T allele occurs in a case b c OR = increase in odds of being a case for each additional G allele OR = 1: no association between genotype and disease OR > 1: G allele increases risk of disease OR < 1: T allele increases risk of disease If the disease is rare (e.g. ~0.1% for MS), the odds ratio is roughly equal to the genotype relative risk (GRR): the increase in risk of disease conferred by each additional G allele e.g. if OR = 1.2, Pr(MS TT) = 0.1% Pr(MS GT) = 0.12% Pr(MS GG) = 0.144%

Logistic regression: more flexible analysis for GWA studies Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Y i be the phenotype for individual i Y i = 0 for controls Y i = 1 for cases Let X i be the genotype of individual i at a particular SNP TT X i = 0 GT X i = 1 GG X i = 2

Logistic regression: more flexible analysis for GWA studies Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Y i be the phenotype for individual i Y i = 0 for controls Y i = 1 for cases Let X i be the genotype of individual i at a particular SNP TT X i = 0 GT X i = 1 GG X i = 2 Basic logistic regression model Let p i = E(Y i X i ), expected value of pheno given geno Define logit(p i ) = log e [p i /(1- p i ) ]

Logistic regression: more flexible analysis for GWA studies Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Y i be the phenotype for individual i Y i = 0 for controls Y i = 1 for cases Let X i be the genotype of individual i at a particular SNP TT X i = 0 GT X i = 1 GG X i = 2 Basic logistic regression model Let p i = E(Y i X i ), expected value of pheno given geno Define logit(p i ) = log e [p i /(1- p i ) ] logit(p i ) ~ β 0 + β 1 X i

Logistic regression: more flexible analysis for GWA studies Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Y i be the phenotype for individual i Y i = 0 for controls Y i = 1 for cases Let X i be the genotype of individual i at a particular SNP TT X i = 0 GT X i = 1 GG X i = 2 Basic logistic regression model Let p i = E(Y i X i ), expected value of pheno given geno Define logit(p i ) = log e [p i /(1- p i ) ] logit(p i ) ~ β 0 + β 1 X i Test whether β 1 differs significantly from zero: roughly equivalent to allele counting chi-square test Estimate of odds ratio: exp(β 1 )

Logistic regression: more flexible analysis for GWA studies Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Y i be the phenotype for individual i Y i = 0 for controls Y i = 1 for cases Let X i be the genotype of individual i at a particular SNP TT X i = 0 GT X i = 1 GG X i = 2 Add extra terms to adjust for potential confounders: e.g. ethnicity, genotyping batch, genotypes at other SNPs Let p i = E(Y i X i,c i, D i, ) logit(p i ) ~ β 0 + β 1 X i + β 2 C i + β 3 D i + PLINK --logistic

Multiple testing Suppose you test 500,000 SNPs for association with disease Expect around 500,000 x 0.05 = 25,000 to have p-value less than 0.05 More appropriate significance threshold p = 0.05 / 500,000 = 10-7 genome-wide significance In our MS GWAS we considered SNPs for follow-up if they had p- values less than 0.001 To detect a smaller p-value need a larger study

The power to detect an association Suppose the G allele of a SNP has frequency 0.2. If each additional G allele increases odds of disease by 1.2, and 1618 cases and 3413 controls are genotyped, what is the power (chance) of detecting an association with significance p<0.001? Null hypothesis: true OR=1 Observed OR from this distribution p=0.001 1 1.2 Odds ratio (OR)

The power to detect an association Suppose the G allele of a SNP has frequency 0.2. If each additional G allele increases odds of disease by 1.2, and 1618 cases and 3413 controls are genotyped, what is the power (chance) of detecting an association with significance p<0.001? Null hypothesis: true OR=1 Observed OR from this distribution Alternative hypothesis: true OR=1.2 Observed OR from this distribution 1 p=0.001 1.2 Odds ratio (OR)

The power to detect an association Suppose the G allele of a SNP has frequency 0.2. If each additional G allele increases odds of disease by 1.2, and 1618 cases and 3413 controls are genotyped, what is the power (chance) of detecting an association with significance p<0.001? Null hypothesis: true OR=1 Observed OR from this distribution Alternative hypothesis: true OR=1.2 Observed OR from this distribution Power = blue shaded area = 59% 1 p=0.001 1.2 Odds ratio (OR)

Effect of increasing sample size Observed OR tends to be closer to true OR (narrower distributions) Null and alternative distributions become more separate Power increases Null hypothesis: true OR=1 Observed OR from this distribution Alternative hypothesis: true OR=1.2 Observed OR from this distribution Power is now greater p=0.001 1 1.2 Odds ratio (OR)

Multiple sclerosis - degradation of myelin sheath around nerve fibres www.msif.org/en/about_ms/ demyelination.html

Multiple sclerosis neurodegenerative disease autoimmune attack on myelin sheaths around nerve cells more females affected than males (3:1) average age-at-onset ~30 ~16,000 people with MS in Australia ($2 billion p.a.) no cure

Epstein-Barr virus Risk factors Exposure to infant siblings (Ponsonby et al, JAMA, 2005) Latitude gradient, childhood sun exposure (van der Mei et al, Lancet, 2003) Only genetic risk factor known before 2007 (first GWAS): HLA-DRB1*1501 discovered in 1972 (60% MS and 30% controls) IL7R IL2RA CLEC16A CD226 KIF1B TYK2 CD58 EVI5/RPL5 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Human genome project The SNP consortium SNP genotyping arrays The International HapMap Project GWA studies

Australian and New Zealand MS GWAS Assemble collection of DNA samples (all states + NZ) Genotype 1952 MS cases from around Australia and New Zealand with Illumina 370CNV BeadChips (Patrick Danoy, Matt Brown, Diamantina Institute, UQ) Analyse GWAS data Quality control (Devindri Perera, Menzies) Impute genotypes at millions of other SNPs (Sharon Browning, Univ of Auckland) Compare case genotypes with >3500 controls from the UK and US (publicly available data) Replication genotyping (Justin Rubioʼs lab, Howard Florey Institute, Univ of Melbourne)

Quality control - MS samples (PLINK) Start with 1952 samples Exclusions Samples with >2% of SNPs not called 70 --mind

Genotype call rate No call rate Oragene saliva DNA Samples

Quality control - MS samples (PLINK) Start with 1952 samples Exclusions Samples with >2% of SNPs not called 70 Suspect batch of samples 128 Uncertain phenotype 10 Duplicates / relatives 88 Ancestry outliers 35 --mind --genome

Quality control - ethnicity N European African Principal components analysis: EIGENSTRAT Price et al (2006). Nat Genet 38: 904 Japanese Chinese Use an independent set of ~77,000 SNPs --indep-pairwise 178 outliers removed: - 35 MS - 143 controls

Quality control - MS samples Start with 1952 samples Exclusions Samples with >2% of SNPs not called 70 Suspect batch of samples 128 Uncertain phenotype 10 Duplicates / relatives 88 Ancestry outliers 35 Sex discrepancies 3 Leaves 1618 samples --mind --genome --check-sex Quality control - SNPs Start with 310,504 SNPs in both case and control datasets Exclude SNPs Not called in >5% of samples In Hardy-Weinberg disequilibrium Where one allele has frequency < 1% Leaves 302,098 SNPs --geno --hwe --maf

Choice of 5% no-call threshold We originally planned to use a 10% threshold, but lots of SNPs with no call rate 5-10% showed deviations from Hardy-Weinberg equilibrium Closer look at SNPs with call rates between 5% and 10% suggested that they were unreliable

GWAS - results Total sample = 1618 MS cases + 3413 controls HLA P=7 x 10-84 P=10-7 # 50 # 500 P=0.001

Extra QC for associated SNPs: cluster plots UK controls ANZ cases both

The replication phase Selected 100 SNPs for replication genotyping 2,256 ANZ MS cases + 2,310 ANZ controls Two chromosome regions on chr 12 and chr 20 showed (almost) genome-wide significant (p<5 x 10-7 ) association with MS after combining GWAS and replication data SNPs in 13/53 other regions with replication p-values < 0.1: more than expected by chance (p=0.002)

Chromosome 12 association: the downside of LD P=4.1 x 10-6 P=0.00001 P=0.0001 rs703842 GWAS P = 4.1 x 10-6 replication P = 1.4 x 10-6 GWAS + rep P = 5.4 x 10-11 Allele frequency 0.33 Odds ratio 0.81 (protective)

Chromosome 12 association: the downside of LD P=4.1 x 10-6 P=0.00001 P=0.0001 KIF5A SNP associated with rheumatoid arthritis and type 1 diabetes

Chromosome 12 association: the downside of LD P=4.1 x 10-6 P=0.00001 P=0.0001 Logistic regression with both SNPs in the same model: rs10876994 p = 0.004 rs703842 p = 0.08

Chromosome 12 association: the downside of LD P=4.1 x 10-6 P=0.00001 P=0.0001 CYP27B1: most likely candidate??

CYP27B1 Cytochrome p450 gene family (drug metabolizing) Encodes 25-hydroxyvitamin D-1 alpha hydroxylase (1α OHase) Converts 25(OH)D 3 to bioactive 1,25(OH) 2 D 3 1,25(OH) 2 D 3 regulates calcium metabolism and the immune system via vitamin D receptor (VDR) 25(OH)D 3 Liver 7 dehydrocholesterol UVB Vit D Diet HLA-DRB1*1501 Adorini and Penna (2008) Nat Clin Prac Rheum 4: 404-12

The chromosome 20 association P=0.0001 rs6074022 GWAS P = 2.5 x 10-5 replication P = 4.6 x 10-4 GWAS + rep P = 1.3 x 10-7 Allele frequency 0.25 Odds ratio 1.20 (increased risk)

CD40 Member of TNF receptor superfamily: regulates many cell- and antibody-mediated immune responses SNPs in CD40 are associated with risk of rheumatoid arthritis and Gravesʼ disease Functional SNP rs1883832c>t, 1 base pair upstream of the ATG translation initiation codon Allelic heterogeneity T allele CD40 protein MS -1 ATG C allele CD40 protein RA GD www.hapmap.org

Another use of logistic regression: test for gene-gene interaction MS risk alleles Chr 12 = rs703842a Chr 20 = rs6074022g Odds ratio Chr20 risk Chr12 risk Modest evidence that each risk allele has a bigger effect in the presence of the other risk allele (p = 0.03)

Summary Case-control GWA studies have been very successful in the past couple of years Linkage disequilibrium means that most, but not all, common human genetic variation is captured by genotyping a few hundred thousand SNPs Small effect sizes (e.g. OR 1.2) mean that GWA studies need to be large, with thousands of cases and controls --> big collaborations Methods of statistical analysis are fairly straightforward, but care is required to clean data The ultimate test of any association: replication in an independent population

Acknowledgments - MS GWAS Hobart: Melbourne: Newcastle: Devindri Perera Bruce Taylor Karen Drysdale Preethi Guru Brendan McMorran Simon Foote Justin Rubio Melanie Bahlo Helmut Butzkueven Vicky Perreau Laura Johnson Judith Field Cathy Jensen Ella Wilkins Caron Chapman Mark Marriott Niall Tubridy Trevor Kilpatrick Jeanette Lechner-Scott Rodney Scott Pablo Moscato Mathew Cox Sydney: Gold Coast: Brisbane: Perth: Graeme Stewart David Booth Robert Heard Jim Wiley Simon Broadley Lyn Griffiths Lotfi Tajouri Michael Pender Matthew Brown Patrick Danoy Johanna Hadler Karen Pryce Peter Csurshes Judith Greer Bill Carroll Alan Kermode Adelaide: New Zealand: Mark Slee Sharon Browning Brian Browning Deborah Mason Ernie Willoughby Glynnis Clarke Ruth McCallum Marilyn Merriman Tony Merriman The Australian and NZ MS Genetics Consortium (2009). Nat Genet 41: 824 Funding: MS Research Australia, John T Reid Charitable Trusts, Trish MS Research Foundation, Australian Research Council