Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Similar documents

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

How To Find Rare Variants In The Human Genome

SNPbrowser Software v3.5

DNA-Analytik III. Genetische Variabilität

UNIVERSITÀ DEGLI STUDI DI SASSARI

Step by Step Guide to Importing Genetic Data into JMP Genomics

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Journal of Statistical Software

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

CASSI: Genome-Wide Interaction Analysis Software

Differential privacy in health care analytics and medical research An interactive tutorial

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

A Multi-locus Genetic Risk Score for Abdominal Aortic Aneurysm

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

SAP HANA Enabling Genome Analysis

GOBII. Genomic & Open-source Breeding Informatics Initiative

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

A Comparison of Genotype Representations to Acquire Stock Trading Strategy Using Genetic Algorithms

A Primer of Genome Science THIRD

Overview One of the promises of studies of human genetic variation is to learn about human history and also to learn about natural selection.

Factors for success in big data science

Haplotype analysis of case-control data

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

Admixture 1.23 Software Manual. David H. Alexander John Novembre Kenneth Lange

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

Core Facility Genomics

The donor search: the best donor or cord blood unit

Accelerating variant calling

Next Generation Sequencing: Technology, Mapping, and Analysis

Linkage, Association, and Type 2 Diabetes Susceptibility. Karen Mohlke, Ph.D. Department of Genetics University of North Carolina - Chapel Hill

Validation and Replication

Epigenetic variation and complex disease risk

Cloud-Based Big Data Analytics in Bioinformatics

Introductory genetics for veterinary students

Marker-Assisted Backcrossing. Marker-Assisted Selection. 1. Select donor alleles at markers flanking target gene. Losing the target allele

Population Genetics and Multifactorial Inheritance 2002

Binary Image Scanning Algorithm for Cane Segmentation

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery

Chromosomes, Mapping, and the Meiosis Inheritance Connection

DNA Determines Your Appearance!

A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes.

MAGIC design. and other topics. Karl Broman. Biostatistics & Medical Informatics University of Wisconsin Madison

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

GENETIC CROSSES. Monohybrid Crosses

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

Parallelization Strategies for Multicore Data Analysis

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Statistical analysis of genome-wide association (GWAS) data

NGS and complex genetics

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

The correct answer is c A. Answer a is incorrect. The white-eye gene must be recessive since heterozygous females have red eyes.

HLA data analysis in anthropology: basic theory and practice

Simplifying Data Interpretation with Nexus Copy Number

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Electronic Medical Records and Genomics: Possibilities, Realities, Ethical Issues to Consider

Investigating the genetic basis for intelligence

Rules for conducting ISAG Comparison Tests (CT) for animal DNA testing.

Genetic Epidemiology Core Laboratory

14.3 Studying the Human Genome

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

Outcome Data, Links to Electronic Medical Records. Dan Roden Vanderbilt University

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Two Correlated Proportions (McNemar Test)

Nonlinear Regression:

Gene Mapping Techniques

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Acceleration for Personalized Medicine Big Data Applications

Logistic Regression (1/24/13)

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

MATCH Commun. Math. Comput. Chem. 61 (2009)

Module 14: Missing Data Stata Practical

Lesson Plan: GENOTYPE AND PHENOTYPE

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Big Data for Population Health

Transcription:

Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan

The Challenge Detecting small effects requires very large sample sizes Combined analysis of data from different studies is one way to increase sample size but these studies may rely on different platforms that have little direct overlap For example, Illumina 317K chip and the Affymetrix 500K chip have only ~51,000 SNPs in common

My Talk Today In silico genotyping Inferring unobserved genotypes Estimate genotypes for relatives of individuals in genome- wide association scan Intuition for how in silico genotyping works Estimate genotypes for untyped markers, by combining study sample with Hapmap Facilitate comparisons across studies Evaluating quality of the inferred genotypes

In Silico Genotyping For Family Samples Family members will share large segments of chromosomes If we genotype many related individuals, we will effectively be genotyping a few chromosomes many times In fact, we can: genotype a few markers on all individuals use high-density panel to genotype a few individuals infer shared segments and then estimate the missing genotypes if relatives have no genotype data, we can still estimate a probability for each of their genotypes

Genotype Inference Part 1 Observed Genotype Data A/A A/G A/A A/G A/T A/A A/T G/T G/G G/T G/G G/T G/T G/T A/G C/G A/A G/G A/A T/G G/G C/C G/G A/G A/T G/T A/A C/G A/G A/T G/T G/T G/A C/G A/G G/T C/G A/G G/G A/A G/G C/C A/A G/G C/C G/G G/G G/G G/G

Genotype Inference Part 2 Inferring Allele Sharing A A G A A A G A T A A A A T T T T T T G G G T G G G T G G T T G A G A A G G A A T T T T T T T G C G G G C C G G A G A G T A A T T T G T G T G T A A G A T T T T C G C G A G A G A A A A G G G G........................ G T T T G G G G T T T T........................ C G G G C C C C G G G G

Genotype Inference Part 3 Imputing Missing Genotypes A/A A/G A/A A/G A/T A/A A/T G/T G/G G/T G/G G/T G/T G/G A/G A/A G/G A/A T/G C/G G/G C/C G/G A/G A/T G/T A/A C/G A/G A/T G/T G/T G/A C/G A/G G/T C/G A/G A/A G/G A/A A/T G/T G/G A/G C/C A/A A/T G/T G/G A/G C/C G/G A/T A/A G/G G/G A/T A/A G/G

Our Approach Consider full set of observed genotypes G Evaluate pedigree likelihood L for each possible value of each missing genotype g ij Posterior probability for each missing genotype P ( g ij = x G) = L( G, g ij L ( G) = x) Implemented both using Elston-Stewart (1972) and Lander-Green (1987) algorithms

Model With Inferred Genotypes Replace genotype score g with its expected value: Where E ( y ) = μ + β g + β c +... g Association test can then be implemented as a score test or as a likelihood ratio test Alternatives would be to i g = 2 P( g = 2 G) + P( g 1 G) i i i = (a) impute genotypes with large posterior probabilities; or (b) integrate joint distribution of unobserved genotypes in family c

Sardinia 6,148 Sardinians from 4 towns in Ogliastra Measured 98 aging related quantitative traits Genotyping: Affymetrix 10K chip in 4,500 individuals (done) Affymetrix 500K chip in 1,500 individuals (ongoing) Large pedigrees, computationally challenging Preliminary results

Preliminary Results from Sardinia Red Blood Cell Hemoglobin Levels β-globin α-globin

Preliminary Results from Sardinia QT interval, Chromosome 1 Before imputation -log(p-value) -log(p-value) After imputation Position (in Mb) Along Chromosome 1

Preliminary Results from Sardinia QT interval, Chromosome 1 Before imputation -log(p-value) -log(p-value) Imputation increases signal at NOS1AP increases from After imputation ~0.005 (top 3000 SNPs) to ~10-4 (top 25 SNPs) Position (in Mb) Along Chromosome 1

In Silico Genotyping For Case Control Samples In families, we expected relatively long stretches of shared chromosome In unrelated individuals, these stretches will typically be much shorter The plan is still to identify stretches of shared chromosome between individuals we then infer intervening genotypes by contrasting study samples with densely typed HapMap samples

Observed Genotypes Observed Genotypes.... A....... A.... A....... G....... C.... A... Reference Ha plotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C Study Sample HapMap

Identify Match Among Reference Observed Genotypes.... A....... A.... A....... G....... C.... A... Reference Ha plotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C

Phase Chromosome, Impute Missing Genotypes Observed Genotypes c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t t t c A t g g Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C

Implementation Markov model is used to model each haplotype, conditional on all others Gibbs sampler is used to estimate parameters and update haplotypes Each individual is updated conditional on all others In parallel to updating haplotypes, estimate error rates and crossover probabilities In theory, this should be very close to the Li and Stephens (2002) model

Output of Imputation Runs g A c c t c A t g g t C t t t c A t g g g A c c t c A t g g t C c c t c A t g c g A c c t c A t g g t C c c t c A t g c g A c c t c A t g g t C c c t c A t g c Iteration 1 Iteration 2 Iteration 3 Iteration 4 g A c c t c A t g g "Best Call" t C c c t c A t g c 1 1 3/4 3/4 1 1 1 1 1 3/4 Quality Score g A c c t c A t g g Reference Allele 1 1 7/4 7/4 2 2 2 2 2 5/4 Dosage

Assessing the Approach: AMD Case Control Study Used 11 tag SNPs to predict 84 SNPs in CFH Predicted genotypes differ from original ~1.8% of the time ~2.5% for PHASE ~3.2% for fastphase Calculation took ~3 minutes ~21min for fastphase ~1 day for PHASE Experimental Data 0 50 100 150 200 Comparison of Test Statistics, Truth vs. Imputed Chi -Square Test Statistic for Disease -Marker Association 0 50 100 150 200 Imputed Data

FUSION Example Finland United States Investigation of NIDDM Genetics Genome-wide association scan in 1200 type II diabetes cases and 1200 controls Imputed 2.5M SNPs for all individuals ~1 week, 50 CPUs Genotyping carried out using the Illumina 317K chip To start, I will focus on 127 SNPs around TCF7L2 There are 984 Hapmap SNPs in the same interval

FUSION: TCF7L2 20 Association Around TCF7L2 15 Imputed Data Observed Data Chi-squared 10 5 0 114.3 114.4 114.5 114.6 114.7 114.8 114.9 115 115.1 115.2 115.3 Chr 10 Position (Mb)

FUSION: TCF7L2 Association Around TCF7L2 20 15 rs7903146 rs12255372 rs12243326 Chi-squared 10 5 0 114.3 114.4 114.5 114.6 114.7 114.8 114.9 115 115.1 115.2 115.3 Chr 10 Position (Mb)

Imputed Data Includes Quality Estimates FUSION TCF7L2 region. Estimated error rate, at each marker, based on similarity between haplotypes estimated at each iteration. Overall average is just under 3.0%. Crossover rates are averaged over Gibbs sampler iterations.

More Thorough Assessment Prior to genome-wide association scan FUSION examined 20Mb region on chromosome 14 A candidate region that shows evidence for linkage The original genotype data 1190 individuals 521 markers not on Illumina HumanHap300 chip The imputed genotyped data ~17,000 genotypes using ~2,000 GWA markers ~1.5 days in a one CPU

Do the imputed alleles match? 1.5% of alleles mismatch original 3.0% of genotypes mismatch original Errors are concentrated on a few markers 14.82% error for 1% of SNPs with lowest quality scores 11.09% error for next 1% of SNPs (1 st 2 nd percentile) 5.86% error for next 1% of SNPs (2 nd 3 rd percentile) 1.11% error for top 95% of SNPs

Predicted and Actual Error Rates -0.5 0.5 1.0 Genotype Matching Error Estimated Error Rate 55 60 65 70 map coordinate Top panel shows actual error rate (imputed vs. actual genotypes) Bottom panel shows estimated error rate

Does Coverage Improve? 1. R 2 in FUSION with Best Tag in HapMap 2. R 2 in HapMap with Best Tag in HapMap 3. R 2 with (Best-Guess) Imputed Genotypes 4. Squared Allele Dosage Correlation e.g., Imputed genotypes over 5 rounds: 1/1 1/1 1/2 1/1 1/1 (Best-Guess) imputed genotype: 1/1 Dosage for allele 1: 1.8

Coverage Comparison (r 2 ) Histogram for R2 in FUSION with Best Tag in HapMap R 2 in FUSION with Best Tag in HapMap 521 chromosome 14 SNPs Histogram for R2 in HapMap with Best Tag in HapMap R 2 in HapMap with Best Tag in HapMap 0 100 200 300 Mean=.78 Median=.89 STD=.26 Coverage=62% 0 100 200 300 Mean=.81 Median=.90 STD=.24 Coverage=66% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 R 2 in FUSION Histogram for with R2 with Imputed Genotypes R 2 in Histogram FUSION for Squared with Imputed Allele Dosage Allele Correlation Dose 0 100 200 300 Mean=.90 Median=.96 STD=.16 Coverage=83% 0 100 200 300 Mean=.91 Median=.96 STD=.14 Coverage=87% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Can we recover original test statistics? Chi-squared test statistic in original data Chi-squared test statistic for best tag Chi-squared test statistic for best 2-SNP 2 tag Chi-squared test statistic for imputed alleles Chi-squared test statistics for allele doses Compare each of these 4 to original statistic

Test Statistic Comparison Test using Best Tag Test using 2-SNP Tags Original Original Original Original Test using Imputed Genotypes Test using Allele Doses

Can we do even better? Ask a better statistician? Jonathan Marchini / Peter Donnelly Matthew Stephens Mark Daly / Paul de Bakker Many more?

Can we do even better? Ask a better statistician? Collect more data? Genotype study samples on two platforms 60 individuals in overlap, 1.78% error rate per allele 100 individuals in overlap, 1.03% error rate 200 individuals in overlap, 0.78% error rate 500 individuals in overlap, 0.41% error rate Maybe we could use a larger HapMap?

Summary It is possible to combine data across studies that rely on different platforms Will add value to genome wide scans My (currently) favorite way is to impute missing genotypes A lot of interesting statistical and computational problems

Acknowledgements Yun Li, Paul Scheet, Jun Ding, Weimin Chen, Serena Sanna FUSION Investigators, led by: Karen Mohlke,, Mike Boehnke, Francis Collins, Jaakko Tuomilehto,, Richard Bergman Sardinia Investigators, led by: David Schlessinger, Manuela Uda,, Antonio Cao, Edward Lakatta,, Paul Costa goncalo@umich.edu

Comparison With Phase Average Number of Errors when Missing Genotypes Inferred Average Number of Flips Needed to Transform into True Haplotypes 25 500 20 400 15 Mach1.0 300 Mach1.0 10 PHASE 200 PHASE 5 100 0 1 30 100 150 500 10000 0 1 30 100 150 500 10000 Number of Rounds Number of Rounds 40 35 30 25 20 15 10 5 0 Average Number of Exactly Correct Haplotypes 1 30 100 150 500 10000 Mach 1.0 PHASE Computation Time for this Dataset: Mach 1.0: ~3 sec per round. PHASE: ~10h in total. Number of Round Simulations follow model of Schaffner et al (2005), Marchini et al (2006).

Mathematical Model Markov model, where each haplotype is a mosaic of other known haplotypes The probability of a particular arrangement depends on number of change-over points Pr( S = s) = Pr( S L 1 1 = s1,,..., SL = sl ) = Pr( S1 = s1) Pr( S j+ 1 = s j+ 1 S j = s j ) j = 1 For a specific arrangement of the mosaic, calculate probability of observed alleles Pr( A = a S = s) = Pr( A 1 = a,..., A 1 L = a L S 1 = s,..., S 1 L = s L ) L = Pr( A = j = 1 j a j E j ( s j ))