Genome Informatics & Cloud Computing Jeffrey Reid Festival of Genomics -- Nov 4, 2015
The internet isn t just for cat pictures anymore (but maybe it should be)! Jeffrey Reid Festival of Genomics -- Nov 4, 2015
Outline Cloud-based approaches to large-scale genome informatics challenges Production analysis at the 1600+ exome per week rate Enabling integration of clinical and genomic data 3
Outline Cloud-based approaches to large-scale genome informatics challenges Production analysis at the 1600+ exome per week rate Enabling integration of clinical and genomic data Cat pictures 4
What is the RGC? Launched in January 2014, including a partnership with the Geisinger Health System Goal: build a comprehensive genotype-phenotype resource combining genomic and EHR data from >250K people to aid drug development and enable genomic medicine Scientifically and medically, it s pretty exciting, said Dr. Leslie G. Biesecker, chief of the genetic disease research branch at the government s National Human Genome Research Institute, who is familiar with the project. As far as I m aware, it s the largest clinical sequencing undertaking in this country so far by a long shot. He added that the move of sequencing into general health care is going to change medicine. 5
RGC Collaborations and Projects Families General Population Founder Populations DRIFT Consortium 6
Regeneron Genetics Center - One of the Most Productive Exome Sequencing Facilities in the World 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Initiated production on July 1, 2014 Fully-automated exome sample preparation on September 1 Averaged 1,000 exomes per week through the last third of 2014 Averaged >1200 exomes per week through 1 st quarter of 2015 2 nd & 3 rd quarter of 2016 averaged >1750 exomes per week All data processing in the cloud 7
Innovative technologies & automation enable ultra high-throughput sequencing & analysis Automated Biobank (1.4M Samples) Library Prep Automation (>200K Samples/Yr) Illumina Fleet (>80K Exomes/Yr) Cloud Informatics (>100K Exomes/Yr) QC >85K exomes sequenced on pace for ~100K individuals by end of year >1750 exomes/week First 100% cloud-based genome center with fully automated analysis pipelines & data sharing 8
100% Cloud-based Many efforts use the cloud for some things 9
100% Cloud-based 10
Why so much cloud? Lack of legacy hardware Overburdened local IT resources Short-term need to ramp up to genomecenter scale production fast Long-term need for scalability Security (thanks to DNAnexus) Enables cloud-based EHR mining and integration with genomic data Data delivery for long/growing list of partners Support R&D and production analysis work easing evolution of analytical tools from the bench to the pipeline 11
A new method for WES copy number variant calling CLAMMS: Copy number estimation using Lattice- Aligned Mixture Models Normalize coverage data accounting for local GC content For each sample, identify a reference panel of similar samples based on sequencing QC metrics Use reference panel to model the expected coverage distribution at each exon, given different copy number states (Mixture Models) Identify CNVs from regions where the sample s coverage is likely non-diploid over contiguous exonic regions (HMM) Packer JS, et. al. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics. 2015 Sep 17. pii: btv547. 12
Normalized Read Coverage Mixture model fits coverage distribution for GSTT1 Neighboring Exons of GSTT1 13
de novo pedigree reconstruction via PRIMUS PRIMUS uses pairwise comparisons to construct a pedigree from genetic data Estimate identity by descent (IBD) between individuals Predict familial relationship Fit them together like a Sudoku puzzle = no genetic data C A E PRIMUS B E C A D Edge Legend Red = parent/child Gold = full-sibling Blue = 2 nd degree Green = 3 rd degree G D F B Staples et al. (2014) AJHG G F 14
# of families (log scale) 1 5 10 50 100 500 3482 IBD1 GHSf40k Families 1.0 5104 parent-child 42880 samples produced: 19962 1 st - 2 nd degree 19489 samples involved 8939 1 st degree produced: 5062 family networks largest = size 21 0.8 0.6 0.4 3835 Full-sib ~11023 2 nd degree ~12933 3 rd degree 0.2 0.0 15 MZ twins 0.0 0.2 0.4 0.6 0.8 1.0 2 5 8 11 14 17 20 # of samples in each family IBD0 15
PRIMUS reconstructs a 21-person pedigree using 1 st degree relationships A (83) B (79) C (73) D (75) E (69) F (74) G (52) H (56) I (63) J (64) K (60) L (54) M (41) N (48) O (58) P (43) Q (49) R (49) S (54) T (40) U (24) V (21) = no genetic data 21 samples connected by 1 st degree relationships Includes ages of samples underneath patient IDs This is the only pedigree that fits the genetic data 16
Loss of function (LOF) variant in CCR5 is transmitted through the family A (83) B (79) C (73) D (75) E (69) F (74) G (52) H (56) I (63) J (64) K (60) L (54) M (41) N (48) O (58) P (43) Q (49) R (49) S (54) T (40) Het for frameshift LOF in CCR5 Hom for frameshift LOF in CCR5 Read stacks confirm a Frameshift LOF allele frequency = 9.7% CCR5 U (24) V (21) = no genetic data Cell surface receptor that HIV uses to enter & infect host cells LOF in CCR5 gives HIV and smallpox resistance 17
Visualize the CCR5 pedigree as an undirected graph Edge Legend Red = parent/child Gold = full-sibling Blue = 2 nd degree 18
connect to other pedigrees with 2 nd degree relationships Edge Legend Red = parent/child Gold = full-sibling Blue = 2 nd degree 19
We are starting to leverage all family networks, including 1107 person 2 nd degree family network Edge Legend Red = parent/child Gold = full-sibling Blue = 2 nd degree Individual 1 st degree pedigrees 21-person CCR5 LOF pedigree 20
Cat-clusions RGC is up and running as a leader in exome sequencing Producing ~7000 exomes a week across a variety of projects Mendelian/Familial disease projects ranging from single families to hundreds of trios with shared phenotypes (CUMC, TSK, UU, UC, etc.) Founder populations including studies in the Amish with UM and CSC Population sequencing with EHR data (DiscovEHR collab w/ghs) First 100% Cloud-based Genome Center Essential to scaling production efforts, secure data analysis & delivery also transformative for easy transition from bench to pipeline Part of an exciting moment as medicine and healthcare embrace the future 21
Acknowledgements Regeneron Genetics Center Aris Baras (co-head) Alan Shuldiner (co-head) Michael Norsen Lyndon Mitnaul Alejandra King Chi Onyewu Charlene Carlino Talita Silva John Overton Alex Lopez Caitlin Forsythe Erin Fuller Karina Toledo Mathew Smith Michael Lattari Maria Sotiropoulos-Padilla Sarah Wolf Thomas Schleicher Jeffrey Reid Chris Sprangel Rick Ulloa Martin Paradesi Kia Manoochehri Miro Georgiev Young Hahn Scott Jones John Penn Sheldon Bai Ke Huang Alicia Hawes Lukas Habegger Jeffrey Staples Evan Maxwell Ingrid Borecki Colm O Dushlaine Cristopher Van Hout Semanti Mukherjee Alex Li Omri Gottesman Brian Cajes Nilanjana Banerjee Rick Dewey Shannon Bruse Jonathan Chung Claudia Gonzaga-Jauregui Kavita Praveen Suganthi Balasubramanian Jan Freudenberg Julie Horowitz Aris Economides Ge Zhou Liz Misir Nehal Gosalia Kiran Nistala John Dronzek Angelo Pefanis Darshi Persaud RGC Steering Committee George Yancopoulos Scott Mellis Andrew Murphy Robert Phillips Neil Stahl Aris Economides RGC Collaborators The Whole GHS Team David Ledbetter David Carey Internet 4 All Catz & LOLz 22
CONFIDENTIAL 23