Analyzing NGS data with clinical data: open source software for translational medicine BASEL LIFE SCIENCE WEEK NGS FORUM SEPTEMBER 24, 2015 Kees van Bochove, CEO The Hyve
Agenda 1. Introduction 2. Open Source in Translational Medicine 3. cbioportal 4. TranSMART 5. ADAM & Apache Spark 2
1. INTRODUCTION 3
The Hyve Professional support for open source software for bioinformatics and translational research software, such as transmart, cbioportal, i2b2, Galaxy, ADAM and OHDSI Core values Share Reuse Specialize Office Locations Utrecht, Netherlands Cambridge, MA, United States Services Software development Data science services Consultancy Hosting / SLAs Mission Enable pre-competitive collaboration in life science R&D by leveraging open source software Fast-growing Started in 2012 30 people by now 4
Interdisciplinary team software engineers, data scientists, project managers & staff; expertise in bioinformatics, medical informatics, software engineering, biostatistics etc. 5
2. OPEN SOURCE IN TRANSLATIONAL MEDICINE 6
http://lanyrd.com/2015/innovation-
8
The Open Source Definition 1. Free Redistribution 2. Availability of Source Code 3. Allow Derived Works 4. Integrity of The Author's Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Redistribution of License 8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software 10. License Must Be Technology-Neutral
Open Source Source code openly accessible and reusable for everyone Enables pre-competitive collaboration: both academics and industry can use and enhance it; which grows a community Transparency: verification (scientific as well as IT security) can be done by anyone, no black box
The software engineering process in an open source community is not different from a closed commercial setting But the stakeholders, contributors, business models, engagement models etc. are!
Different Non-Functional Requirements for Software Bioinformatician in academics: create a novel solution for a problem which has publication value Basic Research: new frontiers Software should demonstrate working principle Bioinformatician / IT Services in pharma/clinic: mainly applied research: Software should be well tested, maintainable, extensible, scalable etc. Need for commercial support for open source software
Open Source in Translational Medicine Clinical - ecrf & apps: Datawarehousing: Data visualisation: Imaging: Scientific compute: Biobanking: Workflow / NGS: 13 Study design:
3. CBIOPORTAL See also: http://www.pistoiaalliance.org/events/cbioportal-webinar 14
15 cbioportal study portal
Colon cancer study in TraIT 16
Event calls Gene alteration events per sample Which genes are altered in each individual tumor sample? Data type Mutations Copy number changes Methylation mrna and/or DNA mrna expression changes Alteration event calls Non-synonymous somatic mutations Homozygous deletion or amplification Epigenetic silencing Gene fusions Over- or under-expression Alteration types and thresholds can be customized for each gene
18 Visualization of events across genes and data types
19 Review cancer genomics events in clinical context
20 GenePrint Visualisation (from cbioportal) in transmart
TM2CBIO In collaboration with Netherlands Cancer Institute ETL pipeline between transmart and cbioportal TranSMART used as data warehouse, and cbioportal as a study-based analytics mart for cancer studies Going from individual data points (e.g. mrna intensity levels) in transmart to alteration events in cbioportal 21
4. TRANSMART 22
TranSMART as a product Datawarehouse bringing together scientists from clinical sciences, preclinical research and discovery around the data Combination of internal datasets and documents with public datasets and knowledge Tailored to both biologists/clinicians and bioinformaticians Dual nature: in use for translational research in both pharma and hospitals/clinic 23
In early 2014, over 50 transmart implementations International Research Initiatives IMI etriks, EMIF CTMM TraIT Pharma & Biotech Sanofi, Millennium, Pfizer, JNJ, Roche Government Aligned Institutions FDA Non-Profits 1Mind4Research, Orion Bionetworks, Critical Path Institute Hospitals / Academics U Michigan, Harvard / Boston Children's Hospital, HEGP, Johns Hopkins, St. Jude Service Providers ConvergeHEALTH, thehyve, Rancho Biosciences, BTGS, Thomson Reuters, Saama Tech, Cognizant Start Organization Type Stage 2008 Johnson & Johnson Pharma Production 2008 Recombinant by Deloitte Services Multiple 2010 Sage Bionetworks Non Profit Production 2010 Thomson Reuters Services Support 2010 U-BIOPRED Consortium Production 2011 SAFE-T Consortium Pilot 2011 University of Michigan, Comprehensive Cancer Center Academic Production 2012 APHP-HEGP Paris France Academic Production 2012 BT Cure Consortium Pilot 2012 CTMM/TraIT Consortium Dev 2012 FDA Government Dev 2012 IMI/eTRIKS Consortium Dev 2012 Merck Pharma Pilot 2012 Millennium Pharmaceuticals Pharma Production 2012 One Mind for Research (1M4R) Non Profit Production 2012 Pfizer Pharma Production 2012 Roche Pharma Evaluation 2012 Sanofi-Aventis Pharma Dev 2012 St. Jude Non Profit Dev 2012 U Michigan, Computational Medicine & Bioinformatics Academic Multiple 2013 Agios Biotech Evaluation 2013 CARPEM Cancer personalized medicine Academic Dev 2013 Harvard University / Boston Children's Hospital Academic Autism Pilot 2013 Boehringer Ingelheim Pharma Pilot 2013 Bristol Myers Squibb Pharma Evaluation 2013 BT Global Services Services Pilot 2013 Accelerated Cure Project for MS Non Profit Dev 2014 Personalized medicine and colorectal cancers (France) Academic Dev 2014 PCORI PRRN Phelan-McDermid Syndrome Data Network Academic Dev
TranSMART Open Source History February 2012: J&J releases transmart as open source on GitHub under GPL v3 December 2012: CTMM TraIT project decides to use transmart as core infrastructure component January 2013: IMI etriks starts, uses transmart as core infrastructure component February 2013: kickoff of transmart Foundation, U. Michigan publishes PostgreSQL port March 2014: IMI EMIF kickoff, transmart is used as data integration component 25
Center for Translational Molecular Medicine (CTMM) Public-private consortium Dedicated to the development of Molecular Diagnostics and Molecular Imaging technologies Focusing on the translational aspects of molecular medicine. 120 partners universities, academic medical centers, medical technology enterprises and chemical and pharmaceutical companies. Budget 300 M 22 projects / research consortia TraIT is the Translational Research IT project supporting these projects with a joint IT infrastructure 26
TraIT Consortium Growing TraIT project team 27
TraIT data workflow Hospital (IT) HIS PACS LIS Samples (IT) BIMS P s e u d o n y m i z a t i o n data domains clinical data OpenClinica imaging data NBIA + AIM biobanking CBM-NL Translational Research (IT) integrated data transmart/i2b2 datawarehouse translational analytics workbench transmart/ cohort explorer R Public Data experimental data e.g. Galaxy, Chipster e.g. PhenotypeDB, Annai Systems Galaxy
Recombina nt / Deloitte CDISC Thomson Reuters Pfizer Astra Zeneca VUmc The Hyve 70 Sanofi Johnson & Johnson University of Michigan Philips University of Luxembourg Amsterdam, June 2013: transmart Workshop Attendees from 10 Pharma companies, 11 University Medical Centers and 12 IT companies http://lanyrd.com/2013/transmart 29
130 Ann Arbor, Michigan, October 2014: Annual Meeting http://lanyrd.com/2014/transmart 30
TranSMART wins all the prizes: Best Show Award, Best Practices Award, Best Poster Award Bio IT World, Boston, April 2015 http://bit.ly/1r2n6uz 31
Contributors
The Hyve transmart 1.3 Contributions Improvements for handling GWAS data & cohort selection on SNP data Build a number of interactive advanced analytics workflows & correct statistical assumptions Imaging workflow: ETL for imaging metadata and results Prototype of a transmart 2.0 interface: new look & feel, user experience Under discussion: improved GUI for ETL? 33
TranSMART 2.0 User Interface - alpha
5. USING APACHE SPARK FOR NGS DATA ANALYSIS 35
NGS data storage & analysis Don't import BAM, Cram, VCF and BCF to a database! They are the databases! Indexed Compressed Highly specialized & optimized storage formats Whole ecosystem is build around this concept. All tools read and write these through a rich API HTS-JDK GATK MapReduce engine
Genome Analysis ToolKit (GATK) MapReduce framework for processing BAM and VCF files / databases Provides walkers that provide access pattern as a stream trough BAM and VCF files On top of these walkers there are analysis tools: Indel realignment Base Quality recalibration Unified Genotyper (=old variant caller) Haplotype Caller (= new variant caller)
ADAM http://bdgenomics.org Genomics processing engine &specialized file format built with: Apache Avro (uniform data format defintion) Apache Spark (memory-based cluster execution) Apache Parquet (Hadoop based columnar storage format) Resulting data can be accessed by Hadoop Map-Reduce, Spark, Shark, Impala, Pig, Hive etc. Support for conversion to and from BAM and VCF. MAF conversion and somatic variant calling unclear. Open source project driven mainly by UC Berkeley
ADAM Example setup in transmart
Translational Research Infrastructure User Interfaces R, Spotfire etc. Galaxy GUI TranSMART GUI TranSMART RESTful API Galaxy Sun Grid Engine (SGE cluster) Clinical API ctakes Clinical Data/i2b2 Mapping between patients in clinical db and samples in omics data Genome Analysis ToolKit (GATK) HTS - JDK BAM, CRAM, VCF, BCF ADAM/ Spark Transcriptome Analysis ToolKit (TATK) (R / Bioconductor) Transcriptome files / DB Isilon High Performance Storage Proteome Analysis ToolKit (PATK) MzML, MzIdent XNAT Hadoop HDFS / Apache Parquet (ADAM) Archiving Object Storage (e.g. Glacier) Imaging Data Repository Public Archives Sequence Read Archive (SRA) Download / ETL Gene Expression Omnibus (GEO) PRoteomics IDEntifications Database (PRIDE)