Analyzing NGS data with clinical data: open source software for translational medicine

Similar documents

PROVENTA INTERNATIONAL. INNOVATION SPOTLIGHT SESSION 0pen Source Technologies for Precision Medicine

The Evolution of Data Platforms in IMI. Anthony Rowe, Janssen R&D IT 08 April 2016 Med-e-Tel Luxembourg

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Enabling Technologies for Collaborative Research in Health. Ann Martin BigData2015 Munsbach, Luxembourg

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Delivering the power of the world s most successful genomics platform

Report of the DTL focus meeting on Life Science Data Repositories

Big Data and the Data Lake. February 2015

The 100,000 genomes project

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

NIH s Genomic Data Sharing Policy

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Moffitt Cancer Center, M2Gen and ConvergeHEALTH Collaboration

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

New solutions for Big Data Analysis and Visualization

Databricks. A Primer

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015

The University is comprised of seven colleges and offers 19. including more than 5000 graduate students.

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

In the largest and perhaps the most ambitious collaborative

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Accelerating variant calling

Globus Genomics Tutorial GlobusWorld 2014

The Future of Data Management with Hadoop and the Enterprise Data Hub

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How Companies are! Using Spark

Are You Big Data Ready?

TRANSLATIONAL BIOINFORMATICS 101

Worldwide Collaborations in Molecular Profiling

Practical Solutions for Big Data Analytics

The Future of Data Management

Big Data An Opportunity or a Distraction? Signal or Noise?

Transla6ng from Clinical Care to Research: Integra6ng i2b2 and OpenClinica

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Processing NGS Data with Hadoop-BAM and SeqPig

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

BIOINFORMATICS Supporting competencies for the pharma industry

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Databricks. A Primer

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

ITG Software Engineering

Cloud-Based Big Data Analytics in Bioinformatics

More Data in Less Time

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Why Spark on Hadoop Matters

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

i2b2 Clinical Research Chart

An Introduction to Genomics and SAS Scientific Discovery Solutions

Personalized Medicine: Humanity s Ultimate Big Data Challenge. Rob Fassett, MD Chief Medical Informatics Officer Oracle Health Sciences

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

1) SCOPE OF THE PROGRAM

TIBCO Spotfire Helps Organon Bridge the Data Gap Between Basic Research and Clinical Trials

Big data in cancer research : DNA sequencing and personalised medicine

Data-driven Medicine in the Age of Genomics Overcoming the Challenge With Advanced Molecular Analytics

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE

Introduction to NGS data analysis

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Join our scientific talent community

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

SAP Healthcare Analytics Solutions Provide physicians and researchers access to patient data from various systems in realtime

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

CSE-E5430 Scalable Cloud Computing. Lecture 4

Hadoop-BAM and SeqPig

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Big Data and Data Science. The globally recognised training program

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Hortonworks Architecting the Future of Big Data

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

HADOOP IN THE LIFE SCIENCES:

Biotechnology and Life Science Marketing Services Mailing List and Data Card Order Form

School of Nursing. Presented by Yvette Conley, PhD

Cisco IT Hadoop Journey

Hadoopizer : a cloud environment for bioinformatics data analysis

Analytics on Spark &

UTILIZING CDISC STANDARDS TO DRIVE EFFICIENCIES WITH OPENCLINICA Mark Wheeldon CEO, Formedix Boston June 21, 2013

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

A leader in the development and application of information technology to prevent and treat disease.

Balancing Big Data for Security, Collaboration and Performance

dixa a data infrastructure for chemical safety Jos Kleinjans Dept of Toxicogenomics Maastricht University

Hadoop & Spark Using Amazon EMR

Transcription:

Analyzing NGS data with clinical data: open source software for translational medicine BASEL LIFE SCIENCE WEEK NGS FORUM SEPTEMBER 24, 2015 Kees van Bochove, CEO The Hyve

Agenda 1. Introduction 2. Open Source in Translational Medicine 3. cbioportal 4. TranSMART 5. ADAM & Apache Spark 2

1. INTRODUCTION 3

The Hyve Professional support for open source software for bioinformatics and translational research software, such as transmart, cbioportal, i2b2, Galaxy, ADAM and OHDSI Core values Share Reuse Specialize Office Locations Utrecht, Netherlands Cambridge, MA, United States Services Software development Data science services Consultancy Hosting / SLAs Mission Enable pre-competitive collaboration in life science R&D by leveraging open source software Fast-growing Started in 2012 30 people by now 4

Interdisciplinary team software engineers, data scientists, project managers & staff; expertise in bioinformatics, medical informatics, software engineering, biostatistics etc. 5

2. OPEN SOURCE IN TRANSLATIONAL MEDICINE 6

http://lanyrd.com/2015/innovation-

8

The Open Source Definition 1. Free Redistribution 2. Availability of Source Code 3. Allow Derived Works 4. Integrity of The Author's Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Redistribution of License 8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software 10. License Must Be Technology-Neutral

Open Source Source code openly accessible and reusable for everyone Enables pre-competitive collaboration: both academics and industry can use and enhance it; which grows a community Transparency: verification (scientific as well as IT security) can be done by anyone, no black box

The software engineering process in an open source community is not different from a closed commercial setting But the stakeholders, contributors, business models, engagement models etc. are!

Different Non-Functional Requirements for Software Bioinformatician in academics: create a novel solution for a problem which has publication value Basic Research: new frontiers Software should demonstrate working principle Bioinformatician / IT Services in pharma/clinic: mainly applied research: Software should be well tested, maintainable, extensible, scalable etc. Need for commercial support for open source software

Open Source in Translational Medicine Clinical - ecrf & apps: Datawarehousing: Data visualisation: Imaging: Scientific compute: Biobanking: Workflow / NGS: 13 Study design:

3. CBIOPORTAL See also: http://www.pistoiaalliance.org/events/cbioportal-webinar 14

15 cbioportal study portal

Colon cancer study in TraIT 16

Event calls Gene alteration events per sample Which genes are altered in each individual tumor sample? Data type Mutations Copy number changes Methylation mrna and/or DNA mrna expression changes Alteration event calls Non-synonymous somatic mutations Homozygous deletion or amplification Epigenetic silencing Gene fusions Over- or under-expression Alteration types and thresholds can be customized for each gene

18 Visualization of events across genes and data types

19 Review cancer genomics events in clinical context

20 GenePrint Visualisation (from cbioportal) in transmart

TM2CBIO In collaboration with Netherlands Cancer Institute ETL pipeline between transmart and cbioportal TranSMART used as data warehouse, and cbioportal as a study-based analytics mart for cancer studies Going from individual data points (e.g. mrna intensity levels) in transmart to alteration events in cbioportal 21

4. TRANSMART 22

TranSMART as a product Datawarehouse bringing together scientists from clinical sciences, preclinical research and discovery around the data Combination of internal datasets and documents with public datasets and knowledge Tailored to both biologists/clinicians and bioinformaticians Dual nature: in use for translational research in both pharma and hospitals/clinic 23

In early 2014, over 50 transmart implementations International Research Initiatives IMI etriks, EMIF CTMM TraIT Pharma & Biotech Sanofi, Millennium, Pfizer, JNJ, Roche Government Aligned Institutions FDA Non-Profits 1Mind4Research, Orion Bionetworks, Critical Path Institute Hospitals / Academics U Michigan, Harvard / Boston Children's Hospital, HEGP, Johns Hopkins, St. Jude Service Providers ConvergeHEALTH, thehyve, Rancho Biosciences, BTGS, Thomson Reuters, Saama Tech, Cognizant Start Organization Type Stage 2008 Johnson & Johnson Pharma Production 2008 Recombinant by Deloitte Services Multiple 2010 Sage Bionetworks Non Profit Production 2010 Thomson Reuters Services Support 2010 U-BIOPRED Consortium Production 2011 SAFE-T Consortium Pilot 2011 University of Michigan, Comprehensive Cancer Center Academic Production 2012 APHP-HEGP Paris France Academic Production 2012 BT Cure Consortium Pilot 2012 CTMM/TraIT Consortium Dev 2012 FDA Government Dev 2012 IMI/eTRIKS Consortium Dev 2012 Merck Pharma Pilot 2012 Millennium Pharmaceuticals Pharma Production 2012 One Mind for Research (1M4R) Non Profit Production 2012 Pfizer Pharma Production 2012 Roche Pharma Evaluation 2012 Sanofi-Aventis Pharma Dev 2012 St. Jude Non Profit Dev 2012 U Michigan, Computational Medicine & Bioinformatics Academic Multiple 2013 Agios Biotech Evaluation 2013 CARPEM Cancer personalized medicine Academic Dev 2013 Harvard University / Boston Children's Hospital Academic Autism Pilot 2013 Boehringer Ingelheim Pharma Pilot 2013 Bristol Myers Squibb Pharma Evaluation 2013 BT Global Services Services Pilot 2013 Accelerated Cure Project for MS Non Profit Dev 2014 Personalized medicine and colorectal cancers (France) Academic Dev 2014 PCORI PRRN Phelan-McDermid Syndrome Data Network Academic Dev

TranSMART Open Source History February 2012: J&J releases transmart as open source on GitHub under GPL v3 December 2012: CTMM TraIT project decides to use transmart as core infrastructure component January 2013: IMI etriks starts, uses transmart as core infrastructure component February 2013: kickoff of transmart Foundation, U. Michigan publishes PostgreSQL port March 2014: IMI EMIF kickoff, transmart is used as data integration component 25

Center for Translational Molecular Medicine (CTMM) Public-private consortium Dedicated to the development of Molecular Diagnostics and Molecular Imaging technologies Focusing on the translational aspects of molecular medicine. 120 partners universities, academic medical centers, medical technology enterprises and chemical and pharmaceutical companies. Budget 300 M 22 projects / research consortia TraIT is the Translational Research IT project supporting these projects with a joint IT infrastructure 26

TraIT Consortium Growing TraIT project team 27

TraIT data workflow Hospital (IT) HIS PACS LIS Samples (IT) BIMS P s e u d o n y m i z a t i o n data domains clinical data OpenClinica imaging data NBIA + AIM biobanking CBM-NL Translational Research (IT) integrated data transmart/i2b2 datawarehouse translational analytics workbench transmart/ cohort explorer R Public Data experimental data e.g. Galaxy, Chipster e.g. PhenotypeDB, Annai Systems Galaxy

Recombina nt / Deloitte CDISC Thomson Reuters Pfizer Astra Zeneca VUmc The Hyve 70 Sanofi Johnson & Johnson University of Michigan Philips University of Luxembourg Amsterdam, June 2013: transmart Workshop Attendees from 10 Pharma companies, 11 University Medical Centers and 12 IT companies http://lanyrd.com/2013/transmart 29

130 Ann Arbor, Michigan, October 2014: Annual Meeting http://lanyrd.com/2014/transmart 30

TranSMART wins all the prizes: Best Show Award, Best Practices Award, Best Poster Award Bio IT World, Boston, April 2015 http://bit.ly/1r2n6uz 31

Contributors

The Hyve transmart 1.3 Contributions Improvements for handling GWAS data & cohort selection on SNP data Build a number of interactive advanced analytics workflows & correct statistical assumptions Imaging workflow: ETL for imaging metadata and results Prototype of a transmart 2.0 interface: new look & feel, user experience Under discussion: improved GUI for ETL? 33

TranSMART 2.0 User Interface - alpha

5. USING APACHE SPARK FOR NGS DATA ANALYSIS 35

NGS data storage & analysis Don't import BAM, Cram, VCF and BCF to a database! They are the databases! Indexed Compressed Highly specialized & optimized storage formats Whole ecosystem is build around this concept. All tools read and write these through a rich API HTS-JDK GATK MapReduce engine

Genome Analysis ToolKit (GATK) MapReduce framework for processing BAM and VCF files / databases Provides walkers that provide access pattern as a stream trough BAM and VCF files On top of these walkers there are analysis tools: Indel realignment Base Quality recalibration Unified Genotyper (=old variant caller) Haplotype Caller (= new variant caller)

ADAM http://bdgenomics.org Genomics processing engine &specialized file format built with: Apache Avro (uniform data format defintion) Apache Spark (memory-based cluster execution) Apache Parquet (Hadoop based columnar storage format) Resulting data can be accessed by Hadoop Map-Reduce, Spark, Shark, Impala, Pig, Hive etc. Support for conversion to and from BAM and VCF. MAF conversion and somatic variant calling unclear. Open source project driven mainly by UC Berkeley

ADAM Example setup in transmart

Translational Research Infrastructure User Interfaces R, Spotfire etc. Galaxy GUI TranSMART GUI TranSMART RESTful API Galaxy Sun Grid Engine (SGE cluster) Clinical API ctakes Clinical Data/i2b2 Mapping between patients in clinical db and samples in omics data Genome Analysis ToolKit (GATK) HTS - JDK BAM, CRAM, VCF, BCF ADAM/ Spark Transcriptome Analysis ToolKit (TATK) (R / Bioconductor) Transcriptome files / DB Isilon High Performance Storage Proteome Analysis ToolKit (PATK) MzML, MzIdent XNAT Hadoop HDFS / Apache Parquet (ADAM) Archiving Object Storage (e.g. Glacier) Imaging Data Repository Public Archives Sequence Read Archive (SRA) Download / ETL Gene Expression Omnibus (GEO) PRoteomics IDEntifications Database (PRIDE)