Data Sharing Initiative: International Cancer Genome Consortium

Similar documents

Analysis One Code Desc. Transaction Amount. Fiscal Period

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

CENTERPOINT ENERGY TEXARKANA SERVICE AREA GAS SUPPLY RATE (GSR) JULY Small Commercial Service (SCS-1) GSR

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Enhanced Vessel Traffic Management System Booking Slots Available and Vessels Booked per Day From 12-JAN-2016 To 30-JUN-2017

Regulatory Issues in Genetic Testing and Targeted Drug Development

ROYAL REHAB COLLEGE AND THE ENTOURAGE EDUCATION GROUP. UPDATED SCHEDULE OF VET UNITS OF STUDY AND VET TUITION FEES Course Aug 1/2015

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

GeneProf and the new GeneProf Web Services

APPLICATION ANNUAL WORK PLAN (ONE OBJECTIVE PER PAGE)

New solutions for Big Data Analysis and Visualization

Worldwide Collaborations in Molecular Profiling

Are you prepared to make the decisions that matter most? Decision making in healthcare

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

TRANSFORMING HEALTH SYSTEM WITH IT Ain Aaviksoo, MD MPH. Deputy Secretary General for eservices & Innovation Ministry of Social Affairs of Estonia

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Working Capital and the Financing Decision C H A P T E R S I X

Personalized medicine in China s healthcare system

A leader in the development and application of information technology to prevent and treat disease.

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

SMART-on-FHIR Genomics: Enabling Precision Medicine by Bridging Clinical and Genomic Information

CHILDREN AND YOUNG PEOPLE'S PLAN: PLANNING AND PERFORMANCE MANAGEMENT STRATEGY

Legislative Brief: PAY OR PLAY PENALTIES LOOK BACK MEASUREMENT METHOD EXAMPLES. EmPowerHR

jbpm Explained with Simple Use Cases

Delivering the power of the world s most successful genomics platform

Your Alumni Community. An Alumni Community for Every School and College

VICTORIAN CARDIAC OUTCOMES REGISTRY. Data Management Policy

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015

A!Team!Cymru!EIS!Report:!Growing!Exploitation!of!Small! OfCice!Routers!Creating!Serious!Risks!

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Cloud Ready for Bioinformatics?

Equipping your Forecasting Toolkit to Account for Ongoing Changes

Ashley Institute of Training Schedule of VET Tuition Fees 2015

Sweating Digital Assets Analytics Way

Project Planning, Scheduling and Control: Assignment 2 D. U. Singer Hospital Products Corp.

Personalized Medicine and IT

Andrew Pylyp. Capital Market Day. Managing Director Wer liefert was? Stockholm 27. November 2006

G E N OM I C S S E RV I C ES

Attacking the Biobank Bottleneck

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

DATA MINING - SELECTED TOPICS

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Breen Elementary School

Human Brain Project -

How Real-time Analysis turns Big Medical Data into Precision Medicine?

REWRITING PAYER/PROVIDER COLLABORATION July 24, MIKE FAY Vice President, Health Networks

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Toward Acceleration of Open Innovation

Using the Bionimbus Protected Data Cloud (PDC): Obtaining Access Credentials FAQ

Big Data Challenges in Bioinformatics

Secondary Uses of Health Data IMPAC s Oncology Data Alliance Program

Making Healthcare Meaningful Through Meaningful Use Stage 2

Proteomic Analysis using Accurate Mass Tags. Gordon Anderson PNNL January 4-5, 2005

Dr Alexander Henzing

Clinical Safety & Effectiveness Cohort #15, Team 6

EMR and ehr Together for patients and providers. ehealth Conference October 3-4, 2014

Challenges with Meaningful Use EHR Satisfaction & Usability Diminishing

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Trading Calendar Holiday Schedule Holiday Schedule. Martin Luther King Jr.'s Birthday (Observed) Independence Day

Computing & Telecommunications Services Monthly Report March 2015

Resource Management Spreadsheet Capabilities. Stuart Dixon Resource Manager

Analytic-Driven Quality Keys Success in Risk-Based Contracts. Ross Gustafson, Vice President Allina Performance Resources, Health Catalyst

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Human Genome Organization: An Update. Genome Organization: An Update

Grain Stocks Estimates: Can Anything Explain the Market Surprises of Recent Years? Scott H. Irwin

Bayer Invests Heavily in R&D and is committed to innovation Sustained Innovation Output from all Subgroups

Using Data Mining for Mobile Communication Clustering and Characterization

Cancer Genomics: What Does It Mean for You?

Enabling the Health Continuum with Informatics. Jeroen Tas Healthcare Informatics.Solutions.Services

The 100,000 genomes project

ACCA Interactive Timetable

ACCA Interactive Timetable

Grids Computing and Collaboration

Centers of Academic Excellence in Cyber Security (CAE-C) Knowledge Units Review

Transcription:

Data Sharing Initiative: International Cancer Genome Consortium Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research 1

Sharing Data Sharing BIG Genome Initiative: DATA International From Cancer Genome Consortium 17 countries Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research

ICGC Map March 2014 71 projects launched 3

ICGC data is distributed, but coordinated by OICR and accessible through common portals 4

Data Types Collected Donor clinical and demographic data Sample data Simple Somatic Mutations Copy Number Somatic Mutations Structural Somatic Mutations Gene Expression Splicing Variation mirna Expression Methylation Protein Expression -Cancer pathways -New biomarkers -New targeted drugs -New diagnostic tools -Precision medicine 5

Data is standardized across projects to enable data sharing across projects Without standardized data format Without standardized data dictionary TXT XML ENSG00000141510 = p53 = TP53??? Non-sense mutation = stop-gain??? VCF MAF Vs Data Portal Standardized data De-identified clinical data 6

The New ICGC Data Portal Oct 1 st, 2013

ICGC datasets to date ICGC Data Portal Cumulative Donor Count for Member Projects Release 15 11,000 Release 14 10,000 Release 9 Release 10 Release 11 Release 12 Release 13 9000 8000 7000 6000 5000 Number of Donors Release 7 Release 8 4000 3000 2000 1000 Dec-11 Jan-2012 Feb March April May June July Aug Sept Oct Nov Dec Jan-2013 Feb March April May June July Aug Sept-2013 Oct Nov Dec Jan 2014 8 8

Open and Controlled Data Access 9

Data Access Compliance Office supported by IPAC IPAC: International Policy interoperability and data Access Clearinghouse; Provides a one stop screening service for policy interoperability and access authorization; Operated by P3G/McGill U. 10

Data sharing is severely hindered when data is huge, except for bioinformatics giants. Big Data 11

Storing ICGC Data in The Cloud Cancer Genome Data Sets Access control Algorithm development Programmer APIs Data browsers Toolkits Virtual Machines 12

The Whole Genome Pan-Cancer Analysis Project (PCAP) Goals: Understand what s going on in the 95% of the cancer genome that isn t protein-coding. Non-coding RNAs. Regulatory elements. Amplifications/deletions & other structural changes. Resources: >2000 whole genome tumor/normal pairs from ICGC. 15 working groups 130 research subprojects 13

PCAP Analytic Issues Calling of cancer mutations in non-coding regions is an evolving art. Require uniform data processing and mutation calling in order to avoid method-specific differences. Many of the PCAP subprojects require access to the raw read data. Data set is large! 500 TB (but final ICGC data will be ~10000 TB) Version: 26Apr2012 14

Six Cloud Compute Centres University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul 15

Phase I: Partition Data and Call Mutations >2000 pairs 330 330 330 330 330 330 University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls 16

Phase II: Synchronize Alignments & Mutation Calls Aligned Reads (500 TB) University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul Mutation Calls (100 GB) 17

Phase III: Downstream Analysis University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul ICGC Researchers and Working Groups 18

Status of PCAP Legal Ethics approval obtained Data usage agreements signed by all data centers Memorandum of understanding executed by most centers Technical OpenStack/VMWare, Vagrant, GNOS, & SeqWare installed on all data centers Alignment workflows successfully executed on VMs at Chicago, Hinxton and Barcelona Same data yields same alignments! First 1400 genome pairs identified; will be ready for distribution to data centers by March 1. Another ~1000 genome pairs are in preparation. Version: 26Apr2012 19

Challenges we ve encountered Legal Despite international nature of project, regional regulations have not gone away. Data sets originating in USA can only be hosted by certain US-based institutions. NIH has not yet approved phase III of the project for US-originated data sets. Sensitivities on the part of some European countries limit the distribution of non-us data sets to non-usbased organizations (blame Snowden & NSA disclosures) Technical Adapting traditional grid-based HPCs to use cloud-based technologies has been challenging but not insurmountable. Version: 26Apr2012 20

Credits Global Alliance for Genomics and Health 21