Data Sharing Initiative: International Cancer Genome Consortium Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research 1
Sharing Data Sharing BIG Genome Initiative: DATA International From Cancer Genome Consortium 17 countries Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research
ICGC Map March 2014 71 projects launched 3
ICGC data is distributed, but coordinated by OICR and accessible through common portals 4
Data Types Collected Donor clinical and demographic data Sample data Simple Somatic Mutations Copy Number Somatic Mutations Structural Somatic Mutations Gene Expression Splicing Variation mirna Expression Methylation Protein Expression -Cancer pathways -New biomarkers -New targeted drugs -New diagnostic tools -Precision medicine 5
Data is standardized across projects to enable data sharing across projects Without standardized data format Without standardized data dictionary TXT XML ENSG00000141510 = p53 = TP53??? Non-sense mutation = stop-gain??? VCF MAF Vs Data Portal Standardized data De-identified clinical data 6
The New ICGC Data Portal Oct 1 st, 2013
ICGC datasets to date ICGC Data Portal Cumulative Donor Count for Member Projects Release 15 11,000 Release 14 10,000 Release 9 Release 10 Release 11 Release 12 Release 13 9000 8000 7000 6000 5000 Number of Donors Release 7 Release 8 4000 3000 2000 1000 Dec-11 Jan-2012 Feb March April May June July Aug Sept Oct Nov Dec Jan-2013 Feb March April May June July Aug Sept-2013 Oct Nov Dec Jan 2014 8 8
Open and Controlled Data Access 9
Data Access Compliance Office supported by IPAC IPAC: International Policy interoperability and data Access Clearinghouse; Provides a one stop screening service for policy interoperability and access authorization; Operated by P3G/McGill U. 10
Data sharing is severely hindered when data is huge, except for bioinformatics giants. Big Data 11
Storing ICGC Data in The Cloud Cancer Genome Data Sets Access control Algorithm development Programmer APIs Data browsers Toolkits Virtual Machines 12
The Whole Genome Pan-Cancer Analysis Project (PCAP) Goals: Understand what s going on in the 95% of the cancer genome that isn t protein-coding. Non-coding RNAs. Regulatory elements. Amplifications/deletions & other structural changes. Resources: >2000 whole genome tumor/normal pairs from ICGC. 15 working groups 130 research subprojects 13
PCAP Analytic Issues Calling of cancer mutations in non-coding regions is an evolving art. Require uniform data processing and mutation calling in order to avoid method-specific differences. Many of the PCAP subprojects require access to the raw read data. Data set is large! 500 TB (but final ICGC data will be ~10000 TB) Version: 26Apr2012 14
Six Cloud Compute Centres University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul 15
Phase I: Partition Data and Call Mutations >2000 pairs 330 330 330 330 330 330 University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls Aligned genomes mutation calls 16
Phase II: Synchronize Alignments & Mutation Calls Aligned Reads (500 TB) University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul Mutation Calls (100 GB) 17
Phase III: Downstream Analysis University of Chicago Bionimbus Protected Data Cloud DKFZ, Heidelberg European Bioinformatics Institute, Hinxton UK Barcelona Supercomputer Center IMSUT+RIKEN, Tokyo ITRI, Seoul ICGC Researchers and Working Groups 18
Status of PCAP Legal Ethics approval obtained Data usage agreements signed by all data centers Memorandum of understanding executed by most centers Technical OpenStack/VMWare, Vagrant, GNOS, & SeqWare installed on all data centers Alignment workflows successfully executed on VMs at Chicago, Hinxton and Barcelona Same data yields same alignments! First 1400 genome pairs identified; will be ready for distribution to data centers by March 1. Another ~1000 genome pairs are in preparation. Version: 26Apr2012 19
Challenges we ve encountered Legal Despite international nature of project, regional regulations have not gone away. Data sets originating in USA can only be hosted by certain US-based institutions. NIH has not yet approved phase III of the project for US-originated data sets. Sensitivities on the part of some European countries limit the distribution of non-us data sets to non-usbased organizations (blame Snowden & NSA disclosures) Technical Adapting traditional grid-based HPCs to use cloud-based technologies has been challenging but not insurmountable. Version: 26Apr2012 20
Credits Global Alliance for Genomics and Health 21