Big Data, Big Challenges Big Data, Big Challenges DeIC Conference 2013 Michael Sullivan, M.D.
Big Data Variety Volume Visualiza0on Velocity
Variety Roger Ebert 1942-2013 Roger was diagnosed with cancer of a salivary gland in 2002 and died in 2013. What I believe is that all clear- minded people should remain two things throughout their life:mes: curious and teachable.
TesGng for Cancer Today Diagnosis: Sublingual Adenocarcinoma (rare) Standard treatment: None Tissue biopsies showed a rare cancer. Serial CT scans of the lung showed disease progression.
Cancer Diagnosis and Treatment in the Future Central Dogma of Molecular Biology
DNA Sequencing Results This Circos plot shows gene expression before (T1) and arer (T2) treatment.
Signaling Network 3D RET protein structure Disrupted signaling pathways drive tumor proliferagon.
Heterogenous Data
NIH/NCI Cancer Genomics Cloud Ini0a0ve u The Cloud - move the compute, not the data. u 3 pilot centers maybe CGHub, BioNimbus, and Broad. u Preloaded with tools and data (TCGA will have 2.5 PB). u Assumes exisgng infrastructure base. u EsGmated $5 million per site per year.
Trans- NIH: BD2K and Infrastructure Plus u Big Data to Knowledge (BD2K) has 4 parts: v FacilitaGng Broad Use of Data (catalog, metadata) v Analysis Methods and SoRware (access to HPC) v Enhancing Training (data science) v Centers of Excellence (6-8 centers) u Infrastructure Plus Intramural (campus) upgrade
Volume May 5, 2013 mega = 10 6 giga = 10 9 tera = 10 12 peta = 10 15 exa = 10 18 ze`a = 10 21 yo`a = 10 24 googol = 10 100 googolplex = 10 googol
Growth of Storage at NCBI (2002-2013)
NCBI Outbound Data (TB/Month) Trillions 1200 1000 800 600 Total 400 200 0 Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug 2009 2010 2011 2012 2013
InformaGon Content: u = log 2 (n) in silico: n = 2, u = 1 1 byte = 8 bits in vivo: n = 4, u = 2 1 byte = 4 base pairs
Storage in silico vs. in vivo NCBI or EBI 20 Petabytes Human 160 ZeHabytes* Ra0o: 1 : 8,000,000 * CalculaGon: 6.4 Gbp 1 Byte 10 microbe bp 10 13 cells 160 ZB X X X = cell 4 bp 1 human bp person person GB = Gigabyte (10 9 ) PB = Petabyte (10 15 ) ZB = Ze`abyte (10 21 ) bp = base pair Gbp = Gigabase pair (10 9 ) Assume all cells are diploid.
Visualiza0on Physics LHC Lead Ion Collision Source: CERN (ALICE detector)" 16 10/1/13, 2012 Internet2 Life Sciences MRI Monkey Brain Source: Van Wedeen, M.D., Martinos Center and Dept. of Radiology, Massachusetts General Hospital and Harvard University Medical School"
Circos Mapping of Dog and Human Genes Human 23 Chromosomes 3.1 Gb Dog 39 Chromosomes 2.4 Gb
Human Dog Synteny by Dog Chromosome
Protein- Protein InteracGons
The Human Connectome
Velocity
White House Champions of Open Science David Lipman, NCBI Atul Bu`e Stanford University David Altshuler Broad InsGtute Stephen Friend Sage Bionetworks
Science DMZ Science DMZ h`p://fasterdata.es.net/science- dmz/science- dmz- security/
Growth of perfsonar ~30 Countries ~200 Domains ~850 Instances
NCBI EBI Performance Problem: 10 Mb stream (expected 500Mb) SoluGon: perfsonar at the endpoints localized the problem
Dr. Lin Fang US China 10 Gbps Link Fed Ex: Internet + FTP: China- US 10G Link: 2 days 26 hours 30 seconds Dr. Dawei Lin Sample.fa (24GB) 26 10/1/13, 2012 Internet2
Scalability 300 million light years ~.00001 meter Galaxy Neuron
NIST Big Data Reference Architecture NIST RA M0226v5 INFORMATION VALUE CHAIN System Orchestrator Data Provider DATA SW Applica0on Provider Collec0on Cura0on Analy0cs Visualiza0on Access DATA SW IT Provider Big Data Processing Frameworks (analy0c tools, etc.) Horizontally Scalable Ver0cally Scalable Big Data Pla`orms (databases, etc.) Horizontally Scalable Ver0cally Scalable Infrastructures Horizontally Scalable (VM clusters) Ver0cally Scalable DATA SW Data Consumer Security & Privacy Management IT VALUE CHAIN Physical and Virtual Resources (Networking, Compu0ng, etc.) 09/04/2013 NIST Big Data WG / Ref Arch Subgroup 28
Internet2 Network Infrastructure Topology
Innovation Platform 100 GigE Layer 2 Connec0on Science DMZ Soaware Defined Networking Internet2 innova0on backbone delivered as 100G L1 High- Performance Layer 2/3 Switch/Router SDN Control Server Performance Node Switches, data stores for data- intensive science R&E IP TR- CPS IP Network Layer 3 Your Research Sta0c Layer 2 Dynamic Layer 2 GENI Experiments GENI? Tradi0onal regional and commodity providers Tradi0onal Campus Border Router Tradi0onal L3 Campus Border Security Campus Enterprise Network For more informa0on, see fasterdata.es.net Tradi0onal Services Tradi0onal Switch Substrate Op0cal System Dark Fiber Innova0on Services Soaware Defined Networking Substrate 30 10/1/13, 2012 Internet2 www.internet2.edu
Condo of Condos
DemocraGzaGon of Sequencing 2,386 Genome Sequencers Worldwide 30 May 2013 Source: Map of High-throughput Sequencers" 32 10/1/13, 2012 Internet2
NaGonal Cyberinfrastructure XSEDE NSF- funded Supercomputers HPC resources Internet2 250 universiges XSEDEnet NCGAS Indiana University TACC SDSC PSC Source: h`ps://www.xsede.org/networking 33 10/1/13, 2012 Internet2
NCGAS Virtual Instrument Indiana University 6 PB Storage TACC NSF- Funded or XSEDE AllocaGon Federally Funded NCGAS Galaxy Portal POD Galaxy Portal Mason POD 5 PB D.C. 100 Gig Internet2 5.5 PB Storage 4 PB Storage SDSC PSC Sequencing Center 10 Gig NLR NCBI Source: Barne`, W.K., and R.D. LeDuc, Next Genera:on Cyberinfrastructures for Next Genera:on Sequencing and Genome Science, presented at 2013 AAMC GIR Conference, Vancouver, BC
A note here. Apache Hadoop Ecosystem
AWS Cluster Pricing
HPC and Cloud Convergence?
msullivan@internet2.edu OpenStack