Big Data, Big Challenges



Similar documents
Science Gateways What are they and why are they having such a tremendous impact on science? Nancy Wilkins- Diehr wilkinsn@sdsc.edu

Real-World Insights from an SDN Lab. Ron Milford Manager, InCNTRE SDN Lab Indiana University

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Presented By Joe Mambretti, Director, International Center for Advanced Internet Research, Northwestern University

Personalized Medicine and IT

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Stanford SDN-Based Private Cloud. Johan van Reijendam Stanford University

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

SuperStack Next Exit. Challenges on CC*IIE at UF

Open Science, Big Data and Research Reproducibility. Tony Hey Senior Data Science Fellow escience Ins>tute University of Washington

Software-Defined Networking

Cancer Genomics: What Does It Mean for You?

David Minor. Chronopolis Program Manager Director, Digital Preserva7on Ini7a7ves UCSD Library San Diego Supercomputer Center

NIST Big Data Phase I Public Working Group

Big Data. George O. Strawn NITRD

Bionimbus: From Big Data to Clouds and Commons

NITRD and Big Data. George O. Strawn NITRD

GTC Presentation March 19, Copyright 2012 Penguin Computing, Inc. All rights reserved

Data management challenges in todays Healthcare and Life Sciences ecosystems

SDN Controller Requirement

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

Big Data Streams. Analytics Challenges, Analysis, and Applications. Adel M. Alimi

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data in OpenTopography

Perspec'ves on SDN. Roadmap to SDN Workshop, LBL

Data Centric Computing Revisited

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

How the ersa Problem became the ersa Solu3on. Why a network and network security is impera3ve for ersa s NeCTAR cloud. Paul Bartczak Infrastructure

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

CS 378 Big Data Programming

May 13-14, Copyright 2015 Open Networking User Group. All Rights Reserved Not For

Using the Bionimbus Protected Data Cloud (PDC): Obtaining Access Credentials FAQ

Agenda. NRENs, GARR and GEANT in a nutshell SDN Activities Conclusion. Mauro Campanella Internet Festival, Pisa 9 Oct

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

LHCONE Site Connections

Data Sharing Initiative: International Cancer Genome Consortium

Application Development. A Paradigm Shift

HADOOP IN THE LIFE SCIENCES:

EMBL Identity & Access Management

Wireshark Developer and User Conference

Unifying the Programmability of Cloud and Carrier Infrastructure

Building Storage Service in a Private Cloud

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Genetic diagnostics the gateway to personalized medicine

The New Dynamism in Research and Education Networks

HOW SDN AND (NFV) WILL RADICALLY CHANGE DATA CENTRE ARCHITECTURES AND ENABLE NEXT GENERATION CLOUD SERVICES

Vivien Bonazzi ADDS Office (OD) George Komatsoulis (NCBI)

SDN PARTNER INTEGRATION: SANDVINE

Manufacturing and the Internet of Everything

Real Time Big Data Processing

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Enterprise Data Center Networks

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

BRINGING NETWORKS TO THE CLOUD ERA

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Big Data a threat or a chance?

Introduction to the Mathematics of Big Data. Philippe B. Laval

addition to upgrading connectivity between the PoPs to 100Gbps, GPN is pursuing additional collocation space in Kansas City and is in the pilot stage

Apache Hadoop FileSystem and its Usage in Facebook

Cisco Prime Network Services Controller. Sonali Kalje Sr. Product Manager Cloud and Virtualization, Cisco Systems

Software Defined Networking - a new approach to network design and operation. Paul Horrocks Pre-Sales Strategist 8 th November 2012

Technical Overview Simple, Scalable, Object Storage Software

The Real Score of Cloud

MapReduce and Hadoop Distributed File System V I J A Y R A O

Core and Pod Data Center Design

Transcription:

Big Data, Big Challenges Big Data, Big Challenges DeIC Conference 2013 Michael Sullivan, M.D.

Big Data Variety Volume Visualiza0on Velocity

Variety Roger Ebert 1942-2013 Roger was diagnosed with cancer of a salivary gland in 2002 and died in 2013. What I believe is that all clear- minded people should remain two things throughout their life:mes: curious and teachable.

TesGng for Cancer Today Diagnosis: Sublingual Adenocarcinoma (rare) Standard treatment: None Tissue biopsies showed a rare cancer. Serial CT scans of the lung showed disease progression.

Cancer Diagnosis and Treatment in the Future Central Dogma of Molecular Biology

DNA Sequencing Results This Circos plot shows gene expression before (T1) and arer (T2) treatment.

Signaling Network 3D RET protein structure Disrupted signaling pathways drive tumor proliferagon.

Heterogenous Data

NIH/NCI Cancer Genomics Cloud Ini0a0ve u The Cloud - move the compute, not the data. u 3 pilot centers maybe CGHub, BioNimbus, and Broad. u Preloaded with tools and data (TCGA will have 2.5 PB). u Assumes exisgng infrastructure base. u EsGmated $5 million per site per year.

Trans- NIH: BD2K and Infrastructure Plus u Big Data to Knowledge (BD2K) has 4 parts: v FacilitaGng Broad Use of Data (catalog, metadata) v Analysis Methods and SoRware (access to HPC) v Enhancing Training (data science) v Centers of Excellence (6-8 centers) u Infrastructure Plus Intramural (campus) upgrade

Volume May 5, 2013 mega = 10 6 giga = 10 9 tera = 10 12 peta = 10 15 exa = 10 18 ze`a = 10 21 yo`a = 10 24 googol = 10 100 googolplex = 10 googol

Growth of Storage at NCBI (2002-2013)

NCBI Outbound Data (TB/Month) Trillions 1200 1000 800 600 Total 400 200 0 Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug Oct Dec Feb Apr Jun Aug 2009 2010 2011 2012 2013

InformaGon Content: u = log 2 (n) in silico: n = 2, u = 1 1 byte = 8 bits in vivo: n = 4, u = 2 1 byte = 4 base pairs

Storage in silico vs. in vivo NCBI or EBI 20 Petabytes Human 160 ZeHabytes* Ra0o: 1 : 8,000,000 * CalculaGon: 6.4 Gbp 1 Byte 10 microbe bp 10 13 cells 160 ZB X X X = cell 4 bp 1 human bp person person GB = Gigabyte (10 9 ) PB = Petabyte (10 15 ) ZB = Ze`abyte (10 21 ) bp = base pair Gbp = Gigabase pair (10 9 ) Assume all cells are diploid.

Visualiza0on Physics LHC Lead Ion Collision Source: CERN (ALICE detector)" 16 10/1/13, 2012 Internet2 Life Sciences MRI Monkey Brain Source: Van Wedeen, M.D., Martinos Center and Dept. of Radiology, Massachusetts General Hospital and Harvard University Medical School"

Circos Mapping of Dog and Human Genes Human 23 Chromosomes 3.1 Gb Dog 39 Chromosomes 2.4 Gb

Human Dog Synteny by Dog Chromosome

Protein- Protein InteracGons

The Human Connectome

Velocity

White House Champions of Open Science David Lipman, NCBI Atul Bu`e Stanford University David Altshuler Broad InsGtute Stephen Friend Sage Bionetworks

Science DMZ Science DMZ h`p://fasterdata.es.net/science- dmz/science- dmz- security/

Growth of perfsonar ~30 Countries ~200 Domains ~850 Instances

NCBI EBI Performance Problem: 10 Mb stream (expected 500Mb) SoluGon: perfsonar at the endpoints localized the problem

Dr. Lin Fang US China 10 Gbps Link Fed Ex: Internet + FTP: China- US 10G Link: 2 days 26 hours 30 seconds Dr. Dawei Lin Sample.fa (24GB) 26 10/1/13, 2012 Internet2

Scalability 300 million light years ~.00001 meter Galaxy Neuron

NIST Big Data Reference Architecture NIST RA M0226v5 INFORMATION VALUE CHAIN System Orchestrator Data Provider DATA SW Applica0on Provider Collec0on Cura0on Analy0cs Visualiza0on Access DATA SW IT Provider Big Data Processing Frameworks (analy0c tools, etc.) Horizontally Scalable Ver0cally Scalable Big Data Pla`orms (databases, etc.) Horizontally Scalable Ver0cally Scalable Infrastructures Horizontally Scalable (VM clusters) Ver0cally Scalable DATA SW Data Consumer Security & Privacy Management IT VALUE CHAIN Physical and Virtual Resources (Networking, Compu0ng, etc.) 09/04/2013 NIST Big Data WG / Ref Arch Subgroup 28

Internet2 Network Infrastructure Topology

Innovation Platform 100 GigE Layer 2 Connec0on Science DMZ Soaware Defined Networking Internet2 innova0on backbone delivered as 100G L1 High- Performance Layer 2/3 Switch/Router SDN Control Server Performance Node Switches, data stores for data- intensive science R&E IP TR- CPS IP Network Layer 3 Your Research Sta0c Layer 2 Dynamic Layer 2 GENI Experiments GENI? Tradi0onal regional and commodity providers Tradi0onal Campus Border Router Tradi0onal L3 Campus Border Security Campus Enterprise Network For more informa0on, see fasterdata.es.net Tradi0onal Services Tradi0onal Switch Substrate Op0cal System Dark Fiber Innova0on Services Soaware Defined Networking Substrate 30 10/1/13, 2012 Internet2 www.internet2.edu

Condo of Condos

DemocraGzaGon of Sequencing 2,386 Genome Sequencers Worldwide 30 May 2013 Source: Map of High-throughput Sequencers" 32 10/1/13, 2012 Internet2

NaGonal Cyberinfrastructure XSEDE NSF- funded Supercomputers HPC resources Internet2 250 universiges XSEDEnet NCGAS Indiana University TACC SDSC PSC Source: h`ps://www.xsede.org/networking 33 10/1/13, 2012 Internet2

NCGAS Virtual Instrument Indiana University 6 PB Storage TACC NSF- Funded or XSEDE AllocaGon Federally Funded NCGAS Galaxy Portal POD Galaxy Portal Mason POD 5 PB D.C. 100 Gig Internet2 5.5 PB Storage 4 PB Storage SDSC PSC Sequencing Center 10 Gig NLR NCBI Source: Barne`, W.K., and R.D. LeDuc, Next Genera:on Cyberinfrastructures for Next Genera:on Sequencing and Genome Science, presented at 2013 AAMC GIR Conference, Vancouver, BC

A note here. Apache Hadoop Ecosystem

AWS Cluster Pricing

HPC and Cloud Convergence?

msullivan@internet2.edu OpenStack