1 Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres
2 Talk outline! We talk about Petabyte? 2
3 Deluge of Data Data is now considered the Fourth Paradigm in Science the first three paradigms were experimental, theoretical and computational science. This shift is being driven by the rapid growth in data from improvements in scientific instruments
4 Scientific instruments Physics Large Hadron Collider produced around 15 petabytes of data in Astronomy Large Synoptic Survey Telescope it s anticipated to produce around 10 petabytes per year. 4
5 Example: In Genome Research? Cost of sequencing a human-sized genome $ $ $ $ $ $ Sep-01 Jan-02 May-02 Sep-02 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 Jan-08 May-08 Sep-08 Jan-09 May-09 Sep-09 Jan-10 May-10 Sep-10 Jan-11 May-11 Sep-11 Jan-12 May-12 Sep-12 Jan-13 Source: National Human Genome Research Institute (NHGRI) 5
6 Data Deluge: Due to the changes in big data generation Example: Biomedicine Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)
7 Important open issues Transfer of data from one location to another (*) shipping external hard disks processing the data while it is being transferred Future? Data won t be moved! (*) Out of scope of this presentation Source:
8 Important open issues Security and privacy of the data from individuals (*) The same problems that appear in other areas Use advanced encryption algorithms sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=fsnqd0jqszy2om&tbnid=gsxemmans2soom:&ved=0cauqjrw& url=http%3a%2f%2fwww.tbase.com%2fcorporate%2fprivacy-andsecurity&ei=m 3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust= (Out of scope of this presentation) Source: 8
9 Important open issues Increased need to store data (*) Cloud based computing solutions have emerged Source:
10 Important open issues Increased need to store data (*) Cloud based computing solutions have emerged The most common Cloud Computing inhibitors should be tackled Security Privacy Lack of Standards Data Integrity Regulatory Data Recovery Control Vendor Maturity...
11 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) Source: -save-health-care-0151-but-at-what-cost-to-privacy/257621/ (*) Big Data definition?
12 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?
13 The most critical open issue Source: cetemma - mataró DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?
14 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) The informa=on is non ac=onable knowledge (*) Big Data definition?
15 What is the usefulness of Big Data? Performs predictions of outcomes and behaviors - Value Data InformaMon + Volume + Knowledge Approach: Machine Learning works in the sense that these methods detect subtle structure in data relatively easily without having to make strong assumptions about parameters of distributions - 15
16 Data Analytics: To extract knowledge Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies, and to perform predictions of outcomes and behaviors. (*) Wikipedia (Out of scope of this presentation) :- ( 16
17 Talk outline! We talk about Petabyte? 17
18 Challenges related to us? The big data problem: In the end it is a Computing Challenge 18
19 Computing Challenge Researchers need to crunch a large amount of data very quickly (and easily) using high-performance computers Example: A de novo assembly algorithm for DNA data finds reads whose sequences overlap and records those overlaps in a huge diagram called an assembly graph. For a large genome, this graph can occupy many terabytes of RAM, and completing the genome sequence can require weeks or months of computation on a world-class supercomputer. 19
20 What does life science research do at BSC? Pipeline schema: MareNostrum h / cpus AlMx / CLL cluster 0. Receive the data: Raw Genome Data: 120Gb Common Storage 50-60h / 1-10 cpus
21 Also data deluge appears in genomics The DNA data deluge comes from thousands of sources More than 2000 sequencing instruments around the world more than 15 petabytes x year of genetic data. And soon, tens of thousands of sources!!!! Image source: https://share.sandia.gov/news/resources/ news_releases/images/2009/biofuel_genes.jpg
22 The total computing burden is growing DNA sequencing is on the path to becoming an everyday tool in medicine. Computing, not sequencing, is now the slower and more costly aspect of genomics research.!
23 How can we help at BSC? Something must be done now, or else we ll need to put vital research on hold while the necessary computational techniques catch up or are invented. What is clear is that it will involve both better algorithms and a renewed focus on such big data approaches in managing and processing data. How? Doing outstanding research to speedup this process 23
24 What is the time required to retrieve information? 1 Petabyte = 1000 x (1 Terabyte )
25 What is the time required to retrieve information? assume 100MB/sec
26 What is the time required to retrieve information? assume 100MB/sec scanning 1 Terabyte: more than 5 hours
27 What is the time required to retrieve information? scanning 1 Petabyte: more than hours
28 Solution? massive parallelism not only in computation but also in storage assume disks: scanning 1 TB takes 1 second Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg
29 Talk outline! We talk about Petabyte? 29
30 Research in Big Data at BSC To support this massive data parallelism & distribution it is necessary to redefine and improve: Data Processing across hundreds of thousands of servers Data Management across hundreds of thousands of data devices Dealing with new System Data Infrastructure 30
31 How? How do companies like google read and process data from disks in parallel? Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg
32 GOOGLE: New programming model To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance) Ecosystem of big data processing tools open source, distributed, and run on commodity hardware.
33 MapReduce: some details The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Input Data Two phases Map phase Reduce phase Reducers Mappers Output Data
34 Limitations of MapReduce as a Programming model? MapReduce is great but not every one is a MapReduce expert I am a python expert but. There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model Input Data Different programming models deal with different challenges Example: pycompss from BSC Output Data OmpSs/COMPSs 34
35 Big Data resource management Big Data characteristics Volume Variety Velocity Requirements from data store Scalability Scheme-less Relaxed consistency & capacity to digest Relational databases are not suitable for Big Data problems Lack of horizontal scalability Complex structures to express complex relationships Hard consistency Non-relational databases (NoSQL) are the alternative data store Relaxing consistencyà Eventual consistency 35
36 General view of NoSQL storage (and replication)
37 Big Data resource management: open issues in NoSQL Query performance depends heavily on data model Designed to support many concurrent short queries à Solutions: automatic configuration, query plan and data organization BSC- Aeneas: https://github.com/cugni/aeneas Model_A Model_B Model_C Client Query Driven Query Plan
38 New System Data Infrastructure required Example: Current computer systems available at genomics research institutions are commonly designed to run general computational jobs where, Traditionally the limiting resource is the use of CPU. Also, we find a large common storage space shared for all nodes.
39 Example: Computers in use for bioinformatic jobs Typical mix of such computer systems and common bioinformatics applications: à Bottlenecks & underutilization problems Figure 16. EDW Appliance is a loosely coupled, shared nothing, MPP architecture First approach: Big Data Rack Architecture: Shared Nothing
40 Storage Technology: Non Volatile Memory evolution Evolution of Flash Adoption FLASH + DISK FLASH AS DISK FLASH AS MEMORY SNIA NVM Summit April 28, (*) HHD 100 cheaper than RAM. But 1000 times slower 40
41 Example: Computers in use for bioinformatic jobs Jobs are responsible of managing the input data: partitions, organisation, merge of intermediate results Large parts of code are not functional, but housekeeping tasks Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate data intensive tasks Compute Dense Compute Fabric Active Storage Fabric Archival Storage Disk/Tape
42 Important: Remote Nodes Have Gotten Closer Interconnects have become much faster IB latency 2000 ns is only 20x slower that RAM and is 100x faster that SSD Source: 42
43 Conclusion: Paradigm shift Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Source: Heiko Joerg
44 Conclusions: How can we help? How can IT researchers help scientists like you cope with the onslaught of data? This is a crucial question and there is no definitive answer yet. What is clear is that it will involve both better algorithms and a renewed focus on big data approaches such as: data infrastructure, data managing and data processing.
45 Questions & Answers Over to you, what do you think? Thank you for your attention! - Jordi 45
46 Thank you to 46
47 More information Updated information will be posted at 47
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
G00249318 Top 10 Technology Trends Impacting Information Infrastructure, 2013 Published: 19 February 2013 Analyst(s): Regina Casonato, Mark A. Beyer, Merv Adrian, Ted Friedman, Debra Logan, Frank Buytendijk,
Trends in Cloud Computing and Big Data Nikita Bhagat, Ginni Bansal, Dr.Bikrampal Kaur firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Abstract - BIG data refers to the
Data-Intensive Text Processing with MapReduce i Jimmy Lin and Chris Dyer University of Maryland, College Park Manuscript prepared April 11, 2010 This is the pre-production manuscript of a book in the Morgan
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
Solving a Big-Data Problem with GPU: The Network Traffic Analysis Mercedes Barrionuevo, Mariela Lopresti, Natalia Miranda, Fabiana Piccoli UNSL - Universidad Nacional de San Luis, Departamento de Informática,
B.2 Executive Summary As demonstrated in Section A, Compute Canada (CC) supports a vibrant community of researchers spanning all disciplines and regions in Canada. Providing access to world- class infrastructure
Cloud Computing and Grid Computing 360-Degree Compared 1,2,3 Ian Foster, 4 Yong Zhao, 1 Ioan Raicu, 5 Shiyong Lu email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org 1 Department
The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines iii Synthesis Lectures on Computer Architecture Editor Mark D. Hill, University of Wisconsin, Madison Synthesis Lectures
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
SOFTWARE ENGINEERING Key Enabler for Innovation NESSI White Paper Networked European Software and Services Initiative July 2014 Executive Summary Economy and industry is experiencing a transformation towards
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
Resource Management for Scientific Application in Hybrid Cloud Computing Environments Dissertation by Simon Ostermann submitted to the Faculty of Mathematics, Computer Science and Physics of the University
SCHOONER TECHNOLOGY WHITE PAPER Deploying Higher Level Building Blocks for Web 2.0 and Cloud Computing Datacenters How Tightly Coupled Data Access Appliances Simplify Scaling, Decrease Business Complexity,
ascent Thought leadership from Atos white paper Data Analytics as a Service: unleashing the power of Cloud and Big Data Your business technologists. Powering progress Big Data and Cloud, two of the trends
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
ISSN (Online): 2409-4285 www.ijcsse.org Page: 78-85 A Survey of Big Data Cloud Computing Security Elmustafa Sayed Ali Ahmed 1 and Rashid A.Saeed 2 1 Electrical and Electronic Engineering Department, Red
Plug Into The Cloud with Oracle Database 12c ORACLE WHITE PAPER DECEMBER 2014 Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only,
WHITE PAPER Introduction... 2 Reduce Tool and Process Sprawl... 2 Control Virtual Server Sprawl... 3 Effectively Manage Network Stress... 4 Reliably Deliver Application Services... 5 Comprehensively Manage