1 Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres
2 Talk outline! We talk about Petabyte? 2
3 Deluge of Data Data is now considered the Fourth Paradigm in Science the first three paradigms were experimental, theoretical and computational science. This shift is being driven by the rapid growth in data from improvements in scientific instruments
4 Scientific instruments Physics Large Hadron Collider produced around 15 petabytes of data in Astronomy Large Synoptic Survey Telescope it s anticipated to produce around 10 petabytes per year. 4
5 Example: In Genome Research? Cost of sequencing a human-sized genome $ $ $ $ $ $ Sep-01 Jan-02 May-02 Sep-02 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 Jan-08 May-08 Sep-08 Jan-09 May-09 Sep-09 Jan-10 May-10 Sep-10 Jan-11 May-11 Sep-11 Jan-12 May-12 Sep-12 Jan-13 Source: National Human Genome Research Institute (NHGRI) 5
6 Data Deluge: Due to the changes in big data generation Example: Biomedicine Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)
7 Important open issues Transfer of data from one location to another (*) shipping external hard disks processing the data while it is being transferred Future? Data won t be moved! (*) Out of scope of this presentation Source:
8 Important open issues Security and privacy of the data from individuals (*) The same problems that appear in other areas Use advanced encryption algorithms sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=fsnqd0jqszy2om&tbnid=gsxemmans2soom:&ved=0cauqjrw& url=http%3a%2f%2fwww.tbase.com%2fcorporate%2fprivacy-andsecurity&ei=m 3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust= (Out of scope of this presentation) Source: 8
9 Important open issues Increased need to store data (*) Cloud based computing solutions have emerged Source:
10 Important open issues Increased need to store data (*) Cloud based computing solutions have emerged The most common Cloud Computing inhibitors should be tackled Security Privacy Lack of Standards Data Integrity Regulatory Data Recovery Control Vendor Maturity...
11 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) Source: -save-health-care-0151-but-at-what-cost-to-privacy/257621/ (*) Big Data definition?
12 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?
13 The most critical open issue Source: cetemma - mataró DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?
14 The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) The informa=on is non ac=onable knowledge (*) Big Data definition?
15 What is the usefulness of Big Data? Performs predictions of outcomes and behaviors - Value Data InformaMon + Volume + Knowledge Approach: Machine Learning works in the sense that these methods detect subtle structure in data relatively easily without having to make strong assumptions about parameters of distributions - 15
16 Data Analytics: To extract knowledge Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies, and to perform predictions of outcomes and behaviors. (*) Wikipedia (Out of scope of this presentation) :- ( 16
17 Talk outline! We talk about Petabyte? 17
18 Challenges related to us? The big data problem: In the end it is a Computing Challenge 18
19 Computing Challenge Researchers need to crunch a large amount of data very quickly (and easily) using high-performance computers Example: A de novo assembly algorithm for DNA data finds reads whose sequences overlap and records those overlaps in a huge diagram called an assembly graph. For a large genome, this graph can occupy many terabytes of RAM, and completing the genome sequence can require weeks or months of computation on a world-class supercomputer. 19
20 What does life science research do at BSC? Pipeline schema: MareNostrum h / cpus AlMx / CLL cluster 0. Receive the data: Raw Genome Data: 120Gb Common Storage 50-60h / 1-10 cpus
21 Also data deluge appears in genomics The DNA data deluge comes from thousands of sources More than 2000 sequencing instruments around the world more than 15 petabytes x year of genetic data. And soon, tens of thousands of sources!!!! Image source: https://share.sandia.gov/news/resources/ news_releases/images/2009/biofuel_genes.jpg
22 The total computing burden is growing DNA sequencing is on the path to becoming an everyday tool in medicine. Computing, not sequencing, is now the slower and more costly aspect of genomics research.!
23 How can we help at BSC? Something must be done now, or else we ll need to put vital research on hold while the necessary computational techniques catch up or are invented. What is clear is that it will involve both better algorithms and a renewed focus on such big data approaches in managing and processing data. How? Doing outstanding research to speedup this process 23
24 What is the time required to retrieve information? 1 Petabyte = 1000 x (1 Terabyte )
25 What is the time required to retrieve information? assume 100MB/sec
26 What is the time required to retrieve information? assume 100MB/sec scanning 1 Terabyte: more than 5 hours
27 What is the time required to retrieve information? scanning 1 Petabyte: more than hours
28 Solution? massive parallelism not only in computation but also in storage assume disks: scanning 1 TB takes 1 second Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg
29 Talk outline! We talk about Petabyte? 29
30 Research in Big Data at BSC To support this massive data parallelism & distribution it is necessary to redefine and improve: Data Processing across hundreds of thousands of servers Data Management across hundreds of thousands of data devices Dealing with new System Data Infrastructure 30
31 How? How do companies like google read and process data from disks in parallel? Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg
32 GOOGLE: New programming model To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance) Ecosystem of big data processing tools open source, distributed, and run on commodity hardware.
33 MapReduce: some details The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Input Data Two phases Map phase Reduce phase Reducers Mappers Output Data
34 Limitations of MapReduce as a Programming model? MapReduce is great but not every one is a MapReduce expert I am a python expert but. There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model Input Data Different programming models deal with different challenges Example: pycompss from BSC Output Data OmpSs/COMPSs 34
35 Big Data resource management Big Data characteristics Volume Variety Velocity Requirements from data store Scalability Scheme-less Relaxed consistency & capacity to digest Relational databases are not suitable for Big Data problems Lack of horizontal scalability Complex structures to express complex relationships Hard consistency Non-relational databases (NoSQL) are the alternative data store Relaxing consistencyà Eventual consistency 35
36 General view of NoSQL storage (and replication)
37 Big Data resource management: open issues in NoSQL Query performance depends heavily on data model Designed to support many concurrent short queries à Solutions: automatic configuration, query plan and data organization BSC- Aeneas: https://github.com/cugni/aeneas Model_A Model_B Model_C Client Query Driven Query Plan
38 New System Data Infrastructure required Example: Current computer systems available at genomics research institutions are commonly designed to run general computational jobs where, Traditionally the limiting resource is the use of CPU. Also, we find a large common storage space shared for all nodes.
39 Example: Computers in use for bioinformatic jobs Typical mix of such computer systems and common bioinformatics applications: à Bottlenecks & underutilization problems Figure 16. EDW Appliance is a loosely coupled, shared nothing, MPP architecture First approach: Big Data Rack Architecture: Shared Nothing
40 Storage Technology: Non Volatile Memory evolution Evolution of Flash Adoption FLASH + DISK FLASH AS DISK FLASH AS MEMORY SNIA NVM Summit April 28, (*) HHD 100 cheaper than RAM. But 1000 times slower 40
41 Example: Computers in use for bioinformatic jobs Jobs are responsible of managing the input data: partitions, organisation, merge of intermediate results Large parts of code are not functional, but housekeeping tasks Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate data intensive tasks Compute Dense Compute Fabric Active Storage Fabric Archival Storage Disk/Tape
42 Important: Remote Nodes Have Gotten Closer Interconnects have become much faster IB latency 2000 ns is only 20x slower that RAM and is 100x faster that SSD Source: 42
43 Conclusion: Paradigm shift Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Source: Heiko Joerg
44 Conclusions: How can we help? How can IT researchers help scientists like you cope with the onslaught of data? This is a crucial question and there is no definitive answer yet. What is clear is that it will involve both better algorithms and a renewed focus on big data approaches such as: data infrastructure, data managing and data processing.
45 Questions & Answers Over to you, what do you think? Thank you for your attention! - Jordi 45
46 Thank you to 46
47 More information Updated information will be posted at 47