Big Data Challenges in Bioinformatics

Similar documents

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

How To Handle Big Data With A Data Scientist

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Architectures for Big Data Analytics A database perspective

NoSQL Data Base Basics

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Technologies Compared June 2014

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Hadoop Cluster Applications

Big Fast Data Hadoop acceleration with Flash. June 2013

Large scale processing using Hadoop. Ján Vaňo

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BIG DATA-AS-A-SERVICE

Big Data With Hadoop

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Open source Google-style large scale data analysis with Hadoop

bigdata Managing Scale in Ontological Systems

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

In-Memory Analytics for Big Data

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

NoSQL for SQL Professionals William McKnight

Introduction to Hadoop

NextGen Infrastructure for Big DATA Analytics.

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Application Development. A Paradigm Shift

Data management challenges in todays Healthcare and Life Sciences ecosystems

Energy Efficient MapReduce

EMC BACKUP MEETS BIG DATA

HadoopTM Analytics DDN

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Oracle Big Data SQL Technical Update

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

How To Scale Out Of A Nosql Database

Can the Elephants Handle the NoSQL Onslaught?

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Understanding the Value of In-Memory in the IT Landscape

A survey of big data architectures for handling massive data

Data-Intensive Computing with Map-Reduce and Hadoop

Testing Big data is one of the biggest

Big Data and Data Science: Behind the Buzz Words

Cray: Enabling Real-Time Discovery in Big Data

The Future of Data Management

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Chapter 7. Using Hadoop Cluster and MapReduce

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

I/O Considerations in Big Data Analytics

Open source large scale distributed data management with Google s MapReduce and Bigtable

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Software-defined Storage Architecture for Analytics Computing

Big Data and Apache Hadoop s MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

The Internet of Things and Big Data: Intro

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Systems, Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Reference Architecture, Requirements, Gaps, Roles

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Get More Scalability and Flexibility for Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Brief Outline on Bigdata Hadoop

Data Centric Computing Revisited

Protecting Big Data Data Protection Solutions for the Business Data Lake

SQL Server PDW. Artur Vieira Premier Field Engineer

RevoScaleR Speed and Scalability

Hadoop: Embracing future hardware

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Intro to Map/Reduce a.k.a. Hadoop

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

InfiniteGraph: The Distributed Graph Database

Big Data Analytics - Accelerated. stream-horizon.com

Big Data - Infrastructure Considerations

Transcription:

Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es

Talk outline! We talk about Petabyte? 2

Deluge of Data Data is now considered the Fourth Paradigm in Science the first three paradigms were experimental, theoretical and computational science. This shift is being driven by the rapid growth in data from improvements in scientific instruments

Scientific instruments Physics Large Hadron Collider produced around 15 petabytes of data in 2012. Astronomy Large Synoptic Survey Telescope it s anticipated to produce around 10 petabytes per year. 4

Example: In Genome Research? Cost of sequencing a human-sized genome $100000000.0 $10000000.0 $1000000.0 $100000.0 $10000.0 $1000.0 Sep-01 Jan-02 May-02 Sep-02 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 Jan-08 May-08 Sep-08 Jan-09 May-09 Sep-09 Jan-10 May-10 Sep-10 Jan-11 May-11 Sep-11 Jan-12 May-12 Sep-12 Jan-13 Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/ 5

Data Deluge: Due to the changes in big data generation Example: Biomedicine Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)

Important open issues Transfer of data from one location to another (*) shipping external hard disks processing the data while it is being transferred Future? Data won t be moved! (*) Out of scope of this presentation Source: http://footage.shutterstock.com/clip-4721783-stock-footage-animation-presents-datatransfer-between-a-computer-and-a-cloud-a-concept-of-cloud-computing.html

Important open issues Security and privacy of the data from individuals (*) The same problems that appear in other areas Use advanced encryption algorithms http://www.google.es/url? sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=fsnqd0jqszy2om&tbnid=gsxemmans2soom:&ved=0cauqjrw& url=http%3a%2f%2fwww.tbase.com%2fcorporate%2fprivacy-andsecurity&ei=m 3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629 (Out of scope of this presentation) Source: http://www.tbase.com/corporate/privacy-and-security 8

Important open issues Increased need to store data (*) Cloud based computing solutions have emerged Source: http://www.custodia-documental.com/wp-content/uploads/cloud-big-data.jpg

Important open issues Increased need to store data (*) Cloud based computing solutions have emerged The most common Cloud Computing inhibitors should be tackled Security Privacy Lack of Standards Data Integrity Regulatory Data Recovery Control Vendor Maturity...

The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can -save-health-care-0151-but-at-what-cost-to-privacy/257621/ (*) Big Data definition?

The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?

The most critical open issue Source: cetemma - mataró DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) (*) Big Data definition?

The most critical open issue DERIVING VALUE VIA HARNESSING VOLUME, VARIETY AND VELOCITY (*) The informa=on is non ac=onable knowledge (*) Big Data definition?

What is the usefulness of Big Data? Performs predictions of outcomes and behaviors - Value Data InformaMon + Volume + Knowledge Approach: Machine Learning works in the sense that these methods detect subtle structure in data relatively easily without having to make strong assumptions about parameters of distributions - 15

Data Analytics: To extract knowledge Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies, and to perform predictions of outcomes and behaviors. (*) Wikipedia (Out of scope of this presentation) :- ( 16

Talk outline! We talk about Petabyte? 17

Challenges related to us? The big data problem: In the end it is a Computing Challenge 18

Computing Challenge Researchers need to crunch a large amount of data very quickly (and easily) using high-performance computers Example: A de novo assembly algorithm for DNA data finds reads whose sequences overlap and records those overlaps in a huge diagram called an assembly graph. For a large genome, this graph can occupy many terabytes of RAM, and completing the genome sequence can require weeks or months of computation on a world-class supercomputer. 19

What does life science research do at BSC? Pipeline schema: MareNostrum 25-40 h / 50-100 cpus AlMx / CLL cluster 0. Receive the data: Raw Genome Data: 120Gb Common Storage 50-60h / 1-10 cpus

Also data deluge appears in genomics The DNA data deluge comes from thousands of sources More than 2000 sequencing instruments around the world more than 15 petabytes x year of genetic data. And soon, tens of thousands of sources!!!! Image source: https://share.sandia.gov/news/resources/ news_releases/images/2009/biofuel_genes.jpg

The total computing burden is growing DNA sequencing is on the path to becoming an everyday tool in medicine. Computing, not sequencing, is now the slower and more costly aspect of genomics research.!

How can we help at BSC? Something must be done now, or else we ll need to put vital research on hold while the necessary computational techniques catch up or are invented. What is clear is that it will involve both better algorithms and a renewed focus on such big data approaches in managing and processing data. How? Doing outstanding research to speedup this process 23

What is the time required to retrieve information? 1 Petabyte = 1000 x (1 Terabyte )

What is the time required to retrieve information? assume 100MB/sec

What is the time required to retrieve information? assume 100MB/sec scanning 1 Terabyte: more than 5 hours

What is the time required to retrieve information? scanning 1 Petabyte: more than 5.000 hours

Solution? massive parallelism not only in computation but also in storage assume 10.000 disks: scanning 1 TB takes 1 second Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg

Talk outline! We talk about Petabyte? 29

Research in Big Data at BSC To support this massive data parallelism & distribution it is necessary to redefine and improve: Data Processing across hundreds of thousands of servers Data Management across hundreds of thousands of data devices Dealing with new System Data Infrastructure 30

How? How do companies like google read and process data from 10.000 disks in parallel? Source: hvp://www.google.com/about/datacenters/gallery/images/_2000/idi_018.jpg

GOOGLE: New programming model To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance) Ecosystem of big data processing tools open source, distributed, and run on commodity hardware.

MapReduce: some details The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Input Data Two phases Map phase Reduce phase Reducers Mappers Output Data

Limitations of MapReduce as a Programming model? MapReduce is great but not every one is a MapReduce expert I am a python expert but. There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model Input Data Different programming models deal with different challenges Example: pycompss from BSC Output Data OmpSs/COMPSs 34

Big Data resource management Big Data characteristics Volume Variety Velocity Requirements from data store Scalability Scheme-less Relaxed consistency & capacity to digest Relational databases are not suitable for Big Data problems Lack of horizontal scalability Complex structures to express complex relationships Hard consistency Non-relational databases (NoSQL) are the alternative data store Relaxing consistencyà Eventual consistency 35

General view of NoSQL storage (and replication) 2 3 4 1 3 5 1 2 3 4 5 1 3 4 2 4 5 1 2 5

Big Data resource management: open issues in NoSQL Query performance depends heavily on data model Designed to support many concurrent short queries à Solutions: automatic configuration, query plan and data organization BSC- Aeneas: https://github.com/cugni/aeneas Model_A Model_B Model_C Client Query Driven Query Plan

New System Data Infrastructure required Example: Current computer systems available at genomics research institutions are commonly designed to run general computational jobs where, Traditionally the limiting resource is the use of CPU. Also, we find a large common storage space shared for all nodes.

Example: Computers in use for bioinformatic jobs Typical mix of such computer systems and common bioinformatics applications: à Bottlenecks & underutilization problems Figure 16. EDW Appliance is a loosely coupled, shared nothing, MPP architecture First approach: Big Data Rack Architecture: Shared Nothing

Storage Technology: Non Volatile Memory evolution Evolution of Flash Adoption FLASH + DISK FLASH AS DISK FLASH AS MEMORY SNIA NVM Summit April 28, 2013 4 (*) HHD 100 cheaper than RAM. But 1000 times slower 40

Example: Computers in use for bioinformatic jobs Jobs are responsible of managing the input data: partitions, organisation, merge of intermediate results Large parts of code are not functional, but housekeeping tasks Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate data intensive tasks Compute Dense Compute Fabric Active Storage Fabric Archival Storage Disk/Tape

Important: Remote Nodes Have Gotten Closer Interconnects have become much faster IB latency 2000 ns is only 20x slower that RAM and is 100x faster that SSD Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia 42

Conclusion: Paradigm shift Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics

Conclusions: How can we help? How can IT researchers help scientists like you cope with the onslaught of data? This is a crucial question and there is no definitive answer yet. What is clear is that it will involve both better algorithms and a renewed focus on big data approaches such as: data infrastructure, data managing and data processing.

Questions & Answers Over to you, what do you think? Thank you for your attention! - Jordi 45

Thank you to 46

More information Updated information will be posted at www.jorditorres.eu 47