Hadoopizer : a cloud environment for bioinformatics data analysis



Similar documents
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Hadoop. Bioinformatics Big Data

High Throughput Sequencing Data Analysis using Cloud Computing

Cloud-Based Big Data Analytics in Bioinformatics

Open source large scale distributed data management with Google s MapReduce and Bigtable

New solutions for Big Data Analysis and Visualization

Open source Google-style large scale data analysis with Hadoop

Cloud-based Analytics and Map Reduce

Real Time Big Data Processing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop & Spark Using Amazon EMR

Map Reduce & Hadoop Recommended Text:

CSE-E5430 Scalable Cloud Computing. Lecture 4

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Processing NGS Data with Hadoop-BAM and SeqPig

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

A Cost-Evaluation of MapReduce Applications in the Cloud

Hadoop Parallel Data Processing

HADOOP IN THE LIFE SCIENCES:

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data on Microsoft Platform

Hadoop-BAM and SeqPig

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Accelerating Data-Intensive Genome Analysis in the Cloud

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

How To Handle Big Data With A Data Scientist

CSE-E5430 Scalable Cloud Computing Lecture 2

Improving MapReduce Performance in Heterogeneous Environments

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

GraySort on Apache Spark by Databricks

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Data processing goes big

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Analyze Human Genome Using Big Data

Apache Flink Next-gen data analysis. Kostas

HDFS Cluster Installation Automation for TupleWare

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Big Data Challenges in Bioinformatics

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Support for data-intensive computing with CloudMan

Cloud Computing using MapReduce, Hadoop, Spark

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

COURSE CONTENT Big Data and Hadoop Training

Deep Sequencing Data Analysis

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Workshop on Hadoop with Big Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

SURFsara HPC Cloud Workshop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

A curated Domain centric shared Docker registry linked to the Galaxy toolshed

Twister4Azure: Data Analytics in the Cloud

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Service for Data-Intensive Computations on Virtual Clusters

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop. Sunday, November 25, 12

Extending Hadoop beyond MapReduce

Apache Hama Design Document v0.6

Big Data Analytics. Lucas Rego Drumond

Analysis of ChIP-seq data in Galaxy

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

How To Use Hadoop

GC3 Use cases for the Cloud

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Basic processing of next-generation sequencing (NGS) data

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Energy Efficient MapReduce

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

-> Integration of MAPHiTS in Galaxy

BBM467 Data Intensive ApplicaAons

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Running and Enhancing your own Galaxy. Daniel Blankenberg The Galaxy Team

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce with Apache Hadoop Analysing Big Data

Transcription:

Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France (2) olivier.sallou@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France (3) olivier.collin@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France Journées scientifiques mésocentres et France Grilles 3 Octobre 2012

Data deluge in biology Next generation sequencing (NGS) Very cheap DNA sequencing Compute and storage grow slower than sequence production Data deluge This is the first wave Proteomic Imaging... Kahn. On the future of genomic data. Science (2011) vol. 331 (6018) pp. 728-9

Next Generation Sequencing technologies DNA Alphabet = A, T, G, C (1 letter = 1 base ) Human genome = 3 Gb Next generation sequencing (NGS) 1 run (11 days) = 6 Billion reads of 100 b Many methods De novo assembly of a genome Mapping on a genome...

NGS application: example of mapping Many algorithms (101*) Reads (short sequences) Sequencing errors, mutations, repetitions, gaps Parallelization Indexed genome Each read can be mapped individually * source: seqanswers.com wiki

Data management Production facilities Sequencing centers Genoscope, BGI, companies,... Sequencers in labs => disseminated Computing resources Disseminated too => transfers in labs, on platforms (or none) Peaks of activity => Engineering challenges

Why the cloud Extensible computing resources Amazon, Azure, private clouds Parallelization SGE or Hadoop clusters Reproducible results Shareable virtual machines

Hadoop MapReduce framework Distributed processing of large datasets Scalability 1 to thousands of nodes Fault tolerance Detect node failure and retry Hadoop-related projects Hbase, Hive, Cassandra, Pig,...

Hadoop Hadoop in a few words 1. Load input data as key/values 2. Distribute them to computing node 3. Map(): transform to new key/values pairs 4. Reduce(): combine values having the same key 5.Write to output file

Hadoop Big input file Node 1 Node 2 Node 3 Node 4 MAP Node 2 REDUCE Complete result file

The project Cloud computing and bioinformatics data: Does it work? Do we miss some tools? Will it be efficient? Private cloud setup: GenoCloud Inventory of existing solutions Development of Hadoopizer

Private cloud Private cloud for test purpose Based on OpenNebula 240 cores, 940 Gb memory, 8 Tb storage EC2 compatibility Ready to use images Sge cluster Hadoop cluster http://genocloud.genouest.org

Inventory Already existing tools: CloudAligner Mapping (specific algorithm) CloudBurst Mapping (RMAP algorithm) Contrail De novo assembler (specific algorithm) Crossbow SNPs detection (uses Bowtie and SOAPsnp) Myrna RNAseq, differential expression (bowtie, R/Bioconductor)

Inventory Less specific tools: Eoulsan Filtering, mapping, differential expression Pipelines Extendable (but not simple) CloudMan SGE cluster (with console) + Galaxy frontend

Bioinfo applications in the cloud Main limitations Fast evolution of both data & algorithms Obsolescence of algorithms (myrna, cloudburst, contrail) Evolution cost Test new algorithms Custom code/scripts (very common) Incompatible dependencies Missing Launching custom command line Splitters adapted to bioinformatics data formats Some glue to make life easier => Hadoopizer

Non parallelized treatment >a sequence AAAATGCGTCGTACCGT >another sequence TGTCGTACTGGTAC >third sequence TGTCGTACAAACGTTCGA... Big input file EXECUTION of command line on all data Complete result file

Hadoopizer: how it works SPLITTING with specific splitter EXECUTION of command line on received data >a sequence AAAATGCGTCGTACCGT >another sequence TGTCGTACTGGTAC >third sequence TGTCGTACAAACGTTCGA... Big input file Input chunks Output chunks REDUCING with specific reducer Complete result file

Hadoopizer: using it Xml example <?xml version="1.0" encoding="utf-8"?> <job> <command> bowtie -m 1 --best --strata -S ${genome} ${reads} > ${mapped} </command> <input id="genome"> <url autocomplete="true">/home/example/indexed_genome</url> </input> <input id="reads" split="true"> <url splitter="fastq">/home/example/reads.fastq</url> </input> <outputs> <url>/home/example/output_mapping/</url> <output id="mapped" reducer="sam" /> </outputs> </job> Command line example hadoop jar hadoopizer.jar -c job_config.xml -w hdfs://master_node_ip/bowtie_tmp/

Hadoopizer: using it Xml example <?xml version="1.0" encoding="utf-8"?> <job> <command> bowtie -m 1 --best --strata -S ${genome} ${reads} > ${mapped} </command> <input id="genome"> <url autocomplete="true">/home/example/indexed_genome</url> </input> <input id="reads" split="true"> <url splitter="fastq">/home/example/reads.fastq</url> </input> <outputs> <url>/home/example/output_mapping/</url> <output id="mapped" reducer="sam" /> </outputs> </job> Command line example hadoop jar hadoopizer.jar -c job_config.xml -w hdfs://master_node_ip/bowtie_tmp/

Hadoopizer: other features Software deployment hadoop jar hadoopizer.jar -b binaries.tar.gz [...] Extracted in work dir on each node Supported formats Fasta, Fastq, Sam, (Bam) Compression Input and output Multiple input+output Support of paired sequences

Mapping example: first results First results on mapping: Reference genome: 400 Mb Reads: 11 Gb With hadoopizer 343 splits on 10 nodes 55 min for map step, 3h for reduce step This is a test cloud, not optimized for performances Configuration tuning, code improvements,... Same command on 1 machine (same config) 4h30

Hadoopizer: performances Benchmarks The next step of the project Comparison with: non parallelized SGE parallelized other implementations using Hadoop Take into account the transfer time Expected results: Overhead due to I/O on computing nodes

Hadoopizer conclusion Coarse grain parallelism For embarrassingly parallel problems Works with any command line Perspectives Support more data formats (bed, wiggle, gff,...) Performance issues

Perspectives 6 months work 1.0 released on github https://github.com/genouest/hadoopizer Open position to continue the project Benchmarks Public clouds Support other formats, new features Real life applications

Thank you! www.genouest.org genocloud.genouest.org github.com/genouest/hadoopizer support@genouest.org