Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille



Similar documents
High Throughput Sequencing Data Analysis using Cloud Computing

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoop Parallel Data Processing

Apache Hadoop. Alexandru Costan

A Cost-Evaluation of MapReduce Applications in the Cloud

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce, Hadoop and Amazon AWS

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Architecture. Part 1

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Cloud Computing. Alex Crawford Ben Johnstone

Data Sharing Options for Scientific Workflows on Amazon EC2

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

GraySort on Apache Spark by Databricks

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Chapter 7. Using Hadoop Cluster and MapReduce

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

UBUNTU DISK IO BENCHMARK TEST RESULTS

Qsoft Inc

A Service for Data-Intensive Computations on Virtual Clusters

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

CSE-E5430 Scalable Cloud Computing. Lecture 4

Storage Architectures for Big Data in the Cloud

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

GC3 Use cases for the Cloud

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Duke University

Cornell University Center for Advanced Computing

Cloud-based Analytics and Map Reduce

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Cloud Federation to Elastically Increase MapReduce Processing Resources

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

CSE-E5430 Scalable Cloud Computing Lecture 2

Extending Hadoop beyond MapReduce

How To Use Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Intro to Map/Reduce a.k.a. Hadoop

Cloud-Based Big Data Analytics in Bioinformatics

Improving MapReduce Performance in Heterogeneous Environments

Data management challenges in todays Healthcare and Life Sciences ecosystems

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to MapReduce and Hadoop

Analysis of ChIP-seq data in Galaxy

Early Cloud Experiences with the Kepler Scientific Workflow System

New solutions for Big Data Analysis and Visualization

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Cloud Ready for Bioinformatics?

Last time. Today. IaaS Providers. Amazon Web Services, overview

Map Reduce & Hadoop Recommended Text:

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Next generation sequencing (NGS)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Cloud Computing Workload Benchmark Report

Research Article Hadoop-Based Distributed Sensor Node Management System

Hadoop & Spark Using Amazon EMR

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

G E N OM I C S S E RV I C ES

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Analysis of NGS Data

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Deep Sequencing Data Analysis

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

BioHPC Web Computing Resources at CBSU

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Bringing Big Data Modelling into the Hands of Domain Experts

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Cloud computing - Architecting in the cloud


UGENE Quick Start Guide

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Big data blue print for cloud architecture

Cloud Computing Performance. Benchmark Testing Report. Comparing ProfitBricks vs. Amazon EC2

Globus Genomics Tutorial GlobusWorld 2014

Transcription:

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013

The Sanger DNA sequencing method Sequencing by synthesis by Frédérick Sanger (1977, chemistry nobel price 1980). 126 capillary sequencers can sequence one human genome in 12 days. From The Scientist 2

The second generation of sequencing device One Illumina HiSeq can sequence two human genomes in 8 days. From htttp://www.illumina.com 3

Applications cover a lot of biological fields De novo sequencing; Resequencing; Functional analysis; Metagenomic; Diagnostics; Personal genomic Shendure & Aiden (2012) Nat. Biotech. 4

Data analysis levels Sequencer outputs produce To of data for each run and large raw result files. Data analysis on big genomes require calculation power and memory. http://www.geospiza.com/finchtalk/labels/next Generation Sequencing.html 5

Eoulsan: our RNA-Seq data analysis workflow Eoulsan works from raw sequencer outputs (Illumina) to the list of differentially expressed genes. Its modular and flexible analysis framework runs with various already available analysis solutions (SOAP2, Bowtie, Bowtie2, BWA, GSNAP). New function extension through external java plug-in. Distributed calculation to speed up the analysis. Jourdren (2012) Bioinformatics http:/transcriptome.ens.fr/eoulsan/ 6

We use MapReduce to increase analysis speed MapReduce is used for parallel computation and automatically handles duties, such as job scheduling, fault tolerance and distributed aggregation. Map(id_alignment, alignment) " list(id_exon, 1)" Sort(id_exon,1)" Reduce(id_exon, list(1,1...1)) " list(id_exon, count)" White (2009) O'Reilly Media Hadoop is a popular (Twiter, facebook, ebay ) open-source implementation of the MapReduce framework as the original Google implementation is not public. Hadoop is a Java framework that can be executed on any cluster. 7

Eoulsan workflow on Amazon Web Services $ eoulsan.sh -conf conf-aws.txt awsexec -d "Job name" param-aws.xml design.txt s3://sgdb-test/demo" 8

Tests for instance and mapper types Instance Memory (Go) CPU (EC2 unit) I/O performance Price USD/hour m1.large 7.5 4 high $0.44 m1.xlarge 16.0 8 high $0.88 c1.xlarge 7.0 20 high $0.88 9

Can we run Eoulsan on a grid infrastructure? Is Hadoop compatible with grid cluster? Several technical solutions need to be addressed. Task management Master and slaves on the grid cluster; Or, only slaves on the grid cluster. Grid submission Batch queue Data storage HDFS storage available on site; POSIX/NFS storage to build HDFS; HDFS cluster run locally on nodes. jobtracker tasktracker tasktracker tasktracker 10

First results Tests using a R&D grid cluster at LLR. 4 identical nodes E5520 2.27GHz 16 virtual cores; 48GB RAM; 250GB hard drive; 1Gb/s Ethernet connection; OS Scientific Linux 6.3 64bit; hadoop-1.0.4-1.x86_64 Storage Dedicated HDFS cluster; Or POSIX/NFS storage. Execution time (min) Same Mouse RNA-Seq data than the one we use on AWS. HDFS POSIX NFS 11

Conclusion The first tests are very promising: Hadoop can work on grid cluster. Eoulsan delivers quicker results than using Amazon Web Services. Storage choice as no influence on analysis performance. There is a maximum number of slaves after which the gain is minimal (5 for the task we used for the tests). Interface for downloading/uploading data from/to HDFS/NFS/HTML/Grid storage is available. The workflow heavily uses whole node scheduling. This is now a standard tool on the GRID but it may be nontrivial to effectively implement it. 12

Future work Benchmark larger datasets to compare performances between grid and AWS. Allow Eoulsan to remotely run on the grid. Final goal: make the choice of the computing infrastructure transparent for final users By creating generic VM to run Hadoop everywhere: CNRS Research Cloud; GenoCloud (BioGenOuest); Stratuslab cloud 13

Acknowledgments Laurent Jourdren Stéphane Le Crom Maria Bernard Claire Wallon Vivien Deshaies Sophie Lemoine Sandrine Perrin Andrea Sartirana Philippe Busson David Chamont Pascale Hennion Paulo Mora de Freitas 14

15

Evolution of sequencing technologies HGP 2003 13y - $2.7 10 9 Venter 2007 4y - $100 10 6 Watson 2008 4.5m - $1.5 10 6 10 10 10 9 10 8 10 7 10 6 10 5 10 4 Cost of human genome sequencing ($) 10 3 Stratton (2009) Nature 16

A ready-to-use software solution Aim: to automate the analysis of a large number of samples at once. With minimal file requirements. Data: several Fastq files (.bz2) 1 reference genome (.fasta) 1 annotation file (.gff3) Set up: 1 XML parameter file 1 design file A design file inspired from the limma R package. And one command line to launch the whole process. $ eoulsan.sh exec parameter_file.xml design_file.txt" http:/transcriptome.ens.fr/eoulsan/ 17

Tests for instance and mapper types 18

Conclusion Eoulsan automates the analysis of a large number of samples at once; Standalone version Its modular and flexible analysis framework runs with various already available analysis solutions; On several computer infrastructure types. Distributed version Final goal: make the choice of the computing infrastructure transparent for final users. By creating generic VM to run Hadoop everywhere (Private Cloud, Stratuslab) Local clusters Grids Cloud Computing http:/transcriptome.ens.fr/eoulsan/ 19