High Throughput Sequencing Data Analysis using Cloud Computing



Similar documents
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

UBUNTU DISK IO BENCHMARK TEST RESULTS

Duke University

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

PUBLIC CLOUD USAGE TRENDS

Cloud-Based Big Data Analytics in Bioinformatics

Hadoop Parallel Data Processing

Cloud Computing and E-Commerce

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Cloud-based Analytics and Map Reduce

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud computing - Architecting in the cloud

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud Computing. Adam Barker

Improving MapReduce Performance in Heterogeneous Environments

L1: Introduction to Hadoop

Open source Google-style large scale data analysis with Hadoop

FREE computing using Amazon EC2

Implement Hadoop jobs to extract business value from large and varied data sets

GeneProf and the new GeneProf Web Services

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Smartronix Inc. Cloud Assured Services Commercial Price List

GreenSQL AWS Deployment

SimGrid Cloud Broker: Simulation of Public and Private Clouds

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Herodotos Herodotou Shivnath Babu. Duke University

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Data Sharing Options for Scientific Workflows on Amazon EC2

Cloud Performance Benchmark Series

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

BioHPC Web Computing Resources at CBSU

HDFS Cluster Installation Automation for TupleWare

Building a Private Cloud Cloud Infrastructure Using Opensource

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

Introduction to Cloud computing. Viet Tran

UGENE Quick Start Guide

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Apache Hadoop. Alexandru Costan

Last time. Today. IaaS Providers. Amazon Web Services, overview

Introduction to NGS data analysis

New solutions for Big Data Analysis and Visualization

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

G E N OM I C S S E RV I C ES

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Analysis of ChIP-seq data in Galaxy

Introduction. Overview of Bioconductor packages for short read analysis

Frequently Asked Questions Next Generation Sequencing

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A Service for Data-Intensive Computations on Virtual Clusters

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

Hadoop IST 734 SS CHUNG

Deep Sequencing Data Analysis

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Cloud Computing Workload Benchmark Report

Resource Sizing: Spotfire for AWS

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Practical Solutions for Big Data Analytics

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Scientific Computing with Amazon Web Services

Cloud Computing and Amazon Web Services

Map Reduce & Hadoop Recommended Text:

Hadoop & Spark Using Amazon EMR

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Globus Genomics Tutorial GlobusWorld 2014

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Amazon EC2 Product Details Page 1 of 5

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Analysis of NGS Data

Using Big Data and GIS to Model Aviation Fuel Burn

CSE-E5430 Scalable Cloud Computing. Lecture 4

Data processing goes big

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Cloud security CS642: Computer Security Professor Ristenpart h9p:// rist at cs dot wisc dot edu University of Wisconsin CS 642

Cornell University Center for Advanced Computing

How To Use Hadoop

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Based Tes,ng & Capacity Planning (CloudPerf)

Cloud Computing Deja Vu

Basic processing of next-generation sequencing (NGS) data

A Cost-Evaluation of MapReduce Applications in the Cloud

Generic Log Analyzer Using Hadoop Mapreduce Framework

Transcription:

High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure (IBENS) Montagne Sainte Geneviève Genomic Platform

A RNA-Seq data analysis workflow A flexible analysis framework. From raw sequencer outputs (Illumina) to the list of differentially expressed genes. Based on available analysis solutions (SOAP, Bowtie, BWA). New function extension through external java plug-in. Distributed calculation to speed up the analysis.

A ready-to-use software solution Aim: to automate the analysis of a large number of samples at once. With minimal file requirements. - Data: several Fastq files (.bz2) 1 reference genome (.fasta) 1 annotation file (.gff3) - Set up: 1 XML parameter file 1 design file A design file inspired from the limma R package. And one command line to launch the whole process. $ eoulsan.sh exec parameter_file.xml design_file.txt!

How to make HTS intensive calculation? Data analysis requires large computer infrastructures. Could computing can help for small to medium computer requirements. Outsourcing data analysis on the network is economic thanks to on demand reservation of computer resources. Such clusters are only profitable when computers are continually used. (AWS)

MapReduce to increase analysis speed MapReduce is used for parallel computation and automatically handles duties, such as job scheduling, fault tolerance and distributed aggregation. Map(id_alignment, alignment)!! list(id_exon, 1)! Sort(id_exon,1)! Reduce(id_exon, list(1,1...1))!! list(id_exon, count)! White (2009) O'Reilly Media Hadoop is a popular (Twiter, facebook, ebay ) open-source implementation of the MapReduce framework as the original Google implementation is not public. Hadoop is a Java framework that can be executed on any cluster (such as AWS).

Eoulsan workflow on AWS $ eoulsan.sh -conf conf-aws.txt awsexec -d "Job name" param-aws.xml design.txt s3://sgdb-test/demo!

Instance type selection What kind of computer server (instance) do we need to book from Amazon Web Service for RNA-Seq data analysis? Instance Memory (Go) CPU (EC2 unit) I/O performance Price USD/hour m1.small 1.7 1 moderate $0.11 m1.large 7.5 4 high $0.44 m1.xlarge 16.0 8 high $0.88 m2.xlarge 17.1 6.5 moderate $0.66 m2.2xlarge 34.2 13 high $1.35 m2.4xlarge 68.4 26 high $2.70 c1.medium 1.7 5 moderate $0.22 c1.xlarge 7.0 20 high $0.88 Mouse RNA-Seq data, 8 samples, 23.5 million reads each (188 million read total), 76b Single Read, 10 instances booked.

Tests for instance and mapper types

Cost and number of instances are proportional The only choice to make is to favor either price or speed of the analysis. There is no suboptimal configuration.

How to deal with increasing data amount? With Illumina sequencers, the throughput double every 10 months. Specifications Read Number Read/channel Read for RNASeq Sample/run GAIIx 2010 200,000,000 25,000,000 45,000,000 ~4 HiSeq 1000 2010 640,000,000 80,000,000 45,000,000 ~14 HiSeq 1000 2011 1,500,000,000 187,500,000 45,000,000 ~33

Running time evolved linearly with sample size We test the increase of the number of samples from 8 to 32 (188 to 752 million reads) using Bowtie mapper on varying m1.large instance numbers.

Conclusion Eoulsan automates the analysis of a large number of samples at once; It simplifies the execution and configuration of a cloud computing infrastructure; Its modular and flexible analysis framework runs with various already available analysis solutions; Eoulsan handles sequencer throughput increase. Future developments: Improve the RNA-Seq workflow (gapped alignments, spliced transcript abundance estimation, new transcript discovery); Add new functional genomic abilities (ChIP-Seq, smallrna-seq); Test with other cloud solutions (StratusLab, OpenStack, OpenNebula).

Eoulsan is available for download eoulsan/ Standalone version Distributed version Local clusters Cloud Computing Laurent Jourdren Maria Bernard Marie-Agnès Dillies Sophie Lemoine