High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure (IBENS) Montagne Sainte Geneviève Genomic Platform
A RNA-Seq data analysis workflow A flexible analysis framework. From raw sequencer outputs (Illumina) to the list of differentially expressed genes. Based on available analysis solutions (SOAP, Bowtie, BWA). New function extension through external java plug-in. Distributed calculation to speed up the analysis.
A ready-to-use software solution Aim: to automate the analysis of a large number of samples at once. With minimal file requirements. - Data: several Fastq files (.bz2) 1 reference genome (.fasta) 1 annotation file (.gff3) - Set up: 1 XML parameter file 1 design file A design file inspired from the limma R package. And one command line to launch the whole process. $ eoulsan.sh exec parameter_file.xml design_file.txt!
How to make HTS intensive calculation? Data analysis requires large computer infrastructures. Could computing can help for small to medium computer requirements. Outsourcing data analysis on the network is economic thanks to on demand reservation of computer resources. Such clusters are only profitable when computers are continually used. (AWS)
MapReduce to increase analysis speed MapReduce is used for parallel computation and automatically handles duties, such as job scheduling, fault tolerance and distributed aggregation. Map(id_alignment, alignment)!! list(id_exon, 1)! Sort(id_exon,1)! Reduce(id_exon, list(1,1...1))!! list(id_exon, count)! White (2009) O'Reilly Media Hadoop is a popular (Twiter, facebook, ebay ) open-source implementation of the MapReduce framework as the original Google implementation is not public. Hadoop is a Java framework that can be executed on any cluster (such as AWS).
Eoulsan workflow on AWS $ eoulsan.sh -conf conf-aws.txt awsexec -d "Job name" param-aws.xml design.txt s3://sgdb-test/demo!
Instance type selection What kind of computer server (instance) do we need to book from Amazon Web Service for RNA-Seq data analysis? Instance Memory (Go) CPU (EC2 unit) I/O performance Price USD/hour m1.small 1.7 1 moderate $0.11 m1.large 7.5 4 high $0.44 m1.xlarge 16.0 8 high $0.88 m2.xlarge 17.1 6.5 moderate $0.66 m2.2xlarge 34.2 13 high $1.35 m2.4xlarge 68.4 26 high $2.70 c1.medium 1.7 5 moderate $0.22 c1.xlarge 7.0 20 high $0.88 Mouse RNA-Seq data, 8 samples, 23.5 million reads each (188 million read total), 76b Single Read, 10 instances booked.
Tests for instance and mapper types
Cost and number of instances are proportional The only choice to make is to favor either price or speed of the analysis. There is no suboptimal configuration.
How to deal with increasing data amount? With Illumina sequencers, the throughput double every 10 months. Specifications Read Number Read/channel Read for RNASeq Sample/run GAIIx 2010 200,000,000 25,000,000 45,000,000 ~4 HiSeq 1000 2010 640,000,000 80,000,000 45,000,000 ~14 HiSeq 1000 2011 1,500,000,000 187,500,000 45,000,000 ~33
Running time evolved linearly with sample size We test the increase of the number of samples from 8 to 32 (188 to 752 million reads) using Bowtie mapper on varying m1.large instance numbers.
Conclusion Eoulsan automates the analysis of a large number of samples at once; It simplifies the execution and configuration of a cloud computing infrastructure; Its modular and flexible analysis framework runs with various already available analysis solutions; Eoulsan handles sequencer throughput increase. Future developments: Improve the RNA-Seq workflow (gapped alignments, spliced transcript abundance estimation, new transcript discovery); Add new functional genomic abilities (ChIP-Seq, smallrna-seq); Test with other cloud solutions (StratusLab, OpenStack, OpenNebula).
Eoulsan is available for download eoulsan/ Standalone version Distributed version Local clusters Cloud Computing Laurent Jourdren Maria Bernard Marie-Agnès Dillies Sophie Lemoine