Accelerating life sciences research

Transcription

1 IBM Systems and Technology Thought Leadership White Paper June 2013 Accelerating life sciences research IBM Platform Symphony helps deliver improved performance for life sciences workloads using Contrail software

2 2 Accelerating life sciences research Contents 2 Addressing the challenges of genome assembly with Contrail 3 Accelerating results with IBM Platform Symphony 3 The benchmark environment 4 Selecting the E.coli model 4 Test methodology 6 Results 6 Interpreting the results 7 The additional benefits of Platform Symphony 7 Limitations and additional work 7 Conclusion 8 Appendix: Shell script for benchmark testing 10 Actual benchmark results captured over three successive comparative runs 11 Hadoop configuration files New approaches to genomic analysis, such as next-generation sequencing, will play key roles in advancing scientific knowledge, facilitating the development of targeted drugs and delivering personalized healthcare. To capitalize on these new approaches, life sciences organizations need computing environments that can process tremendous amounts of data rapidly. Speed of analysis is critical in life sciences since it relates directly to the rate of discovery and the cost-efficiency of employing genomic sequencing for personalized medicine on a large scale. Contrail, a bioinformatics application, leverages the Hadoop MapReduce framework to deliver gains in performance and cost-efficiency in genome sequencing. By combining Contrail with IBM Platform Symphony, a commercial workload scheduler and grid manager, researchers can see even greater advantages. This paper presents the results of recent benchmark testing that demonstrate the advantages of using Platform Symphony in conjunction with Contrail. Addressing the challenges of genome assembly with Contrail Contrail is open-source software that was developed to solve key challenges associated with large-scale genome assembly. It enables de novo assembly of large genomes from short reads, bridging research in computational biology with advances in the Hadoop MapReduce framework. The first step in analyzing a previously un-sequenced organism is to assemble reads by merging similar reads into progressively longer sequences. Assemblers such as Velvet and Euler attempt to solve the assembly problem by constructing, simplifying and traversing a de Bruijn graph of the read sequences. 1 These assemblers primarily focus on correcting errors, reconstructing unambiguous regions and resolving short repeats. While these assemblers can manage small genomes, scaling to larger, mammalian-sized genomes is challenging. The assemblers require constructing and manipulating graphs that are too large to fit in the memory of most computer systems. Larger models can require computing environments with terabytes of memory and building those environments would be too expensive for most institutions. Contrail addresses the memory limitation by re-representing the algorithm to run on a distributed MapReduce framework that avoids the need for massive amounts of memory on any individual system. Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an indepth analysis even for large genomes on clusters of commodity computer systems running a Linux operating system.

3 IBM Systems and Technology 3 Accelerating results with IBM Platform Symphony Platform Symphony software offers enterprise-class management of distributed compute and big data applications on a scalable, shared grid. By providing a low-latency scheduling environment for heterogeneous workloads, Platform Symphony can help accelerate application workloads and enable IT groups to enhance the efficiency of how resources are used. Platform Symphony available on its own, and as a limited-use license as part of the IBM InfoSphere BigInsights software distribution also makes it easy for organizations to run applications specifically designed for big data and achieve higher levels of performance to facilitate rapid decision making. With Platform Symphony - Advanced Edition augmenting a supported Hadoop distribution, organizations can run their existing Hadoop MapReduce applications without modification. Platform Symphony does not replace Hadoop; it replaces only the standard batch scheduler included with the open-source Hadoop MapReduce distribution. Platform Symphony enhances Hadoop by providing a faster, low-latency MapReduce runtime layer and more reliable and flexible workload management. In other industries, Platform Symphony has been shown to substantially accelerate Hadoop MapReduce workloads. The goal of this benchmark was to demonstrate how Platform Symphony could deliver similar advantages for a life sciences workload. The benchmark environment This benchmark measured the relative performance of a Contrail model with and without Platform Symphony. Relatively little performance optimization was done for either the Hadoop-only case or the Hadoop plus Platform Symphony cases. Existing lab hardware was used to conduct the tests so the hardware environment may not have been optimal, but it was sufficient for this kind of simple comparative test. Hardware A Hadoop MapReduce cluster comprising multiple IBM rackmount servers (see Figure 1) was used to support the benchmark. The cluster had a single head node and seven data nodes. The head node was a 2.6 GHz IBM System x 3650 M4 server with 32 GB of memory. Six of the compute nodes were IBM System x dx360 M4 servers configured with 64 GB of memory per server and 40 Gbps InfiniBand interconnects. The seventh server was an IBM idataplex M3 server. All nodes were connected through a 40 Gbps InfiniBand switch. The test ran IP over InfiniBand (IPoB). A separate 1 Gb Ethernet network was used for node configuration and management. Mellanox IB switch IBM 3650 M4 server IBM dx360 m4 server IBM dx360 m4 server IBM dx360 m4 server IBM dx360 m4 server IBM dx360 m4 server IBM dx360 m4 server IBM dx360 m3 server Figure 1. IBM System x test environment for Contrail performance comparisons.

4 4 Accelerating life sciences research Figure 2. The Platform Symphony management console cluster view. Software The cluster nodes all ran Red Hat Enterprise Linux 6.2. Hadoop was downloaded from apache.org and configured in accordance with instructions provided in the Platform Symphony release notes (see Figure 2). The Contrail software tested was the latest version available from apps/mediawiki/contrail-bio/index.php?title=contrail as of March The Contrail code was installed based on instructions in the Contrail Quickstart Guide (available on the Contrail wiki). For the comparative test, Platform Symphony Version was used in conjunction with the Hadoop software above. Tests were initially conducted with both Hadoop and 1.1.1, but it was judged to be more valid to focus on since this was the more recent version. A significant difference between the two versions is the heartbeat interval. Hadoop employs a more aggressive 0.3-second heartbeat interval, while Hadoop has a 3-second interval. For this reason, Hadoop generally outperforms Hadoop on small clusters such as the test environment, where a fast heartbeat interval is reasonable. Selecting the E.coli model For this comparative benchmark testing, the Ecoli.10k file included in the data directory of the Contrail distribution was chosen as the basis for the test. 2 The benchmark team treated the E.coli model provided with the Contrail distribution as a black box and ran Contrail in accordance with the provided directions. Test methodology To simplify the benchmark testing, and to facilitate repeated runs with different data models and parameter settings, a shell script was developed (see the Appendix) to run the benchmark. Much of the logic of the script involves parsing the output of the Contrail simulation runs for both Hadoop-only and Hadoop plus Platform Symphony cases to easily capture runtime details from repeated benchmark runs. Without this kind of automation, manually gathering statistics from repeated job runs so that they could be easily compared would have been tedious.

5 IBM Systems and Technology 5 The two test case configurations employed mostly the default settings. The benchmark team did, however, change three variables in the Platform Symphony application profile for the Symphony MapReduce tenant, under which the Contrail jobs ran. The application profiles were configured with these settings: prestartapplication= true tasklowwatermark= 0.0 taskhighwatermark= 1.0 These settings are known to deliver better performance for Hadoop MapReduce workloads and would likely be the same settings used by organizations deploying such an application in production. These are standard settings explained in the Platform Symphony product documentation. The benchmark execution script: Sets up the environment for both Hadoop and Platform Symphony Cleans up the Hadoop Distributed File System (HDFS) environment to make sure there is no data from prior runs Copies the E.coli model files into HDFS Runs the identical model twice once using the Hadoop-only environment and once using the Hadoop plus Platform Symphony environment Following these runs, the output files contrail.out.hadoop and contrail.out.symphony generated by the script were parsed to show comparative runtime statistics. For the Platform Symphony portion of the test, the running jobs could be monitored through the Platform Symphony management console (see Figure 3). Figure 3. View of running Contrail jobs in the Platform Symphony management console.

6 6 Accelerating life sciences research Results Using standard Hadoop 1.1.1, the average duration of each Hadoop MapReduce job was found to be seconds with a total runtime of 873 seconds. Using Hadoop in conjunction with Platform Symphony accelerated the calculation of the Contrail model, reducing the average job runtime to just 4.68 seconds and compressing the total runtime to just 258 seconds almost a 3.5 times performance boost. The captured script output is shown below. Figure 4 shows the relative total runtimes in a bar chart form. Hadoop + Platform Symphony Total jobs: 53 Maximum job length: 124 seconds Average job length: seconds Total duration: 258 seconds Hadoop only Total jobs: 53 Maximum job length: 18 seconds Average job length: seconds Total duration: 873 seconds 1, Contrail runtime to subset of E.coli bacteria (10K reads) Without Platform Symphony With Platform Symphony Total runtime for 53 jobs (seconds) Figure 4. Using Platform Symphony with Hadoop helped significantly reduce the workload s runtime. Interpreting the results While not all models will show similar performance gains, the observations in this test are consistent with a social media benchmark 3 in which Platform Symphony was shown to accelerate workloads by an average 7.3 times. Generally, for latencysensitive applications that involve multiple short-running jobs, Platform Symphony will help improve performance because of its low-latency scheduling architecture. As a result, organizations can either complete work faster or realize cost savings by deploying a smaller cluster environment to attain performance objectives.

7 IBM Systems and Technology 7 The additional benefits of Platform Symphony Even though this effort focuses on comparing performance, Platform Symphony includes capabilities that can provide several additional advantages to life sciences organizations. For example: Proportional resource allocation: Organizations can run multiple MapReduce workloads concurrently, dynamically changing priorities and associated resource allocations in real time. Fast job pre-emption: Organizations can make sure critical workloads start and finish quickly while longer-running workloads continue to run in the background. Job recoverability: JobTracker execution is journaled so that jobs can resume where they left off in the event of failure. Optional IBM General Parallel File System (IBM GPFS ): Organizations running both MapReduce and non-mapreduce workloads can benefit from GPFS since it is a POSIX 4 file system that can support both Hadoop MapReduce and non-mapreduce workloads concurrently accessing file system data, without the need to copy data in and out of the file system. Multi-mode clusters: Organizations running Hadoop MapReduce as well as traditional non-mapreduce workloads can configure individual clusters to support both Platform LSF and Platform Symphony. Platform LSF is a powerful workload management solution for running large, batchoriented workloads. Running both Platform LSF and Platform Symphony on the same cluster can deliver additional flexibility and increase the number of life sciences applications that can efficiently share cluster resources. Limitations and additional work This test involved a single model organizations could experience different results with different models or different numbers of reads. Results may also vary with the size of the cluster. Furthermore, the disk subsystem as configured was suboptimal for both test cases. Organizations might see different results with a more optimized file system configuration. It is debatable whether this specific Contrail test should be described as a big data workload since the actual files involved are relatively small by big data standards. The business advantage of using Hadoop MapReduce for this kind of workload, however, is undeniable. The MapReduce framework helps reduce the costs of performing de novo genome assembly, avoiding the need for costly systems with massive amounts of physical memory. Based on these tests, Platform Symphony builds on the inherent advantages associated with the use of Contrail by providing an additional incremental performance advantage. Conclusion As this testing demonstrates, life sciences organizations using Contrail can expect to see a significant performance advantage by using the Platform Symphony scheduler in place of the standard scheduler included with the Hadoop MapReduce distribution. In the sample model comprising 10,000 reads, Platform Symphony accelerated the calculation of the Contrail result by 3.4 times. Because InfoSphere BigInsights 2.1 incorporates the IBM Platform Symphony scheduler, life sciences organizations considering deploying Hadoop MapReduce workloads along with other existing workloads should consider BigInsights as a platform for their big data applications.

8 8 Accelerating life sciences research Appendix: Shell script for benchmark testing contrail-test.sh This is the script used to control the execution of the benchmark. #!/bin/sh usage() { cat << EOF usage: $0 -i <path> -o <path> [-k <int> -l <prefix>] This section runs Contrail on the input data. OPTIONS -i <path> Path to the HDFS input directory -o <path> Path to the HDFS output directory -k <int> Value of K (default 25) -l <prefix> Local outfile prefix (default contrail.out) EOF } # # Extract total duration from contrail output get_duration() { local dur=`grep Duration: $1 awk { print $3; } ` echo $dur } # # Parse Hadoop contrail output and print statistics parse_hadoop() { local outfile=$1 local jobpattern= job_ local tmpfile= _lengths.tmp local max=0 local tot=0 local num=`grep $jobpattern $outfile wc -l` grep $jobpattern $outfile sed -E s/(.*) ($jobpattern.*)/\2/g awk { print $2; } > $tmpfile local jobs=( $( cat $tmpfile ) ) rm -f $tmpfile for i in ${jobs[@]} do if [ $i -gt $max ] then max=$i fi ((tot=$tot+$i)) done avg=`echo scale=4; $tot/$num bc` echo Hadoop -- Total Jobs: $num echo Max Job Length: $max sec echo Avg Job Length: $avg sec }

9 IBM Systems and Technology 9 # # Parse Symphony contrail output and print statistics parse_symphony() { local outfile=$1 local jobpattern= ^job_ local tmpfile= _lengths.tmp local max=0 for i in ${jobs[@]} do if [ $i -gt $max ] then max=$i fi ((tot=$tot+$i)) done avg=`echo scale=4; $tot/$num bc` echo Symphony -- Total Jobs: $num echo Max Job Length: $max sec echo Avg Job Length: $avg sec } HDFS_INPUT= HDFS_OUTPUT= CONTRAIL_K=25 PREFIX=contrail.out while getopts i:o:k:l: ARG do case $ARG in i) HDFS_INPUT=$OPTARG ;; o) HDFS_OUTPUT=$OPTARG ;; k) CONTRAIL_K=$OPTARG ;; l) PREFIX=$OPTARG ;; esac done SYM_ASMDIR=${HDFS_OUTPUT}.symphony SYM_OUTFILE=${PREFIX}.symphony HADOOP_ASMDIR=${HDFS_OUTPUT}.hadoop HADOOP_OUTFILE=${PREFIX}.hadoop if [[ -z $HDFS_INPUT ]] [[ -z $HDFS_OUTPUT ]] then usage exit fi if [[ -z ${HADOOP_HOME} ]] then echo HADOOP_HOME not defined. exit fi if [[ -z ${PMR_BINDIR} ]] then echo PMR_BINDIR not defined. exit fi echo Cleaning HDFS:${HDFS_INPUT} ${HADOOP_HOME}/bin/hadoop fs -rmr ${HDFS_INPUT}

10 10 Accelerating life sciences research echo Cleaning HDFS:${HADOOP_ASMDIR} ${HADOOP_HOME}/bin/hadoop fs -rmr ${HADOOP_ASMDIR} echo Cleaning HDFS:${SYM_ASMDIR} ${HADOOP_HOME}/bin/hadoop fs -rmr ${SYM_ASMDIR} echo Copying input files to HDFS:${HDFS_INPUT} ${HADOOP_HOME}/bin/hadoop fs -mkdir ${HDFS_INPUT} ${HADOOP_HOME}/bin/hadoop fs -copyfromlocal Ec10k.sim[12].fq ${HDFS_INPUT} # echo Running contrail (K=${CONTRAIL_K}) on Hadoop # echo ======= Redirecting all output to ${HADOOP_OUTFILE} in the current directory # export CONTRAIL_JAR=contrail.jar # ${HADOOP_HOME}/bin/hadoop jar ${CONTRAIL_JAR} contrail.contrail -asm ${HADOOP_ASMDIR} -k ${CONTRAIL_K} -reads ${HDFS_INPUT} &> ${HADOOP_ OUTFILE} # if [ $? -ne 0 ] # then # echo ERROR: Hadoop execution failed. Aborting... # exit 1; # fi echo Running contrail (K=${CONTRAIL_K}) on Symphony echo ======= Redirecting all output to ${SYM_OUTFILE} in the current directory $PMR_BINDIR/mrsh jar ${CONTRAIL_JAR} contrail. Contrail -asm ${SYM_ASMDIR} -k ${CONTRAIL_K} -reads ${HDFS_INPUT} &> ${SYM_OUTFILE} if [ $? -ne 0 ] then echo ERROR: Symphony execution failed. Aborting... exit 1; fi echo parse_symphony ${SYM_OUTFILE} SYMPHONY_DUR=`get_duration ${SYM_OUTFILE}` echo Total Duration: ${SYMPHONY_DUR} sec echo parse_hadoop ${HADOOP_OUTFILE} HADOOP_DUR=`get_duration ${HADOOP_OUTFILE}` echo Total Duration: ${HADOOP_DUR} sec echo SPEEDUP=`echo scale=4; ${HADOOP_DUR}/${SYMPHONY_ DUR} bc` echo Symphony Speedup: ${SPEEDUP}x Actual benchmark results captured over three successive comparative runs Note that the second and third test results were discarded because Platform Symphony performance was substantially better than the Hadoop MapReduce results, likely because of caching effects Platform Symphony can persist services. Symphony Total jobs: 53 Maximum job length: 124 seconds Average job length: seconds Total duration: 258 seconds Hadoop Total jobs: 53 Maximum job length: 18 seconds Average job length: seconds Total duration: 873 seconds Symphony speedup: times Symphony Total jobs: 53 Maximum job length: 4 seconds Average job length: seconds Total duration: 142 seconds

11 IBM Systems and Technology 11 Hadoop Total jobs: 53 Maximum job length: 20 seconds Average job length: seconds Total duration: 871 seconds Symphony speedup: times Symphony Total jobs: 53 Maximum job length: 4 seconds Average job length: seconds Total duration: 142 seconds Hadoop Total jobs: 53 Maximum job length: 20 seconds Average job length: seconds Total duration: 871 seconds Symphony speedup: times Hadoop configuration files core-site.xml <?xml version= 1.0?> <?xml-stylesheet type= text/xsl href= configuration.xsl?> <configuration> <name>hadoop.tmp.dir</name> <value>/hadoop/data</value> <name>fs.default.name</name> <value>hdfs://atsplat2.private:19000/</value> </configuration> hdfs-site.xml <?xml version= 1.0?> <?xml-stylesheet type= text/xsl href= configuration.xsl?> <configuration> <namedfs.replication</name> <value3</value> </configuration> mapred-site.xml <?xml version= 1.0?> <?xml-stylesheet type= text/xsl href= configuration.xsl?> <configuration> <name>mapred.job.tracker</name> <value>atsplat2.private:19001</value> <name>mapred.tasktracker.map.tasks.maximum</name> <value>15</value> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>15</value> <name>mapred.map.child.java.opts</name> <value>-xmx2048m</value> <name>mapred.reduce.child.java.opts</name> <value>-xmx2048m</value> </configuration>

12 For more information To learn more about Contrail, visit: contrail-bio;a=tree For more information about IBM Platform Symphony, visit: ibm.com/platformcomputing/products/symphony For more information about IBM InfoSphere BigInsights and other IBM big data solutions, contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/infosphere/biginsights Copyright IBM Corporation 2013 IBM Corporation Systems and Technology Group Route 100 Somers, NY Produced in the United States of America June 2013 IBM, the IBM logo, ibm.com, BigInsights, GPFS, idataplex, InfoSphere, LSF, Platform, and System x are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at Copyright and trademark information at ibm.com/legal/copytrade.shtml Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. This document is current as of the initial date of publication and may be changed by IBM at any time. The performance data discussed herein is presented as derived under specific operating conditions. Actual results may vary. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated. 1 While the science of genome assembly is outside of the scope of this paper, interested parties can learn more about Contrail by visiting: 2 For details about the 10K read E.coli model included with the Contrail software distribution, visit: gitweb.cgi?p=contrail-bio/contrail-bio;a=tree. Groundbreaking work on the E.coli K-12 strain MG1655 was done at the University of Wisconsin. For more information, visit 3 For an audited STAC Report commissioned by IBM, visit: ibm.com/systems/technicalcomputing/platformcomputing/products/ symphony/highperfhadoop.html 4 Portable Operating System Interface for UNIX. See for details. Please Recycle DCW03047USEN-01