Accelerated BLAST Performance with Tera-BLAST : a comparison of FPGA versus GPU and CPU BLAST implementations

Similar documents
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Clustering Billions of Data Points Using GPUs

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Pairwise Sequence Alignment

Bioinformatics Resources at a Glance

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

High Performance Computing in CST STUDIO SUITE

Linear Sequence Analysis. 3-D Structure Analysis

Accelerating CFD using OpenFOAM with GPUs

Bioinformatics Grid - Enabled Tools For Biologists.

BMC Bioinformatics. Open Access. Abstract

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Biological Databases and Protein Sequence Analysis

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

CD-HIT User s Guide. Last updated: April 5,

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

MetaMorph Microscopy Automation & Image Analysis Software Super-Resolution Module

Introduction to GPU hardware and to CUDA

L20: GPU Architecture and Models

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

ultra fast SOM using CUDA

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Recommended hardware system configurations for ANSYS users

Parallel Computing with MATLAB

Molecular Databases and Tools

Tableau Server 7.0 scalability

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

GPU-Based Network Traffic Monitoring & Analysis Tools

BIOINFORMATICS TUTORIAL

Intelligent Heuristic Construction with Active Learning

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

GeoImaging Accelerator Pansharp Test Results

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPU Renderfarm with Integrated Asset Management & Production System (AMPS)

A Tutorial in Genetic Sequence Classification Tools and Techniques

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Packet-based Network Traffic Monitoring and Analysis with GPUs

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

vrealize Business System Requirements Guide

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

CFD Implementation with In-Socket FPGA Accelerators

GPU-accelerated Large Scale Analytics using MapReduce Model

Recent Advances in HPC for Structural Mechanics Simulations

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

VIBE. Visual Integrated Bioinformatics Environment. Enter the Visual Age of Computational Genomics. Whitepaper

Introduction to GPU Programming Languages

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Copyright by Parallels Holdings, Ltd. All rights reserved.

Integrated Grid Solutions. and Greenplum

Resource Scheduling Best Practice in Hybrid Clusters

Amazon EC2 XenApp Scalability Analysis

Sequence Analysis on a 216-Processor Beowulf Cluster

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

High Availability of the Polarion Server

Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

Informatica Ultra Messaging SMX Shared-Memory Transport

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Performance Analysis of Web based Applications on Single and Multi Core Servers

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Drivers to support the growing business data demand for Performance Management solutions and BI Analytics

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Network Traffic Monitoring and Analysis with GPUs

Clusters with GPUs under Linux and Windows HPC

How to choose a suitable computer

Trends in High-Performance Computing for Power Grid Applications

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Envox CDP 7.0 Performance Comparison of VoiceXML and Envox Scripts

Installation Guide. (Version ) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Several tips on how to choose a suitable computer

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

StruxureWare TM Center Expert. Data

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Remote Web Services for Model Building

Transcription:

Technical Note Accelerated BLAST Performance with : a comparison of FPGA versus GPU and CPU BLAST implementations TimeLogic Division, Active Motif Inc, 1914 Palomar Oaks Way, Suite 150, Carlsbad, CA 92008 ABSTRACT A number of technologies have emerged for accelerating similarity search algorithms in bioinformatics, including the use of field programmable gate arrays (FPGA), graphics processing units (GPU), and clusters of standard multicore CPUs. Here we present, an FPGA-accelerated implementation of the BLAST algorithm, and compare the performance to GPU-accelerated BLAST and the industry standard on high performance computers. Our results show that, running on the TimeLogic J-series FPGA Similarity Search Engine, performs 100 s of times faster than BLAST running on generic NVIDIA Tesla M2090 GPU cards or standard Intel Xeon multi-core CPU s. INTRODUCTION The use of algorithms that perform similarity searches between nucleotide or protein sequences is commonplace within the field of bioinformatics. The Smith-Waterman algorithm (Smith & Waterman, 1981) has long been the Gold Standard for sequence alignment tools because it delivers the best local alignment for a given query/target pair. Unfortunately, dynamic programming algorithms such as Smith-Waterman are too cumbersome for high-throughput bioinformatics work due to the large size of the data sets involved. As a result, heuristic methodology has been utilized to improve algorithmic performance. One such approach, the Basic Local Alignment Search Tool (BLAST) (Altschul et al, 1997), has been enormously popular as it very successfully achieves a balance of performance and search sensitivity. However, continued improvements in DNA sequencing technology along with the corresponding growth of public database repositories have meant that even algorithms like BLAST have struggled to keep up with demand. Addressing this challenge has traditionally required larger and larger CPU clusters with significant acquisition and operational costs. To tackle this persistent problem, heuristic algorithms have been combined with hardware acceleration technologies like general-purpose graphics processing units (GPU) and field programmable gate arrays (FPGA). Here we compare BLAST running on CPUs, GPUs, and FPGAs to show that the FPGA-accelerated algorithm from TimeLogic provides the industry s fastest and most cost effective similarity search tool for high-throughput bioinformatics. MATERIALS AND METHODS Hardware and Software All tests were run on a Dell R720 server with dual Xeon E5-2690 2.90 GHz 8-core hyper-threaded CPUs (32 cores) and 64GB RAM, under the 64-bit CentOS v6 Linux operating system. CPU tests were performed with v2.2.27 (Camacho et al., 2009) with default parameters. All other software tests were run using the equivalent of these defaults: -num_descriptions 500 -num_alignments 250 -word_size 3 -threshold 11 -evalue 10 -seg no. GPU tests were performed with a single high-end NVIDIA Tesla M2090 GPU card with 512 processors at 1.3 GHz, and 6GB of global memory (retail cost ~$2500.00). Setup required the download and installation of the CUDA toolkit v4.2.9, 64-bit Red Hat driver v295.41, and the GPU Computing SDK v 4.2.9, followed by application specific software for CUDA-BLASTP v2.0 (Liu, et al., 2011) and GPU-BLASTP v1.1 (Vouzis and Sahinidis, 2011). Because CUDA-BLASTP has a limitation of accepting only a single query sequence, a perl script was written to iterate over multiple query sequences in series. For GPU-BLAST, default values were used for -gpu_blocks and -gpu_threads parameters. Version 1.0 May 2013

tests were performed with one and two TimeLogic J-series FPGA-accelerated Similarity Search Engines running DeCypher v9.0 software. Data Sets Query sequence data was taken from a subset of the Uniprot Swissprot database comprising the first 1.2 megasymbols of amino acid residues (3,266 sequences). Database sequence data was taken from a subset of the NCBI non-redundant (nr) protein database comprising the first 1 giga-symbols of amino acid residues (3,219,757 sequences). Binary BLAST databases were built following established protocols for each respective BLAST implementation. RESULTS AND DISCUSSION FPGA-Accelerated searches with one and two TimeLogic J-series accelerators installed in a 32 CPU-core server ran in 2.9 and 1.9 minutes, respectively. Comparatively, GPU-BLAST searches ran between 303.1 and 104.9 minutes, and NCBI BLAST+ between 817.7 and 77.7 minutes, depending on the number of CPU cores used in these searches (Table 1 and Figure 1). and GPU-BLAST search performance is dependent on the number of CPU cores available, and increases with increasing CPU allocation. However, as CPU cores increase, GPU performance eventually falls below CPU-only performance. When 16 CPU cores are used, the benefit of using a GPU accelerator is negligible (107.8 minutes vs 104.9 minutes), and when 32 CPU cores are used, GPU accelerated versions actually run 41% slower than CPU-only versions. Time (minutes) GPU-BLAST (1x Tesla M2090) (1x J3 accelerator) 1 817.7 303.1 n/a n/a 2 445.4 233.0 n/a n/a 4 257.2 154.3 n/a n/a 8 167.8 110.8 n/a n/a 16 107.8 104.9 n/a n/a 32 77.7 109.7 2.9 1.9 Table 1: Search times for each BLAST version search using increasing numbers of CPU cores. Figure 1: BLAST search times in minutes for (FPGA), GPU-BLAST (GPU), and NCBI-BLAST+ (CPU). Using two J-series similarity search engines, runs up to 434.2x faster than running on a single core (Table 2 and Figure 2). runs up to 41.2x faster than running on 32 processor cores (Figure 2), or 1319.7x per core equivalent (Table 3 and Figure 3). www.timelogic.com May 2013 2

Acceleration Factor vs. 1 core GPU-BLAST (1x Tesla M2090) (1x J3 accelerator) 1 1.0x 2.7x n/a n/a 2 1.8x 3.5x n/a n/a 4 3.2x 5.3x n/a n/a 8 4.9x 7.4x n/a n/a 16 7.6x 7.8x n/a n/a 32 10.5x 7.5x 283.6x 434.2x Table 2: runs up to 434.2x faster than running on a single core. runs only 10.5x faster when 32 cores vs. 1 core are used. GPU-BLAST runs slower than once more than 16 CPU cores are available. Figure 2: performs up to 434.2x faster than and up to 161.0x faster than GPU- BLAST when only 1 CPU is used. When 32 CPU cores are used, performs up to 41.2x faster than NCBI-BLAST+ and up to 58.2x faster than GPU-BLAST. Acceleration Factor as core equivalents GPU-BLAST (1x Tesla M2090) (1x J3 accelerator) 1 1.0x 2.7x n/a n/a 2 2.0x 3.8x n/a n/a 4 4.0x 6.7x n/a n/a 8 8.0x 11.1x n/a n/a 16 16.0x 16.4x n/a n/a 32 32.0x 22.7x 862.0x 1319.7x Table 3: CPU Core Equivalents are calculated by multiplying the normalized acceleration factor by the number of cores used. www.timelogic.com May 2013 3

Figure 3: supplies up to 1319.7 core equivalents of performance in a single 2U node. Types of GPU-Accelerated BLAST The CUDA-BLASTP software is published as performing up to 10x faster than when compared to a single CPU core (Liu, et al., 2011). In practice, performance is generally lower due to several significant limitations. First, optimal performance is highly data specific and dependent on query sequence length. Second, CUDA-BLASTP was designed to only accept a single query sequence. This severely limits its use in high performance computing, as the only way to analyze larger data sets is to iterate serially over multiple query sequence files. Such iteration, according to the CUDA-BLASTP author, will add significant performance overhead. Using this iterative method, which was unnecessary for all other software in this experiment, CUDA-BLASTP performed slower than standard CPUs. In this case, performance was decreased by approximately 50 percent (data not shown). CUDA-BLASTP also reported errors on approximately 5% of the search iterations: Unspecified launch failure errors and Segmentation fault (core dumped). Alternatively, the GPU-BLAST software is built upon the standard NCBI-BLAST+ code and, like the original, can handle both multiple queries and multiple CPU processors. It is reported to provide a speedup of 3-4 times when compared to running on a single CPU core, but with our data, GPU-BLAST performed only approximately 2 times faster (data shown above). When more commonly available multi-core CPUs are used, for example the 32 CPU core Dell R720 servers used for these tests, GPU-BLAST ran 41% slower than. CONCLUSION In summary, offers the best performance for the BLAST algorithm based on resource investment. Tera- BLAST exceeds CPU performance by up to 434.2x, and GPU-accelerated performance by up to 161.0x. Furthermore, provides the equivalent of up to 1319.7 CPU cores of performance in a single 2U server. In addition to superior performance, TimeLogic s DeCypher offers far more flexibility than GPU-based implementations in that it can be used for all flavors of BLAST (BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX), as well as Tera-Probe, which these GPU alternatives do not support. GPU-accelerated BLAST algorithms perform only slightly better than standard CPUs, and only when small numbers of CPU cores are utilized. When more than 8 CPU cores are used, this performance gain disappears altogether. When all 32 available CPU cores were utilized, GPU performance was actually worse than running in software. As multi-core CPUs are now the minimal standard offering, GPU computing for BLAST does not appear to be a costeffective investment. When considering price for performance in high-throughput biocomputing, DeCypher is by far the best solution for accelerated high-throughput analysis. www.timelogic.com May 2013 4

REFERENCES 1. Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. (1997) Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. (PMID: 9254694) 2. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421. (PMID: 20003500) 3. Liu, W., Schmidt, B., and Muller-Wittig, W. (2011) CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(6):1678-84. (PMID: 21339531) 4. Smith, T., Waterman, M. (1981) Identification of common molecular subsequences. J Molecular Biology 147:195 197. (PMID: 7265238) 5. Vouzis, P. and Sahinidis, N. (2011) GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics, 2011, 27(2):182-8. (PMID: 21088027) www.timelogic.com May 2013 5