SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants



Similar documents
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

A High-performance Computing Toolset for Big Data Analysis of Genome-Wide Variants

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Delivering the power of the world s most successful genomics platform

Hadoop-BAM and SeqPig

VariantSpark: Applying Spark-based machine learning methods to genomic information

Very Large Enterprise Network Deployment, 25,000+ Users

Fast Analytics on Big Data with H20

Benchmarking Hadoop & HBase on Violin

RevoScaleR Speed and Scalability

Hardware Configuration Guide

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

The Performance Characteristics of MapReduce Applications on Scalable Clusters

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Analysis Tools and Libraries for BigData

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

1 Bull, 2011 Bull Extreme Computing

A Performance Analysis of Distributed Indexing using Terrier

Cluster Computing at HRI

SAP HANA Enabling Genome Analysis

CHAPTER FIVE RESULT ANALYSIS

CSE-E5430 Scalable Cloud Computing. Lecture 4

Very Large Enterprise Network, Deployment, Users

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Microsoft SQL Server 2005 on Windows Server 2003

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Main Memory Data Warehouses

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Integrated Grid Solutions. and Greenplum

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

Processing NGS Data with Hadoop-BAM and SeqPig

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

IT Firm Virtualizes Databases: Trims Servers 85 Percent, Ups Performance 50 Percent

Step by Step Guide to Importing Genetic Data into JMP Genomics

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Testing Database Performance with HelperCore on Multi-Core Processors

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Packet-based Network Traffic Monitoring and Analysis with GPUs

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Project Management Professional (PMP) Certification: PMBOK Guide Fifth Edition

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

A stream computing approach towards scalable NLP

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

IP Video Rendering Basics

Cisco Data Preparation

Acceleration for Personalized Medicine Big Data Applications

Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

New Design and Layout Tips For Processing Multiple Tasks

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Performance Guideline for syslog-ng Premium Edition 5 LTS

Introduction to Parallel Programming and MapReduce

Recommended hardware system configurations for ANSYS users

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

ITG Software Engineering

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

On the Cost of Mining Very Large Open Source Repositories

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Multi-Threading Performance on Commodity Multi-Core Processors

Hadoop. Bioinformatics Big Data

Vormetric and SanDisk : Encryption-at-Rest for Active Data Sets

Scaling from 1 PC to a super computer using Mascot

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

IPRO ecapture Performance Report using BlueArc Titan Network Storage System

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Microsoft Dynamics CRM 2011 Guide to features and requirements

NEXTGEN v5.8 HARDWARE VERIFICATION GUIDE CLIENT HOSTED OR THIRD PARTY SERVERS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Data Mining with Hadoop at TACC

Transcription:

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants 1 Dr. Xiuwen Zheng Department of Biostatistics University of Washington Seattle

Introduction Thousands of gigabyte genetic data sets provide significant challenges in data management, even on well-equipped hardware 2 The 1000 Genomes Project Phase 1 (1KG): ~39 million variants (differences from the reference genome) of 1092 individuals http://www.1000genomes.org Variant Call Format (VCF) files: genotypes + annotation, totaling ~184G in a compressed manner

CoreArray (C++ library) Methods designed for large-scale data management of genome-wide variants 3 data format (GDS) to store multiple array-oriented datasets in a single file Two R packages gdsfmt R interface to CoreArray Genomic Data Structure (GDS) files SeqArray specifically designed for data management of genome-wide sequence variants from Variant Call Format (VCF) files

Methods Advantages 1. Direct access of data without parsing VCF text files 2. Stored in a binary and array-oriented manner 2 bits are employed as a primitive type to store alleles (e.g., A, G, C, T) efficient access of variants using R language 3. Genotypic data stored in a compressed manner rare variants -> highly compressed without sacrificing access efficiency e.g., 1KG, 26G genotypes -> 1.5G by the zlib algorithm (5.8%!) 4. Run in parallel! 4

Methods Key Functions Table 1: The key functions in the SeqArray package. 5 Function seqvcf2gds seqsummary seqsetfilter seqgetdata seqapply seqparallel Description Reformats VCF files to GDS format Gets the summary (# of samples, # of variants, INFO/ FORMAT variables, etc) Sets a filter to sample or variant (define a subset of data) Gets data from a sequencing file (from a subset of data) Applies a user-defined function over array margins Applies functions in parallel

Dataset: Benchmark the 1000 Genomes Project Phase 1, chromosome 1 3,007,196 variants, 1092 individuals the original VCF file: 10.7G (compressed!) genotypes + annotations reformat to a single SeqArray file: 10.6G 6 Calculate the frequencies of reference alleles 1. R code (sequential version) 2. R code (parallel version) 3. Seamless R and C++ integration via the Rcpp package (sequential version)

Benchmark Test 1 (sequentially) 7 # load the R package library(seqarray) # open the file genofile <- seqopen("1kg.chr1.gds") # apply the user-defined function variant by variant system.time(afreq <- seqapply(genofile, "genotype", FUN = function(x) { mean(x==0, na.rm=true) }, as.is="double", margin="by.variant") ) the typical x looks like: sample allele [,1] [,2] [,3] [,4] [,5] [1,] 0 1 0 1 1 [2,] 0 0 0 1 0 0 reference allele 1 the first alternative allele the user-defined function ~6.8 minutes on a Linux system with two quad-core Intel processors (2.27GHz) and 32 GB RAM

Benchmark Test 2 (in parallel) 8 # load the R package library(parallel) # create a computing cluster with 4 cores seqparallelsetup(4) the user-defined kernel function distributed to 4 different computing nodes # run in parallel system.time(afreq <- seqparallel(gdsfile=genofile, FUN = function(gdsfile) { seqapply(gdsfile, "genotype", as.is="double, FUN = function(x) mean(x==0, na.rm=true)) }, split = "by.variant") ) ~2.1 minutes (vs 6.8m in Test 1) divide genotypes into 4 non-overlapping parts

Benchmark Test 3 (C++ Integration) library(rcpp) cppfunction('double RefAlleleFreq(IntegerMatrix x) { int nrow = x.nrow(), ncol = x.ncol(); int cnt=0, zero_cnt=0, g; }') for (int i = 0; i < nrow; i++) { for (int j = 0; j < ncol; j++) { if ((g = x(i, j))!= NA_INTEGER) { cnt ++; if (g == 0) zero_cnt ++; }}} return double(zero_cnt) / cnt; system.time( afreq <- seqapply(genofile, "genotype", RefAlleleFreq, as.is="double", margin="by.variant") ) ~0.7 minutes (significantly faster!!! vs 6.8m in Test 1) 9 dynamically define an inline C/C++ function in R

Conclusion SeqArray will be of great interest to R users involved in data analyses of large-scale sequencing variants 10 particularly those with limited experience of parallel / highperformance computing SeqVarTools (Bioconductor) variant analysis, such like allele frequency, HWE, Mendelian errors, etc functions to display genotypes / annotations in a readable format

Acknowledgements Department of Biostatistics at University of Washington Seattle, Genetic Analysis Center: Dr. Stephanie M. Gogarten Dr. Cathy Laurie 11