SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants



Similar documents
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

A High-performance Computing Toolset for Big Data Analysis of Genome-Wide Variants

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Delivering the power of the world s most successful genomics platform

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

Fast Analytics on Big Data with H20

Hadoop-BAM and SeqPig

VariantSpark: Applying Spark-based machine learning methods to genomic information

RevoScaleR Speed and Scalability

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Very Large Enterprise Network Deployment, 25,000+ Users

Analysis Tools and Libraries for BigData

Benchmarking Hadoop & HBase on Violin

Hardware Configuration Guide

CSE-E5430 Scalable Cloud Computing Lecture 2

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Step by Step Guide to Importing Genetic Data into JMP Genomics

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

SAP HANA Enabling Genome Analysis

Cisco Data Preparation

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

1 Bull, 2011 Bull Extreme Computing

IT Firm Virtualizes Databases: Trims Servers 85 Percent, Ups Performance 50 Percent

Cluster Computing at HRI

A Performance Analysis of Distributed Indexing using Terrier

CSE-E5430 Scalable Cloud Computing. Lecture 4

Very Large Enterprise Network, Deployment, Users

CHAPTER FIVE RESULT ANALYSIS

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Microsoft SQL Server 2005 on Windows Server 2003

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Main Memory Data Warehouses

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Your Best Next Business Solution Big Data In R 24/3/2010

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Performance Guideline for syslog-ng Premium Edition 5 LTS

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

Lytec 2014 Hardware and Software Requirements. Lytec Single-User Hardware and Software Requirements

IP Video Rendering Basics

Integrated Grid Solutions. and Greenplum

Processing NGS Data with Hadoop-BAM and SeqPig

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Big Data and Parallel Work with R

Testing Database Performance with HelperCore on Multi-Core Processors

SNPbrowser Software v3.5

Cloud-Based Big Data Analytics in Bioinformatics

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Packet-based Network Traffic Monitoring and Analysis with GPUs

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

School of Nursing. Presented by Yvette Conley, PhD

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Biomedical Big Data and Precision Medicine

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Project Management Professional (PMP) Certification: PMBOK Guide Fifth Edition

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

A stream computing approach towards scalable NLP

Reference Architecture. EMC Global Solutions. 42 South Street Hopkinton MA

Multicore Parallel Computing with OpenMP

Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers

Acceleration for Personalized Medicine Big Data Applications

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Package hoarder. June 30, 2015

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

New Design and Layout Tips For Processing Multiple Tasks

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Introduction to Parallel Programming and MapReduce

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Recommended hardware system configurations for ANSYS users

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

ITG Software Engineering

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

On the Cost of Mining Very Large Open Source Repositories

Transcription:

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants Xiuwen Zheng Department of Biostatistics University of Washington Seattle

Introduction Thousands of gigabyte genetic data sets provide significant challenges in data management, even on well-equipped hardware 2 The latest 1000 Genomes Project (1KG): ~39 million variants (differences from the reference genome) of 1092 individuals http://www.1000genomes.org Variant Call Format (VCF) files: genotypes + annotation, totaling ~184G in a compressed manner

Introduction R users, existing R/Bioconductor packages 3 VariantAnnotation allows for the exploration and annotation of genetic variants stored in VCF files not computationally efficient (parsing a text VCF file) GenABEL genome-wide SNP association analysis no support of multi-allelic variants

Methods CoreArray, a C/C++ library designed for large-scale data management of genome-wide variants 4 a universal data format to store multiple array-oriented data in a single file Two R packages were developed to address or reduce the associated computational burden gdsfmt a general R interface to CoreArray SeqArray specifically designed for data management of genome-wide sequencing variants

Methods Advantages 1. Direct access of data without parsing VCF text files 2. Stored in a binary and array-oriented manner 2 bits are employed as a primitive type to store alleles (e.g., A, G, C, T) efficient access of variants using the R language 3. Genotypic data stored in a compressed manner rare variants -> highly compressed without sacrificing access efficiency e.g., 1KG, 26G genotypes -> 1.5G by the zlib algorithm (5.8%!) 4. Run in parallel! 5

Methods Key Functions Table 1: The key functions in the SeqArray package. Function seqvcf2gds Description 6 Reformats VCF files seqsummary Gets the summary of a sequencing file (# of samples, # of variants, INFO/FORMAT variables, etc) seqsetfilter seqgetdata seqapply seqparallel Sets a filter to sample or variant (define a subset of data) Gets data from a sequencing file (from a subset of data) Applies a user-defined function over array margins Applies functions in parallel

Dataset: Benchmark the 1000 Genomes Project, chromosome 1 3,007,196 variants, 1092 individuals the original VCF file: 10.7G (compressed!) genotypes + annotations reformat to a single SeqArray file: 10.6G 7 Calculate the frequencies of reference alleles 1. R code (sequential version) 2. R code (parallel version) 3. Seamless R and C++ integration via the Rcpp package (sequential version)

Benchmark Test 1 (sequentially) 8 # load the R package library(seqarray) # open the file genofile <- seqopen("1kg.chr1.gds") the typical x looks like: sample allele [,1] [,2] [,3] [,4] [,5] [1,] 0 1 0 1 1 [2,] 0 0 0 1 0 # apply the user-defined function variant by variant system.time(afreq <- seqapply(genofile, "genotype", FUN = function(x) { mean(x==0, na.rm=true) }, as.is="double", margin="by.variant") ) 0 reference allele 1 the first alternative allele the user-defined function ~6.8 minutes on a Linux system with two quad-core Intel processors (2.27GHz) and 32 GB RAM

Benchmark Test 2 (in parallel) 9 # load the R package library(parallel) # create a computing cluster with 4 cores cl <- makecluster(4) the user-defined kernel function distributed to 4 different computing nodes # run in parallel system.time(afreq <- seqparallel(cl, genofile, FUN = function(gdsfile) { seqapply(gdsfile, "genotype", as.is="double", FUN = function(x) mean(x==0, na.rm=true)) }, split = "by.variant") ) ~2.1 minutes (vs 6.8m in Test 1) divide genotypes into 4 non-overlapping parts

Benchmark Test 3 (C++ Integration) library(rcpp) cppfunction('double RefAlleleFreq(IntegerMatrix x) { int nrow = x.nrow(), ncol = x.ncol(); int cnt=0, zero_cnt=0, g; }') for (int i = 0; i < nrow; i++) { for (int j = 0; j < ncol; j++) { if ((g = x(i, j))!= NA_INTEGER) { cnt ++; if (g == 0) zero_cnt ++; }}} return double(zero_cnt) / cnt; system.time( afreq <- seqapply(genofile, "genotype", RefAlleleFreq, as.is="double", margin="by.variant") ) ~0.7 minutes (significantly faster!!! vs 6.8m in Test 1) 10 dynamically define an inline C/C++ function in R

Conclusion SeqArray will be of great interest to R users involved in data analyses of large-scale sequencing variants 11 particularly those with limited experience of high-performance computing SeqVarTools (Bioconductor), coming soon variant analysis, such like allele frequency, PCA, HWE, Mendelian errors, etc functions to display genotypes / annotations in a readable format

Acknowledgements Department of Biostatistics at University of Washington Seattle 12 Stephanie M. Gogarten Cathy Laurie Bruce S. Weir