DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

Similar documents
Creating a New Annotation Package using SQLForge

HowTo: Querying online Data

netresponse probabilistic tools for functional network analysis

Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method

Using the generecommender Package

GEOsubmission. Alexandre Kuhn May 3, 2016

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

mmnet: Metagenomic analysis of microbiome metabolic network

R4RNA: A R package for RNA visualization and analysis

Package GEOquery. August 18, 2015

KEGGgraph: a graph approach to KEGG PATHWAY in R and Bioconductor

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

Visualising big data in R

MALDIquantForeign: Import/Export routines for MALDIquant

Analysis of Bead-summary Data using beadarray

Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qpcr experiments

Gene2Pathway Predicting Pathway Membership via Domain Signatures

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

Step by Step Guide to Importing Genetic Data into JMP Genomics

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

NaviCell Data Visualization Python API

1. Installation Instructions - Unix & Linux

Package retrosheet. April 13, 2015

Cluster software and Java TreeView

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Backing Up TestTrack Native Project Databases

Managed File Transfer with Universal File Mover

G E N OM I C S S E RV I C ES

Report Writing for Data Science in R

Capture Pro Software FTP Server System Output

HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

GAIA: Genomic Analysis of Important Aberrations

a) Install the SDK into a directory of your choice (/opt/java/jdk1.5.0_11, /opt/java/jdk1.6.0_02, or YOUR_JAVA_HOME_DIR)

Exercise with Gene Ontology - Cytoscape - BiNGO

Using Internet or Windows Explorer to Upload Your Site

Tutorial for proteome data analysis using the Perseus software platform

PMOD Installation on Linux Systems

CA Workload Automation Agent for Databases

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

RenderStorm Cloud Render (Powered by Squidnet Software): Getting started.

CD-HIT User s Guide. Last updated: April 5,

Bulk Downloader. Call Recording: Bulk Downloader

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Normalization of RNA-Seq

Bioinformatics Grid - Enabled Tools For Biologists.

VTLBackup4i. Backup your IBM i data to remote location automatically. Quick Reference and Tutorial. Version 02.00

Introduction to robust calibration and variance stabilisation with VSN

TCB No March Technical Bulletin. GS FLX and GS FLX+ Systems. Configuration of Data Backup Using backupscript.sh

Using Microsoft Expression Web to Upload Your Site

In 2014, the Research Data Purdue University

RJE Database Accessory Programs

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

WebPublish User s Manual

Storage Solutions for Bioinformatics

Capture Pro Software FTP Server Output Format

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

GeneProf and the new GeneProf Web Services

Analysis of FFPE DNA Data in CNAG 2.0 A Manual

Basic processing of next-generation sequencing (NGS) data

GDC Data Transfer Tool User s Guide. NCI Genomic Data Commons (GDC)

Cloud-Based Big Data Analytics in Bioinformatics

Introduction to R software for statistical computing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Nipper Studio Beginner s Guide

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

TCB No September Technical Bulletin. GS FLX+ System & GS FLX System. Installation of 454 Sequencing System Software v2.

Analysis of ChIP-seq data in Galaxy

Gene Expression Analysis

Practical Differential Gene Expression. Introduction

Local Caching Servers (LCS): User Manual

IBM SPSS Statistics Version 22. Concurrent License Administrator s Guide

Verify Needed Root Certificates Exist in Java Trust Store for Datawire JavaAPI

Downloading Your Backup:

HADOOP PERFORMANCE TUNING

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

IST Amigo Project. Accounting & Billing Software Developer s Guide. Public

LAE Enterprise Server Installation Guide

Managed File Transfer

CRSP MOVEit Cloud Getting Started Guide

Package RDAVIDWebService

Integrated Rule-based Data Management System for Genome Sequencing Data

HP Data Protector Integration with Autonomy IDOL Server

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

Upgrade your Software

Object systems available in R. Why use classes? Information hiding. Statistics 771. R Object Systems Managing R Projects Creating R Packages

Chapter 7. Using Hadoop Cluster and MapReduce

Published. Technical Bulletin: Use and Configuration of Quanterix Database Backup Scripts 1. PURPOSE 2. REFERENCES 3.

Tivoli Enterprise Monitoring Server "HOT" backup

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,

CLC Server Command Line Tools USER MANUAL

Practice Fusion API Client Installation Guide for Windows

LONI IMAGE & DATA ARCHIVE USER MANUAL

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Attix5 Pro Server Edition

EBSCO MEDIA FILE TRANSFER SOFTWARE INSTALLATION INSTRUCTIONS

2. Installation Instructions - Windows (Download)

HARFORD COMMUNITY COLLEGE 401 Thomas Run Road Bel Air, MD Course Outline CIS INTRODUCTION TO UNIX

Introduction Connecting Via FTP Where do I upload my website? What to call your home page? Troubleshooting FTP...

V-locity Installation

Comparing Methods for Identifying Transcription Factor Target Genes

Transcription:

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis Quanhu Sheng, Yu Shyr, Xi Chen Center for Quantitative Sciences, Vanderbilt University, Nashville, USA shengqh (at) gmail.com October 18, 2016 Abstract Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis. We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package. Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step. In this vignette, we demonstrate the application of DupChecker as a package for checking high-throughput genomic data redundancy in meta-analysis. DupChecker can download the GEO/ArrayExpress 1

raw data files from EBI/ncbi ftp server, extract individual data files, calculate MD5 fingerprint for each data file and validate the redundancy of those data files. DupChecker version: 1.13.0 If you use DupChecker in published research, please cite: Quanhu Sheng, Yu Shyr, Xi Chen.: DupChecker: a bioconductor package for checking high-throug data redundancy in meta-analysis. BMC bioinformatics 2014, 15:323. Contents 1 Introduction 2 2 Standard workflow 3 2.1 Quick start............................ 3 2.2 GEO/ArrayExpress data download............... 3 2.3 Build file table.......................... 4 2.4 Validate file redundancy..................... 4 2.5 Real example........................... 5 3 Discussion 6 4 Session Info 6 1 Introduction Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates would make study results questionable. We developed a package DupChecker that efficiently identifies duplicated samples by generating MD5 fingerprints for individual data file. 2

2 Standard workflow 2.1 Quick start Here we show the most basic steps for a validation procedure. You need to create a target directory used to store the data. In order to illustrate the procedure, here we used two very small datasets without redundant CEL files. You may also try a real example with redundancy at example 2.5. library(dupchecker) ## Create a "DupChecker" directory under temporary directory rootdir<-paste0(dirname(tempdir()), "/DupChecker") dir.create(rootdir, showwarnings = FALSE) message(paste0("downloading data to ", rootdir, "...")) geodownload(datasets=c("gse1478"), targetdir=rootdir) ## 1 GSE1478 8 arrayexpressdownload(datasets=c("e-mexp-3872"), targetdir=rootdir) ## 1 E-MEXP-3872 6 datafile<-buildfiletable(rootdir=rootdir, filepattern="cel$") result<-validatefile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootdir, "/duptable.csv")) } 2.2 GEO/ArrayExpress data download Firstly, function geodownload/arrayexpressdownload will download raw data from ncbi/ebi ftp server based on datasets user provided. Once the compressed raw data is downloaded, individual data files will be extracted from 3

compressed raw data. Some of the data files may have been compressed. Since the files compressed from same data file by different softwares will have different MD5 figureprints, we will decompress those compressed data files and validate the file redundancy using decompressed data files. datatable<-geodownload(datasets = c("gse1478"), targetdir=rootdir) datatable ## 1 GSE1478 10 datatable<-arrayexpressdownload(datasets=c("e-mexp-3872"), targetdir=rootdir) datatable ## 1 E-MEXP-3872 7 The datatable is a data frame containing dataset name and how many files in that dataset. There are two possible situations that you may want to download and decompress the data using external tools. 1) The download or decompress cost too much time in R environment; 2) The internal tool in R may download imcomplete or interrupted file for huge file which will cause the untar or gunzip command fails. 2.3 Build file table Secondly, function buildfiletable will try to find all files (default) or expected files matching filepattern parameter in the subdirectories of root directory. The result data frame contains two columns, dataset (directory name) and filename. Here, rootdir can also be an array of directories. In following example, we just focused on AffyMetrix CEL file only. datafile<-buildfiletable(rootdir=rootdir, filepattern="cel$") 2.4 Validate file redundancy The function validatefile will calculate MD5 fingerprint for each file in table and then check to see if any two files have same MD5 fingerprint. The files 4

with same fingerprint will be treated as duplication. The function will return a table contains all duplicated files and datasets. result<-validatefile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootdir, "/duptable.csv")) } 2.5 Real example There are three colon cancer related datasets (GSE14333, GSE13067 and GSE17538) containing seriously data redundancy problem. User may run following codes to do the validation. It may cost a few miniutes to a few hours based on the network bindwidth and the computer performance. library(dupchecker) rootdir<-paste0(dirname(tempdir()), "/DupChecker_RealExample") dir.create(rootdir, showwarnings = FALSE) geodownload(datasets = c("gse14333", "GSE13067", "GSE17538"), targetdir=rootdir) datafile<-buildfiletable(rootdir=rootdir, filepattern="cel$") result<-validatefile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootdir, "/duptable.csv")) } Table 1 illustrated the duplication between those three datasets. GSE13067 [64/74] means 64 of 74 CEL files in GSE13067 dataset were duplicated in other two datasets. 5

Table 1: Part of duplication table of three datasets MD5 GSE13067[64/74] GSE14333[231/290] GSE17538[167/244] 001ddd757f185561c9ff9b4e95563372 GSM358397.CEL GSM437169.CEL 00b2e2290a924fc2d67b40c097687404 GSM358503.CEL GSM437210.CEL 012ed9083b8f1b2ae828af44dbab29f0 GSM327335.CEL GSM358620.CEL 023c4e4f9ebfc09b838a22f2a7bdaa59 GSM358441.CEL GSM437117.CEL 3 Discussion We illustrated the application using gene expression data, but DupChecker package can also be applied to other types of high-throughput genomic data including next-generation sequencing data. 4 Session Info R Under development (unstable) (2016-09-29 r71410), x86_64-pc-linux-gnu Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C Running under: Ubuntu 16.04.1 LTS Base packages: base, datasets, grdevices, graphics, methods, stats, utils Other packages: DupChecker 1.13.0 Loaded via a namespace (and not attached): R.methodsS3 1.7.1, R.oo 1.20.0, R.utils 2.4.0, RCurl 1.95-4.8, bitops 1.0-6, evaluate 0.10, formatr 1.4, highr 0.6, knitr 1.14, magrittr 1.5, stringi 1.1.2, stringr 1.1.0, tools 3.4.0 6