GEOsubmission. Alexandre Kuhn (kuhnam@mail.nih.gov) May 3, 2016



Similar documents
Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method

Creating a New Annotation Package using SQLForge

HowTo: Querying online Data

Using the generecommender Package

R4RNA: A R package for RNA visualization and analysis

netresponse probabilistic tools for functional network analysis

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Analysis of Bead-summary Data using beadarray

KEGGgraph: a graph approach to KEGG PATHWAY in R and Bioconductor

mmnet: Metagenomic analysis of microbiome metabolic network

Gene2Pathway Predicting Pathway Membership via Domain Signatures

Introduction to robust calibration and variance stabilisation with VSN

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

Visualising big data in R

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qpcr experiments

Practical Differential Gene Expression. Introduction

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

GAIA: Genomic Analysis of Important Aberrations

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Basic Analysis of Microarray Data

MALDIquantForeign: Import/Export routines for MALDIquant

Cluster software and Java TreeView

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

Row Quantile Normalisation of Microarrays

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

Step-by-Step Installation Guide for MONAHRQ. Version 2.0.4

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

Step by Step Guide to Importing Genetic Data into JMP Genomics

Importing and Exporting With SPSS for Windows 17 TUT 117

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Gene set and data preparation

SPSS INSTRUCTION CHAPTER 1

Data Availability Policies & Author Responsibility Policies Time of Evaluation: May 2014

How To Test The Bandwidth Meter For Hyperv On Windows V (Windows) On A Hyperv Server (Windows V2) On An Uniden V2 (Amd64) Or V2A (Windows 2

How to Run Spark Application

Importing from Tab-Delimited Files

Summary of Select Results from the 2012 Medical School Information Technology Survey Sponsored by the Group on Information Resources (GIR)

GSR Microarrays Project Management System

Appendix A. Editing Profiles with the Built-In Text Editor

Learn how to create web enabled (browser) forms in InfoPath 2013 and publish them in SharePoint InfoPath 2013 Web Enabled (Browser) forms

TELECOMMUNICATIONS REQUIREMENTS FOR TRANSMITTING ELECTRONIC DATA FILES ADMINISTRATIVE SERVICES OF KANSAS

Redtail CRM Integration. Users Guide Cities Digital, Inc. All rights reserved. Contents i

Acronis Backup & Recovery 10 Server for Linux. Update 5. Installation Guide

QMX ios MDM Pre-Requisites and Installation Guide

Managing ACE Software Licenses

Star System Salon Management Software. Powerful Effective Easy to Use

Using SPSS, Chapter 2: Descriptive Statistics

MONAHRQ 5.0. Quick Start Guide for Host Users. May 2014

Integrated Research Application System (IRAS)

CGS 1550 File Transfer Project Revised 3/10/2005

Scatter Plots with Error Bars

ORACLE USER PRODUCTIVITY KIT USAGE TRACKING ADMINISTRATION & REPORTING RELEASE 3.6 PART NO. E

--version Print the version number of GeneTorrent to the standard output stream. This version number should be included in all bug reports.

Bulk Upload Tool (Beta) - Quick Start Guide 1. Facebook Ads. Bulk Upload Quick Start Guide

Sisense. Product Highlights.

Tivoli Log File Agent Version Fix Pack 2. User's Guide SC

Chapter 7. Process Analysis and Diagramming

Acronis Backup & Recovery 10 Server for Linux. Installation Guide

Package GEOquery. August 18, 2015

User Manual. Affymetrix GeneChip Command Console 3.0 User Manual. P/N Rev. 5

Summary of Select Results from the 2014 Medical School Information Technology Survey Sponsored by the Group on Information Resources (GIR)

Importance of Statistics in creating high dimensional data

University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics

Influence of GSM and UMTS on the Blood Brain Barrier in vitro additional results

Basic Firewall Lab. Lab Objectives. Configuration

Air Resources Board File Transfer Protocol (FTP)

Quality Assessment of Exon and Gene Arrays

CPE453 Laboratory Assignment #2 The CPE453 Monitor

Package copa. R topics documented: August 9, 2016

Egnyte Single Sign-On (SSO) Configuration for Active Directory Federation Services (ADFS)

PMES Dashboard User Manual

GDC Data Transfer Tool User s Guide. NCI Genomic Data Commons (GDC)

A Short Introduction to Eviews

Nipper Studio Beginner s Guide

Cisco Unified Communications Self Care Portal User Guide, Release 10.5(1)

FIRST STEPS WITH SCILAB

LifeScope Genomic Analysis Software 2.5

Google Merchant Center

Table of Contents. Using the plug- in Pure Storage Flash Array Home Page... 11

SonicWALL GMS Custom Reports

GeneChip Expression Analysis. Data Analysis Fundamentals

LibraryWorld.com. Getting Started Guide

Importing and Exporting Databases in Oasis montaj

Managing Software and Configurations

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Quick Reference Guide. Online Courier: FTP. Signing On. Using FTP Pickup. To Access Online Courier.

Hadoop Tutorial. General Instructions

Table of Contents. Online backup Manager User s Guide

SQL SERVER REPORTING SERVICES 2012 (POWER VIEW)

Package PREDAsampledata

ELECTRONIC DATA PROCESSOR (EDP) QUICKSTART FOR DATA PROVIDERS

NaviCell Data Visualization Python API

Software in science - syllabus

Lab Activity H17 Glow Sticks

Network Detective Client Connector

[1] [2]

Transcription:

GEOsubmission Alexandre Kuhn (kuhnam@mail.nih.gov) May 3, 2016 1 Summary The goal of GEOsubmission is to ease the submission of microarray datasets to the GEO repository. It generates a single file (SOFT format) containing sample information and gene expression values. This file can then be uploaded to GEO in a single step. 2 Introduction The rate of microarray data deposition in public repository is low [Ochsner et al., 2008]. GEOsubmission provides a simple and quick way to create a dataset submission for deposition at GEO (http://www.ncbi.nlm.nih.gov/geo/). 3 Principle Submitting a microarray dataset to a public repository generally implies to gather information on the samples (including how they were processed) and upload this information along with the microarray data to a repository. Having both sample information and gene expression data in a single file greatly eases the submission process. GEO accepts SOFT files (http://www.ncbi.nlm.nih. gov/geo/info/soft2.html), a file format that can be used to describe samples and expression values in a single text file. GEOsubmission contains a function (microarray2soft) that generates a SOFT file. Sample information is gathered from user-provided text files, which are then bundled together with the corresponding expression values in the SOFT file. Microarray expression data is provided as a single tab-delimited text file (with rows corresponding to probes and columns to samples). In the case of Affymetrix microarrays, RMA-normalized expression can be calculated directly from the CEL files, i.e. without providing a separate text file for the expression values. Sample information is provided in two text files. The first one describes each sample in the dataset. The second one provides information on the dataset itself (named a series by GEO). microarray2soft performs consistency checks on sample and series information as well as makes sure that they match the corresponding micorraray raw data files. Once the SOFT file is created, it can be compressed together with the raw data microarray data in a zip file. This zip file can be used for a single-step deposition (named direct deposit ) at GEO. 1

4 Example usage Consider an example experiment where we assayed gene expression in neuronal cultures using Affymetrix microarrays. Let us assume that this dataset (named neuronalcultures ) contains two samples (named 1 and 2 ), corresponding to CEL files sample1.cel and sample2.cel, respectively. Let us assume further that sample and series information is contained in the files named sampleinfo.txt and seriesinfo.txt respectively (in the current directory). You can use the following code to create a SOFT file named example.soft in the current directory: library(geosubmission) sampleid<-c('1','2') seriesname<-'neuronalcultures' microarray2soft(sampleid,'sampleinfo.txt',seriesname,'seriesinfo.txt', + softname='mydata.soft') Note that this first example cannot be run because we only provide dummy, not valid CEL files with this package. The format and content of the sample and series files is given in the two corresponding example files provided with this package. Their content can be shown in R with the following commands (or by opening them; they are located in the extdata directory contained in the installation directory of GEOsubmission): datadirectory<-system.file('extdata',package='geosubmission') read.delim(file.path(datadirectory,'sampleinfo.txt')) read.delim(file.path(datadirectory,'seriesinfo.txt')) Alternatively, for instance in the case of a microarray experiment using a different platform than Affymetrix, we can provide the gene expression values (for inclusion in the SOFT file) in a separate text file. If this is the case, microarray2soft will only check that the raw microarray data files given in sampleinfo.txt actually exists and will use the separate text file as the source of expression values. In our neuronal culture example, if we wished to use the expression values in the file expressionnormalized.txt (instead of calculating normalized expression from the microarray data files), we can use the following. We first specify a directory and a file to write the generated example SOFT file out to: soft_example_fullpath<-tempfile(pattern='soft_example') soft_example_name<-basename(soft_example_fullpath) soft_example_dir<-dirname(soft_example_fullpath) and then generate and write the SOFT file with: microarray2soft(sampleid,'sampleinfo.txt',seriesname,'seriesinfo.txt', + datadir=datadirectory,writedir=soft_example_dir, + softname=soft_example_name,expressionmatrix='expressionnormalized.txt') 2

The generated SOFT file looks like this: readlines(soft_example_fullpath) [1] "^SAMPLE = 1" [2] "!Sample_title = sample1" [3] "!Sample_supplementary_file = sample1.cel" [4] "!Sample_source_name = Brain" [5] "!Sample_organism = Rattus norvegicus" [6] "!Sample_characteristics = Primary neuronal culture" [7] "!Sample_molecule = Total RNA" [8] "!Sample_extract_protocol = TRIzol (Invitrogen) followed by RNeasy column cleanup (Qia [9] "!Sample_label = Biotin" [10] "!Sample_label_protocol = Affymetrix GeneChip\xae IVT Labeling Kit, according to manuf [11] "!Sample_hyb_protocol = Affymetrix Eukaryotic Target Hybridization protocol (GeneChip\ [12] "!Sample_scan_protocol = Affymetrix\xae GeneChip\xae Scanner 3000 with GCOS software a [13] "!Sample_data_processing = Probe set summarization and normalization was performed by [14] "!Sample_description = Primary culture from cerebral cortices of rat P1 pups" [15] "!Sample_platform_id = GPL1355" [16] "#ID_REF =" [17] "#VALUE = RMA-calculated signal intensity" [18] "!Sample_table_begin" [19] "ID_REF\tVALUE" [20] "probe1\t6" [21] "probe2\t5" [22] "!Sample_table_end" [23] "^SAMPLE = 2" [24] "!Sample_title = sample2" [25] "!Sample_supplementary_file = sample2.cel" [26] "!Sample_source_name = Brain" [27] "!Sample_organism = Rattus norvegicus" [28] "!Sample_characteristics = Primary neuronal culture" [29] "!Sample_molecule = Total RNA" [30] "!Sample_extract_protocol = TRIzol (Invitrogen) followed by RNeasy column cleanup (Qia [31] "!Sample_label = Biotin" [32] "!Sample_label_protocol = Affymetrix GeneChip\xae IVT Labeling Kit, according to manuf [33] "!Sample_hyb_protocol = Affymetrix Eukaryotic Target Hybridization protocol (GeneChip\ [34] "!Sample_scan_protocol = Affymetrix\xae GeneChip\xae Scanner 3000 with GCOS software a [35] "!Sample_data_processing = Probe set summarization and normalization was performed by [36] "!Sample_description = Primary culture from cerebral cortices of rat P1 pups" [37] "!Sample_platform_id = GPL1355" [38] "#ID_REF =" [39] "#VALUE = RMA-calculated signal intensity" [40] "!Sample_table_begin" [41] "ID_REF\tVALUE" [42] "probe1\t7" [43] "probe2\t4" [44] "!Sample_table_end" [45] "^SERIES = neuronalcultures" [46] "!Series_title = Gene expression from primary neuronal cultures." [47] "!Series_summary = Gene expression from primary neuronal cultures." 3

[48] "!Series_type = Primary cell cultures" [49] "!Series_overall_design = Primary neuronal cultures (2 biological replicates)." [50] "!Series_contributor = John,Smith" [51] "!Series_contributor = William,Ford" [52] "!Series_sample_id = 1" [53] "!Series_sample_id = 2" The format of the file containing expression values is shown with the example file expressionnormalized.txt that is contained in this package. It can be output to the R console with the following command (it resides in the extdata directory of the package installation directory): read.delim(file.path(datadirectory,'expressionnormalized.txt')) The SOFT file can also be written to the standard output with: microarray2soft(c('1','2'),'sampleinfo.txt',seriesname,'seriesinfo.txt', + datadir=datadirectory,softname='',expressionmatrix='expressionnormalized.txt', + verbose=false) More detailed information on the input arguments of microarray2soft are given in the help file that can be accessed by typing?microarray2soft at the R prompt. 5 Session Information The version number of R and packages loaded for generating the vignette were: R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] GEOsubmission_1.25.0 loaded via a namespace (and not attached): [1] zlibbioc_1.19.0 BiocInstaller_1.23.0 parallel_3.3.0 [4] tools_3.3.0 affy_1.51.0 affyio_1.43.0 [7] Biobase_2.33.0 preprocesscore_1.35.0 BiocGenerics_0.19.0 4

References Scott Ochsner, David Steffen, Christian J Stoeckert, and Neil J McKenna. Much room for improvement in deposition rates of expression microarray datasets. Nat. Methods, 5(12):991, 2008. doi: 10.1038/ nmeth1208-991. URL http://www.nature.com/nmeth/journal/v5/n12/ suppinfo/nmeth1208-991_s1.html. 5