srap: Simplified RNA-Seq Analysis Pipeline

Similar documents
Tutorial for proteome data analysis using the Perseus software platform

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Step-by-Step Guide to Basic Expression Analysis and Normalization

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Statistical issues in the analysis of microarray data

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Frequently Asked Questions Next Generation Sequencing

Gene expression analysis. Ulf Leser and Karin Zimmermann

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Hierarchical Clustering Analysis

Cluster software and Java TreeView

DeCyder Extended Data Analysis module Version 1.0

Exercise with Gene Ontology - Cytoscape - BiNGO

Package ERP. December 14, 2015

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Scatter Plots with Error Bars

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Basic processing of next-generation sequencing (NGS) data

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Analysis of ChIP-seq data in Galaxy

Package empiricalfdr.deseq2

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

Gene Expression Analysis

Structural Health Monitoring Tools (SHMTools)

CDD user guide. PsN Revised

DeCyder Extended Data Analysis (EDA) Software

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

JustClust User Manual

Practical Differential Gene Expression. Introduction

A Streamlined Workflow for Untargeted Metabolomics

PreciseTM Whitepaper

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

UCINET Visualization and Quantitative Analysis Tutorial

They can be obtained in HQJHQH format directly from the home page at:

mrna NGS Data Analysis Report

MICROARRAY Я US USER GUIDE

Package RDAVIDWebService

Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice

ALLEN Mouse Brain Atlas

CNV Univariate Analysis Tutorial

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

PREDA S4-classes. Francesco Ferrari October 13, 2015

Principal Component Analysis

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

The Scientific Data Mining Process

Oracle Data Miner (Extension of SQL Developer 4.0)

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

SAS Analyst for Windows Tutorial

False Discovery Rates

GenomeStudio Data Analysis Software

Package tagcloud. R topics documented: July 3, 2015

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Package PRISMA. February 15, 2013

Petrel TIPS&TRICKS from SCM

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

Regression Clustering

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Understanding West Nile Virus Infection

Quantitative proteomics background

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Exploratory data analysis for microarray data

How To Write A Clinical Trial In Sas

BioVisualization: Enhancing Clinical Data Mining

RuleBender Tutorial

Tutorial for the WGCNA package for R I. Network analysis of liver expression data in female mice

STATISTICA Formula Guide: Logistic Regression. Table of Contents

MultiExperiment Viewer Quickstart Guide

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Package bigrf. February 19, 2015

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

Big Data: Rethinking Text Visualization

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Guide for Data Visualization and Analysis using ACSN

Package erp.easy. September 26, 2015

COC131 Data Mining - Clustering

GenomeStudio Data Analysis Software

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Delivering the power of the world s most successful genomics platform

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

NGS Data Analysis: An Intro to RNA-Seq

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Analysis of Illumina Gene Expression Microarray Data

NATIONAL GENETICS REFERENCE LABORATORY (Manchester)

Challenges associated with analysis and storage of NGS data

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Transcription:

srap: Simplified RNA-Seq Analysis Pipeline Charles Warden October 17, 2016 1 Introduction This package provides a pipeline for gene expression analysis. The normalization function is specific for RNA-Seq analysis, but all other functions will work with any type of gene expression data. The output from each function is created in a separate subfolder. Please see the workflow below: srap Workflow Normalization (For RNA-Seq) Rounding (RPKM Cutoff) Log2 transformation QC (Any Data Type) PCA Dendrogram Sample Histogram Box-Plot DEG (Any Data Type) Linear Regression or ANOVA Fold-change (against reference group) Gene lists (in Excel) Heatmap Box-Plot for DEG Functional Enrichment (Any Data Type) BD-Func GO (Human, Mouse) + MSigDB (Human) Normalization will take RPKM expression values, round the RPKM values below a given cutoff, and perform a log 2 data transformation. Quality control 1

metrics (Principal Component Analysis, Hierarchical Clustering, Sample Histograms and Box-Plots, and Descriptive Statistics) can be provided for these normalized expression values. These approximately normal expression distributions are then subject to differential expression. Differentially expressed gene lists are provided in Excel files and data can be visualized using a heatmap for all differentially expressed genes or a box-plot for a specific gene. BD-Func can then be used to calculate functional enrichment (either for fold-change values between two groups or using the normalized expression values). 2 Data RPKM expression value for MiSeq samples were calculated using TopHat and Cufflinks from raw.fastq files from GSE37703. A template script for this sample preparation (run RNA Seq v2.pl) is available here. This example dataset contains two groups, each with 3 replicates. The dataset is truncated for testing purposes. > library("srap") > dir <- system.file("extdata", package="srap") > expression.table <- file.path(dir,"miseq_cufflinks_genes_truncate.txt") > sample.table <- file.path(dir,"miseq_sample_description.txt") > project.folder <- getwd() > project.name <- "MiSeq" The code for this example assumes all files are in the current working directory. However, you can specify the input and output files in any location (using the complete file path). 3 Data Normalization To normalize RNA-Seq RPKM values, run the following function > expression.mat <- RNA.norm(expression.table, project.name, project.folder) NULL RNA-Seq data is normalized by rounding RPKM (Read Per Kilobase per Million reads, [1]) values by a specified value (default=0.1), followed by a log 2 transformation. This matches the gene expression strategy described in [2]. The output is standard data frame with samples in rows and genes in columns. An Excel file containing the normalized expression values is created in the Raw Data folder. If you do not already have a table of RPKM/FPKM expression values, you can use the RNA.prepare.input() function to create such a file. Please see help(rna.prepare.input) for more details. 2

4 Quality Control Figures To create quality control figures, run the following function > RNA.qc(sample.table, expression.mat, project.name, + project.folder, plot.legend=f, + color.palette=c("green","orange")) [1] "SRR493372" "SRR493373" "SRR493374" "SRR493375" "SRR493376" "SRR493377" [1] "integer" [1] "Group: HOXA1KD" "Group: scramble" [1] "Color: green" "Color: orange" NULL The input is a matrix of normalized expression values, possibly created from the RNA.norm function. This function creates quality control figures within the QC subfolder. Quality control figures / tables include: Principal Components Analysis (figure for 1st two principal components, table for all principal components), Sample Dendrogram, Sample Histogram, Box-Plot for Sample Distribution, as well as a table of descriptive statistics for each sample (median, top/bottom quartile, maximum, and minimum). Please see the example figures displayed below: 3

RNA.qc PCA Plot PC2 (1.11%) 0.4 0.2 0.0 0.2 0.4 0.407 0.408 0.409 0.410 PC1 (97.62%) 4

RNA.qc Dendrogram SRR493374 SRR493373 SRR493372 SRR493376 SRR493375 SRR493377 150 100 50 0 5

RNA.qc Sample Histogram Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 5 10 Log2(RPKM) 6

RNA.qc Sample Box-Plot 0 5 10 This step is optional - this function is not needed for downstream analysis. However, this function is likely to be useful to identifying outliers, overall quality of the data, etc. 5 Differential Expression To define differentially expressed genes, run the following function > stat.table <- RNA.deg(sample.table, expression.mat, + project.name, project.folder, box.plot=false, + ref.group=t, ref="scramble", + method="aov", color.palette=c("green","orange")) [1] 422 4 [1] 422 4 [1] 422 6 [1] "Group: HOXA1KD" "Group: scramble" 7

[1] "Color: green" "Color: orange" NULL The input is a matrix of normalized expression values, possibly created from the RNA.norm function. The function returns a table of differential expression statistics for all genes. In all cases, p-values can be calculated via linear regression of ANOVA, and false-discovery rates (FDR) are calculated by the method of [3]. It is assumed that expression values are on a log 2 scale, as described in [2]. If the primary variable (defined in the second column of the sample description table) is a factor with two groups and a specified reference, then fold-change values can also used to select differentially expressed genes (along with p-value and FDR values). The function creates lists of differentially expressed genes (as well as a table of statistics for all genes) in Excel files. A heatmap of differentially expressed genes is also displayed. If desired, the user can also create box-plots for all differentially expressed genes. Please see the example figures below: RNA.deg Heatmap 8

RNA.deg Box-Plot 6.8 6.9 7.0 7.1 7.2 7.3 7.4 HOXA1KD scramble An Excel file containing the table of differential expression statistics for all genes is created in the Raw Data folder. All other outputfiles are created in the DEG folder (DEG stands for Differenitally Expressed Genes ). 6 Functional Enrichment To identify functional categories subject to differential expression, run the following function > #data(bdfunc.enrichment.human) > #data(bdfunc.enrichment.mouse) > RNA.bdfunc.fc(stat.table, plot.flag=false, + project.name, project.folder, species="human") NULL > RNA.bdfunc.signal(expression.mat, sample.table, plot.flag=false, + project.name, project.folder, species="human") 9

NULL This is an implementation of the Bi-Directional FUNCtional enrichment (BD- Func [4]) algorithm. Briefly, p-values quantifying the difference between up- and down-regulated genes can be calculated via t-test, Mann-Whitney U test, or K-S test. If desired, false discovery rates (FDR) can be calculated using either the method of Benjamini and Hochberg [3] or the Storey q-value [5]. The input for the RNA.bdfunc.fc function is a table of differential expression statisics (like that created by the RNA.deg function). If desired, the this function can create density plots for all gene lists (see below). RNA.bdfunc.fc Density Plot Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Activated Genes Inhibited Genes 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Signal The input for the RNA.bdfunc.signal function is a table of normalized expression values (like that created by the RNA.norm function). If desired, the this function can create box-plots for enrichment scores across all gene lists (see below). 10

RNA.bdfunc.signal Box-Plot ANOVA p value = 0.01 Test Statistic 0.65 0.70 0.75 0.80 0.85 0.90 HOXA1KD scramble In both cases, the output files are created with in the BD-Func subfolder. The goal of BD-Func is to calculate functional enrichment by comparing lists of activated and inhibited genes for a functional category, pathway, and/or network. This package includes pre-defined enrichment lists are available for human and mouse gene symbols. The human enrichment list is based upon Gene Ontology [6] and MSigDB [7] gene lists. The mouse enrichment list is based upon Gene Ontology categories. Additional gene lists will need to be imported using the enrichment.file parameter. This step is optional - there are no other functions that depend on the results of this analysis. References [1] Mortazavi A et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth, 5:621 628, 2008. 11

[2] Warden CD et al. Optimal Calculation of RNA-Seq Fold-Change Values. Int J Comput Bioinfo In Silico Model, 2:285-292, 2014 [3] Benjamini Y and Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B, 57:289 300, 1995. [4] Warden CD et al. BD-Func: A Streamlined Algorithm for Predicting Activation and Inhibition of Pathways. peerj, 1:e159, 2013. [5] Storey JD and Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100:9440 9445, 2003. [6] Ashburner M et al. Gene Ontology: tool for the unification of biology. Nat Genet, 25:25 29, 2000. [7] Liberzon A et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics, 27:1739 1740, 2011. 12