Normalization of RNA-Seq
|
|
- Alicia Jordan
- 8 years ago
- Views:
Transcription
1 Normalization of RNA-Seq Davide Risso Modified: April 27, Compiled: April 27, Retrieving the data Usually, an RNA-Seq data analysis from scratch starts with a set of FASTQ files (see e.g. which contain information on both the quality and the sequence of the short reads. There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, GSNAP, Stampy,... ). A common output file format is the SAM/BAM format (of which you can read here: You just saw how to align reads when you don t have a genome, and how to summarize them. When you do have a genome, a standard approach is to align the reads with Bowtie or TopHat, and then summarize them in region of interests, such as gene, exons, non-coding RNAs, etc. To do this, you need your aligned reads and an annotation for your reference genome. There are tools and packages to summarize the aligned reads in gene counts. One of them is HTSeq ( HTSeq/doc/overview.html). The simple command: $ htseq-count example.sam Saccharomyces_cerevisiae.EF2.60.gtf will produce a table of counts, i.e., YAL002W 1 YAL003W 19 YAL005C 8 YAL007C 2 YAL008W 2 YAL012W 9 YAL014C 1 YAL016W 3 YAL017W 2 YAL019W 1 By doing this for every sample in your study you end up with a table with m rows (genes) and n columns (samples). This is what you have in the file genelevelcounts.txt. 1
2 $ head genelevelcounts.txt YAL067W-A YAL067C YAL066W YAL065C YAL064W-B YAL064C-A Today, we will consider an example based on the data analyzed in Risso et al. [7]. The Sherlock Lab in Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates. Illumina s standard Genome Analyzer pre-processing pipeline was used to yield 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie [4], considering only unique mapping and allowing up to two mismatches. The read count for a given gene is defined as the number of reads with 5 -end falling within the corresponding region. The gene-level counts for this example are provided in the yeastrnaseqrisso2011 R package. For Exploratory Data Analysis (EDA) and normalization purposes, it is useful to consider some features of the genes, such as GC-content and gene length. To obtained this information, we need the gene sequences, that can be retrieved from different sources (e.g., Ensembl, UCSC, FlyBase,... ). In the Yeast community, a standard resource is the SGD website ( org). In general, a good resource is Ensembl ( In any case, you need to download the sequences of your regions of interest (e.g., protein coding genes, non-coding RNAs,... ), usually in FASTA format. Example of FASTA format: $ head Scer.fasta >YAL001C ATGGTACTGACGATTTATCCTGACGAACTCGTACAAA... >YAL002W ATGGAGCAAAATGGCCTTGACCACGA... Once you have your FASTA file, it is easy to compute length and GC-content of each gene using the ShortRead Bioconductor package. Bioconductor ( is an open source project based on the R statistical programming language ( Enter a terminal and type R. This will open an R console. > library(shortread) > filename <- "Scer.fasta" > fa <- readfasta(filename) > abc <- alphabetfrequency(sread(fa), baseonly=true) > rownames(abc) <- sapply(strsplit(as.character(id(fa))," "),function(x) x[1]) 2
3 > alphabet <- abc[,1:4] > gc <- rowsums(alphabet[,2:3])/rowsums(alphabet) > length <- width(sread(fa)) > head(gc) YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C > head(length) [1] We can create a data.frame to store this information, we will call it gene- Info. > geneinfo <- data.frame(length=length, gc=gc) > head(geneinfo) length gc YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C Exploratory Data Analysis We will use the EDASeq [6] R package for the EDA and the normalization. This package provides a class of objects named SeqExpressionSet, useful to store gene counts along with gene and lane information. First of all, we need to read the counts into R. This is done with the read.table function: > genelevelcounts <- read.table("genelevelcounts.txt", header=true, row.names=1) > laneinfo <- read.table("laneinfo.txt", header=true, row.names=1) We want to filter out the non-expressed genes. For simplicity, we consider only the genes expressed in all growth conditions, i.e., genes with an average read count of 10 or more. > means <- rowmeans(genelevelcounts) > filter <- means >= 10 > table(filter) filter FALSE TRUE
4 > genelevelcounts <- genelevelcounts[filter,] This leaves us with 5534 genes. Now we can store this information (gene and lane info along with gene counts) in one single object. > library(edaseq) > data <- newseqexpressionset(exprs = as.matrix(genelevelcounts), + featuredata = geneinfo[rownames(genelevelcounts), ], + phenodata = laneinfo) > data SeqExpressionSet (storagemode: lockedenvironment) assaydata: 5534 features, 14 samples element names: exprs, offset protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: lib_prep conditions flow_cell lib_prep_proto varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: > head(exprs(data)) Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 YAL062W YAL061W YAL060W YAL059W YAL058W YAL056C-A D7 G1 G2 G3 YAL062W YAL061W YAL060W YAL059W YAL058W YAL056C-A > pdata(data) 4
5 lib_prep conditions flow_cell lib_prep_proto Y1_1 Y1 YPD 428R1 Protocol1 Y1_2 Y1 YPD 4328B Protocol1 Y2_1 Y2 YPD 428R1 Protocol1 Y2_2 Y2 YPD 4328B Protocol1 Y7_1 Y7 YPD 428R1 Protocol1 Y7_2 Y7 YPD 4328B Protocol1 Y4_1 Y4 YPD 61MKN Protocol2 Y4_2 Y4 YPD 61MKN Protocol2 D1 D1 Del 428R1 Protocol1 D2 D2 Del 428R1 Protocol1 D7 D7 Del 428R1 Protocol1 G1 G1 Gly 6247L Protocol2 G2 G2 Gly 62OAY Protocol1 G3 G3 Gly 62OAY Protocol1 > head(fdata(data)) length gc YAL062W YAL061W YAL060W YAL059W YAL058W YAL056C-A We can look at some graphical summary of the data. discover biases and artifacts in the data. This will help us Between-lane distribution of gene-level counts. One of the main considerations when dealing with gene-level counts is the difference in count distributions between lanes. The boxplot method provides an easy way to produce boxplots of the logarithms of the gene counts in each lane. 5
6 > colors <- as.numeric(pdata(data)[, 2]) + 1 > boxplot(data, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G Over-dispersion. The function meanvarplot can be used to check whether the count data are over-dispersed (for the Poisson distribution, one would expect the points to be evenly scattered around the black line). > meanvarplot(data[, 1:8], log=true) mean variance 6
7 Gene-specific effects on read counts. Several authors have reported selection biases related to sequence features such as gene length, GC-content, and mappability [2, 3, 5, 7]. Using biasplot, one can see the dependence of gene-level counts on GCcontent. The same plot could be created for gene length or mappability instead of GC-content. > biasplot(data[,1:8], "gc", log=true, ylim=c(0, 8), col=1) gene counts (log) Y1 Y2 Y7 Y gc 3 Normalization Following Risso et al. [7], we consider two main types of effects on gene-level counts: (1) within-lane gene-specific (and possibly lane-specific) effects, e.g., related to gene length or GC-content, and (2) effects related to between-lane distributional differences, e.g., sequencing depth. Accordingly, withinlanenormalization and betweenlanenormalization adjust for the first and second type of effects, respectively. We recommend to normalize for within-lane effects prior to between-lane normalization. EDASeq implements four within-lane normalization methods, namely: loess robust local regression of read counts (log) on a gene feature such as GC-content (loess), global-scaling between feature strata using the median (median), globalscaling between feature strata using the upper-quartile (upper), and full-quantile normalization between feature strata (full). For a discussion of these methods in context of GC-content normalization see Risso et al. [7]. Regarding between-lane normalization, the package implements three of the methods introduced in Bullard et al. [2]: global-scaling using the median (median), global-scaling using the upper-quartile (upper), and full-quantile normalization (full). 7
8 > datawithin <- withinlanenormalization(data, "gc", which="full") > datanorm <- betweenlanenormalization(datawithin, which="median") After normalization the GC-content bias is reduced, and the gene-level counts are comparable across lanes. > biasplot(datanorm[,1:8], "gc", log=true, ylim=c(0, 8), col=1) gc gene counts (log) Y1 Y2 Y7 Y4 > boxplot(datanorm, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G Moreover, the overdispersion is reduced in normalized counts, even though the Poisson assumption still does not hold true. 8
9 > meanvarplot(datanorm[, 1:8], log=true) variance mean You can write to file your normalized counts with the write.table function > write.table(datanorm, file="normalizedcounts.txt", sep="\t", quote=false) 4 Differential expression (DE) analysis One of the main applications of RNA-Seq is differential expression analysis. The normalized counts (or the original counts and the offset) obtained using the EDASeq package can be supplied to packages such as edger [8] or DESeq [1] to find differentially expressed genes. Some authors have argued that it is better to leave the count data unchanged to preserve their sampling properties and instead use an offset for normalization purposes in the context of DE analysis [1, 3, 8]. This can be achieved easily using the argument offset in both normalization functions. > dataoffset <- withinlanenormalization(data, "gc", + which="full", offset=true) > dataoffset <- betweenlanenormalization(dataoffset, + which="full", offset=true) 4.1 DESeq If one wants to use the normalized data to perform a DE analysis with DESeq, there is a simple way to transform the data in the format needed by DESeq. > library(deseq) > counts <- as(datanorm,"countdataset") > counts 9
10 CountDataSet (storagemode: environment) assaydata: 5534 features, 14 samples element names: counts protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: sizefactor lib_prep... lib_prep_proto (5 total) varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: 5 SessionInfo > tolatex(sessioninfo()) R version ( ), x86_64-apple-darwin Locale: en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf-8/en_us.utf-8 Base packages: base, datasets, graphics, grdevices, methods, stats, utils Other packages: aroma.light , Biobase , BiocGenerics 0.2.0, Biostrings , DESeq 1.8.1, EDASeq 1.2.0, GenomicRanges 1.8.3, IRanges , lattice , latticeextra , locfit 1.5-7, R.methodsS , R.oo 1.9.3, RColorBrewer 1.0-5, Rsamtools 1.8.1, ShortRead Loaded via a namespace (and not attached): annotate , AnnotationDbi , bitops , BSgenome , DBI 0.2-5, genefilter , geneplotter , grid , hwriter 1.3, KernSmooth , RCurl , RSQLite , rtracklayer , splines , stats , survival , tools , XML 3.9-4, xtable 1.7-0, zlibbioc References [1] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R
11 [2] Bullard, J., Purdom, E., Hansen, K., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mrna- Seq experiments. BMC Bioinformatics, 11(1), 94. [3] Hansen, K., Irizarry, R., and Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. [4] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. [5] Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct, 4(1), 14. [6] Risso, D. and Dudoit, S. (2011). EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq. R package version bioconductor.org/packages/release/bioc/html/edaseq.html. [7] Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics, 12, 480. [8] Robinson, M., McCarthy, D., and Smyth, G. (2010). edger: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1),
EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq
EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq Davide Risso Modified: May 22, 2012. Compiled: October 14, 2013 1 Introduction In this document, we show how to conduct Exploratory Data
More informationPractical Differential Gene Expression. Introduction
Practical Differential Gene Expression Introduction In this tutorial you will learn how to use R packages for analysis of differential expression. The dataset we use are the gene-summarized count data
More informationGene Expression Analysis
Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands
More informationIntroduction. Overview of Bioconductor packages for short read analysis
Overview of Bioconductor packages for short read analysis Introduction General introduction SRAdb Pseudo code (Shortread) Short overview of some packages Quality assessment Example sequencing data in Bioconductor
More informationBasic processing of next-generation sequencing (NGS) data
Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance
More informationNGS Data Analysis: An Intro to RNA-Seq
NGS Data Analysis: An Intro to RNA-Seq March 25th, 2014 GST Colloquim: March 25th, 2014 1 / 1 Workshop Design Basics of NGS Sample Prep RNA-Seq Analysis GST Colloquim: March 25th, 2014 2 / 1 Experimental
More informationCreating a New Annotation Package using SQLForge
Creating a New Annotation Package using SQLForge Marc Carlson, Herve Pages, Nianhua Li February 4, 2016 1 Introduction The AnnotationForge package provides a series of functions that can be used to build
More informationFrom Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data
From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity
More informationAnalysis of Bead-summary Data using beadarray
Analysis of Bead-summary Data using beadarray Mark Dunning October 13, 2015 Contents 1 Introduction 2 2 feature and pheno data 3 3 Subsetting the data 4 4 Exploratory analysis using boxplots 7 4.1 A note
More informationmsmseda LC-MS/MS Exploratory Data Analysis
msmseda LC-MS/MS Exploratory Data Analysis Josep Gregori, Alex Sanchez, and Josep Villanueva Vall Hebron Institute of Oncology & Statistics Dept. Barcelona University josep.gregori@gmail.com October 13,
More informationFlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert May 3, 2016 Abstract FlipFlop implements a fast method for de novo transcript
More informationChallenges associated with analysis and storage of NGS data
Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group gabry@ebi.ac.uk Next-generation sequencing Next-generation sequencing
More informationSoftware and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University
Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation
More informationExpression Quantification (I)
Expression Quantification (I) Mario Fasold, LIFE, IZBI Sequencing Technology One Illumina HiSeq 2000 run produces 2 times (paired-end) ca. 1,2 Billion reads ca. 120 GB FASTQ file RNA-seq protocol Task
More informationorg.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
More informationRNA-seq. Quantification and Differential Expression. Genomics: Lecture #12
(2) Quantification and Differential Expression Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #12 Today (2) Gene Expression per Sources of bias,
More informationedger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.
edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K. Smyth First edition 17 September 2008 Last revised 8 October
More informationCRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.
: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results
More informationIntroduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,
More informationComparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
More informationRNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance
RNA Express Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance ILLUMINA PROPRIETARY 15052918 Rev. A February 2014 This document and its contents are
More informationIntroduction to robust calibration and variance stabilisation with VSN
Introduction to robust calibration and variance stabilisation with VSN Wolfgang Huber January 5, 2016 Contents 1 Getting started 1 2 Running VSN on data from a single two-colour array 2 3 Running VSN on
More informationnetresponse probabilistic tools for functional network analysis
netresponse probabilistic tools for functional network analysis Leo Lahti 1,2, Olli-Pekka Huovilainen 1, António Gusmão 1 and Juuso Parkkinen 1 (1) Dpt. Information and Computer Science, Aalto University,
More informationResearch Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies
ISRN Bioinformatics Volume 2013, Article ID 481545, 8 pages http://dx.doi.org/10.1155/2013/481545 Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale
More informationPackage empiricalfdr.deseq2
Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz
More informationFrequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
More informationIRanges, GenomicRanges, and Biostrings
IRanges, GenomicRanges, and Biostrings Bioconductor Infrastructure Packages for Sequence Analysis Patrick Aboyoun Fred Hutchinson Cancer Research Center 7-9 June, 2010 Outline Introduction Genomic Intervals
More informationUsing Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
More informationFlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
More informationPackage cgdsr. August 27, 2015
Type Package Package cgdsr August 27, 2015 Title R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS) Version 1.2.5 Date 2015-08-25 Author Anders Jacobsen Maintainer Augustin Luna
More informationVersion 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
More informationGeneProf and the new GeneProf Web Services
GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter
More informationBioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
More informationDeep Sequencing Data Analysis
Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist
More information17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg (hackenberg@ugr.es)
WEB-SERVER MANUAL Contact: Michael Hackenberg (hackenberg@ugr.es) 1 1 Introduction srnabench is a free web-server tool and standalone application for processing small- RNA data obtained from next generation
More informationGenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
More informationData Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis
Data Acquisition DNA microarrays The functional genomics pipeline Experimental design affects outcome data analysis Data acquisition microarray processing Data preprocessing scaling/normalization/filtering
More informationPackage GEOquery. August 18, 2015
Type Package Package GEOquery August 18, 2015 Title Get data from NCBI Gene Expression Omnibus (GEO) Version 2.34.0 Date 2014-09-28 Author Maintainer BugReports
More informationExercise with Gene Ontology - Cytoscape - BiNGO
Exercise with Gene Ontology - Cytoscape - BiNGO This practical has material extracted from http://www.cbs.dtu.dk/chipcourse/exercises/ex_go/goexercise11.php In this exercise we will analyze microarray
More informationInteractive Visualization of Genomic Data
Interactive Visualization of Genomic Data Interfacing Qt and R Michael Lawrence November 17, 2010 1 Introduction 2 Qt-based Interactive Graphics Canvas Design Implementation 3 Looking Forward: Integration
More informationHow Sequencing Experiments Fail
How Sequencing Experiments Fail v1.0 Simon Andrews simon.andrews@babraham.ac.uk Classes of Failure Technical Tracking Library Contamination Biological Interpretation Something went wrong with a machine
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationFrozen Robust Multi-Array Analysis and the Gene Expression Barcode
Frozen Robust Multi-Array Analysis and the Gene Expression Barcode Matthew N. McCall October 13, 2015 Contents 1 Frozen Robust Multiarray Analysis (frma) 2 1.1 From CEL files to expression estimates...................
More informationStatistical analysis of modern sequencing data quality control, modelling and interpretation
Statistical analysis of modern sequencing data quality control, modelling and interpretation Jörg Rahnenführer Technische Universität Dortmund, Fakultät Statistik Email: rahnenfuehrer@statistik.tu-.de
More informationVisualisation tools for next-generation sequencing
Visualisation tools for next-generation sequencing Simon Anders EBI is an Outstation of the European Molecular Biology Laboratory. Outline Exploring and checking alignment with alignment viewers Using
More informationRNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012
RNA-Seq Tutorial 1 John Garbe Research Informatics Support Systems, MSI March 19, 2012 Tutorial 1 RNA-Seq Tutorials RNA-Seq experiment design and analysis Instruction on individual software will be provided
More informationThe Galaxy workflow. George Magklaras PhD RHCE
The Galaxy workflow George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org
More informationQuality Assessment of Exon and Gene Arrays
Quality Assessment of Exon and Gene Arrays I. Introduction In this white paper we describe some quality assessment procedures that are computed from CEL files from Whole Transcript (WT) based arrays such
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationBioHPC Web Computing Resources at CBSU
BioHPC Web Computing Resources at CBSU 3CPG workshop Robert Bukowski Computational Biology Service Unit http://cbsu.tc.cornell.edu/lab/doc/biohpc_web_tutorial.pdf BioHPC infrastructure at CBSU BioHPC Web
More informationTutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment
Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249
More informationAnalyzing Flow Cytometry Data with Bioconductor
Introduction Data Analysis Analyzing Flow Cytometry Data with Bioconductor Nolwenn Le Meur, Deepayan Sarkar, Errol Strain, Byron Ellis, Perry Haaland, Florian Hahne Fred Hutchinson Cancer Research Center
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationA Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University jakemdrew@gmail.com www.jakemdrew.com Sequence Characters IUPAC nucleotide
More informationStatistical challenges in RNA-Seq data analysis
Statistical challenges in RNA-Seq data analysis Julie Aubert UMR 518 AgroParisTech-INRA Mathématiques et Informatique Appliquées ETGE, Aussois, 2012 April 26 J. Aubert () Stat. challenges RNA-Seq ETEGE,
More informationBIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis
BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis By the end of this lab students should be able to: Describe the uses for each line of the DNA subway program (Red/Yellow/Blue/Green) Describe
More informationData Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
More informationAnalysis of ChIP-seq data in Galaxy
Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers
More informationOVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE
OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION R is a free software environment for statistical computing
More informationEoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013 The Sanger DNA sequencing method Sequencing
More informationBioconductor R packages for exploratory analysis and normalization of cdna microarray data
1 This is page 1 Printer: Opaque this Bioconductor R packages for exploratory analysis and normalization of cdna microarray data Sandrine Dudoit Yee Hwa Yang Abstract This chapter describes a collection
More informationModule 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
More informationEfficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study
Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study Sebastian J. Schultheiss Machine Learning in Biology, Rätsch Lab, FML of the Max Planck Society Tübingen,
More informationScientific Argumentation and Software Significance
Scientific argumentation and software design VJ Carey, Channing Lab, Harvard Medical School BioC 2009 Three case studies in cancer transcriptomics Containers Software reliability Scientific argumentation;
More informationMeasuring gene expression (Microarrays) Ulf Leser
Measuring gene expression (Microarrays) Ulf Leser This Lecture Gene expression Microarrays Idea Technologies Problems Quality control Normalization Analysis next week! 2 http://learn.genetics.utah.edu/content/molecules/transcribe/
More informationmrna NGS Data Analysis Report
mrna NGS Data Analysis Report Project: Test Project (Ref code: 00001) Customer: Test customer Company/Institute: Exiqon Date: Monday, June 29, 2015 Performed by: XploreRNA Exiqon A/S Company Reg. No. (CVR)
More informationHowTo: Querying online Data
HowTo: Querying online Data Jeff Gentry and Robert Gentleman May 3, 2016 1 Overview This article demonstrates how you can make use of the tools that have been provided for on-line querying of data resources.
More informationNew solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationStatistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
More informationHENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT
HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT Kimberly Bishop Lilly 1,2, Truong Luu 1,2, Regina Cer 1,2, and LT Vishwesh Mokashi 1 1 Naval Medical Research Center, NMRC Frederick, 8400 Research Plaza,
More informationSubread/Rsubread Users Guide
Subread/Rsubread Users Guide Subread v1.5.0-p1/rsubread v1.20.3 1 February 2016 Wei Shi and Yang Liao Bioinformatics Division The Walter and Eliza Hall Institute of Medical Research The University of Melbourne
More informationAnalysis of gene expression data. Ulf Leser and Philippe Thomas
Analysis of gene expression data Ulf Leser and Philippe Thomas This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser:
More informationScatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
More informationGenome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009
Genome and DNA Sequence Databases BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009 Admin Reading: Chapters 1 & 2 Notes available in PDF format on-line (see class calendar page): http://www.soe.ucsc.edu/classes/bme110/spring09/bme110-calendar.html
More informationGWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.
GWAS Data Cleaning GENEVA Coordinating Center Department of Biostatistics University of Washington January 13, 2016 Contents 1 Overview 2 2 Preparing Data 3 2.1 Data formats used in GWASTools............................
More informationA demonstration of the use of Datagrid testbed and services for the biomedical community
A demonstration of the use of Datagrid testbed and services for the biomedical community Biomedical applications work package V. Breton, Y Legré (CNRS/IN2P3) R. Météry (CS) Credits : C. Blanchet, T. Contamine,
More informationIntroduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
More information-> Integration of MAPHiTS in Galaxy
Enabling NGS Analysis with(out) the Infrastructure, 12:0512 Development of a workflow for SNPs detection in grapevine From Sets to Graphs: Towards a Realistic Enrichment Analy species: MAPHiTS -> Integration
More informationPackage hoarder. June 30, 2015
Type Package Title Information Retrieval for Genetic Datasets Version 0.1 Date 2015-06-29 Author [aut, cre], Anu Sironen [aut] Package hoarder June 30, 2015 Maintainer Depends
More informationBIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am)
BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am) Course Instructor: Dr. Tzu L. Phang, Assistant Professor
More informationStandards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium
Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable
More informationUGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
More informationExiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays
Exiqon Array Software Manual Quick guide to data extraction from mircury LNA microrna Arrays March 2010 Table of contents Introduction Overview...................................................... 3 ImaGene
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationBenjamin Czech, Jonathan B. Preall, Jon McGinn, and Gregory J. Hannon
Molecular Cell, Volume 50 Supplemental Information A Transcriptome-wide RNAi Screen in the Drosophila Ovary Reveals Factors of the Germline pirna Pathway Benjamin Czech, Jonathan B. Preall, Jon McGinn,
More informationGeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing
for Next Generation Sequencing Dale Baskin, N. Eric Olson, Laura Lucas, Todd Smith 1 Abstract Next generation sequencing technology is rapidly changing the way laboratories and researchers approach the
More informationMICROARRAY DATA ANALYSIS TOOL USING JAVA AND R
MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R By Vasundhara Akkineni B.Tech, University of Madras, 2003 A Thesis Submitted to the Faculty of the Graduate School of the University of Louisville In Partial
More informationImportance of Statistics in creating high dimensional data
Importance of Statistics in creating high dimensional data Hemant K. Tiwari, PhD Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham History of Genomic Data
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More informationHigh Throughput Sequencing Data Analysis using Cloud Computing
High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure
More informationBiological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
More informationCummeRbund: Visualization and Exploration of Cufflinks High-throughput Sequencing Data
CummeRbund: Visualization and Exploration of Cufflinks High-throughput Sequencing Data Loyal A. Goff, Cole Trapnell, David Kelley May 7, 214 Contents 1 Requirements 2 2 Introduction 3 3 CummeRbund Classes
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationMethods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data
WHITE PAPER Ion RNA-Seq Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data Introduction High-resolution measurements of transcriptional activity and organization
More informationLecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr
Lecture 11 Data storage and LIMS solutions Stéphane LE CROM lecrom@biologie.ens.fr Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog
More informationOptimization of sampling strata with the SamplingStrata package
Optimization of sampling strata with the SamplingStrata package Package version 1.1 Giulio Barcaroli January 12, 2016 Abstract In stratified random sampling the problem of determining the optimal size
More information5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
More information8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)
Experimental Design & Intro to NGS Data Analysis Ryan Peters Field Application Specialist Partek, Incorporated Agenda Experimental Design Examples ANOVA What assays are possible? NGS Analytical Process
More informationKeeping up with DNA technologies
Keeping up with DNA technologies Mihai Pop Department of Computer Science Center for Bioinformatics and Computational Biology University of Maryland, College Park The evolution of DNA sequencing Since
More informationOn-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly
On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly DANIEL BLANKENBERG, JAMES TAYLOR, IAN SCHENCK, JIANBIN HE, YI ZHANG, MATTHEW
More information