Introduction. Overview of Bioconductor packages for short read analysis



Similar documents
Basic processing of next-generation sequencing (NGS) data

Comparing Methods for Identifying Transcription Factor Target Genes

Normalization of RNA-Seq

Challenges associated with analysis and storage of NGS data

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Visualisation tools for next-generation sequencing

-> Integration of MAPHiTS in Galaxy

Module 1. Sequence Formats and Retrieval. Charles Steward

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Creating a New Annotation Package using SQLForge

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

IRanges, GenomicRanges, and Biostrings

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Using Databases in R

Analysis of NGS Data

High Throughput Sequencing Data Analysis using Cloud Computing

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

LifeScope Genomic Analysis Software 2.5

Integrating computational data analysis capabilities into analytics applications

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Analysis of ChIP-seq data in Galaxy

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Next Generation Sequencing

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

UGENE Quick Start Guide

PreciseTM Whitepaper

Computational Genomics. Next generation sequencing (NGS)

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

GenBank, Entrez, & FASTA

Deep Sequencing Data Analysis

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Frequently Asked Questions Next Generation Sequencing

A Primer of Genome Science THIRD

Next generation DNA sequencing technologies. theory & prac-ce

A Tutorial in Genetic Sequence Classification Tools and Techniques

Introduction to NGS data analysis

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Next generation sequencing (NGS)

Bioinformatics Grid - Enabled Tools For Biologists.

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

NGS Data Analysis: An Intro to RNA-Seq

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Databases and mapping BWA. Samtools

Package GEOquery. August 18, 2015

Bioinformatics Unit Department of Biological Services. Get to know us

GeneProf and the new GeneProf Web Services

Gene Expression Analysis

Data formats and file conversions

CHALLENGES IN NEXT-GENERATION SEQUENCING

RNA- seq de novo ABiMS

SRA File Formats Guide

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

NGS data analysis. Bernardo J. Clavijo

Bioinformatics Resources at a Glance

Practical Solutions for Big Data Analytics

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

BioHPC Web Computing Resources at CBSU

July 7th 2009 DNA sequencing

Practical Differential Gene Expression. Introduction

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

New solutions for Big Data Analysis and Visualization

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Version 5.0 Release Notes

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

What s New in Pathway Studio Web 11.1

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Statistical challenges in RNA-Seq data analysis

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Unipro UGENE Manual. Version

Analysis of Illumina Gene Expression Microarray Data

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

How Sequencing Experiments Fail

Partek Methylation User Guide

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Next Generation Sequencing

Text file One header line meta information lines One line : variant/position

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Transcription:

Overview of Bioconductor packages for short read analysis Introduction General introduction SRAdb Pseudo code (Shortread) Short overview of some packages Quality assessment Example sequencing data in Bioconductor Analysis example Overview Bioconductor can import diverse sequence-related file types: Fasta Fastq ELAND MAQ BWA Bowtie SSOAP BAM Gff Bed wig Using Bioconductor for Sequence Data #1 1

Using Bioconductor for Sequence Data #2 Bioconductor support common and advanced sequence manipulation operations: Trimming Transformation Alignment Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb library). SRAdb SRAdb: Sequence Read Archive database A compilation of metadata from NCBI SRA and tools The largest public repository of sequencing data from the next generation of sequencing platforms including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, and others. Easy access to the metadata associated with submission, study, sample, experiment and run. All the NCBI SRA metadata are parsed into a SQLite database that can be stored and queried locally. Fulltext search makes querying metadata very flexible and powerful. Sra or sra-lite files can be downloaded for doing alignment locally. Sample workflow #1 The following psuedo-code illustrates a typical R / Bioconductor session. It shows initial exploration of 454 resequencing of a 16S RNA microbial community samples. The workflow loads the ShortRead package and its dependencies. It inputs about 250,000 reads of 200-250 bp each from a fastq file. Flexible pattern matching (note the ambiguity letter `V') removes a PCR primer artifact. The final lines plot the cumulative number of trimmed reads as a function of their (log) abundance. The result shows that most of the reads are from relatively few sequences that each occur many times. 2

Sample workflow #2 ## Load packages; also loads Biostrings, IRanges,... > library(shortread) > library(lattice) # for advanced plotting ## Input > seq <- readfastq("/path/to/file.fastq") ## Remove a PCR primer > pcrprimer <- "GGACTACCVGGGTATCTAAT" > trimmed <- trimlrpatterns(lpattern=pcrprimer, subject=sread(seq), Lfixed="subject") ## Calculate and plot cumulative reads vs. occurrences > tbl <- tables(trimmed)[[2]] > xyplot(cumsum(nreads * noccurrences) ~ noccurrences, tbl, scales=list(x=list(log=true)), type="b", pch=20, xlab="number of Occurrences", ylab="cumulative Number of Reads") Sample workflow #3 Bioconductor packages #1 Preprocessing: Quality assessment and representation: ShortRead, GenomicRanges Read remediation, trimming, primer removal, specialized manipulation: IRanges, ShortRead, Biostrings Specialized alignment tasks: Biostrings, BSgenome Domain specific analysis: ChIP-seq: chipseq, ChIPseqR, CSAR, BayesPeak Differential expression: DESeq, edger, bayseq RNA-seq: Genominator 3

Bioconductor packages #2 Annotation: Gene-centric: AnnotationDbi, org.*.db, KEGG.db, GO.db, Category, GOstats Genome coordinate: GenomicFeatures, ChIPpeakAnno Integration: Digital and microarray differential expression RNAseq and gene ontology / pathway, goseq HapMap, 1000 genomes, UCSC, Sequence Read Archive, GEO, ArrayExpress, rtracklayer, biomart, Rsamtools, GEOquery Quality Assessment > library(shortread) > dir <- # Input + "/mnt/fred/solexa/xxx/100524_hwi- EAS88_0005" > sp <- SolexaPath(dir) # Many other formats > qa <- qa(sp) # Collate statistics -- slow > rpt <- report(qa) # Create report > browseurl(rpt) # View in browser Example data > library(eatonetalchipseq) > fl <- system.file("extdata", + "GSM424494_wt_G2_orc_chip_rep1_S288C_14.mapview.txt.gz, package="eatonetalchipseq") > aln <- readaligned(fl, type = "MAQMapview") 4

AlignedRead class > aln class: AlignedRead length: 478774 reads; width: 39 cycles chromosome: S288C_14 S288C_14... S288C_14 S288C_14 position: 2 4... 784295 784295 strand: + -... + + alignquality: IntegerQuality aligndata varlabels: nmismatchbesthit mismatchquality nexactmatch24 nonemismatch24 > table(strand(aln), usena="always") + - * <NA> 64170 414604 0 0 > Accessing reads > head(sread(aln), 3) A DNAStringSet instance of length 3 width seq [1] 39 CGGCTTTCTGACCG...AAAAATGAAAATG [2] 39 GATTTATGAAAGAA...AAATGAAAATGAA [3] 39 CTTTCTGACCGAAA...AATGAAAATGAAA Alphabet by cycle Expectation: nucleotide use independent of cycle > alnp <- aln[strand(aln) == "+"] > abc <- alphabetbycycle(sread(alnp)) > class(abc) [1] "matrix" > abc[1:6,1:4] cycle alphabet [,1] [,2] [,3] [,4] A 20701 23067 21668 19920 C 15159 9523 11402 11952 G 11856 12762 11599 14220 T 16454 18818 19501 18078 M 0 0 0 0 R 0 0 0 0 5

Alphabet by cycle matplot takes a matrix and plots each column as a set of points: > tabc <- t(abc[1:4,]) > matplot(tabc, type="l", + lty=rep(1, 4)) Quality by cycle Encoded quality scores can be decoded to their numerical values and represented as a matrix. Calculating the average of the column means creates a vector of average quality scores across cycle. > m <- as(quality(alnp), "matrix") > plot(colmeans(m), type="b") Overview #1 The following packages illustrate the diversity of functionality available; all are in the release version of Bioconductor. IRanges, GenomicRanges and genomeintervals for range-based (e.g., chromosomal regions) calculation, data manipulation, and general-purpose data representation. Biostrings for alignment, pattern matching (e.g., primer removal), and data manipulation of large biological sequences or sets of sequences. ShortRead and Rsamtools for file I/O, quality assessment, and high-level, general purpose data summary. rtracklayer for import and export of tracks on the UCSC genome browser. BSgenome for accessing and manipulating curated whole-genome representations. GenomicFeatures for annotation of sequence features across common genomes, biomart for access to Biomart databases. 6

Overview #2 SRAdb for querying and retrieving data from the Sequence Read Archive. ChIP-seq and related (e.g., motif discovery, identification of high-coverage segments) activities are facilitated by packages such as CSAR, chipseq, ChIPseqR, ChIPsim, ChIPpeakAnno, rgadem, segmentseq, BayesPeak, PICS. Differential expression and RNA-seq style analysis can be accomplished with Genominator, edger, bayseq, DESeq, and DEGseq. Questions? 7