Introduction. Overview of Bioconductor packages for short read analysis

Overview of Bioconductor packages for short read analysis Introduction General introduction SRAdb Pseudo code (Shortread) Short overview of some packages Quality assessment Example sequencing data in Bioconductor Analysis example Overview Bioconductor can import diverse sequence-related file types: Fasta Fastq ELAND MAQ BWA Bowtie SSOAP BAM Gff Bed wig Using Bioconductor for Sequence Data #1 1

Using Bioconductor for Sequence Data #2 Bioconductor support common and advanced sequence manipulation operations: Trimming Transformation Alignment Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb library). SRAdb SRAdb: Sequence Read Archive database A compilation of metadata from NCBI SRA and tools The largest public repository of sequencing data from the next generation of sequencing platforms including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, and others. Easy access to the metadata associated with submission, study, sample, experiment and run. All the NCBI SRA metadata are parsed into a SQLite database that can be stored and queried locally. Fulltext search makes querying metadata very flexible and powerful. Sra or sra-lite files can be downloaded for doing alignment locally. Sample workflow #1 The following psuedo-code illustrates a typical R / Bioconductor session. It shows initial exploration of 454 resequencing of a 16S RNA microbial community samples. The workflow loads the ShortRead package and its dependencies. It inputs about 250,000 reads of 200-250 bp each from a fastq file. Flexible pattern matching (note the ambiguity letter `V') removes a PCR primer artifact. The final lines plot the cumulative number of trimmed reads as a function of their (log) abundance. The result shows that most of the reads are from relatively few sequences that each occur many times. 2

Sample workflow #2 ## Load packages; also loads Biostrings, IRanges,... > library(shortread) > library(lattice) # for advanced plotting ## Input > seq <- readfastq("/path/to/file.fastq") ## Remove a PCR primer > pcrprimer <- "GGACTACCVGGGTATCTAAT" > trimmed <- trimlrpatterns(lpattern=pcrprimer, subject=sread(seq), Lfixed="subject") ## Calculate and plot cumulative reads vs. occurrences > tbl <- tables(trimmed)[[2]] > xyplot(cumsum(nreads * noccurrences) ~ noccurrences, tbl, scales=list(x=list(log=true)), type="b", pch=20, xlab="number of Occurrences", ylab="cumulative Number of Reads") Sample workflow #3 Bioconductor packages #1 Preprocessing: Quality assessment and representation: ShortRead, GenomicRanges Read remediation, trimming, primer removal, specialized manipulation: IRanges, ShortRead, Biostrings Specialized alignment tasks: Biostrings, BSgenome Domain specific analysis: ChIP-seq: chipseq, ChIPseqR, CSAR, BayesPeak Differential expression: DESeq, edger, bayseq RNA-seq: Genominator 3

Bioconductor packages #2 Annotation: Gene-centric: AnnotationDbi, org.*.db, KEGG.db, GO.db, Category, GOstats Genome coordinate: GenomicFeatures, ChIPpeakAnno Integration: Digital and microarray differential expression RNAseq and gene ontology / pathway, goseq HapMap, 1000 genomes, UCSC, Sequence Read Archive, GEO, ArrayExpress, rtracklayer, biomart, Rsamtools, GEOquery Quality Assessment > library(shortread) > dir <- # Input + "/mnt/fred/solexa/xxx/100524_hwi- EAS88_0005" > sp <- SolexaPath(dir) # Many other formats > qa <- qa(sp) # Collate statistics -- slow > rpt <- report(qa) # Create report > browseurl(rpt) # View in browser Example data > library(eatonetalchipseq) > fl <- system.file("extdata", + "GSM424494_wt_G2_orc_chip_rep1_S288C_14.mapview.txt.gz, package="eatonetalchipseq") > aln <- readaligned(fl, type = "MAQMapview") 4

AlignedRead class > aln class: AlignedRead length: 478774 reads; width: 39 cycles chromosome: S288C_14 S288C_14... S288C_14 S288C_14 position: 2 4... 784295 784295 strand: + -... + + alignquality: IntegerQuality aligndata varlabels: nmismatchbesthit mismatchquality nexactmatch24 nonemismatch24 > table(strand(aln), usena="always") + - * <NA> 64170 414604 0 0 > Accessing reads > head(sread(aln), 3) A DNAStringSet instance of length 3 width seq [1] 39 CGGCTTTCTGACCG...AAAAATGAAAATG [2] 39 GATTTATGAAAGAA...AAATGAAAATGAA [3] 39 CTTTCTGACCGAAA...AATGAAAATGAAA Alphabet by cycle Expectation: nucleotide use independent of cycle > alnp <- aln[strand(aln) == "+"] > abc <- alphabetbycycle(sread(alnp)) > class(abc) [1] "matrix" > abc[1:6,1:4] cycle alphabet [,1] [,2] [,3] [,4] A 20701 23067 21668 19920 C 15159 9523 11402 11952 G 11856 12762 11599 14220 T 16454 18818 19501 18078 M 0 0 0 0 R 0 0 0 0 5

Alphabet by cycle matplot takes a matrix and plots each column as a set of points: > tabc <- t(abc[1:4,]) > matplot(tabc, type="l", + lty=rep(1, 4)) Quality by cycle Encoded quality scores can be decoded to their numerical values and represented as a matrix. Calculating the average of the column means creates a vector of average quality scores across cycle. > m <- as(quality(alnp), "matrix") > plot(colmeans(m), type="b") Overview #1 The following packages illustrate the diversity of functionality available; all are in the release version of Bioconductor. IRanges, GenomicRanges and genomeintervals for range-based (e.g., chromosomal regions) calculation, data manipulation, and general-purpose data representation. Biostrings for alignment, pattern matching (e.g., primer removal), and data manipulation of large biological sequences or sets of sequences. ShortRead and Rsamtools for file I/O, quality assessment, and high-level, general purpose data summary. rtracklayer for import and export of tracks on the UCSC genome browser. BSgenome for accessing and manipulating curated whole-genome representations. GenomicFeatures for annotation of sequence features across common genomes, biomart for access to Biomart databases. 6

Overview #2 SRAdb for querying and retrieving data from the Sequence Read Archive. ChIP-seq and related (e.g., motif discovery, identification of high-coverage segments) activities are facilitated by packages such as CSAR, chipseq, ChIPseqR, ChIPsim, ChIPpeakAnno, rgadem, segmentseq, BayesPeak, PICS. Differential expression and RNA-seq style analysis can be accomplished with Genominator, edger, bayseq, DESeq, and DEGseq. Questions? 7