Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1
Reminder: we are measuring expression of protein coding genes by transcript abundance Biology: Chromosome Gene Transcript Protein mrna abundance (transcript sequence copies) 2
A typical experimental setup Condition 1 Condition 2 Tissue sample 1 Tissue sample 2. Tissue sample n Tissue sample 1 Tissue sample 2. Tissue sample n mrna sequencing (RNA-seq) Data processing Expression analysis (comparisons) 3
Data processing overview Raw sequence generation Sequence filtration Mapping Mapping filtration Computing gene expression values Gene expression analysis 4
Sequence file formats Output files from sequencing equipment Can be MANY and HUGE data files! A common file format is Fastq (36 bp reads example): @OBAN:8:1:2:902#0/1 AAAGCTTGTTTTTTCCCTACANCTGTATCCTTTCTT +OBAN:8:1:2:902#0/1 aaava]ay]aaba_ryz``x[dn_[x_]pu_aaz_a @OBAN:8:1:2:1718#0/1 TAAATATAACATTCTTTCCACNACACTTTCTAGGAC +OBAN:8:1:2:1718#0/1 aaaaaaaaa`aaa]aaa[`_xd\\`\`a^[^]pfxw... @OBAN:8:1:1114:370#0/1 GGAAGGCAGCGAACATCTGTTCAATCTCCTCCTTGG +OBAN:8:1:1114:370#0/1 a^aba\]_[_]wa[[z] M^YNX``]]^a`_XTXB DNA sequence Quality sequence Sequence number 1 Sequence number 2 Sequence number n 5
Sequence format conversions If specific sequence file format is required for downstream analysis/processing, conversion can be done with programs like maq or fq_all2std.pl Program: maq (Mapping and Assembly with Qualities) Version: 0.7.1 Contact: Heng Li <lh3@sanger.ac.uk> Usage: maq <command> [options] Format converting: sol2sanger convert Solexa FASTQ to standard/sanger FASTQ mapass2maq convert mapass2's map format to maq's map format bfq2fastq convert BFQ to FASTQ format fq_all2std.pl 6
Filtering on raw sequences Quality Read count per tissue/sample Uniqueness Trimming 7
Mapping sequences Choice of suitable program Depend on your purpose of analysis. There are many alignment software programs, here are some commonly used examples Bowtie: Basic mapping to reference sequence Maq/BWA: Mapping and SNP/indel detection Tophat: splices alignments for studying exon splice-junctions 8
Mapping sequences Choice of reference sequence database what to map against? Genome reference: common if you want to do transcript assembly, study alternative splicing and detect SNPs/indels Transcript reference: common if you want to do simple read mapping, count transcript copies and analyze expression levels 9
Mapping sequences Common issues Reference database not available. If you do not have a reference you need to consider building you own 10
Mapping sequences Common issues Lack of unique mapping will typically lead to random mapping Gene A Gene B ATCATCGGGCCATCGATTAGCTGATCGGACGCTA ATCGATTAGCTG TTTTCCTCTTTATCGATTAGCTGGGGGT ATCGATTAGCTG Sequence read with multiple mapping options 11
Mapping sequences Paired end reads 12
Example: map reads to reference database with Bowtie Build reference database index: bowtie-build NC_002127.fna e_coli_o157_h7 Test build: bowtie -c e_coli_o157_h7 GCGTGAGCTATGAGAAAGCGCCACGCTTCC Map/align reads to references: bowtie S e_coli reads/e_coli_1000.fq 13
Sequence Alignment MAP format (SAM: Bowtie output example) Standard alignment file format: OBAN:8:1:3:1366#0/1 16 ENSSSCG00000004803 ENSSSCT00000005302 ACTC1 1250 255 36M * 0 0 GTCTACTTTACGTTCAGGATGACAGGTTAATGCTTC VXG^Z^`_ZVUYZ]T`ZQ\_U\W[X_^aaa`[R\\R XA:i:0 MD:Z:36 NM:i:0 OBAN:8:1:3:285#0/1 4 * 0 0 * * 0 0 AGGTATTGGGTTTGGGGGCCTTACACACCAGGTGGA `VOW^b`RVRS`aUQMT[Z^_a_`_Y_]Y]\RNTVW XM:i:0 OBAN:8:1:3:672#0/1 4 * 0 0 * * 0 0 TGGGTATACAGTTCATCCAGTACCCGCTCCGGCTTC a`^\y_^``]``aq]a^^_vp`[[\^[sy[yqjwub XM:i:0 14
Filtering sequence mapping Low percent reads mapped Low number of genes covered 15
Computing absolute expression Counting transcript copies Mapping output Sample 1 Count Sample 1 GeneA ATCGATTAGAC GeneA 2 GeneA ATGGGCTGCAG GeneB 1 GeneB ATTTCGGCTGC GeneC 3 GeneC GeneC GeneC ATCCCTCCCTA GGGCTGGCTGC GCCGGCGGCAA Count copies f.ex. with Perl script 16
Creating a gene expression matrix Concat and Pivot Column IDs defined by tissue samples Rows IDs defined by gene/transcript IDs Gene Sample1 Sample2. SampleN GeneA 2 4 GeneB 1 0 GeneC 3 14... 17
Filtering genes by absolute expression Number of reads per gene per sample (alternatively per total samples) Overall tissue sample assessment table 18
Transformation of expression values Adjust for technical differences like total read count per sample Depending in the downstream analysis tool Some tools read absolute count and does transformation/normalization Relative abundance (RA) Log transformation of RA Reads per Kilobase per Million Reads (RPKM; Mortazavi et al, 2008) 19
Carry on with gene expression analysis! Differential expression Clustering Etc 20
Some general considerations Linux is the best environment for handling and analyzing huge data files Learning some of the Linux commands can be helpful (grep, sed, cut, awk) Learning Perl/R programming can also help data text file processing Use batch files to build data processing pipelines (documentation and re-use) Get use to shift between various tools for processing, analyzing and visualization Check input/output files, you are responsible, not the software/script authors! 21
Resources for NGS data Forum for discussing NGS data analysis: http://www.seqanswers.com Galaxy online tools: http://main.g2.bx.psu.edu NCBI Short Read Archive (SRA): http://trace.ncbi.nlm.nih.gov/traces/sra Bioconductor packages for NGS: http://www.bioconductor.org/help/workflows/high-throughput-sequencing ShortRead, Biostrings, edger, Rsamtools, biomart