NGS Data Analysis: An Intro to RNA-Seq

Transcription

1 NGS Data Analysis: An Intro to RNA-Seq March 25th, 2014 GST Colloquim: March 25th, / 1

2 Workshop Design Basics of NGS Sample Prep RNA-Seq Analysis GST Colloquim: March 25th, / 1

3 Experimental Design There are lots of of sequencing experiments available: Resequencing Assembly RNA-Seq CHiP-Seq Meta-genomics GST Colloquim: March 25th, / 1

4 Common experimental questions: Measure variation within or between species Generate a genome sequence Transcriptome characterization Identify protein binding sites Population genetics Differential expression studies GST Colloquim: March 25th, / 1

5 Basic Process GST Colloquim: March 25th, / 1

6 Design Considerations What resources do you have already? (reference genome, curated gene models, etc.) Do you need biological reps? (Depends on the experiment, but the answer is usually yes.) Do you need technical reps? (Most likely not.) Do you need controls? (Depends on the experiment.) Do you need deep sequencing coverage?(again, depends on the experiment.) All of these questions should be answered before you start. GST Colloquim: March 25th, / 1

7 Types of reads Single: Paired: Mate-Paired: Fast runs Cheapest overall cost More data for each fragment More data for alignment/assembly Same inputs as single-end Best for iso-form detection. Longer pairs than Paired-end Allow sequencing over long repeats Good for detecting structural variations Requre more input DNA than any other library GST Colloquim: March 25th, / 1

8 How many reads? Genomic RNA-Seq Depends on the size of your genome. You want enough reads to cover your genome at depth. Depends on complexity of the transcriptional profile you re working on and if you need to capture rare events Rule of thumb is that more replicates are more important than more sequences. Again this is another decision that is entirely dependent on the question you are trying to answer and the organism you are working in. In reality, there is usually more sequencing capacity in a lane than you need for a sample so the real question is how many samples can you pool into a given lane. GST Colloquim: March 25th, / 1

9 Read length: Again completely dependent on experiment and organism. Longer is usually better. But sometimes short is good enough. GST Colloquim: March 25th, / 1

10 Selecting a technology: Based on Read / Library Type Illumina Paired, Single, Mate Pair Ion Torrent Paired, Single, Mate Pair Solid Single, Mate Pair 454 Single, Mate Pair PacBio Single Read Length: Illumina bp Ion Torrent bp ( bp for Paired) bp PacBio 1000+bp Solid 75bp GST Colloquim: March 25th, / 1

11 Selecting a technology: Read Number: (Manufacturer s claims, and machine dependent) Illumina GigaBases Ion Torrent Million reads million reads PacBio? Solid Gigabases GST Colloquim: March 25th, / 1

12 Sample Prep (RNA-Seq Specific) Sample Collection and Storage: RNA-Later - Stabilization buffer 1 month storage time at RT. Good for field collection. Liquid Nitrogen - Fast, cheap, effective as long as you have constant access. RNA extraction Some sequencing centers only want total RNA so that they can verify sample quality before library prep. GST Colloquim: March 25th, / 1

13 Sample Prep (RNA-Seq Specific) rrna Depletion: Poly-A Enrichment polya tails of mrna used to enrich a sample (most common) rrna depletion rrna is actively bound and removed (important if large amount of rrna present) cdna Library: Non Stranded total RNA used for cdna library construction. Strand information not preserved. Stranded Strand information is preserved. Crucial in organisms with overlapping genes. GST Colloquim: March 25th, / 1

14 Library Prep It is common to have a sequencing center do this step for you, but depending on budget and experience you may want to do this yourself. Fragment DNA Sonication or Enzyme based methods followed by size selection DNA-Repair Blunting + A overhang Ligate Adaptors Attachment Site PCR addition of attachment site to one end. Barcode Attachemnt PCR addition of bar-code and attachment site to other end Clean Up Remove un ligated adapters etc. GST Colloquim: March 25th, / 1

15 Sequencing Send your samples off to the sequencing center. You ll get raw data back when it s done. GST Colloquim: March 25th, / 1

16 Quality Control of Raw Data Need to measure: Proportion of high quality bases called. Distribution of called nucleotides. Number of reads that are high overall quality Distribution of read qualities at each position GST Colloquim: March 25th, / 1

17 Trimming and Filtering Reads It is common practice to: remove reads with overall poor quality trim the ends of reads to remove low quality sequences remove low quality nucleotides There are compelling arguments why you may want to do this later, but in general its always safe to do these steps before you align reads. GST Colloquim: March 25th, / 1

18 What comes next? 1 Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to Study Splicing from RNA-Seq. figshare. GST Colloquim: March 25th, / 1

19 What comes next? 2 Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to Study Splicing from RNA-Seq. figshare. GST Colloquim: March 25th, / 1

20 Learning Objectives RNA-seq data quality-control (FastQC) Align sequence reads to a reference genome using Tophat Review samtools and file formats conversion View alignments in the IGV Analyze differential gene expression (in R environment) GST Colloquim: March 25th, / 1

21 Analysis workflow GST Colloquim: March 25th, / 1

22 Tools bowtie2 tophat2 FastQC samtools R and required Bioconductor packages (DESeq) RStudio HTSeq Integrative Genomics Viewer (IGV) Java Most of these required tools are already installed in my bin folder: /lustre/home/qjia2/bin GST Colloquim: March 25th, / 1

23 The data Data used in this tutorial was acquired from this paper: Trapnell C, et al: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 2012, 7(3): Pubmed It is generated in silico in Drosophila melanogaster and contains 6 paired-end samples corresponding to 3 biological replicates each of 2 conditions. For more details, please click here. File name Description C1_R1_1.fq.gz, C1_R1_2.fq.gz Simulated Condition 1, replicate 1 C1_R2_1.fq.gz, C1_R2_2.fq.gz Simulated Condition 1, replicate 2 C1_R3_1.fq.gz, C1_R3_2.fq.gz Simulated Condition 1, replicate 3 C2_R1_1.fq.gz, C2_R1_2.fq.gz Simulated Condition 2, replicate 1 C2_R2_1.fq.gz, C2_R2_2.fq.gz Simulated Condition 2, replicate 2 C2_R3_1.fq.gz, C2_R3_2.fq.gz Simulated Condition 2, replicate 3 GST Colloquim: March 25th, / 1

24 Download the reference genome and gene model annotations You also need the reference genome and gene model annotations (GTF models), which can be downloaded from Ensembl or Illumina wget ftp://ftp.ensembl.org/pub//mnt2/release-75/fasta/drosophila_melanogaster/dna/drosophila_melanogaster.bdgp5.75.dna.toplevel.fa.gz wget ftp://ftp.ensembl.org/pub//mnt2/release-75/gtf/drosophila_melanogaster/drosophila_melanogaster.bdgp5.75.gtf.gz gunzip Drosophila_melanogaster.BDGP5.75.* Indexing your reference genome: /lustre/home/qjia2/bin/bowtie2-build -f Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa Dme_BDGP5_75 After executing the command, the following BT2 files will be created: Dme_BDGP5_75.1.bt2 Dme_BDGP5_75.2.bt2 Dme_BDGP5_75.3.bt2 Dme_BDGP5_75.4.bt2 Dme_BDGP5_75.rev.1.bt2 Dme_BDGP5_75.rev.2.bt2 For model species, you can download pre-built Bowtie and Bowtie 2 indexes from Bowtie website. GST Colloquim: March 25th, / 1

25 Create links to the required data Those required files are stored in the following directory in Newton: /data/scratch/qjia2/data2012 In your working directory, you can create links to these files so that you don t need to copy these files into your folders. To create links, type the following commands from your working directory: ln -s /data/scratch/qjia2/data2012/dme_bdgp5_75.*. ln -s /data/scratch/qjia2/data2012/genes.gtf. ln -s /data/scratch/qjia2/data2012/gsm79448*. Then, type: ls You will see those files. GST Colloquim: March 25th, / 1

26 Assess data quality In this workshop, we ll use FastQC to check the quality and integrity of the RNA-seq reads. FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. Create a directory to store output files: mkdir fastqc_reports Run FastQC: /lustre/home/qjia2/bin/fastqc -f fastq -o fastqc_reports *.fq.gz Inspect the output: FastQC generates its output as an HTML file for each file and you need view it in your web browser. FastQC report for a good Illumina dataset FastQC report for a bad Illumina dataset GST Colloquim: March 25th, / 1

27 Align RNA-seq reads to the genome using TopHat2 Create a job definition file called C1R1.sge: #$ -N C1R1 #$ -q medium* #$ -cwd #$ -pe threads 8 /home/qjia2/bin/tophat2 -G genes.gtf -o C1_R1_thout Dme_BDGP5_75 GSM794483_C1_R1_1.fq.gz GSM794483_C1_R1_2.fq.gz Submit the job using the qsub command: qsub C1R1.sge Use the qstat command to check the status of your jobs: qstat Kill your job: qdel your_job_pid GST Colloquim: March 25th, / 1

28 TopHat2 output The tophat2 produces a number of files, most of which are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are: accepted_hits.bam: This file details the alignments for mapped reads. align_summary.txt deletions.bed: insertions.bed junctions.bed: This file contains all the splice-sites detected by TopHat during the alignment. logs/ prep_reads.info unmapped.bam The accepted_hits.bam file is used for our further analysis. This file is not humanreadable, but we can use Samtools to convert it to the.sam format. Next, we ll talk about Smatools first and then use IGV to look at our alignments. GST Colloquim: March 25th, / 1

29 samtools SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. samtools Program: samtools (Tools for alignments in the SAM format) Version: cd Usage: samtools <command> [options] Command: view sort mpileup depth faidx tview index idxstats fixmate flagstat calmd merge rmdup reheader cat bedcov targetcut phase bamshuf SAM<->>BAM conversion sort alignment file multi-way pileup compute the depth index/extract FASTA text alignment viewer index alignment BAM index stats (r595 or later) fix mate information simple stats recalculate MD/NM tags and '=' bases merge sorted alignments remove PCR duplicates replace BAM header concatenate BAMs read depth per BED region cut fosmid regions (for fosmid pool only) phase heterozygotes shuffle and group alignments by name GST Colloquim: March 25th, / 1

30 File manipulation To analyse differential expression, we need to count the reads that align to each gene. The htseq-count script needs sorted.sam files as an input, so run the following commands to sort and create.sam files. samtools sort -n C1_R1_thout/accepted_hits.bam C1_R1_sn samtools view -o C1_R1_sn.sam C1_R1_sn.bam In order to view the alignments in IGV, the.bam files must be sorted by position and indexed. samtools sort C1_R1_thout/accepted_hits.bam C1_R1_s samtools index C1_R1_s.bam GST Colloquim: March 25th, / 1

31 View alignments in the IGV 1. Start the IGV software If you haven t installed it or have trouble starting it, please click here. 2. Load genome and gene annotation into IGV Under the Main Menu, click Genomes -> Create.genome File,and the following window will appear: GST Colloquim: March 25th, / 1

32 View alignments in the IGV - cont. 3. Load mapped reads into IGV Under the Main Menu, click on File -> Load from File. Choose C1_R1_s.bam, and wait for IGV to finish loading. 4. Navigate in IGV For further details see the IGV user guide at here. GST Colloquim: March 25th, / 1

33 Count reads in features with htseq-count HTSeq is a python package, so it can be used as a library. It also provides a set of stand-alone scripts that we can use from command line. The script called heseq-count will be used to count the reads overlapping with known genes. It accepts.sam files and a genome annotation file (gtf format) as inputs. htseq-count -s no -a 10 C1_R1_sn.sam genes.gtf > C1_R1.count -s: whether the data is from a strand-specific assay (default: yes) -a: skip all reads with alignment quality lower than the given minimum value (default: 10) It outputs a table with counts for each feature. FBgn FBgn FBgn FBgn FBgn After running this command on the other five samples, merge htseq-count files into one (mergedcounts.txt). gene_id C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 FBgn FBgn FBgn FBgn GST 71 Colloquim: 73 March 25th, / 1

34 Find differentially expressed genes (DESeq) The commands used here are also described in the DESeq vignette (PDF). 1. Starting R and loading required modules R library("deseq") 2. Set your working directory # make sure you are under For_DESeq directory. setwd("/users/mac/documents/rna_seq/files/dataset/for_deseq") # You can use getwd() command to check your current working directory. getwd() 3. Read in your count table. CountTable = read.table("mergedcounts.txt", header = TRUE, row.names = 1) You table should look like this: head(counttable) ## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 ## FBgn ## FBgn ## FBgn ## FBgn ## FBgn ## FBgn GST Colloquim: March 25th, / 1

35 Find differentially expressed genes (DESeq) - cont. 4. Add treatment information to the data. condition = factor(c("c1", "C1", "C1", "C2", "C2", "C2")) condition ## [1] C1 C1 C1 C2 C2 C2 ## Levels: C1 C2 5. Create a newcountdataset cds <- newcountdataset(counttable, condition) 6. Estimate the size factors from the count data (Normalization) cds <- estimatesizefactors(cds) To see these size factors, do this: sizefactors(cds) ## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 ## GST Colloquim: March 25th, / 1

36 Find differentially expressed genes (DESeq) - cont. Then, we can normalize the counts by the size factors using the following command: head(counts(cds, normalized = TRUE)) ## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 ## FBgn ## FBgn ## FBgn ## FBgn ## FBgn ## FBgn Calculate dispersion values cds <- estimatedispersions(cds) 8. Inspect the estimated dispersions plotdispests(cds) GST Colloquim: March 25th, / 1

37 Find differentially expressed genes (DESeq) - cont. 9. Perform the test for differential expression deg = nbinomtest(cds, "C1", "C2") 10. Plot the log 2 fold changes against the mean normalised counts plotma(deg) GST Colloquim: March 25th, / 1

38 Find differentially expressed genes (DESeq) - cont. 11. Plot histogram of p values hist(deg$pval, breaks = 100, col = "skyblue", main = "") 12. Filter for significant genes at a 10% false discovery rate (FDR) degsig = deg[deg$padj < 0.1, ] Count the number of significant genes: addmargins(table(deg$padj < 0.1)) ## ## FALSE TRUE Sum ## GST Colloquim: March 25th, / 1

39 Find differentially expressed genes (DESeq) - cont. 13. Look at the significantly upregulated and downregulated genes head(degsig[order(degsig$log2foldchange, decreasing = TRUE), ]) ## id basemean basemeana basemeanb foldchange log2foldchange ## 2388 FBgn ## 126 FBgn ## FBgn ## FBgn ## 2103 FBgn ## 2076 FBgn ## pval padj ## e e-93 ## e e-93 ## e e-87 ## e e-69 ## e e-86 ## e e-78 head(degsig[order(degsig$log2foldchange, decreasing = FALSE), ]) ## id basemean basemeana basemeanb foldchange log2foldchange ## FBgn ## 2685 FBgn ## 5844 FBgn ## 5682 FBgn ## 8947 FBgn ## FBgn ## pval padj ## e ## e ## e ## e ## e ## e GST Colloquim: March 25th, / 1

40 Find differentially expressed genes (DESeq) - cont. 14. Save our output to a file write.csv(deg, file = "Result_table.csv") write.csv(degsig, file = "Result_table_0.01FDR.csv") You can use a spreadsheet program such as Excel to open.csv files. GST Colloquim: March 25th, / 1

41 References 1. S. Anders, D. J. McCarthy, Y. S. Chen, M. Okoniewski, G. K. Smyth, W. Huber, M. D. Robinson, Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols 8, (2013); published online EpubSep (Doi /Nprot ). 2. C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, L. Pachter, Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7, (2012); published online EpubMar ( /nprot ). 3. DESeq vignette: GST Colloquim: March 25th, / 1