Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics
Sequencing Illumina platforms Characteristics: High throughput (3 genomes). Paired end. High accuracy. Read length 2 125bp. Relatively long run time (6 days). Relatively expensive. Figure 1: HiSeq 2500. Workshop NGS, Hogeschool Leiden 1/50 Thursday, May 22, 2014
Sequencing Illumina platforms Characteristics: Moderate throughput (3 exomes). Paired end. High accuracy. Read length 2 300bp. Relatively short run time (3 days). Relatively expensive. Figure 2: MiSeq. Workshop NGS, Hogeschool Leiden 2/50 Thursday, May 22, 2014
Sequencing Illumina platforms Figure 3: Flowcell. Figure 4: Amplification. Workshop NGS, Hogeschool Leiden 3/50 Thursday, May 22, 2014
Sequencing Illumina platforms Figure 3: Flowcell. Optical system. DNA fragments are loaded on a flowcell. Fragments are amplified. Clusters of fragments give the same signal. Figure 4: Amplification. Workshop NGS, Hogeschool Leiden 3/50 Thursday, May 22, 2014
Sequencing Illumina platforms Figure 5: Image of tile (part of the flowcell). Workshop NGS, Hogeschool Leiden 4/50 Thursday, May 22, 2014
Sequencing Illumina platforms Figure 6: Base calling on Illumina systems. Workshop NGS, Hogeschool Leiden 5/50 Thursday, May 22, 2014
Sequencing Illumina platforms Figure 7: Paired end. Workshop NGS, Hogeschool Leiden 6/50 Thursday, May 22, 2014
Sequencing Illumina platforms Advantages: Mapping to non-unique regions. Structural variation. De novo assembly (scaffolding). Figure 7: Paired end. Workshop NGS, Hogeschool Leiden 6/50 Thursday, May 22, 2014
Life-Tech Sequencing Characteristics: Low throughput. Single end (for now). High accuracy. Read length ±400bp. Short run time (1 day). Cheap runs. Figure 8: Ion torrent. Workshop NGS, Hogeschool Leiden 7/50 Thursday, May 22, 2014
Life-Tech Sequencing Figure 9: Ion proton. Characteristics: Moderate throughput (1 exome). Single end (for now). High accuracy. Read length ±175bp. Short run time (1 day). Cheap runs. Workshop NGS, Hogeschool Leiden 8/50 Thursday, May 22, 2014
Life-Tech Sequencing Figure 10: Ion Torrent chip. Figure 11: Ion Proton chip. Workshop NGS, Hogeschool Leiden 9/50 Thursday, May 22, 2014
Life-Tech Sequencing Figure 12: Close up of an Ion chip. Workshop NGS, Hogeschool Leiden 10/50 Thursday, May 22, 2014
Life-Tech Sequencing Figure 14: Base calling. Figure 13: Loading a sample. Problems with mono nucleotide stretches. Workshop NGS, Hogeschool Leiden 11/50 Thursday, May 22, 2014
Sequencing Pacific Biosciences Figure 15: PacBio RS. Workshop NGS, Hogeschool Leiden 12/50 Thursday, May 22, 2014
Sequencing Pacific Biosciences Characteristics: Long reads (50% is longer than 10,000 bp). High error rate (15-20%). Low throughput (comparable with the Roche 454). Workshop NGS, Hogeschool Leiden 13/50 Thursday, May 22, 2014
Sequencing Pacific Biosciences Characteristics: Long reads (50% is longer than 10,000 bp). High error rate (15-20%). Low throughput (comparable with the Roche 454). Circular consensus sequencing. Sequence the same molecule several times. Extremely high accuracy. Acceptable read length (±250bp). Workshop NGS, Hogeschool Leiden 13/50 Thursday, May 22, 2014
Sequencing Pacific Biosciences Figure 16: Real time polymerase monitoring. Workshop NGS, Hogeschool Leiden 14/50 Thursday, May 22, 2014
Sequencing Pacific Biosciences Figure 17: Base calling on the PacBio. Workshop NGS, Hogeschool Leiden 15/50 Thursday, May 22, 2014
Data analysis Next generation sequencing data 1 @SGGPP:4:101 2 TTCGGGGGCTGGCAAATCCACTTCCGTGACACGCTACCATTCGCTGGTGGT 3 + 4 +4589,53330 0&07+03:54/2362 +.488587>@/25440++0(+ 5 @SGGPP:4:102 6 CGGTAAACCACCCTGCTGACGGAACCCTAATGCGCCTGAAAGACAGCGTTC 7 + 8 34/ 0 +.000(.55:;:99(0(+2(22(0316;185;;0;:<<>=AA59 9 @SGGPP:4:106 10 TCGTTAACGACTTTGTTCGCCACCGCAACCGCCTGTTTCGGGTCACAGGCA 11 + 12 09875;5? <;?@A4?B:BBB<AA>CCC>C>BB0. >=0488+3444:@5@< 13 @SGGPP:4:112 14 TTGATGAATATATTATTTCAGGGAATAATTATGACACCTTTAGAACGCATT 15 + 16 70<<@::5:<;==7;>>/79<:.:494.8(,,8:753/5@5??C>B???B7 Listing 1: A FastQ file. Workshop NGS, Hogeschool Leiden 16/50 Thursday, May 22, 2014
Data analysis Common data analysis techniques Reference based techniques: Resequencing. Exome sequencing. Transcriptome sequencing (RNA). Full genome sequencing. Structural variation. Copy number variation. Not reference based: De novo assembly. k-mer profiling. Workshop NGS, Hogeschool Leiden 17/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. 4. Filtering. Post-variant calling quality control. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. 4. Filtering. Post-variant calling quality control. 5. Annotation. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014
Pre-alignment Data cleaning Depending on the sequencing platform, parts of the reads need to be removed. Remove linker sequences (Cutadapt, FASTX toolkit). Clip low quality reads at the end of the read (Sickle, Trimmomatic, FASTX toolkit). Length filtering (Fastools). http://code.google.com/p/cutadapt/ http://hannonlab.cshl.edu/fastx toolkit/ https://github.com/najoshi/sickle http://www.usadellab.org/cms/index.php?page=trimmomatic https://pypi.python.org/pypi/fastools Workshop NGS, Hogeschool Leiden 19/50 Thursday, May 22, 2014
Trimming Pre-alignment Figure 18: Quality score per position. Workshop NGS, Hogeschool Leiden 20/50 Thursday, May 22, 2014
Clipping Pre-alignment Figure 19: Sequencing linkers. Workshop NGS, Hogeschool Leiden 21/50 Thursday, May 22, 2014
Pre-alignment Quality control The FastQC toolkit can be used for quality control (both before and after the data cleaning step). GC content. GC distribution. Quality scores distribution.... http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Workshop NGS, Hogeschool Leiden 22/50 Thursday, May 22, 2014
Figure 21: Per sequence quality. Workshop NGS, Hogeschool Leiden 23/50 Thursday, May 22, 2014 Pre-alignment Example QC output Figure 20: Per base sequence content.
Alignment The best match to the reference genome Very efficient. The reference genome needs to be indexed. Finding an alignment is as easy as looking up a word in a dictionary. Figure 22: Visualisation of an alignment. Workshop NGS, Hogeschool Leiden 24/50 Thursday, May 22, 2014
Alignment Choose an aligner Not all aligners can deal with indels. Only a couple of years ago, only SNPs were considered. Bowtie. http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/ http://tophat.cbcb.umd.edu/ http://bio-bwa.sourceforge.net/ Workshop NGS, Hogeschool Leiden 25/50 Thursday, May 22, 2014
Alignment Choose an aligner Not all aligners can deal with indels. Only a couple of years ago, only SNPs were considered. Bowtie. Few aligners can work with large deletions. Spliced RNA. GMAP / GSNAP. Tophat. BWA-MEM. http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/ http://tophat.cbcb.umd.edu/ http://bio-bwa.sourceforge.net/ Workshop NGS, Hogeschool Leiden 25/50 Thursday, May 22, 2014
Alignment Choose an aligner The choice of aligner may be restricted by the sequencer. For the Ion Torrent: Tmap. Combination of three different aligners. Deals with errors in homopolymer stretches. For the PacBio: BLASR. https://github.com/iontorrent/ts/tree/master/analysis/tmap https://github.com/pacificbiosciences/blasr Workshop NGS, Hogeschool Leiden 26/50 Thursday, May 22, 2014
Variant calling Consistent deviations from the reference Figure 23: Result of an alignment. Workshop NGS, Hogeschool Leiden 27/50 Thursday, May 22, 2014
Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014
Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014
Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. RNA. Allele specific expression. RNA editing. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014
Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. RNA. Allele specific expression. RNA editing. Strand specific sampleprep. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014
Variant calling Choice of variant caller Rules of thumb: Well known organism and experiment: Statistical model. Use a simpler variant caller otherwise. http://samtools.sourceforge.net/ https://www.broadinstitute.org/gatk/ http://varscan.sourceforge.net/ Workshop NGS, Hogeschool Leiden 29/50 Thursday, May 22, 2014
Variant calling Choice of variant caller Rules of thumb: Well known organism and experiment: Statistical model. Use a simpler variant caller otherwise. Popular variant callers: Samtools. GATK. VarScan. http://samtools.sourceforge.net/ https://www.broadinstitute.org/gatk/ http://varscan.sourceforge.net/ Workshop NGS, Hogeschool Leiden 29/50 Thursday, May 22, 2014
Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014
Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. We filter for a maximum coverage because of copy number variation. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014
Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. We filter for a maximum coverage because of copy number variation. A good way to calculate the maximum: Calculate the mean coverage. Only of the covered (targeted) regions. Multiply this number with a reasonable factor e.g., 2.5. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014
Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014
Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014
Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014
Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Is it in the 5 /3 UTR of a gene?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014
Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Is it in the 5 /3 UTR of a gene?... Is it in a regulatory region?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014
Full genome sequencing Copy number variation Figure 24: Coverage patterns over a whole chromosome. Workshop NGS, Hogeschool Leiden 32/50 Thursday, May 22, 2014
Full genome sequencing Copy number variation Per sample: The reference needs to be very good. Sequencability biases. Mapping biases. Workshop NGS, Hogeschool Leiden 33/50 Thursday, May 22, 2014
Full genome sequencing Copy number variation Per sample: The reference needs to be very good. Sequencability biases. Mapping biases. Within a population: Mixture of distributions. Not sensitive to aforementioned biases. Needs a lot of controls. Workshop NGS, Hogeschool Leiden 33/50 Thursday, May 22, 2014
Full genome sequencing Structural variation Figure 25: Multiple issues while mapping. Workshop NGS, Hogeschool Leiden 34/50 Thursday, May 22, 2014
Full genome sequencing Structural variation Figure 26: Discordant and split reads. http://breakdancer.sourceforge.net/ http://sourceforge.net/projects/pindel/ Workshop NGS, Hogeschool Leiden 35/50 Thursday, May 22, 2014
Full genome sequencing Structural variation Figure 27: Different types of structural variation. Workshop NGS, Hogeschool Leiden 36/50 Thursday, May 22, 2014
Assesmbly De Novo assembly Figure 28: General overview of de novo assembly. Workshop NGS, Hogeschool Leiden 37/50 Thursday, May 22, 2014
Assesmbly De Novo assembly Figure 29: Overlaps are used to reconstruct a genome. Workshop NGS, Hogeschool Leiden 38/50 Thursday, May 22, 2014
Scaffolding De Novo assembly Figure 30: Paired end or mate pair reads can be used. Workshop NGS, Hogeschool Leiden 39/50 Thursday, May 22, 2014
De Novo assembly Easier assembly with PacBio Figure 31: Correcting PacBio reads. Workshop NGS, Hogeschool Leiden 40/50 Thursday, May 22, 2014
Pipelines Pipelines Figure 32: A real-life pipeline. Workshop NGS, Hogeschool Leiden 41/50 Thursday, May 22, 2014
Pipelines Pipelines Figure 33: Scene from Modern times. Workshop NGS, Hogeschool Leiden 42/50 Thursday, May 22, 2014
Pipelines Pipelines Combining tools: The output of one tool can serve as the input for another. Not necessarily linear.... Workshop NGS, Hogeschool Leiden 43/50 Thursday, May 22, 2014
Pipelines Pipelines Combining tools: The output of one tool can serve as the input for another. Not necessarily linear.... Running various different tools: Two or three different aligners. A couple of variant callers.... Workshop NGS, Hogeschool Leiden 43/50 Thursday, May 22, 2014
Pipelines Combining tools 1 bwa aln t 8 $reference $i > $i. sai 2 bwa samse $reference $i. sai $i > $i.sam 3 samtools view bt $reference o $i.bam $i.sam Listing 2: Shell script Workshop NGS, Hogeschool Leiden 44/50 Thursday, May 22, 2014
Pipelines Combining tools 1 bwa aln t 8 $reference $i > $i. sai 2 bwa samse $reference $i. sai $i > $i.sam 3 samtools view bt $reference o $i.bam $i.sam Listing 2: Shell script 1 %. sai : %.fq 2 $ (BWA) aln t $ (THREADS) $ ( c a l l MKREF, $@) $< > $@ 3 4 %.sam: %. sai %.fq 5 $ (BWA) samse $ ( c a l l MKREF, $@) $ ˆ > $@ 6 7 %.bam: %.sam 8 $ (SAMTOOLS) view bt $ ( c a l l MKREF, $@) o $@ $< Listing 3: Makefile Workshop NGS, Hogeschool Leiden 44/50 Thursday, May 22, 2014
Graphical interfaces Galaxy Galaxy: a graphical user interface: Wrapper for command line utilities. User friendly. Point and click. http://galaxy.psu.edu/ Workshop NGS, Hogeschool Leiden 45/50 Thursday, May 22, 2014
Galaxy Graphical interfaces Galaxy: a graphical user interface: Wrapper for command line utilities. User friendly. Point and click. Workflows. Save all the steps you did in your analysis. Rerun the entire analysis on a new dataset. Share your workflow with other people.... http://galaxy.psu.edu/ Workshop NGS, Hogeschool Leiden 45/50 Thursday, May 22, 2014
Galaxy Graphical interfaces Figure 34: Galaxy main user interface Workshop NGS, Hogeschool Leiden 46/50 Thursday, May 22, 2014
Galaxy Graphical interfaces Figure 35: User friendly interface with Galaxy Workshop NGS, Hogeschool Leiden 47/50 Thursday, May 22, 2014
Graphical interfaces Workflow of a parallel pipeline Makefile blueprint all create_report.secondary Makefile.SUFFIXES.DEFAULT recursive clean MC0184ACXX_s_1.h_cov MC0184ACXX_s_1.tar.gz MC0184ACXX_s_1.final.snponly.vcf.annotation MC0184ACXX_s_1.final.indelonly.vcf.annotation MC0184ACXX_s_1.covPerTarget MC0184ACXX_s_1.hsMetrics MC0184ACXX_s_1.mtdna_mincov.bed MC0184ACXX_s_1.asomes_avgcov.bed MC0184ACXX_s_1.xysomes_avgcov.bed MC0184ACXX_s_1.final.vcf.ti-tv MC0184ACXX_s_1.final.snponly.vcf MC0184ACXX_s_1.final.indelonly.vcf MC0184ACXX_s_1.final.vcf.vdb_annotation /sureselect_whole_exome_v2.hg19.cbr.bed.interval MC0184ACXX_s_1.asomes_mincov.bed MC0184ACXX_s_1.xysomes_mincov.bed MC0184ACXX_s_1.asomes_cnvhigh.bed MC0184ACXX_s_1.xysomes_cnvhigh.bed MC0184ACXX_s_1.final.vcf MC0184ACXX_s_1.insertSizesHist.pdf MC0184ACXX_s_1.pileup MC0184ACXX_s_1.insertSizes MC0184ACXX_s_1.mincov.bed MC0184ACXX_s_1.asomes_cutoff_steepcov.bed MC0184ACXX_s_1.xysomes_cutoff_steepcov.bed MC0184ACXX_s_1.asome.vcf MC0184ACXX_s_1.xysome.vcf MC0184ACXX_s_1.mtdna.vcf MC0184ACXX_s_1.bases.aligned MC0184ACXX_s_1.wig MC0184ACXX_s_1.asome_cutoff.vcf MC0184ACXX_s_1.xysome_cutoff.vcf MC0184ACXX_s_1.mtdna_cutoff.vcf MC0184ACXX_s_1.sorted.dedup.bam.flagstat MC0184ACXX_s_1.readsPerChr MC0184ACXX_s_1.sorted.dedup.bam.bai MC0184ACXX_s_1.sorted.dedup.bam.pileup MC0184ACXX_s_1.asome.targeted.avg-cov MC0184ACXX_s_1.xysome.targeted.avg-cov MC0184ACXX_s_1.bases.loose-ontarget MC0184ACXX_s_1.loose-ontarget.bam.flagstat MC0184ACXX_s_1.bases.strong-ontarget MC0184ACXX_s_1.strong-ontarget.bam.flagstat MC0184ACXX_s_1.bcf MC0184ACXX_s_1.sorted.dedup.bam.sorted.dedup.bam H.Sapiens/hg19/reference.fa /sureselect_whole_exome_v2.hg19.cbr.bed MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.sorted.dedup.bam.sorted.bam MC0184ACXX_s_1.sorted.bam MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1_1_sequence_fastqc MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.parts 000.MC0184ACXX_s_1_1_sequence.part 000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1_2_sequence.filtered_fastqc MC0184ACXX_s_1_1_sequence.part000 MC0184ACXX_s_1_2_sequence.part000 MC0184ACXX_s_1_2_sequence_fastqc MC0184ACXX_s_1_1_sequence.sync MC0184ACXX_s_1_2_sequence.sync MC0184ACXX_s_1_1_sequence.filtered MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1_2_sequence.txt.recaldata MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1_2_sequence.trimmed MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1_2_sequence.txt Figure 36: Dependency diagram. Workshop NGS, Hogeschool Leiden 48/50 Thursday, May 22, 2014
Graphical interfaces Workflow of a parallel pipeline MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.sorted.bam MC0184ACXX_s_1_1_sequenc MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.parts 000.MC0184ACXX_s_1_1_sequence.part 000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1_1_sequence.part000 MC0184ACXX_s_1_2_sequence.part000 MC0184ACXX_s_1_1_sequence.sync MC0184ACXX_s_1_2_sequence.sync MC0184ACXX_s_1_1_sequence.filtered MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1_2_sequence.txt.recaldata MC0184ACXX_s_1 MC0184ACXX_s_1_1_sequence.txt Figure 37: Zoomed in. Workshop NGS, Hogeschool Leiden 49/50 Thursday, May 22, 2014
Questions? Acknowledgements: Michiel van Galen Martijn Vermaat Johan den Dunnen Workshop NGS, Hogeschool Leiden 50/50 Thursday, May 22, 2014