Computational Genomics. Next generation sequencing (NGS)



Similar documents
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Next Generation Sequencing

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

July 7th 2009 DNA sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

Introduction to next-generation sequencing data

NGS data analysis. Bernardo J. Clavijo

Next generation sequencing (NGS)

RNAseq / ChipSeq / Methylseq and personalized genomics

Expression Quantification (I)

Introduction to NGS data analysis

Genome-wide measurements of protein-dna interaction by chromatin immunoprecipitation

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Reading DNA Sequences:

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing

Next generation DNA sequencing technologies. theory & prac-ce

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

How is genome sequencing done?

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Core Facility Genomics

De Novo Assembly Using Illumina Reads

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Next Generation Sequencing for DUMMIES

PreciseTM Whitepaper

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Analysis of ChIP-seq data in Galaxy

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Basic processing of next-generation sequencing (NGS) data

Challenges associated with analysis and storage of NGS data

Next Generation Sequencing data Analysis at Genoscope. Jean-Marc Aury

DNA Sequencing & The Human Genome Project

Concepts and methods in sequencing and genome assembly

Overview of Next Generation Sequencing platform technologies

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

History of DNA Sequencing & Current Applications

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

G E N OM I C S S E RV I C ES

Algorithms for Next Generation Sequencing Data Analysis

MiSeq: Imaging and Base Calling

How long is long enough?

Introduction Bioo Scientific

NGS Technologies for Genomics and Transcriptomics

DNA Sequence Analysis

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

CHALLENGES IN NEXT-GENERATION SEQUENCING

How many of you have checked out the web site on protein-dna interactions?

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Athanasia Pavlopoulou University of Thessaly, Lamia June 2015

Version 5.0 Release Notes

An Overview of DNA Sequencing

Keeping up with DNA technologies

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

The NGS IT notes. George Magklaras PhD RHCE

Gene Expression Assays

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

GeneProf and the new GeneProf Web Services

Software Getting Started Guide

How Sequencing Experiments Fail

Introduction. Overview of Bioconductor packages for short read analysis

Partek Methylation User Guide

Genomics GENterprise

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Searching Nucleotide Databases

Next Generation Sequencing; Technologies, applications and data analysis

NEXT GENERATION SEQUENCING

FOR REFERENCE PURPOSES

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Universidade Estadual de Maringá

EPIGENETICS DNA and Histone Model

Bioinformatics Unit Department of Biological Services. Get to know us

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Biotechnology: DNA Technology & Genomics

Deep Sequencing Data Analysis

NGS Data Analysis: An Intro to RNA-Seq

Forensic DNA Testing Terminology

Transcription and Translation of DNA

Transcription:

Computational Genomics Next generation sequencing (NGS)

Sequencing technology defies Moore s law Nature Methods 2011

Log 10 (price) Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years 10 8 2007: 454 1M$, 3 months 6 4 2 2001: Celera 100M$, 3 years 2008: ABI SOLiD 60K$, 2 weeks 2009: Illumina, Helicos 40-50K$ 2010: 5K$, a few days 2014: 1000$, <24 hrs 2000 2005 Year 2010 2014 4

6

Applications of next-generation sequencing Nature Biotechnology 26 (10): 1135-1145 (2008)

Roche & illumina analyzers Genome Sequencer FLX Illumina Genome Analyzer

High Parallelism is Achieved in Polony Sequencing Sanger Polony 10

Generation of Polony array: DNA Beads (454, SOLiD) DNA Beads are generated using Emulsion PCR 11

Generation of Polony array: DNA Beads (454, SOLiD) DNA Beads are placed in wells 12

Generation of Polony array: Bridge-PCR (Solexa) DNA fragments are attached to array and used as PCR templates 13

Sequencing: Fluorescently labeled Nucleotides (Solexa) Complementary strand elongation: DNA Polymerase 14

Sequencing: Fluorescently Labeled Nucleotides (ABI SOLiD) Complementary strand elongation: DNA Ligase 15

Single Molecule Sequencing: HeliScope Direct sequencing of DNA molecules: no amplification stage DNA fragments are attached to array Potential benefits: higher throughput, less errors 16

Technology Summary Read length Sequencing Technology Throughput (per run) Sanger ~800bp Sanger 400kbp 500$ 454 ~400bp Polony 500Mbp 60$ Cost (1mbp)* Solexa 75bp Polony 20Gbp 2$ SOLiD 75bp Polony 60Gbp 2$ Helicos 30-35bp Single molecule 25Gbp 1$ *Source: Shendure & Ji, Nat Biotech, 2008 17

Errors Erlich et al. Nature Methods 5: 679-682 (2008)

Genome Sequencing Goal figuring the order of nucleotides across a genome Problem Current DNA sequencing methods can handle only short stretches of DNA at once (usually between 100-200 bp s) Solution Sequence and then use computers to assemble the small pieces 19

Genome Sequencing TG..GT TC..CC AC..GC CG..CA TT..TC AC..GC GA..GC TG..AC CT..TG GT..GC AC..GC AC..GC AA..GC AT..AT TT..CC Genome Short fragments of DNA Short DNA sequences ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACA CCTACGTGACCGGTACTGGTAACGT ACGCCTACGTGACCGGTACTGGTAA CGTATACACGTGACCGGTACTGGTA ACGTACACCTACGTGACCGGTACTG GTAACGTACGCCTACGTGACCGGTA CTGGTAACGTATACCTCT... Sequenced genome 20 20

Assembly How do we use the short reads to recover the genome being sequenced?

1: Using a reference genome: Alignment of reads to a reference..actgggtcatcgtacgatcgatcgatcgatcgatcggctagctagcta....actgggtcatcgtacgatcgatagatcgatcgatcgctagctagcta.. Reference Sample

2. Assembly without a reference (de novo assembly) Paired-End sequencing (Mate pairs) Sequence two ends of a fragment of known size. Currently fragment length (insert size) can range from 200 bps 10,000 bps

Velvet Euler / De Bruijn approach. Introduced as a alternative to overlap-layoutconsensus approach in capillary sequencing. More suited for short read assembly. Based on De Bruijn graph. Implemented in Velvet 1, the mostly used short read assembly method at present. 1 Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18: 821-829. 2008

De Bruijn graph method Break each read sequence in to overlapping fragments of size k. (k-mers) Form De Bruijn graph such that each (k-1)-mer represents a node in the graph. Edge exists between node a to b iff there exists a k-mer such that is prefix is a and suffix is b. Traverse the graph in unambiguous path to form contigs.

K = 4 De Bruijn graph

De Bruijn graph method / Velvet Elegant way of representing the problem. Very fast execution. Error correction can be handled in the graph. De Bruijn graph size can be huge. ~200GB for human genomes. Does not use pair information in initial phase, resulting in overly complicated graphs

Applications (beyond DNA)

RNASeq Sample Total RNA PolyA RNA Small RNA Library Construction Reference Mapping to Genome PUN Assembly to Contigs Analysis Digital Counts Reads per kilobase per million (RPKM) Transcript structure Secondary structure Targets or Products Sequencing Base calling & QC CSIRO. INI Meeting July 2010 - Tutorial - Applications

Depth of Sequence RNASeq Compositional Sequence count Transcript Abundance properties Majority of the data can be dominated by a small number of highly abundant transcripts Ability to observe transcripts of smaller abundance is dependent upon sequence depth

Normalization Normalization is the process in which components of experiments are made comparable before statistical analysis. It is important in sequencing as well as for microarrays. A couple issues in normalization are different sequencing depth (library size) and distributions of reads (long right tails).

Summarized Counts

Simple RPKM Normalization Proportion of reads: number of reads (n) mapping to an exon (gene) divided by the total number of reads (N), n/n. RPKM: Reads Per Kilobase of exon (gene) per Million mapped sequence reads, 10 9 n/(nl), where L is the length of the transcriptional unit in bp (Mortazavi et al., Nat. Meth., 2008).

RNASeq - Correspondence Good correspondence with : Expression Arrays Tiling Arrays qrt-pcr Range of up to 5 orders of magnitude Better detection of low abundance transcripts Greater power to detect Transcript sequence polymorphism Novel trans-splicing Paralogous genes Individual cell type expression

ChIPSeq MNase Linker Digest Remove Nucleosomes Sequence & Align

ChIP-Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

ChIP-Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis

Alignment ELAND Bowtie SOAP SeqMap

FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome Peak detection

One sample analysis A simple way is the sliding window method Poisson background model is commonly used to estimate error rate k i ~ Poisson(λ 0 ) k i Or people use Monte Carlo simulations Both are based on the assumption that read sampling rate is a constant across the genome. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

FDR estimation based on Poisson and negative binomial model Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Two sample analysis Reason: read sample rates at the same genomic locus are correlated across different samples. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

CisGenome two sample analysis Alignment k 1i Exploration n i =k 1i + k 2i k 1i n i ~ Binom(n i, p 0 ) k 2i FDR computation Peak Detection Post Processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Epigenome Protein-DNA interactions [ChIPSeq] Nucleosome positioning Histone modification Transcription factor interactions Methylation [MethylSeq] Impact of NextGen Whole genome profiling Resolution Image : ClearScience

Third generation sequencing 3 rd generation PacBio RS Single Molecule Sequencing Technology (tsms) No amplification 10Gb os sequence data per 8 day run Single Molecule Real Time (SMRT) sequencing technology (PacBio RS)

Nanopore sequencing