NGS data analysis. Bernardo J. Clavijo

Similar documents
Next Generation Sequencing

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Introduction to next-generation sequencing data

Next generation DNA sequencing technologies. theory & prac-ce

Introduction to NGS data analysis

NGS Technologies for Genomics and Transcriptomics

Computational Genomics. Next generation sequencing (NGS)

How Sequencing Experiments Fail

July 7th 2009 DNA sequencing

Microbial Oceanomics using High-Throughput DNA Sequencing

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

PreciseTM Whitepaper

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Overview of Next Generation Sequencing platform technologies

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

MiSeq: Imaging and Base Calling

G E N OM I C S S E RV I C ES

Software Getting Started Guide

RNAseq / ChipSeq / Methylseq and personalized genomics

FOR REFERENCE PURPOSES

Epigenomics User Workflow Document- Internal Users

Concepts and methods in sequencing and genome assembly

History of DNA Sequencing & Current Applications

An Overview of DNA Sequencing

How long is long enough?

DNA Sequencing & The Human Genome Project

How is genome sequencing done?

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

SEQUENCING. From Sample to Sequence-Ready

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

BRCA1 / 2 testing by massive sequencing highlights, shadows or pitfalls?

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

14/12/2012. HLA typing - problem #1. Applications for NGS. HLA typing - problem #1 HLA typing - problem #2

The NGS IT notes. George Magklaras PhD RHCE

Introduction Bioo Scientific

Next Generation Sequencing: Technology, Mapping, and Analysis

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Next Generation Sequencing data Analysis at Genoscope. Jean-Marc Aury

Genomics GENterprise

Expression Quantification (I)

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

NEXT GENERATION SEQUENCING

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

The RNAi Consortium (TRC) Broad Institute

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Dal proge*o genoma umano ad oggi: evoluzione delle tecniche di sequenziamento, analisi genomica e proteomica e prospe9ve future!

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Keeping up with DNA technologies

Data Analysis for Ion Torrent Sequencing

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Gene Expression Analysis

Analysis of ChIP-seq data in Galaxy

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Comparing Methods for Identifying Transcription Factor Target Genes

Next Generation Sequencing for DUMMIES

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Introduction To Epigenetic Regulation: How Can The Epigenomics Core Services Help Your Research? Maria (Ken) Figueroa, M.D. Core Scientific Director

Cluster Generation. Module 2: Overview

Sequencing Library qpcr Quantification Guide

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

Core Facility Genomics

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Specialty Lab Informatics and its role in a large academic medical center

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Introduction. Overview of Bioconductor packages for short read analysis

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

De Novo Assembly Using Illumina Reads

qpcr Quantification Protocol Guide

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

Handling next generation sequence data

DNA Sequence Analysis

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

NGS Data Analysis: An Intro to RNA-Seq

How many of you have checked out the web site on protein-dna interactions?

Reading DNA Sequences:

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Validation and Replication

Welcome to the Plant Breeding and Genomics Webinar Series

Transcription:

NGS data analysis Bernardo J. Clavijo 1

A brief history of DNA sequencing 1953 double helix structure, Watson & Crick! 1977 rapid DNA sequencing, Sanger! 1977 first full (5k) genome bacteriophage Phi X! Late 80s first production Sanger sequencers! Mid 90s DNA microarrays! 2001 draft human genome! 2004 first 454 pyrosequencing machine! 2006 first Solexa/Illumina sequencer! 2011 PacBio RS! 2014 Nanopore

Growth of sequencing Science 331 (11 Feb 2011)

Growth of sequencing Science 331 (11 Feb 2011)

Next Generation Sequencing

TGAC Sequencing Platforms Illumina GAII x 1 Illumina HiSeq x 3 Illumina MiSeq x 3 Roche 454FLX x 2 PacBio RS x 1 Proton x 1 Opgen Argus x 1

TGAC Sequencing Platforms s l a v i r r A 4 1 0 2 : W NE N/ O I n i M / w _Ne y g o l o n ch e T / s d a o upl 20 / 7 m = o w c. m h? c Illumina GAII x 1 noporeteillumina HiSeq xng3 Illumina MiSeq x 3 p. y p o c en.na p w o w _ w 0 / / 0 : 3 s _ http mini_ion re o p o n a Oxford N MinION Roche 454FLX x 2 PacBio RS x 1! s y r I o n a Bion Proton x 1 Opgen Argus x 1

Platforms compared METHOD READ LENGTH NUMBER OF READS THROUGHPUT RUN TIME ACCURACY APPROX. COST ILLUMINA HiSeq 2500 High Output Sequencing by synthesis Up to 100bp PE 1.5 billion per flowcell 300 Gb 11 days 99.9% 14,000 ILLUMINA HiSeq 2500 Rapid Sequencing by synthesis Up to 150bp P.E 300 million per flowcell 90 Gb 40hours! 99.9% 4,400 ILLUMINA MiSeq Sequencing by synthesis Up to 250bp P.E 15 million per flowcell 8.5 Gb 39hours 99.9% 1,400 454 Pyrosequencing Up to 400 bp 1 million per plate 400 Mb 10 hours 99.9% 6,000 PACBIO Standard Run Real time sequencing 3Kb Upper 5% >6kb 50 000 per SMRT cell 100 Mb 2x55mins 86% 300 PACBIO Long Run Real time sequencing 3.5kb Upper 5% >10kb 25 000 per SMRT cell 60 Mb 1 x 120mins 86% 300 OpGen Argus Optical Map 150kb -> 2Mb ~2 000 per Map Card 3Gb 120mins N/A 500-1000

The *-seq era Exome capture! RAD-seq! CHIP-seq! RNA-seq! Single-cell sequencing! Basically... we are in the something-seq era

Looking for The whole genome sequence.! Differences with a know genome.! Transcripts.! Various Signals across the genome/transcriptome.! Relative abundances (of genomes/transcripts).

OK, we have TONS of data...!...let s try to analyse it.

The genome assembly problem Original DNA Fragments Sequenced ends Fragments Con8gs Scaffold

Read mapping

RNA-seq data: mapping vs assembling

... and a very much used one: just BLAST it!!!

Meta-genomes

Meta-genomes + Meta-transcriptomes?

Working with heuristics 16

Black box processing DATA Processing RESULTS 17

Heuristic processing: using shortcuts DATA Processing RESULTS 18

Why use heuristics? The problem is not completely defined.!! Exhaustive methods are:! Too limited, thus producing simple partial solutions.! Too slow, not scaling well.!! DATA Processing RESULTS Data varies too much and no good models are available.!! It is so much faster and easier and it works! (sometimes, anyway) 19

Black box processing done right DATA Processing RESULTS 20

Black box processing done right DATA Processing RESULTS Use good data, check its pre-conditions to be well processed.! Know (roughly) how the processing works.! Check soundness and sanity of results. 20

Knowing your data 21

Experiment design (you create the data!) Know your biological question.!! Plan your data processing (from an information perspective).!! Decide on conditions and biological/technical replicas.!! Decide on technologies and coverages:! How will the typical bias affect your experiment?! Is the coverage enough? Significant results?

Living on a biased environment

Sample and library preparation: a source of bias DNA/RNA extraction techniques have bias:! And sample quality limit sequencing!! Samples are never pure.! PCR generates further bias.! No chemical reaction is perfect, nor complete.! You can learn what your typical biases are:! Assess them.! Take their impact into account.! Try to get better data produced. 24

Do QC before performing the analysis

Read preparation: Adaptor trimming: if you have lots of adaptor sequence.! But SPECIALLY if you have linkers from LMP (check Nextclip).! Pair joining: allows higher k on overlapping reads. Might loose longer frags.! Quality trimming: only if your data is terrible and you are short of memory.! Error correction: once it miscorrects, all subsequent processing is tainted.! Your analysis should be able to cope with errors.! Pacbio reads are a special case, more about that later.! Deduplication: hard to do right, sometimes needed, scaffolders handle it.! Digital normalisation: rna-* / meta-*, and if you understand what it does.! IN GENERAL: illumina is better than it used to be. Keep it in mind. 26

That s all for now...! now you can think about analysing your data.