NGS data analysis. Bernardo J. Clavijo

NGS data analysis Bernardo J. Clavijo 1

A brief history of DNA sequencing 1953 double helix structure, Watson & Crick! 1977 rapid DNA sequencing, Sanger! 1977 first full (5k) genome bacteriophage Phi X! Late 80s first production Sanger sequencers! Mid 90s DNA microarrays! 2001 draft human genome! 2004 first 454 pyrosequencing machine! 2006 first Solexa/Illumina sequencer! 2011 PacBio RS! 2014 Nanopore

Growth of sequencing Science 331 (11 Feb 2011)

Next Generation Sequencing

TGAC Sequencing Platforms Illumina GAII x 1 Illumina HiSeq x 3 Illumina MiSeq x 3 Roche 454FLX x 2 PacBio RS x 1 Proton x 1 Opgen Argus x 1

TGAC Sequencing Platforms s l a v i r r A 4 1 0 2 : W NE N/ O I n i M / w _Ne y g o l o n ch e T / s d a o upl 20 / 7 m = o w c. m h? c Illumina GAII x 1 noporeteillumina HiSeq xng3 Illumina MiSeq x 3 p. y p o c en.na p w o w _ w 0 / / 0 : 3 s _ http mini_ion re o p o n a Oxford N MinION Roche 454FLX x 2 PacBio RS x 1! s y r I o n a Bion Proton x 1 Opgen Argus x 1

Platforms compared METHOD READ LENGTH NUMBER OF READS THROUGHPUT RUN TIME ACCURACY APPROX. COST ILLUMINA HiSeq 2500 High Output Sequencing by synthesis Up to 100bp PE 1.5 billion per flowcell 300 Gb 11 days 99.9% 14,000 ILLUMINA HiSeq 2500 Rapid Sequencing by synthesis Up to 150bp P.E 300 million per flowcell 90 Gb 40hours! 99.9% 4,400 ILLUMINA MiSeq Sequencing by synthesis Up to 250bp P.E 15 million per flowcell 8.5 Gb 39hours 99.9% 1,400 454 Pyrosequencing Up to 400 bp 1 million per plate 400 Mb 10 hours 99.9% 6,000 PACBIO Standard Run Real time sequencing 3Kb Upper 5% >6kb 50 000 per SMRT cell 100 Mb 2x55mins 86% 300 PACBIO Long Run Real time sequencing 3.5kb Upper 5% >10kb 25 000 per SMRT cell 60 Mb 1 x 120mins 86% 300 OpGen Argus Optical Map 150kb -> 2Mb ~2 000 per Map Card 3Gb 120mins N/A 500-1000

The *-seq era Exome capture! RAD-seq! CHIP-seq! RNA-seq! Single-cell sequencing! Basically... we are in the something-seq era

Looking for The whole genome sequence.! Differences with a know genome.! Transcripts.! Various Signals across the genome/transcriptome.! Relative abundances (of genomes/transcripts).

OK, we have TONS of data...!...let s try to analyse it.

The genome assembly problem Original DNA Fragments Sequenced ends Fragments Con8gs Scaffold

Read mapping

RNA-seq data: mapping vs assembling

... and a very much used one: just BLAST it!!!

Meta-genomes

Meta-genomes + Meta-transcriptomes?

Working with heuristics 16

Black box processing DATA Processing RESULTS 17

Heuristic processing: using shortcuts DATA Processing RESULTS 18

Why use heuristics? The problem is not completely defined.!! Exhaustive methods are:! Too limited, thus producing simple partial solutions.! Too slow, not scaling well.!! DATA Processing RESULTS Data varies too much and no good models are available.!! It is so much faster and easier and it works! (sometimes, anyway) 19

Black box processing done right DATA Processing RESULTS 20

Black box processing done right DATA Processing RESULTS Use good data, check its pre-conditions to be well processed.! Know (roughly) how the processing works.! Check soundness and sanity of results. 20

Knowing your data 21

Experiment design (you create the data!) Know your biological question.!! Plan your data processing (from an information perspective).!! Decide on conditions and biological/technical replicas.!! Decide on technologies and coverages:! How will the typical bias affect your experiment?! Is the coverage enough? Significant results?

Living on a biased environment

Sample and library preparation: a source of bias DNA/RNA extraction techniques have bias:! And sample quality limit sequencing!! Samples are never pure.! PCR generates further bias.! No chemical reaction is perfect, nor complete.! You can learn what your typical biases are:! Assess them.! Take their impact into account.! Try to get better data produced. 24

Do QC before performing the analysis

Read preparation: Adaptor trimming: if you have lots of adaptor sequence.! But SPECIALLY if you have linkers from LMP (check Nextclip).! Pair joining: allows higher k on overlapping reads. Might loose longer frags.! Quality trimming: only if your data is terrible and you are short of memory.! Error correction: once it miscorrects, all subsequent processing is tainted.! Your analysis should be able to cope with errors.! Pacbio reads are a special case, more about that later.! Deduplication: hard to do right, sometimes needed, scaffolders handle it.! Digital normalisation: rna-* / meta-*, and if you understand what it does.! IN GENERAL: illumina is better than it used to be. Keep it in mind. 26

That s all for now...! now you can think about analysing your data.