Base Quality Score Recalibra2on

Similar documents
( TUTORIAL. (July 2006)

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

(A) Microarray analysis was performed on ATM and MDM isolated from 4 obese donors.

10 µg lyophilized plasmid DNA (store lyophilized plasmid at 20 C)

DNA Sample preparation and Submission Guidelines

Table S1. Related to Figure 4

Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation. Beginning Perl, Chap 4 6. Example 1

The p53 MUTATION HANDBOOK

Mutations and Genetic Variability. 1. What is occurring in the diagram below?

Next Generation Sequencing

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

Inverse PCR & Cycle Sequencing of P Element Insertions for STS Generation

Supplementary Online Material for Morris et al. sirna-induced transcriptional gene

Gene Synthesis 191. Mutagenesis 194. Gene Cloning 196. AccuGeneBlock Service 198. Gene Synthesis FAQs 201. User Protocol 204

Hands on Simulation of Mutation

Supplementary Information. Binding region and interaction properties of sulfoquinovosylacylglycerol (SQAG) with human

SERVICES CATALOGUE WITH SUBMISSION GUIDELINES

pcas-guide System Validation in Genome Editing

Molecular analyses of EGFR: mutation and amplification detection

Inverse PCR and Sequencing of P-element, piggybac and Minos Insertion Sites in the Drosophila Gene Disruption Project

Gene Finding CMSC 423

ANALYSIS OF A CIRCULAR CODE MODEL

Supplemental Data. Short Article. PPARγ Activation Primes Human Monocytes. into Alternative M2 Macrophages. with Anti-inflammatory Properties

Y-chromosome haplotype distribution in Han Chinese populations and modern human origin in East Asians

Marine Biology DEC 2004; 146(1) : Copyright 2004 Springer

Module 6: Digital DNA

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

Cloning, sequencing, and expression of H.a. YNRI and H.a. YNII, encoding nitrate and nitrite reductases in the yeast Hansenula anomala

The making of The Genoma Music

pcmv6-neo Vector Application Guide Contents

Chapter 9. Applications of probability. 9.1 The genetic code

Molecular chaperones involved in preprotein. targeting to plant organelles

ANALYSIS OF GROWTH HORMONE IN TENCH (TINCA TINCA) ANALÝZA RŮSTOVÉHO HORMONU LÍNA OBECNÉHO (TINCA TINCA)

Title : Parallel DNA Synthesis : Two PCR product from one DNA template

NimbleGen SeqCap EZ Library SR User s Guide Version 3.0

The DNA-"Wave Biocomputer"

Transmembrane Signaling in Chimeras of the E. coli Chemotaxis Receptors and Bacterial Class III Adenylyl Cyclases

DISSERTATIONES MEDICINAE UNIVERSITATIS TARTUENSIS 108

N-terminal Regulatory Domains of Phosphodiesterases 1, 4, 5 and 10 examined with an Adenylyl Cyclase as a Reporter

Drosophila NK-homeobox genes

Heraeus Sepatech, Kendro Laboratory Products GmbH, Berlin. Becton Dickinson,Heidelberg. Biozym, Hessisch Oldendorf. Eppendorf, Hamburg

DNA Bracelets

Insulin Receptor Gene Mutations in Iranian Patients with Type II Diabetes Mellitus

Five-minute cloning of Taq polymerase-amplified PCR products

Characterization of cdna clones of the family of trypsin/a-amylase inhibitors (CM-proteins) in barley {Hordeum vulgare L.)

Mutation. Mutation provides raw material to evolution. Different kinds of mutations have different effects

Molecular detection of Babesia rossi and Hepatozoon sp. in African wild dogs (Lycaon pictus) in South Africa

Six Homeoproteins and a Iinc-RNA at the Fast MYH Locus Lock Fast Myofiber Terminal Phenotype

Metabolic Engineering of Escherichia coli for Enhanced Production of Succinic Acid, Based on Genome Comparison and In Silico Gene Knockout Simulation

Association of IGF1 and IGFBP3 polymorphisms with colorectal polyps and colorectal cancer risk

Biopython Tutorial and Cookbook

TITRATION OF raav (VG) USING QUANTITATIVE REAL TIME PCR

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

were demonstrated to be, respectively, the catalytic and regulatory subunits of protein phosphatase 2A (PP2A) (29).

Introduction to Bioinformatics (Master ChemoInformatique)

Event-specific Method for the Quantification of Maize MIR162 Using Real-time PCR. Protocol

Archimer

BD BaculoGold Baculovirus Expression System Innovative Solutions for Proteomics

All commonly-used expression vectors used in the Jia Lab contain the following multiple cloning site: BamHI EcoRI SmaI SalI XhoI_ NotI

Impaired insulin and insulin-like growth factor expression and signaling mechanisms in Alzheimer s disease is this type 3 diabetes?

Interleukin-4 Receptor Signal Transduction: Involvement of P62


inhibition of mitosis

The Arabinosyltransferase EmbC Is Inhibited by Ethambutol in Mycobacterium tuberculosis

9. Materials. 9.1 Chemicals Acetic Acid (glacial) Materials. 8-Aminoguanosine Ammonium Hydroxide Ammonium Persulfate.

On Covert Data Communication Channels Employing DNA Recombinant and Mutagenesis-based Steganographic Techniques

BioTOP-Report. Biotech and Pharma in Berlin-Brandenburg

4 th DVFA Life Science Conference. Going East / Going West. Life Science Asia/Europe getting insight in mutual growth opportunities

Chapter 5. Stripping Bacillus: ComK auto-stimulation is responsible for the bistable response in competence development

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

BioTOP-Report. Biotech and Pharma in Berlin-Brandenburg

TP53 Genotype but Not p53 Immunohistochemical Result Predicts Response to Preoperative Short-Term Radiotherapy in Rectal Cancer

Open Access CASE STUDY

Distribution of the DNA transposon family, Pokey in the Daphnia pulex species complex

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

The nucleotide sequence of the gene for human protein C

DNA Sequencing of the eta Gene Coding for Staphylococcal Exfoliative Toxin Serotype A

Problem Set 3 KEY

Anhang A: Primerliste. 1. Primer für RT-PCR-Analyse. 2. Allgemeine Klonierungsprimer

Mutation of the SPSl-encoded protein kinase of Saccharomyces cerevisiae leads to defects in transcription and morphology during spore formation

III III 0 IIOI DID IIO II I IIII

Chlamydomonas adapted Green Fluorescent Protein (CrGFP)

Cloning and intracellular localization of the U2 small nuclear

2 Materials and Methods

Supporting Information

Assembly of Large Genomes using Cloud Computing Michael Schatz. July 23, 2010 Illumina Sequencing Panel

TA Cloning Kit. Version V 7 April Catalog nos. K , K , K , K , K K , K , K

Irina V Nesterova, Cecily A. Bennett, S. Sibel Erdem, Robert P. Hammer, Prescott L. Deininger, and Steven A. Soper

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

pentr Directional TOPO Cloning Kits

Complete Amino Acid Sequence and in vitro Expression of Rat NF-M, The Middle Molecular Weight Neurofilament Protein

PROTOCOL: Illumina Paired-end Whole Exome Capture Library Preparation Using Full-length Index Adaptors and KAPA DNA Polymerase

Protein Synthesis Simulation

Supplemental Information. Central Nervous System Stromal Cells. Control Local CD8 + T Cell Responses. during Virus-Induced Neuroinflammation

Immortalized epithelial cells from human autosomal dominant polycystic kidney cysts

How To Clone Into Pcdna 3.1/V5-His

Introduction to Bioinformatics 3. DNA editing and contig assembly

Molecular Facts and Figures

Neural Crest-Derived Sympathoadrenergiclike Progenitors of the Postnatal Murine Adrenal Gland

Identification and Characterization of Genes with Specific Expression in Dendritic Cells

Transcription:

talks Base Quality Score Recalibra2on Assigning accurate confidence scores to each sequenced base

We are here in the Best Practices workflow Base Recalibra,on

PURPOSE

Real data is messy - > properly es2ma2ng the evidence is cri2cal

Quality scores emitted by sequencing machines are biased and inaccurate Quality scores are cri2cal for all downstream analysis Systema2c biases are a major contributor to bad calls Example of bias: quali2es reported depending on nucleo2de context RMSE = 4.188 RMSE = 0.281 Empirical Reported Quality 10 5 0 5 10 original Empirical Reported Quality 10 5 0 5 10 recalibrated AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG Dinuc Dinuc BQSR method identifies bias and applies correction

PRINCIPLES

Base Recalibra2on phases Model the error modes Make before/ aher plots Apply recalibra2on and write to file

How do we identify the error modes in the data? RMSE = 4.188 Systema2c errors correlate with basecall features Several relevant features: Reported quality score Posi2on within the read (machine cycle) Sequence context (sequencing chemistry effects) Empirical Reported Quality 10 5 0 5 10 AA AG CA CG GA GG TA TG Dinuc Calculate error empirically and find panerns in how error varies with basecall features Method is empowered by looking at en2re lane of data (works per read group)

Covariation patterns allow us to calculate adjustment factors For each base in each read: - - is it in AA context? - > adjust by X points - -... - - is it at 3 rd posi2on? - > adjust by Y points - -...

How do we derive the adjustment factors used for recalibration? Any sequence mismatch = error except known variants*! Keep track of number of observa2ons and number of errors as a func2on of various error covariates (lane, original quality score, machine cycle, and sequencing context) # of reference mismatches +1 # of observed bases + 2 PHRED- scaled quality score * If you don t have known varia=on, bootstrap (see later on)

Any sequence mismatch = error except known variants Goal is to iden2fy the signal within the noise All mismatches Known varia=on

Base Qu Base Quality Score Recalibration provides a calibrated error model from which to make mutation calls Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project! SLX GA 454 SOLiD Complete Genomics HiSeq Empirical Quality 0 10 20 30 40 Original, RMSE = 5.242 Recalibrated, RMSE = 0.196 Empirical Quality 0 10 20 30 40 Original, RMSE = 2.556 Recalibrated, RMSE = 0.213 Empirical Quality 0 10 20 30 40 Original, RMSE = 1.215 Recalibrated, RMSE = 0.756 Empirical Quality 0 10 20 30 40 Original, RMSE = 4.479 Recalibrated, RMSE = 0.235 Empirical Quality 0 10 20 30 40 Original, RMSE = 5.634 Recalibrated, RMSE = 0.135 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Reported Quality Reported Quality Reported Quality Reported Quality Reported Quality Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.207 Recalibrated, RMSE = 0.186 0 5 10 15 20 25 30 35 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.784 Recalibrated, RMSE = 0.136 0 50 100 150 200 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.688 Recalibrated, RMSE = 0.213 second of pair reads first of pair reads 30 20 10 0 10 20 30 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.679 Recalibrated, RMSE = 0.182 second of pair reads first of pair reads 30 20 10 0 10 20 30 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.609 Recalibrated, RMSE = 0.089 second of pair reads first of pair reads 100 50 0 50 100 Machine Cycle Machine Cycle Machine Cycle Machine Cycle Machine Cycle Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.598 Recalibrated, RMSE = 0.052 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.169 Recalibrated, RMSE = 0.135 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.656 Recalibrated, RMSE = 0.088 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 3.503 Recalibrated, RMSE = 0.06 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.469 Recalibrated, RMSE = 0.083 AA AG CA CG GA GG TA TG Dinucleotide Dinucleotide Dinucleotide Dinucleotide Dinucleotide

Per- base indel error rate also varies by lane, sequence context and sequencing technology 13 Per-base indel error estimates are required for accurate indel calling, particularly on new technologies with indel-rich error model such as Pacific Biosciences. AAAAA context suffix Empirical gap open penalty 0 10 20 30 40 50 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT ReadGroup 20FUK.1 20FUK.2 20FUK.3 20FUK.4 20FUK.5 20FUK.6 20FUK.7 20FUK.8 PacBio HiSeq PacBio

Empirical es2mates of base inser2on and base dele2on error rates unify SNP and indel error models Indels can be recalibrated also using an addi2onal resource Base Substitution Base Insertion with known indels Empirical Quality Score Base Deletion 50 40 New base inser2on and dele2on quals will be used by 30 HaplotypeCaller 20 for bener indel calls 10 (Extra quals for indels make the BAMs significantly bigger) 10 20 30 40 50 10 20 30 40 50 10 20 30 40 Reported Quality Score Base Substitution Insertion Base Insertion Deletion Base Deletion Quality Score Accuracy 4 2 0 2 4 6 Recalibration Original Recalibrated Recalibrated BQSRv2 log10(nbases) 6.75 6.80 6.85 100 50 Cycle Covariate 0 50 100 100 50 Cycle Covariate 0 50 100 100 50 0 50 Base Substitution 14 Quality Score Accuracy CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 2 0 2 4 6 8 Base Substitution Insertion AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Context Covariate Base Insertion Deletion AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Context Covariate Base Deletion Recalibration Original Recalibrated Recalibrated BQSRv2 log10(nbases) 6.5 7.0 7.5 8.0 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA

PROTOCOL

Base Recalibra2on steps/tools Model the error modes BaseRecalibrator Make before/ aher plots AnalyzeCovariates Apply recalibra2on and write to file PrintReads

BQSR involves two complementary paths BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA PLOTS

BQSR involves two complementary paths BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA PLOTS

Base Recalibra2on workflow: data processing path Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

First we build the model Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

TOOL TIPS BaseRecalibrator Builds recalibra2on model java jar GenomeAnalysisTK.jar T BaseRecalibrator \ R human.fasta \ I realigned.bam \ knownsites dbsnp137.vcf \ knownsites gold.standard.indels.vcf \ [ L exome_targets.intervals \ ] o recal.table

Why specify L intervals when running BaseRecalibrator on WEx? BQSR depends on key assump2on: every mismatch is an error, except sites in known variants Off- target sequence likely to have higher error rates with different error modes If off- target sequence is included in recalibra2on, may skew the model and mess up results Ø Use L argument with BaseRecalibrator to restrict recalibra=on to capture targets.

BaseRecalibrator produces the recalibra2on table Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

Second step: actually recalibrate the data Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

TOOL TIPS Print Reads General- use tool co- opted with BQSR flag and fed a recalibra2on report java jar GenomeAnalysisTK.jar T PrintReads \ R human.fasta \ I realigned.bam \ BQSR recal.table \ o recal.bam Creates a new bam file using the input table generated previously which has exquisitely accurate base subs2tu2on, inser2on, and dele2on quality scores Original quali2es can be retained with OQ tag (not default)

PrintReads produces the recalibrated BAM file Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

Great, now how do we get the plots? Make before/ aher plots AnalyzeCovariates

Let s look at the other path BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA PLOTS

Returning to the overall BQSR workflow... Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

We keep the first step (actual recalibra2on) as base that we ll branch off from Original BAM file + Known sites BaseRecalibrator Recalibra2on table

and subs2tute the ploing workflow Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

We already did the first step earlier: Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

which produced the original recalibra2on table Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

So now we do a second pass with BaseRecalibrator: Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

TOOL TIPS Base Recalibrator Second pass evaluates what the data looks like aher recalibra2on java jar GenomeAnalysisTK.jar T BaseRecalibrator \ R human.fasta \ I realigned.bam \ knownsites dbsnp137.vcf \ knownsites gold.standard.indels.vcf \ BQSR recal.table \ o a[er_recal.table

producing a second recalibra2on table Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

Finally, we use the two tables to make the plots Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

TOOL TIPS AnalyzeCovariates Makes plots based on before/aher recalibra2on tables java jar GenomeAnalysisTK.jar T AnalyzeCovariates \ R human.fasta \ before recal.table \ a[er a[er_recal.table \ plots recal_plots.pdf There is an op2on to keep the intermediate.csv file used for ploing, if you want to play with the plot data.

and now we have before/aher recalibra2on plots! Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

So to recap the two complementary paths: BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA BEFORE AFTER PLOTS

RESULTS

Recalibration produces a more accurate estimation of error (doesn t fix it!) Post- recalibra2on quality scores should fit the empirically- derived quality scores very well; no obvious systema2c biases should remain Empirical Quality 0 10 20 30 40 Original, RMSE = 5.634 Recalibrated, RMSE = 0.135 0 10 20 30 40 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.609 Recalibrated, RMSE = 0.089 100 50 0 50 100 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.469 Recalibrated, RMSE = 0.083 AA AG CA CG GA GG TA TG Reported Quality Machine Cycle Dinucleotide 42

Non- humans: no known resources for recalibra2on? Solu2on: bootstrap a set of known variants Call variants on realigned, unrecalibrated data Filter resul2ng variants with stringent filters Use variants that pass filters as known for BQSR Repeat un2l convergence

BQSR bootstrapping workflow: round #1 Original BAM file + Known sites BaseRecalibrator Recalibra2on table HaplotypeCaller ( BQSR recal.table) VariantFiltra=on Raw VCF file

BQSR bootstrapping workflow: rounds #2 to #N Original BAM file + Known sites BaseRecalibrator Recalibra2on table HaplotypeCaller BQSR recal.table VariantFiltra=on Raw VCF file

We are here in the Best Practices workflow Next Step: Variant Discovery

talks Further reading hnp://www.broadins2tute.org/gatk/guide/best- prac2ces hnp://www.broadins2tute.org/gatk/guide/ar2cle?id=44 hnp://www.broadins2tute.org/gatk/gatkdocs/ org_broadins2tute_s2ng_gatk_walkers_bqsr_baserecalibrator.html hnp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_printreads.html hnp://www.broadins2tute.org/gatk/gatkdocs/ org_broadins2tute_s2ng_gatk_walkers_bqsr_analyzecovariates.html