Base Quality Score Recalibra2on

talks Base Quality Score Recalibra2on Assigning accurate confidence scores to each sequenced base

We are here in the Best Practices workflow Base Recalibra,on

PURPOSE

Real data is messy - > properly es2ma2ng the evidence is cri2cal

Quality scores emitted by sequencing machines are biased and inaccurate Quality scores are cri2cal for all downstream analysis Systema2c biases are a major contributor to bad calls Example of bias: quali2es reported depending on nucleo2de context RMSE = 4.188 RMSE = 0.281 Empirical Reported Quality 10 5 0 5 10 original Empirical Reported Quality 10 5 0 5 10 recalibrated AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG Dinuc Dinuc BQSR method identifies bias and applies correction

PRINCIPLES

Base Recalibra2on phases Model the error modes Make before/ aher plots Apply recalibra2on and write to file

How do we identify the error modes in the data? RMSE = 4.188 Systema2c errors correlate with basecall features Several relevant features: Reported quality score Posi2on within the read (machine cycle) Sequence context (sequencing chemistry effects) Empirical Reported Quality 10 5 0 5 10 AA AG CA CG GA GG TA TG Dinuc Calculate error empirically and find panerns in how error varies with basecall features Method is empowered by looking at en2re lane of data (works per read group)

Covariation patterns allow us to calculate adjustment factors For each base in each read: - - is it in AA context? - > adjust by X points - -... - - is it at 3 rd posi2on? - > adjust by Y points - -...

How do we derive the adjustment factors used for recalibration? Any sequence mismatch = error except known variants*! Keep track of number of observa2ons and number of errors as a func2on of various error covariates (lane, original quality score, machine cycle, and sequencing context) # of reference mismatches +1 # of observed bases + 2 PHRED- scaled quality score * If you don t have known varia=on, bootstrap (see later on)

Any sequence mismatch = error except known variants Goal is to iden2fy the signal within the noise All mismatches Known varia=on

Base Qu Base Quality Score Recalibration provides a calibrated error model from which to make mutation calls Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project! SLX GA 454 SOLiD Complete Genomics HiSeq Empirical Quality 0 10 20 30 40 Original, RMSE = 5.242 Recalibrated, RMSE = 0.196 Empirical Quality 0 10 20 30 40 Original, RMSE = 2.556 Recalibrated, RMSE = 0.213 Empirical Quality 0 10 20 30 40 Original, RMSE = 1.215 Recalibrated, RMSE = 0.756 Empirical Quality 0 10 20 30 40 Original, RMSE = 4.479 Recalibrated, RMSE = 0.235 Empirical Quality 0 10 20 30 40 Original, RMSE = 5.634 Recalibrated, RMSE = 0.135 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Reported Quality Reported Quality Reported Quality Reported Quality Reported Quality Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.207 Recalibrated, RMSE = 0.186 0 5 10 15 20 25 30 35 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.784 Recalibrated, RMSE = 0.136 0 50 100 150 200 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.688 Recalibrated, RMSE = 0.213 second of pair reads first of pair reads 30 20 10 0 10 20 30 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.679 Recalibrated, RMSE = 0.182 second of pair reads first of pair reads 30 20 10 0 10 20 30 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.609 Recalibrated, RMSE = 0.089 second of pair reads first of pair reads 100 50 0 50 100 Machine Cycle Machine Cycle Machine Cycle Machine Cycle Machine Cycle Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.598 Recalibrated, RMSE = 0.052 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.169 Recalibrated, RMSE = 0.135 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 1.656 Recalibrated, RMSE = 0.088 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 3.503 Recalibrated, RMSE = 0.06 AA AG CA CG GA GG TA TG Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.469 Recalibrated, RMSE = 0.083 AA AG CA CG GA GG TA TG Dinucleotide Dinucleotide Dinucleotide Dinucleotide Dinucleotide

Per- base indel error rate also varies by lane, sequence context and sequencing technology 13 Per-base indel error estimates are required for accurate indel calling, particularly on new technologies with indel-rich error model such as Pacific Biosciences. AAAAA context suffix Empirical gap open penalty 0 10 20 30 40 50 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT ReadGroup 20FUK.1 20FUK.2 20FUK.3 20FUK.4 20FUK.5 20FUK.6 20FUK.7 20FUK.8 PacBio HiSeq PacBio

Empirical es2mates of base inser2on and base dele2on error rates unify SNP and indel error models Indels can be recalibrated also using an addi2onal resource Base Substitution Base Insertion with known indels Empirical Quality Score Base Deletion 50 40 New base inser2on and dele2on quals will be used by 30 HaplotypeCaller 20 for bener indel calls 10 (Extra quals for indels make the BAMs significantly bigger) 10 20 30 40 50 10 20 30 40 50 10 20 30 40 Reported Quality Score Base Substitution Insertion Base Insertion Deletion Base Deletion Quality Score Accuracy 4 2 0 2 4 6 Recalibration Original Recalibrated Recalibrated BQSRv2 log10(nbases) 6.75 6.80 6.85 100 50 Cycle Covariate 0 50 100 100 50 Cycle Covariate 0 50 100 100 50 0 50 Base Substitution 14 Quality Score Accuracy CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 2 0 2 4 6 8 Base Substitution Insertion AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Context Covariate Base Insertion Deletion AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Context Covariate Base Deletion Recalibration Original Recalibrated Recalibrated BQSRv2 log10(nbases) 6.5 7.0 7.5 8.0 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA

PROTOCOL

Base Recalibra2on steps/tools Model the error modes BaseRecalibrator Make before/ aher plots AnalyzeCovariates Apply recalibra2on and write to file PrintReads

BQSR involves two complementary paths BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA PLOTS

Base Recalibra2on workflow: data processing path Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

First we build the model Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

TOOL TIPS BaseRecalibrator Builds recalibra2on model java jar GenomeAnalysisTK.jar T BaseRecalibrator \ R human.fasta \ I realigned.bam \ knownsites dbsnp137.vcf \ knownsites gold.standard.indels.vcf \ [ L exome_targets.intervals \ ] o recal.table

Why specify L intervals when running BaseRecalibrator on WEx? BQSR depends on key assump2on: every mismatch is an error, except sites in known variants Off- target sequence likely to have higher error rates with different error modes If off- target sequence is included in recalibra2on, may skew the model and mess up results Ø Use L argument with BaseRecalibrator to restrict recalibra=on to capture targets.

BaseRecalibrator produces the recalibra2on table Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

Second step: actually recalibrate the data Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

TOOL TIPS Print Reads General- use tool co- opted with BQSR flag and fed a recalibra2on report java jar GenomeAnalysisTK.jar T PrintReads \ R human.fasta \ I realigned.bam \ BQSR recal.table \ o recal.bam Creates a new bam file using the input table generated previously which has exquisitely accurate base subs2tu2on, inser2on, and dele2on quality scores Original quali2es can be retained with OQ tag (not default)

PrintReads produces the recalibrated BAM file Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

Great, now how do we get the plots? Make before/ aher plots AnalyzeCovariates

Let s look at the other path BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA PLOTS

Returning to the overall BQSR workflow... Original BAM file + Known sites BaseRecalibrator Recalibra2on table PrintReads Recalibrated BAM file

We keep the first step (actual recalibra2on) as base that we ll branch off from Original BAM file + Known sites BaseRecalibrator Recalibra2on table

and subs2tute the ploing workflow Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

We already did the first step earlier: Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

which produced the original recalibra2on table Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

So now we do a second pass with BaseRecalibrator: Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

TOOL TIPS Base Recalibrator Second pass evaluates what the data looks like aher recalibra2on java jar GenomeAnalysisTK.jar T BaseRecalibrator \ R human.fasta \ I realigned.bam \ knownsites dbsnp137.vcf \ knownsites gold.standard.indels.vcf \ BQSR recal.table \ o a[er_recal.table

producing a second recalibra2on table Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

Finally, we use the two tables to make the plots Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

TOOL TIPS AnalyzeCovariates Makes plots based on before/aher recalibra2on tables java jar GenomeAnalysisTK.jar T AnalyzeCovariates \ R human.fasta \ before recal.table \ a[er a[er_recal.table \ plots recal_plots.pdf There is an op2on to keep the intermediate.csv file used for ploing, if you want to play with the plot data.

and now we have before/aher recalibra2on plots! Original BAM file + Known sites BaseRecalibrator (1) Recalibra2on table (1) AnalyzeCovariates BaseRecalibrator (2) Plots Recalibra2on table (2)

So to recap the two complementary paths: BaseRecalibrator (1) PrintReads BaseRecalibrator (2) AnalyzeCovariates PROCESSED DATA BEFORE AFTER PLOTS

RESULTS

Recalibration produces a more accurate estimation of error (doesn t fix it!) Post- recalibra2on quality scores should fit the empirically- derived quality scores very well; no obvious systema2c biases should remain Empirical Quality 0 10 20 30 40 Original, RMSE = 5.634 Recalibrated, RMSE = 0.135 0 10 20 30 40 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.609 Recalibrated, RMSE = 0.089 100 50 0 50 100 Accuracy (Empirical Reported Quality) 10 5 0 5 10 Original, RMSE = 2.469 Recalibrated, RMSE = 0.083 AA AG CA CG GA GG TA TG Reported Quality Machine Cycle Dinucleotide 42

Non- humans: no known resources for recalibra2on? Solu2on: bootstrap a set of known variants Call variants on realigned, unrecalibrated data Filter resul2ng variants with stringent filters Use variants that pass filters as known for BQSR Repeat un2l convergence

BQSR bootstrapping workflow: round #1 Original BAM file + Known sites BaseRecalibrator Recalibra2on table HaplotypeCaller ( BQSR recal.table) VariantFiltra=on Raw VCF file

BQSR bootstrapping workflow: rounds #2 to #N Original BAM file + Known sites BaseRecalibrator Recalibra2on table HaplotypeCaller BQSR recal.table VariantFiltra=on Raw VCF file

We are here in the Best Practices workflow Next Step: Variant Discovery

talks Further reading hnp://www.broadins2tute.org/gatk/guide/best- prac2ces hnp://www.broadins2tute.org/gatk/guide/ar2cle?id=44 hnp://www.broadins2tute.org/gatk/gatkdocs/ org_broadins2tute_s2ng_gatk_walkers_bqsr_baserecalibrator.html hnp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_printreads.html hnp://www.broadins2tute.org/gatk/gatkdocs/ org_broadins2tute_s2ng_gatk_walkers_bqsr_analyzecovariates.html