Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion Groomer. Mapping BWA GATK Preprocess

Size: px

Start display at page:

Download "Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion --------- Groomer. Mapping --------- BWA GATK --------- Preprocess"

Laurence Hutchinson
8 years ago
Views:

1 Workflow Fastq Reference Genome Galaxy Format Conversion Groomer Quality Control FastQC Mapping BWA Format conversion Sam-to-Bam Removing PCR duplicates MarkDup Preprocess GATK Base Recalibration Preprocess GATK Indel Realignment Variant Calling GATK Unified Genotyper Mpileup Variant Calling VarScan VCF Filtering VCF Annotation

MarkDup Preprocess GATK --------- Base Recalibration Preprocess GATK --------- Indel Realignment

2 Genome Analysis Toolkit

3 Plan Introduction Prétraitements des données NGS Recherche de Variants Pourquoi faire la Real/Recab? Travaux pratiques

4 Introduction

5 GATK GATK : Genome Analysis ToolKit The Genome Analysis Toolkit : A MapReduce framework for analyzing next-generation DNA sequencing data, McKenna et al. (2010) Développé par l'équipe de développement du Broad Institute (USA) Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome Atlas...) A la base développé pour génetique humaine mais maintenant générique Développé en Java Citations : Sources GATK Website* Google Scholar * Nature, Science, Nature Genetics, Nature Biotechnology, New England Journal of Medicine, Cell, and Genome Research.

org/gatk/about/ Développé par l'équipe de développement du Broad Institute (USA) Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome Atlas.

6 GATK

7 Comment est détecté un SNP?

8 Comment est détecté un SNP?

9 Comment est détecté un SNP?

10 Comment est détecté un SNP? Complex bayesian algorithms based on : Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality 10 => P error = 1 / => P error = 1 / 1000

Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele

11 Comment est détecté un SNP? Biais de séquençage connus: GA / Hi-Seq : Base Quality 454 : Homopolymères SOLiD : Base Quality + Color space traduction Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality

Base Quality + Color space traduction Base scale Read scale Position scale Genotype

12 Prétraitement des données NGS

13 Raw reads Produits par les logiciels des Séquenceurs Une première étape de recalibration/correction des reads peut être effectuée : 454 : Pyrobayes / Pyrocleaner SOLiD : Rsolid Illumina : Ibis /BayesCall + Taux erreur amélioré de 5 à 30 % - Temps de calcul

effectuée : 454 : Pyrobayes / Pyrocleaner SOLiD : Rsolid

$BP\cceeegffgghfhiiiefgihhhii[baegegfgiiiihhiiihhfhfhighihiiifhhfihieeegaceeedcdddd`bcbcccbbcbcccccbcb @HISEQ4_0105:4:1101:2421:1947#TAGCTT/1$

14 Raw reads NATAAATGCTGTCATACAGACTTGTTGGTGTTGTAAGGCAGCAGACTCCTTTGAGCTTTCATCCGAGAACAATTGAGACTAAATTCCTGGTGCAAAGTCCA +HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 NAAGAAGGCACGAAGCAACTACTTCACTGCATGCTGCCTGTCCTTGGGCTGTTTGCTGCCTTTGGCTAACACCTTTGATTATTTCTGGCTAAGTAGATAGG +HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 NAGAGCTATTTATGAAAACGAGGATGACTAAAACTGCCCAGAAAAAAAACCAACCAACCACGTTTCCAGTGACTGCCACCCTTAGCAAGCAAGGTAATAAC csfasta + Qual

$BS\ceeeegggfgiiiiiiiiiiiiiiiiiiiiiiiiiiihiiiiihhghihhihiiiiiihhiihgggfgeeedddddddbededcccc`bcbeccddcc @HISEQ4_0105:4:1101:3251:1984#TAGCTT/1$

15 Mapping Alignement reads VS Génome de référence Tout logiciel produisant des BAM Ex: BWA, Bowtie, Gsnap, SOAP, SSAHA 1 fichier par lane / individu / condition ou groupé avec Read group (obligatoire)

http://seqanswers.com/forums/showthread.php?

16 Mapping PHOSPHORE:181:C0KD3ACXX:8:2101:3676: M = X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:2101:3676: * = RG:Z:ind1 PHOSPHORE:181:C0KD3ACXX:8:1206:13256: M = X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1206:13256: M = TCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCATCAGTATAAGGCACTCTGAAAGAAAGCAATCTAAATCCC :>DCDDDECAA>>@BFFEC@EIHE;GBHF=GFGHGGGGIIHFHGDG@GDB9IIJIIGHHGGGHIIGDIIHFHHEFGEIIJHGH?GBGIHHGGDFFDFFCC@ X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1202:6947: M = GCAGGCTTTTAAGAATATGTTCTGTTTTCAAATAGTAACCCAAAAAGGGGTGGGGGCGGGGGCAAAGTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CC@FFFFFGHGFHFGGGII>JHGGEHIJIIEHHEGHIGHIJJGGIJFGIJ@FHIIHFBDBDDBB@BC44@:@4?><8A2<2?8?<B<<2<2<<A<ABB? X0:i:1 X1:i:0 MD:Z:99 RG:Z:ind2 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:U SAM spécifications:

BDDCCDDCDDDDCAC: X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 133 13 10354712 0 * = 10354712 0

17 Duplicate Marking/Removing Duplicats PCR (construction des librairies) Samtools rmdup Picard MarkDuplicates Identification Removing

18 Local Realignment Identification des régions à réaligner : The algorithm begins by first identifying regions for realignment where 1) at least one read contains an indel, 2) there exists a cluster of mismatching bases or 3) an already known indel segregates at the site DePristo et al (2011) Réalignement des reads Next, all reads are realigned against just the best haplotype Hi and the reference (H0), and each read Rj is assigned to Hi or H0 DePristo et al (2011)

already known indel segregates at the site DePristo et al (2011) Réalignement des reads Next, all reads are

19 Local Realignment

Raw data Base quality recalibration «The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and

20 Raw data Base quality recalibration «The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context» DePristo et al. (2011) Ewing and Green (1998) Li et al. (2004 ; 2009) Mean BQ = 32,8 - Median = 36,7

inaccurate and co-vary with features like sequencing technology, machine cycle and sequence

21 Raw data Recalibrated data Base quality recalibration Conséquences Mean BQ = 32,8 - Median = 36,7 Mean BQ = 28,8 Median = 28,7 Baisse de la variabilité Baisse de la qualité moyenne

22 Base quality recalibration DePristo et al (2011)

23 Raw data Analysis-ready reads Nouveau fichier BAM Peut être utilisé ensuite avec d autre outils pour la suite des analyses (Samtools mpileup, Popoolation, etc )

24 Recherche de Variants

25 Single vs Multiple sample analysis Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 8th, 2011 (

26 Unified Genotyper Outil GATK Multiple sample analysis Différents modes de détection SNP Indels

27 Format VCF

28 Pourquoi faire le Real/Recab?

29 Comparaison d outils de SNP calling SIGENAE Team LGC - INRA APACHE Project (Alain Vignal) To find SNPs (Single Nucleotide Polymorphism) which differentiate populations Barbary Duck : no reference genome (Beijing duck genome is available) Beijing duck Journée Bioinfo Génotoul 29/03/2012 Barbarie duck

30 Impact of realignment / recalibration on SNP count More homogenous SNP count Δ = 777% Δ = 714% Δ = 42% Δ = 45% Mpileup Mpileup -B Mpileup -E GATK Popoolation raw data realigned data recalibrated data Realigned & recalibrated data Higher impact of recalibration on SNP count

31 raw data Reliable results with other species? DUCK realigned data Realigned & recalibrated data recalibrated data Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 777% 20% BAMs bruts CHICKEN BAMs réalignés BAMs réalignés/recalibrés BAMs recalibrés Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 234% 4% PIG Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 454% 9% 0 0 BAMs réalignés BAMs réalignés/recalibrés BAMs bruts BAMs recalibrés Not the same proportion but huge impact on realignment/recalibration

Conclusion Variability between called SNP by different tools GATK realignment/recalibration greatly helps to reduce this variability High impact of base quality score Reliable on various DNA data,

32 Conclusion Variability between called SNP by different tools GATK realignment/recalibration greatly helps to reduce this variability High impact of base quality score Reliable on various DNA data, but not on RNA data Nature Genetics 2012 «We recommend a recalibration of per-base quality scores as in GATK or SOAPsnp» «Several additional steps can be taken to improve genotype calls, such as local realignments...»

33 Bilan GATK nécessite un peu d'habitude Points forts : Assez rapide d'exécution grâce à la parallélisation possible Comptage allélique Prise en compte des positions multi-alléliques Beaucoup de fonctionnalités et d'options SNPs semblent être fiables Améliorations fréquentes Site Internet Points faibles : Recalibration basée sur des SNPs connus... À l'origine créé pour l'analyse de génomes humains Beaucoup d'étapes avant de lancer l'unifiedgenotyper Nécessite beaucoup d'espace disque pour suivre le pipeline de bout en bout

34 Travaux Pratiques Galaxy

35 Le site de référence GATK Download logiciels + ressources (vcf) Guide Analyse Best Practices Forum Documentation Technique Etc

36 References Samtools : Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup - The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, (2009). Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18: (2008). GATK A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491 (2011). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. McKenna AH, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Depristo M. Genome Res. (2010). Popoolation2 R. Kofler, R. V. Pandey, C. Schlotterer. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics (2011). Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nat Methods (2008) BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Kao W-C, Stevens K, Song YS. Genome Res (2009). Ibis Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Kircher M, Stenzel U, Kelso J.. Genome Biol. (2009). Pyrocleaner Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. Mariette J, Noirot C, Klopp C. BMC Research Notes 2011 Genotype and SNP calling from next-generation sequencing data. Nielsen, R. et al. Nature Reviews Genetics 12: (2011).

Practical Guideline for Whole Genome Sequencing

Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics