Workflow Fastq Reference Genome Galaxy Format Conversion --------- Groomer Quality Control --------- FastQC Mapping --------- BWA Format conversion --------- Sam-to-Bam Removing PCR duplicates --------- MarkDup Preprocess GATK --------- Base Recalibration Preprocess GATK --------- Indel Realignment Variant Calling GATK --------- Unified Genotyper Mpileup Variant Calling VarScan VCF Filtering VCF Annotation
Genome Analysis Toolkit
Plan Introduction Prétraitements des données NGS Recherche de Variants Pourquoi faire la Real/Recab? Travaux pratiques
Introduction
GATK GATK : Genome Analysis ToolKit The Genome Analysis Toolkit : A MapReduce framework for analyzing next-generation DNA sequencing data, McKenna et al. (2010) http://www.broadinstitute.org/gatk/about/ Développé par l'équipe de développement du Broad Institute (USA) Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome Atlas...) A la base développé pour génetique humaine mais maintenant générique Développé en Java Citations : Sources 2010 2011 2012 2013 GATK Website* 2 9 25 Google Scholar 28 145 436 767 * Nature, Science, Nature Genetics, Nature Biotechnology, New England Journal of Medicine, Cell, and Genome Research.
GATK
Comment est détecté un SNP?
Comment est détecté un SNP?
Comment est détecté un SNP?
Comment est détecté un SNP? Complex bayesian algorithms based on : Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality 10 => P error = 1 / 10 30 => P error = 1 / 1000
Comment est détecté un SNP? Biais de séquençage connus: GA / Hi-Seq : Base Quality 454 : Homopolymères SOLiD : Base Quality + Color space traduction Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality
Prétraitement des données NGS
Raw reads Produits par les logiciels des Séquenceurs Une première étape de recalibration/correction des reads peut être effectuée : 454 : Pyrobayes / Pyrocleaner SOLiD : Rsolid Illumina : Ibis /BayesCall + Taux erreur amélioré de 5 à 30 % - Temps de calcul
Raw reads Fastq @HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 NATAAATGCTGTCATACAGACTTGTTGGTGTTGTAAGGCAGCAGACTCCTTTGAGCTTTCATCCGAGAACAATTGAGACTAAATTCCTGGTGCAAAGTCCA +HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 BP\cceeegffgghfhiiiefgihhhii[baegegfgiiiihhiiihhfhfhighihiiifhhfihieeegaceeedcdddd`bcbcccbbcbcccccbcb @HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 NAAGAAGGCACGAAGCAACTACTTCACTGCATGCTGCCTGTCCTTGGGCTGTTTGCTGCCTTTGGCTAACACCTTTGATTATTTCTGGCTAAGTAGATAGG +HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 BS\ceeeegggfgiiiiiiiiiiiiiiiiiiiiiiiiiiihiiiiihhghihhihiiiiiihhiihgggfgeeedddddddbededcccc`bcbeccddcc @HISEQ4_0105:4:1101:3251:1984#TAGCTT/1 NAGAGCTATTTATGAAAACGAGGATGACTAAAACTGCCCAGAAAAAAAACCAACCAACCACGTTTCCAGTGACTGCCACCCTTAGCAAGCAAGGTAATAAC csfasta + Qual
Mapping Alignement reads VS Génome de référence Tout logiciel produisant des BAM Ex: BWA, Bowtie, Gsnap, SOAP, SSAHA http://seqanswers.com/forums/showthread.php?t=43 1 fichier par lane / individu / condition ou groupé avec Read group (obligatoire)
Mapping PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 73 13 10354712 37 101M = 10354712 0 GCCTAGTCCTTTGAGACAGGAGTAAGACAAGAACTCAGGTTAGGGACCTCAAGGACTTGCTGAAGCCCACAAAGATTAGGACAAGCTAATGGAACTCAGAC @@CFDFDFHGHHHIIJJJIJJJCFHIJIJIIFIJJJIJECFGGIGJIIJIJIIJJIGIIIIGGIJJJIGHHEFDFFFDDDCCED?BDDCCDDCDDDDCAC: X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 133 13 10354712 0 * = 10354712 0 GTTAGGGACCTTAAGGATCAATCTTGTCTGAGTTCCATTAGCTTGTCCTAATCTTTGTGGGCTTCAGCAAGTCCTTGAGGTCCCTAACCTGAGTTCTTGTC @@CFFFFFHHHHHJJJJIJJJJIIIIGIIJJJFHIJIJIIJJJJE?DGGCGHIJIJIGIIIIDGFHIIIIGHIJJF@CEH@CFF@CCEEA=CC;@ACA@C5 RG:Z:ind1 PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 99 13 10355951 60 101M = 10355989 139 TGGGAAGGCTTACTGTCTTCATGCAGGATCTGTGTGGCTCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCAT @@CFDFF?DHFHHIIHGHIJJG@HG<FHIIIIJJGGGDGIIJIIJJIGGEBD*?DDGHGGGIGHIH>GG;C>AAAC@DFD;@CECAACDCBBBB9A>>@CA X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 147 13 10355989 60 101M = 10355951-139 TCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCATCAGTATAAGGCACTCTGAAAGAAAGCAATCTAAATCCC :>DCDDDECAA>>@BFFEC@EIHE;GBHF=GFGHGGGGIIHFHGDG@GDB9IIJIIGHHGGGHIIGDIIHFHHEFGEIIJHGH?GBGIHHGGDFFDFFCC@ X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1202:6947:20338 99 13 10358279 29 99M = 10358378 154 GCAGGCTTTTAAGAATATGTTCTGTTTTCAAATAGTAACCCAAAAAGGGGTGGGGGCGGGGGCAAAGTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CC@FFFFFGHGFHFGGGII>JHGGEHIJIIEHHEGHIGHIJJGGIJFGIJ@FHIIHFBDBDDBB@BC44@:@4?><8A2<2?8?<B<<2<2<<A<ABB? X0:i:1 X1:i:0 MD:Z:99 RG:Z:ind2 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:U SAM spécifications: http://samtools.sourceforge.net/
Duplicate Marking/Removing Duplicats PCR (construction des librairies) Samtools rmdup Picard MarkDuplicates Identification Removing
Local Realignment Identification des régions à réaligner : The algorithm begins by first identifying regions for realignment where 1) at least one read contains an indel, 2) there exists a cluster of mismatching bases or 3) an already known indel segregates at the site DePristo et al (2011) Réalignement des reads Next, all reads are realigned against just the best haplotype Hi and the reference (H0), and each read Rj is assigned to Hi or H0 DePristo et al (2011)
Local Realignment
Raw data Base quality recalibration «The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context» DePristo et al. (2011) Ewing and Green (1998) Li et al. (2004 ; 2009) Mean BQ = 32,8 - Median = 36,7
Raw data Recalibrated data Base quality recalibration Conséquences Mean BQ = 32,8 - Median = 36,7 Mean BQ = 28,8 Median = 28,7 Baisse de la variabilité Baisse de la qualité moyenne
Base quality recalibration DePristo et al (2011)
Raw data Analysis-ready reads Nouveau fichier BAM Peut être utilisé ensuite avec d autre outils pour la suite des analyses (Samtools mpileup, Popoolation, etc )
Recherche de Variants
Single vs Multiple sample analysis Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 8th, 2011 (http://www.broadinstitute.org/gatk/best-practices.htm)
Unified Genotyper Outil GATK Multiple sample analysis Différents modes de détection SNP Indels
Format VCF http://www.broadinstitute.org/gatk/how-should-i-interpret-vcf-files-produced-by-the-gatk.htm
Pourquoi faire le Real/Recab?
Comparaison d outils de SNP calling SIGENAE Team LGC - INRA APACHE Project (Alain Vignal) To find SNPs (Single Nucleotide Polymorphism) which differentiate populations Barbary Duck : no reference genome (Beijing duck genome is available) Beijing duck Journée Bioinfo Génotoul 29/03/2012 Barbarie duck
Impact of realignment / recalibration on SNP count More homogenous SNP count Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 40000 30000 20000 Mpileup Mpileup -B Mpileup -E GATK Popoolation2 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data Higher impact of recalibration on SNP count
80000 70000 60000 50000 40000 30000 20000 10000 0 raw data Reliable results with other species? DUCK 80000 70000 60000 50000 40000 30000 20000 10000 0 realigned data Realigned & recalibrated data recalibrated data Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 777% 20% 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 BAMs bruts CHICKEN 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 BAMs réalignés BAMs réalignés/recalibrés BAMs recalibrés Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 234% 4% PIG 200000 150000 100000 50000 200000 150000 100000 50000 Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 454% 9% 0 0 BAMs réalignés BAMs réalignés/recalibrés BAMs bruts BAMs recalibrés Not the same proportion but huge impact on realignment/recalibration
Conclusion Variability between called SNP by different tools GATK realignment/recalibration greatly helps to reduce this variability High impact of base quality score Reliable on various DNA data, but not on RNA data Nature Genetics 2012 «We recommend a recalibration of per-base quality scores as in GATK or SOAPsnp» «Several additional steps can be taken to improve genotype calls, such as local realignments...»
Bilan GATK nécessite un peu d'habitude Points forts : Assez rapide d'exécution grâce à la parallélisation possible Comptage allélique Prise en compte des positions multi-alléliques Beaucoup de fonctionnalités et d'options SNPs semblent être fiables Améliorations fréquentes Site Internet Points faibles : Recalibration basée sur des SNPs connus... À l'origine créé pour l'analyse de génomes humains Beaucoup d'étapes avant de lancer l'unifiedgenotyper Nécessite beaucoup d'espace disque pour suivre le pipeline de bout en bout
Travaux Pratiques Galaxy
Le site de référence GATK http://www.broadinstitute.org/gatk/index.php Download logiciels + ressources (vcf) Guide Analyse Best Practices Forum Documentation Technique Etc
References Samtools : Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup - The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009). Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851-8 (2008). GATK A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491 (2011). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. McKenna AH, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Depristo M. Genome Res. (2010). Popoolation2 R. Kofler, R. V. Pandey, C. Schlotterer. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics (2011). Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nat Methods (2008) BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Kao W-C, Stevens K, Song YS. Genome Res (2009). Ibis Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Kircher M, Stenzel U, Kelso J.. Genome Biol. (2009). Pyrocleaner Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. Mariette J, Noirot C, Klopp C. BMC Research Notes 2011 Genotype and SNP calling from next-generation sequencing data. Nielsen, R. et al. Nature Reviews Genetics 12: 443-451 (2011).