Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion --------- Groomer. Mapping --------- BWA GATK --------- Preprocess



Similar documents
Practical Guideline for Whole Genome Sequencing

How-To: SNP and INDEL detection

Text file One header line meta information lines One line : variant/position

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Introduction to NGS data analysis

Analysis of NGS Data

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective

Accelerating variant calling

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Next generation sequencing (NGS)

Accelerating Data-Intensive Genome Analysis in the Cloud

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

New solutions for Big Data Analysis and Visualization

Challenges associated with analysis and storage of NGS data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Version 5.0 Release Notes

-> Integration of MAPHiTS in Galaxy

«Object-Oriented Multi-Methods in Cecil» Craig Chambers (Cours IFT6310, H08)

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Analysis of ChIP-seq data in Galaxy

Deep Sequencing Data Analysis

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Hadoop. Bioinformatics Big Data

Copy Number Variation: available tools

ESMA REGISTERS OJ/26/06/2012-PROC/2012/004. Questions/ Answers

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Basic processing of next-generation sequencing (NGS) data

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

MapReducing a Genomic Sequencing Workflow

Delivering the power of the world s most successful genomics platform

Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

Integrated Rule-based Data Management System for Genome Sequencing Data

Introduction au BIM. ESEB Seyssinet-Pariset Economie de la construction contact@eseb.fr

HADOOP IN THE LIFE SCIENCES:

Audit de sécurité avec Backtrack 5

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

HiSeq Analysis Software v0.9 User Guide

Next Generation Sequencing: Technology, Mapping, and Analysis

Cloud-Based Big Data Analytics in Bioinformatics

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Hadoop-BAM and SeqPig

Accessing the 1000 Genomes Data. Paul Flicek European BioinformaMcs InsMtute

Comparing Methods for Identifying Transcription Factor Target Genes

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Core Facility Genomics

A Primer of Genome Science THIRD

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Hadoopizer : a cloud environment for bioinformatics data analysis

COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY

Introduction ToIP/Asterisk Quelques applications Trixbox/FOP Autres distributions Conclusion. Asterisk et la ToIP. Projet tuteuré

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

CSE-E5430 Scalable Cloud Computing. Lecture 4

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

High Throughput Sequencing Data Analysis using Cloud Computing

Stockage distribué sous Linux

SAP HANA Enabling Genome Analysis

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Introduction to next-generation sequencing data

Reconstruction d un modèle géométrique à partir d un maillage 3D issu d un scanner surfacique

Introduction. GEAL Bibliothèque Java pour écrire des algorithmes évolutionnaires. Objectifs. Simplicité Evolution et coévolution Parallélisme

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Sun Enterprise Optional Power Sequencer Installation Guide

Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research

Sun Management Center 3.6 Version 5 Add-On Software Release Notes

Processing NGS Data with Hadoop-BAM and SeqPig

N1 Grid Service Provisioning System 5.0 User s Guide for the Linux Plug-In

Sun StorEdge Availability Suite Software Point-in-Time Copy Software Maximizing Backup Performance

Practical Solutions for Big Data Analytics

Data formats and file conversions

Setting up a monitoring and remote control tool

How To Find Rare Variants In The Human Genome

Liste d'adresses URL

Next generation DNA sequencing technologies. theory & prac-ce

COLLABORATIVE LCA. Rachel Arnould and Thomas Albisser. Hop-Cube, France

Factors for success in big data science

Upgrading the Solaris PC NetLink Software

Bioinformatics Unit Department of Biological Services. Get to know us

RNAseq / ChipSeq / Methylseq and personalized genomics

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

Modifier le texte d'un élément d'un feuillet, en le spécifiant par son numéro d'index:

Bioinformatique sur Cloud Cas d usage avec le portail Galaxy

SRA File Formats Guide

Transcription:

Workflow Fastq Reference Genome Galaxy Format Conversion --------- Groomer Quality Control --------- FastQC Mapping --------- BWA Format conversion --------- Sam-to-Bam Removing PCR duplicates --------- MarkDup Preprocess GATK --------- Base Recalibration Preprocess GATK --------- Indel Realignment Variant Calling GATK --------- Unified Genotyper Mpileup Variant Calling VarScan VCF Filtering VCF Annotation

Genome Analysis Toolkit

Plan Introduction Prétraitements des données NGS Recherche de Variants Pourquoi faire la Real/Recab? Travaux pratiques

Introduction

GATK GATK : Genome Analysis ToolKit The Genome Analysis Toolkit : A MapReduce framework for analyzing next-generation DNA sequencing data, McKenna et al. (2010) http://www.broadinstitute.org/gatk/about/ Développé par l'équipe de développement du Broad Institute (USA) Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome Atlas...) A la base développé pour génetique humaine mais maintenant générique Développé en Java Citations : Sources 2010 2011 2012 2013 GATK Website* 2 9 25 Google Scholar 28 145 436 767 * Nature, Science, Nature Genetics, Nature Biotechnology, New England Journal of Medicine, Cell, and Genome Research.

GATK

Comment est détecté un SNP?

Comment est détecté un SNP?

Comment est détecté un SNP?

Comment est détecté un SNP? Complex bayesian algorithms based on : Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality 10 => P error = 1 / 10 30 => P error = 1 / 1000

Comment est détecté un SNP? Biais de séquençage connus: GA / Hi-Seq : Base Quality 454 : Homopolymères SOLiD : Base Quality + Color space traduction Base scale Read scale Position scale Genotype scale Phred-Quality Base Mapping quality Forward/Reverse ALT allele count REF allele count Overall genotype association ALT / REF Read Depth SNP quality

Prétraitement des données NGS

Raw reads Produits par les logiciels des Séquenceurs Une première étape de recalibration/correction des reads peut être effectuée : 454 : Pyrobayes / Pyrocleaner SOLiD : Rsolid Illumina : Ibis /BayesCall + Taux erreur amélioré de 5 à 30 % - Temps de calcul

Raw reads Fastq @HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 NATAAATGCTGTCATACAGACTTGTTGGTGTTGTAAGGCAGCAGACTCCTTTGAGCTTTCATCCGAGAACAATTGAGACTAAATTCCTGGTGCAAAGTCCA +HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 BP\cceeegffgghfhiiiefgihhhii[baegegfgiiiihhiiihhfhfhighihiiifhhfihieeegaceeedcdddd`bcbcccbbcbcccccbcb @HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 NAAGAAGGCACGAAGCAACTACTTCACTGCATGCTGCCTGTCCTTGGGCTGTTTGCTGCCTTTGGCTAACACCTTTGATTATTTCTGGCTAAGTAGATAGG +HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 BS\ceeeegggfgiiiiiiiiiiiiiiiiiiiiiiiiiiihiiiiihhghihhihiiiiiihhiihgggfgeeedddddddbededcccc`bcbeccddcc @HISEQ4_0105:4:1101:3251:1984#TAGCTT/1 NAGAGCTATTTATGAAAACGAGGATGACTAAAACTGCCCAGAAAAAAAACCAACCAACCACGTTTCCAGTGACTGCCACCCTTAGCAAGCAAGGTAATAAC csfasta + Qual

Mapping Alignement reads VS Génome de référence Tout logiciel produisant des BAM Ex: BWA, Bowtie, Gsnap, SOAP, SSAHA http://seqanswers.com/forums/showthread.php?t=43 1 fichier par lane / individu / condition ou groupé avec Read group (obligatoire)

Mapping PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 73 13 10354712 37 101M = 10354712 0 GCCTAGTCCTTTGAGACAGGAGTAAGACAAGAACTCAGGTTAGGGACCTCAAGGACTTGCTGAAGCCCACAAAGATTAGGACAAGCTAATGGAACTCAGAC @@CFDFDFHGHHHIIJJJIJJJCFHIJIJIIFIJJJIJECFGGIGJIIJIJIIJJIGIIIIGGIJJJIGHHEFDFFFDDDCCED?BDDCCDDCDDDDCAC: X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 133 13 10354712 0 * = 10354712 0 GTTAGGGACCTTAAGGATCAATCTTGTCTGAGTTCCATTAGCTTGTCCTAATCTTTGTGGGCTTCAGCAAGTCCTTGAGGTCCCTAACCTGAGTTCTTGTC @@CFFFFFHHHHHJJJJIJJJJIIIIGIIJJJFHIJIJIIJJJJE?DGGCGHIJIJIGIIIIDGFHIIIIGHIJJF@CEH@CFF@CCEEA=CC;@ACA@C5 RG:Z:ind1 PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 99 13 10355951 60 101M = 10355989 139 TGGGAAGGCTTACTGTCTTCATGCAGGATCTGTGTGGCTCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCAT @@CFDFF?DHFHHIIHGHIJJG@HG<FHIIIIJJGGGDGIIJIIJJIGGEBD*?DDGHGGGIGHIH>GG;C>AAAC@DFD;@CECAACDCBBBB9A>>@CA X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 147 13 10355989 60 101M = 10355951-139 TCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCATCAGTATAAGGCACTCTGAAAGAAAGCAATCTAAATCCC :>DCDDDECAA>>@BFFEC@EIHE;GBHF=GFGHGGGGIIHFHGDG@GDB9IIJIIGHHGGGHIIGDIIHFHHEFGEIIJHGH?GBGIHHGGDFFDFFCC@ X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U PHOSPHORE:181:C0KD3ACXX:8:1202:6947:20338 99 13 10358279 29 99M = 10358378 154 GCAGGCTTTTAAGAATATGTTCTGTTTTCAAATAGTAACCCAAAAAGGGGTGGGGGCGGGGGCAAAGTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CC@FFFFFGHGFHFGGGII>JHGGEHIJIIEHHEGHIGHIJJGGIJFGIJ@FHIIHFBDBDDBB@BC44@:@4?><8A2<2?8?<B<<2<2<<A<ABB? X0:i:1 X1:i:0 MD:Z:99 RG:Z:ind2 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:U SAM spécifications: http://samtools.sourceforge.net/

Duplicate Marking/Removing Duplicats PCR (construction des librairies) Samtools rmdup Picard MarkDuplicates Identification Removing

Local Realignment Identification des régions à réaligner : The algorithm begins by first identifying regions for realignment where 1) at least one read contains an indel, 2) there exists a cluster of mismatching bases or 3) an already known indel segregates at the site DePristo et al (2011) Réalignement des reads Next, all reads are realigned against just the best haplotype Hi and the reference (H0), and each read Rj is assigned to Hi or H0 DePristo et al (2011)

Local Realignment

Raw data Base quality recalibration «The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context» DePristo et al. (2011) Ewing and Green (1998) Li et al. (2004 ; 2009) Mean BQ = 32,8 - Median = 36,7

Raw data Recalibrated data Base quality recalibration Conséquences Mean BQ = 32,8 - Median = 36,7 Mean BQ = 28,8 Median = 28,7 Baisse de la variabilité Baisse de la qualité moyenne

Base quality recalibration DePristo et al (2011)

Raw data Analysis-ready reads Nouveau fichier BAM Peut être utilisé ensuite avec d autre outils pour la suite des analyses (Samtools mpileup, Popoolation, etc )

Recherche de Variants

Single vs Multiple sample analysis Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 8th, 2011 (http://www.broadinstitute.org/gatk/best-practices.htm)

Unified Genotyper Outil GATK Multiple sample analysis Différents modes de détection SNP Indels

Format VCF http://www.broadinstitute.org/gatk/how-should-i-interpret-vcf-files-produced-by-the-gatk.htm

Pourquoi faire le Real/Recab?

Comparaison d outils de SNP calling SIGENAE Team LGC - INRA APACHE Project (Alain Vignal) To find SNPs (Single Nucleotide Polymorphism) which differentiate populations Barbary Duck : no reference genome (Beijing duck genome is available) Beijing duck Journée Bioinfo Génotoul 29/03/2012 Barbarie duck

Impact of realignment / recalibration on SNP count More homogenous SNP count Δ = 777% Δ = 714% Δ = 42% Δ = 45% 80000 70000 60000 50000 40000 30000 20000 Mpileup Mpileup -B Mpileup -E GATK Popoolation2 10000 0 raw data realigned data recalibrated data Realigned & recalibrated data Higher impact of recalibration on SNP count

80000 70000 60000 50000 40000 30000 20000 10000 0 raw data Reliable results with other species? DUCK 80000 70000 60000 50000 40000 30000 20000 10000 0 realigned data Realigned & recalibrated data recalibrated data Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 777% 20% 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 BAMs bruts CHICKEN 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 BAMs réalignés BAMs réalignés/recalibrés BAMs recalibrés Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 234% 4% PIG 200000 150000 100000 50000 200000 150000 100000 50000 Mpileup Mpileup -B Mpileup -E GATK Raw data Realigned/Recal data Δ tools 454% 9% 0 0 BAMs réalignés BAMs réalignés/recalibrés BAMs bruts BAMs recalibrés Not the same proportion but huge impact on realignment/recalibration

Conclusion Variability between called SNP by different tools GATK realignment/recalibration greatly helps to reduce this variability High impact of base quality score Reliable on various DNA data, but not on RNA data Nature Genetics 2012 «We recommend a recalibration of per-base quality scores as in GATK or SOAPsnp» «Several additional steps can be taken to improve genotype calls, such as local realignments...»

Bilan GATK nécessite un peu d'habitude Points forts : Assez rapide d'exécution grâce à la parallélisation possible Comptage allélique Prise en compte des positions multi-alléliques Beaucoup de fonctionnalités et d'options SNPs semblent être fiables Améliorations fréquentes Site Internet Points faibles : Recalibration basée sur des SNPs connus... À l'origine créé pour l'analyse de génomes humains Beaucoup d'étapes avant de lancer l'unifiedgenotyper Nécessite beaucoup d'espace disque pour suivre le pipeline de bout en bout

Travaux Pratiques Galaxy

Le site de référence GATK http://www.broadinstitute.org/gatk/index.php Download logiciels + ressources (vcf) Guide Analyse Best Practices Forum Documentation Technique Etc

References Samtools : Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup - The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009). Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851-8 (2008). GATK A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491 (2011). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. McKenna AH, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Depristo M. Genome Res. (2010). Popoolation2 R. Kofler, R. V. Pandey, C. Schlotterer. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics (2011). Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nat Methods (2008) BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Kao W-C, Stevens K, Song YS. Genome Res (2009). Ibis Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Kircher M, Stenzel U, Kelso J.. Genome Biol. (2009). Pyrocleaner Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. Mariette J, Noirot C, Klopp C. BMC Research Notes 2011 Genotype and SNP calling from next-generation sequencing data. Nielsen, R. et al. Nature Reviews Genetics 12: 443-451 (2011).