Text file One header line meta information lines One line : variant/position



Similar documents
The Variant Call Format (VCF) Version 4.2 Specification

Practical Guideline for Whole Genome Sequencing

Introduction to NGS data analysis

Analysis of NGS Data

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

How-To: SNP and INDEL detection

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

New solutions for Big Data Analysis and Visualization

Delivering the power of the world s most successful genomics platform

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Accessing the 1000 Genomes Data. Paul Flicek European BioinformaMcs InsMtute

Genomes and SNPs in Malaria and Sickle Cell Anemia

LifeScope Genomic Analysis Software 2.5

Disease gene identification with exome sequencing

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Bioinformatics Resources at a Glance

SAP HANA Enabling Genome Analysis

Human Genome Organization: An Update. Genome Organization: An Update

Challenges associated with analysis and storage of NGS data

Simplifying Data Interpretation with Nexus Copy Number

Single Nucleotide Polymorphisms (SNPs)

Frequently Asked Questions Next Generation Sequencing

Data formats and file conversions

-> Integration of MAPHiTS in Galaxy

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion Groomer. Mapping BWA GATK Preprocess

Accelerating Life Science Discovery using a High-Performance Analytics Platform in a Collaborative Environment Overview

Data Analysis for Ion Torrent Sequencing

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Database schema documentation for SNPdbe

Bioinformatics Unit Department of Biological Services. Get to know us

NEXT GENERATION SEQUENCING

NGS Data Analysis: An Intro to RNA-Seq

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Comparing Methods for Identifying Transcription Factor Target Genes

Step by Step Guide to Importing Genetic Data into JMP Genomics

THE UNIVERSITY OF MANCHESTER Unit Specification

Data File Formats. File format v1.3 Software v1.8.0

Basic processing of next-generation sequencing (NGS) data

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Accelerating variant calling

Services. Updated 05/31/2016

RNA- seq de novo ABiMS

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Practical Solutions for Big Data Analytics

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Analysis of ChIP-seq data in Galaxy

IGV Hands-on Exercise: UI basics and data integration

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Copy Number Variation: available tools

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Gene Models & Bed format: What they represent.

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

Expression Quantification (I)

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Overview of Next Generation Sequencing platform technologies

BioHPC Web Computing Resources at CBSU

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

RNAseq / ChipSeq / Methylseq and personalized genomics

Introduction to next-generation sequencing data

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Open source analytics for Big Data in Big Pharma

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

Next generation DNA sequencing technologies. theory & prac-ce

G E N OM I C S S E RV I C ES

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Version 5.0 Release Notes

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

TGC AT YOUR SERVICE. Taking your research to the next generation

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Name: Class: Date: ID: A

Introduction. Overview of Bioconductor packages for short read analysis

HiSeq Analysis Software v0.9 User Guide

Deep Sequencing Data Analysis

High Throughput Sequencing Data Analysis using Cloud Computing

Accelerating Data-Intensive Genome Analysis in the Cloud

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Next Generation Sequencing. mapping mutations in congenital heart disease

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations

How To Find Rare Variants In The Human Genome

Module 1. Sequence Formats and Retrieval. Charles Steward

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

Next Generation Sequencing: Technology, Mapping, and Analysis

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Transcription:

Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position

##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1! ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta! ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="ho mo sapiens",taxonomy=x>! ##phasing=partial! ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">! ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">! ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">! ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">! ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">! ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">! ##FILTER=<ID=q10,Description="Quality below 10">! ##FILTER=<ID=s50,Description="Less than 50% of samples have data">! ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">! ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">! ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">! ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">! #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003! 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1:43:5:.,.! 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0:41:3! 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2:35:4! 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0:61:2! 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3!

Chro POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 m 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 AD:DP:GQ 1/1:1,40:40 63:99 1/1:5,405 6:4063:99 :97177,66 81,0 SNP C/G 20 1234. C G Deletion (G) 20 2. TC T Insertion (A) 20 2. TC TCA Complex (population) 20 2. T A,G 20 2. TCG TG,T,TCAG

Chro m POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 GT:AD:DP: GQ 1/1:1,40:40 63:99 1/1:5,405 6:4063:99 :97177,66 81,0 ##INFO=<ID=ID,Number=number,Type=type,Description= description > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> AC=2;AC1=2;AF=1.00;AF1=1;AN=2;BaseQRankSum=1.620;DB;DP=8140;DP4=2,1,2200,1764;Dels=0.01;F Q=- 282;FS=0.000;HaplotypeScore=175.6058;MLEAC=2;MLEAF=1.00;MQ0=0;MQRankSum=- 0.019;PV4= 1,0.33,1,0.36;QD=23.73;RPB=5.660616e- 01;ReadPosRankSum=0.513;VDB=8.297158e- 26;set=Intersection

Chro m POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 GT:AD:DP:G Q 1/1:1,40:4063:9 9 ##FORMAT=<ID=ID,Number=number,Type=type,Description= description > 0/1:5,405 6:4063:99 :97177,66 81,0 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> : genotype, encoded as allele values separated by either of / or ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic 0 : reference allele depths for the ref and alt alleles in the order listed"> 1 : for the first allele in ALT a,b a: read with ref allele b: read width alt alele 2 : for the second allele in ALT 0/1:10,50 Etc 1/1 0/0 :0,200 homo ref allele 0/1 hetero 1/2 :0,20,50 1/1 homo 1/2 double mutation

Tabix is the first generic tool that indexes position sorted files in TAB- delimited formats index on 3 keys chromosome, start, end (.tbi) index bgzip files direct FTP/HTTP access command- line tool library in C, Java, Perl and Python VCF, BED,PSL $ bgzip myvcf.vcf! $ tabix p vcf myvcf.vcf.gz! $tabix myvxf.vcf.gz chr1:1234-5678!

Genomic Alignment Variation Calling Variation Filter Variation Annotation Interpretation Full exome (40 000 60 000) variations (quite low filtration) (100 X 95% at 15x) Query VCF Annotate variant

Merge, Intersect 2 VCFs java - Xmx2g - jar GenomeAnalysisTK.jar - R ref.fasta - T CombineVariants - - variant:foo input1.vcf - - variant:bar input2.vcf - o output.vcf - priority foo,bar vcf- merge A.vcf.gz B.vcf.gz bgzip - c > C.vcf.gz vcf- isec - o - n +2 A.vcf.gz B.vcf.gz C.vcf.gz (at least 2 file) FILTERING VCF : vcf- query file.vcf.gz 1:1000-2000 - c NA001,NA002,NA003 java GATK - R ref.fasta - T SelectVariants - - variant input.vcf - o output.vcf - sn SAMPLE_A - sn SAMPLE_B

VCF annotation database (ensembl,ucsc) intergenic intronic SNP Public Database exonic Tools VEP ensembl SNPEff ANNOVAR Splice site frequencies consequence dbsnp 1000 genomes EVS OMIM cosmic

FASTX FASTQ FASTQC BWA Alignment/Mapping Bowtie BAM Samtools Picard tools GATK GATK Calling Samtools VCF GATK VCFtools TABIX Annotation VEP SNPEff Annovar

100 reads 100 reads 100 reads 25! 25! 25! 25! 10! 20! 30! 40! 10! 10! 60! 20! sequencing it s not random : Library : Capture and sequencing bias

2 ND EXPERIMENT WITH SAME BIAS 100 reads 100 reads 10! 10! 60! 20! 10! 20! 60! 10!

RNAseq : differential expression of your RNA in 2 differents conditions. DNASeq : find large deletion, gene duplication etc... Ref T H E R E I S A L A R G E D E L E T I O N seq T H E R E I S A _ D E L E T I O N seq T H E R E I S A _ D E L E T I O N Cov 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 Ref T H E R E I S A L A R G E D E L E T I O N All1 T H E R E I S A _ D E L E T I O N All2 T H E R E I S A L A R G E D E L E T I O N Cov 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2

DNA RNA De Novo Sequencing Resequencing (Sequence mutation, structural variation). Whole Genome (expensive) Target resequencing (Full exomes) Chip / seq Differential Expression Splice variants detections Small RNA Structural variation (gene fusion)

Sequence comparaison Resequencing (Sequence mutation, structural variation). Whole Genome (expensive) Target resequencing (Full exomes) Splice variants detections Structural variation (gene fusion) Quantification Counting reads Differential Gene Expression Chip / seq Small RNA CNV

Sélectionner pour conserver Désélectionner pour retirer

Variations identiques dans 2 patients Variations identiques dans 2 patients et non présente dans un troisième Variations hétérozygotes dans 2 patients Variations homozygotes dans 1 Variations identiques dans «N» patients

Gènes identiques dans 3 patients Gènes identiques dans 2 patients et jamais dans le 3ème Gènes mutés dans «N» patients

Récessif variations homozygotes chez les enfants malades absentes chez les frères et soeurs sains hétérozygotes chez le père et la mère Compound Dominant 2 variations hétérozygotes chez les enfants. Reçu des deux parents Variations identiques chez les atteints et non présentes chez les sains De novo variations chez les enfants mais pas les parents Strict- denovo Denovo + vérification de la couverture chez les parents.