Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position
##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1! ##reference=file:///seq/references/1000genomespilot-ncbi36.fasta! ##contig=<id=20,length=62435964,assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="ho mo sapiens",taxonomy=x>! ##phasing=partial! ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">! ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">! ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">! ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">! ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">! ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">! ##FILTER=<ID=q10,Description="Quality below 10">! ##FILTER=<ID=s50,Description="Less than 50% of samples have data">! ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">! ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">! ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">! ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">! #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003! 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1:43:5:.,.! 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0:41:3! 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2:35:4! 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0:61:2! 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3!
Chro POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 m 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 AD:DP:GQ 1/1:1,40:40 63:99 1/1:5,405 6:4063:99 :97177,66 81,0 SNP C/G 20 1234. C G Deletion (G) 20 2. TC T Insertion (A) 20 2. TC TCA Complex (population) 20 2. T A,G 20 2. TCG TG,T,TCAG
Chro m POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 GT:AD:DP: GQ 1/1:1,40:40 63:99 1/1:5,405 6:4063:99 :97177,66 81,0 ##INFO=<ID=ID,Number=number,Type=type,Description= description > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> AC=2;AC1=2;AF=1.00;AF1=1;AN=2;BaseQRankSum=1.620;DB;DP=8140;DP4=2,1,2200,1764;Dels=0.01;F Q=- 282;FS=0.000;HaplotypeScore=175.6058;MLEAC=2;MLEAF=1.00;MQ0=0;MQRankSum=- 0.019;PV4= 1,0.33,1,0.36;QD=23.73;RPB=5.660616e- 01;ReadPosRankSum=0.513;VDB=8.297158e- 26;set=Intersection
Chro m POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample2 20 1234 rs678 A G 20 PASS NS=3;DP=14;AF=0.5 GT:AD:DP:G Q 1/1:1,40:4063:9 9 ##FORMAT=<ID=ID,Number=number,Type=type,Description= description > 0/1:5,405 6:4063:99 :97177,66 81,0 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> : genotype, encoded as allele values separated by either of / or ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic 0 : reference allele depths for the ref and alt alleles in the order listed"> 1 : for the first allele in ALT a,b a: read with ref allele b: read width alt alele 2 : for the second allele in ALT 0/1:10,50 Etc 1/1 0/0 :0,200 homo ref allele 0/1 hetero 1/2 :0,20,50 1/1 homo 1/2 double mutation
Tabix is the first generic tool that indexes position sorted files in TAB- delimited formats index on 3 keys chromosome, start, end (.tbi) index bgzip files direct FTP/HTTP access command- line tool library in C, Java, Perl and Python VCF, BED,PSL $ bgzip myvcf.vcf! $ tabix p vcf myvcf.vcf.gz! $tabix myvxf.vcf.gz chr1:1234-5678!
Genomic Alignment Variation Calling Variation Filter Variation Annotation Interpretation Full exome (40 000 60 000) variations (quite low filtration) (100 X 95% at 15x) Query VCF Annotate variant
Merge, Intersect 2 VCFs java - Xmx2g - jar GenomeAnalysisTK.jar - R ref.fasta - T CombineVariants - - variant:foo input1.vcf - - variant:bar input2.vcf - o output.vcf - priority foo,bar vcf- merge A.vcf.gz B.vcf.gz bgzip - c > C.vcf.gz vcf- isec - o - n +2 A.vcf.gz B.vcf.gz C.vcf.gz (at least 2 file) FILTERING VCF : vcf- query file.vcf.gz 1:1000-2000 - c NA001,NA002,NA003 java GATK - R ref.fasta - T SelectVariants - - variant input.vcf - o output.vcf - sn SAMPLE_A - sn SAMPLE_B
VCF annotation database (ensembl,ucsc) intergenic intronic SNP Public Database exonic Tools VEP ensembl SNPEff ANNOVAR Splice site frequencies consequence dbsnp 1000 genomes EVS OMIM cosmic
FASTX FASTQ FASTQC BWA Alignment/Mapping Bowtie BAM Samtools Picard tools GATK GATK Calling Samtools VCF GATK VCFtools TABIX Annotation VEP SNPEff Annovar
100 reads 100 reads 100 reads 25! 25! 25! 25! 10! 20! 30! 40! 10! 10! 60! 20! sequencing it s not random : Library : Capture and sequencing bias
2 ND EXPERIMENT WITH SAME BIAS 100 reads 100 reads 10! 10! 60! 20! 10! 20! 60! 10!
RNAseq : differential expression of your RNA in 2 differents conditions. DNASeq : find large deletion, gene duplication etc... Ref T H E R E I S A L A R G E D E L E T I O N seq T H E R E I S A _ D E L E T I O N seq T H E R E I S A _ D E L E T I O N Cov 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 Ref T H E R E I S A L A R G E D E L E T I O N All1 T H E R E I S A _ D E L E T I O N All2 T H E R E I S A L A R G E D E L E T I O N Cov 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2
DNA RNA De Novo Sequencing Resequencing (Sequence mutation, structural variation). Whole Genome (expensive) Target resequencing (Full exomes) Chip / seq Differential Expression Splice variants detections Small RNA Structural variation (gene fusion)
Sequence comparaison Resequencing (Sequence mutation, structural variation). Whole Genome (expensive) Target resequencing (Full exomes) Splice variants detections Structural variation (gene fusion) Quantification Counting reads Differential Gene Expression Chip / seq Small RNA CNV
Sélectionner pour conserver Désélectionner pour retirer
Variations identiques dans 2 patients Variations identiques dans 2 patients et non présente dans un troisième Variations hétérozygotes dans 2 patients Variations homozygotes dans 1 Variations identiques dans «N» patients
Gènes identiques dans 3 patients Gènes identiques dans 2 patients et jamais dans le 3ème Gènes mutés dans «N» patients
Récessif variations homozygotes chez les enfants malades absentes chez les frères et soeurs sains hétérozygotes chez le père et la mère Compound Dominant 2 variations hétérozygotes chez les enfants. Reçu des deux parents Variations identiques chez les atteints et non présentes chez les sains De novo variations chez les enfants mais pas les parents Strict- denovo Denovo + vérification de la couverture chez les parents.