Database schema documentation for SNPdbe



Similar documents
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Bioinformatics Resources at a Glance

Genomes and SNPs in Malaria and Sickle Cell Anemia

Module 1. Sequence Formats and Retrieval. Charles Steward

Version 5.0 Release Notes

Data File Formats. File format v1.3 Software v1.8.0

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

GenBank, Entrez, & FASTA

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

SAP HANA Enabling Genome Analysis

Introduction to Bioinformatics 3. DNA editing and contig assembly

Linear Sequence Analysis. 3-D Structure Analysis

A Primer of Genome Science THIRD

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Text file One header line meta information lines One line : variant/position

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Searching Nucleotide Databases

DNA Sequence formats

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Gene Models & Bed format: What they represent.

Becker Muscular Dystrophy

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

LECTURE 6 Gene Mutation (Chapter )

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

Biological Sequence Data Formats

Bioinformatics Tools Tutorial Project Gene ID: KRas

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Disease gene identification with exome sequencing

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Activity 7.21 Transcription factors

1 Mutation and Genetic Change

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

MS ACCESS DATABASE DATA TYPES

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Single Nucleotide Polymorphisms (SNPs)

IGV Hands-on Exercise: UI basics and data integration

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Gene mutation and molecular medicine Chapter 15

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Bioinformatics Grid - Enabled Tools For Biologists.

The Human Genome Project

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

European Medicines Agency

Pairwise Sequence Alignment

Sickle cell anemia: Altered beta chain Single AA change (#6 Glu to Val) Consequence: Protein polymerizes Change in RBC shape ---> phenotypes

Milk protein genetic variation in Butana cattle

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

How many of you have checked out the web site on protein-dna interactions?

Next Generation Sequencing: Technology, Mapping, and Analysis

Simplifying Data Interpretation with Nexus Copy Number

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

GWASrap User Manual v1.1

Manual for Demo Data

As most of the databases in the world, our database uses a special language named SQL (Structured Query Language) to execute queries.

Data formats and file conversions

MUTATION, DNA REPAIR AND CANCER

Transcription and Translation of DNA

Processing Genome Data using Scalable Database Technology. My Background

Genetics Lecture Notes Lectures 1 2

Structure and Function of DNA

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Package cgdsr. August 27, 2015

Concluding lesson. Student manual. What kind of protein are you? (Basic)

Step by Step Guide to Importing Genetic Data into JMP Genomics

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Teaching Bioinformatics to Undergraduates

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Lecture 3: Mutations

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

mygenomatix - secure cloud for NGS analysis

Exploring Soybean Transcript Polymorphisms to Identify Gene Variants and Functional Markers for Seed Quality Improvement

Description: Molecular Biology Services and DNA Sequencing

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA CHAPTER 3

Using SQL Server Management Studio

Sequence Database Administration

RNA and Protein Synthesis

IMSR File Format. Author: Mark Airey Last Modified: September 02, :47. Table 1: New format of IMSR Data File

Data Analysis for Ion Torrent Sequencing

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Genetics Test Biology I

GenBank: A Database of Genetic Sequence Data

Translation Study Guide

Delivering the power of the world s most successful genomics platform

Amino Acids and Their Properties

Breast cancer and the role of low penetrance alleles: a focus on ATM gene

Transcription:

Database schema documentation for SNPdbe Changes 02/27/12: seqs_containingsnps.taxid removed dbsnp_snp.tax_id renamed to dbsnp_snp.taxid General information: Data in SNPdbe is organized on several levels. Protein sequence specific information is stored in tables starting with seqs_, information on nssnp level is stored in SNPs_ tables. The central table is geno2func. It collects nssnps from all sources and stores prediction results and evolutionary information. Most important fields in the schema are: the md5 hashsum of a sequence. It serves as the only unique sequence identifier. the mut_id. It designates a unique nssnp throughout all tables in terms of sequence (through its md5), position and mutant residue. Both implicitly serve as foreign keys and connect information between tables. In the following, primary keys are underlined and highlighted in bold. Important: We store protein sequence positions 0-based throughout the whole schema! seqs_sp (contains sequence information on SwissProt proteins) seq mediumtext wild-type protein sequence AC varchar(10) primary SwissProt accession ID varchar(20) SwissProt identification gene varchar(20) gene identifier description mediumtext Information on the protein seqs_refseq (contains sequence information taken from RefSeq; used to map nssnps from dbsnp and 1KG) seq mediumtext wild-type protein sequence id varchar(15) RefSeq protein identifier ver tinyint(3) RefSeq protein version gi int(11) GenBank identifier

comment text Information on the protein seqs_pmd (contains sequence information taken from PMD) pmd_id char(10) PMD identifier nr tinyint(3) PMD sub-identifier seq mediumtext wild-type protein sequence seqs_containingsnps (stores all sequences that contain at least one nssnp and flags to the source of nssnps) seq mediumtext wild-type protein sequence in_dbsnp enum('0','1') '1' if sequence contains at least one SAAS from nssnp in_sp enum('0','1') '1' if sequence contains at least one SAAS from SwissProt in_pmd enum('0','1') '1' if sequence contains at least one SAAS from PMD in_1kg enum('0','1') '1' if sequence contains at least one SAAS from 1KG seqs_equalities (relates two md5s if their sequences differ only by one residue or the leading methionine; in both cases two sequences are considered as being equal) md5_1 char(32) md5 hashsum of the first sequence md5_2 char(32) md5 hashsum of the second sequence by_missing_m enum('0','1') '1' if both differ only by the leading methionine by_one_mismatch enum('0','1') '1' if both differ by one residue pos_mismatch smallint(5) position of that mismatch residue md52keywords (stores gene names and descriptions for each sequence) gene_symbol mediumtext Gene symbol gene_sp varchar(11) SwissProt identification description mediumtext Information on the protein's function

geno2func (central table that stores predictions and evolutionary information for each nssnp) mut_id int(11) Unique nssnp identifier md5 char(32) md5 hashsum of the first sequence wt char(1) wild type residue pos smallint(5) mutation position in the sequence (0-based) mt char(1) mutant residue in_dbsnp enum('0','1') '1' if that mutation is derived from dbsnp in_sp enum('0','1') '1' if that mutation is derived from SwissProt in_pmd enum('0','1') '1' if that mutation is derived from PMD in_1kg enum('0','1') '1' if that mutation is derived from 1000Genomes SNAP_status enum('0','1','2') '0': SNAP failed, '1': internal use, '2': SNAP succeeded SNAP_bin enum('+','-') SNAP prediction: '+' non-neutral, '-' neutral SNAP_score tinyint(4) SNAP raw score from -100 (strongly neutral) to 100 (strongly nonneutral) SNAP_ri tinyint(1) SNAP prediction reliability index from 0 (low reliability) to 9 (high reliability) SNAP_acc tinyint(3) SNAP expected accuracy SIFT_bin enum('+','-') SIFT prediction: '+' deleterious, '-' neutral SIFT_score float SIFT raw score from 0 (strongly deleterious) to 1 (strongly neutral), decision threshold 0.05 PERC_wt tinyint(3) Relative frequency of wildtype residue in PSI-BLAST alignments PERC_mt tinyint(3) Relative frequency of mutant residue in PSI-BLAST alignments PSSM_wt tinyint(4) PSI-BLAST's PSSM value for wildtype residue PSSM_mt tinyint(4) PSI-BLAST's PSSM value for mutant residue PSIC_wt float PSIC conservation score for wildtype residue PSIC_mt float PSIC conservation score for mutant residue pph2 enum('possibly damaging','prob ably damaging','beni gn','unknown') PolyPhen2 prediction for a subset of dbsnp mutants dbsnp2omim_verified (maps dbsnp mutants to disease associations in omim) mut_id int(11) mut_id of the dbsnp mutation (ref to SNPs_dbSNP) omim_id int(11) major omim entry id (ref to OMIM2disease) var_id varchar(5) omim variant identifier (ref to OMIM2disease)

OMIM2disease (relates omim ids, disease and the mutation in a given protein sequence) omim_id int(11) major omim entry id var_id varchar(5) omim variant identifier phenotype text disease associated with that variant gene varchar(30) gene symbol mutation_raw text for internal use only wt char(1) wildtype residue found at the disease causing mutation pos int(11) protein sequence position (0-based) mt char(1) mutant residue taxid2names (maps a NCBI taxid onto organism names; see ftp://ftp.ncbi.nih.gov/pub/taxonomy/) taxid int(11) NCBI's taxid misspelling Columns below denote different alternatives for the organism genbank_anamorph scientific_name synonym blast_name unpublished_name genbank_synonym equivalent_name includes acronym in_part anamorph authority genbank_common_ name genbank_acronym common_name misnomer teleomorph name.

taxid_merged (relates identical taxids that got merged in the past) taxid_old int(11) Taxid that was deleted and merged with... taxid_new int(11)...new taxid. dbsnp_valcode (Defines the validation status of a snp in dbsnp; taken as is from table dump provided by dbsnp) code tinyint(3) Validation status code abbrev varchar(50) Abbreviation of the validation status code descrip varchar(250) Description of the validation status create_time varchar(50) Datetime when the row was inserted into the table; unused last_updated_time varchar(50) Datetime when the row was updated; unused dbsnp_snp (collects additional info on each dbsnp rs; taken as is from dbsnp's organism specific SNP tables) snp_id int(11) dbsnp rs id; ref to SNPs_dbSNP avg_heterozygosity float average heterozygosity of a snp het_se float standard error of average heterozygosity create_time varchar(30) first time this refsnp cluster is created (from dbsnp); unused last_updated_time varchar(30) last time this row of SNP data is updated (from dbsnp); unused CpG_code tinyint(4) currently unused of the organism validation_status tinyint(4) ref to dbsnp_valcode; definition of each validation status exemplar_subsnp_id int(11) currently unused univar_id int(11) currently unused cnt_subsnp tinyint(3) number of ss in that rs; unused map_property tinyint(3) currently unused SNPs_SP (collects all VARIANT,MUTAGEN,CONFLICT entries from SwissProt plus annotations) mut_id int(11) Unique mutant id; ref to geno2func md5 varchar(32) Md5 hashsum of the protein sequence AC varchar(10) Primary SwissProt accession pos smallint(5) Mutant position (0-based) mt char(1) Mutant residue kind enum('variant Kind of the amino acid change; one of VARIANT (observed

','MUTAGEN','C ONFLICT') variation), MUTAGEN (mutagenesis), CONFLICT (sequencing conflicts) site_features text Features observed at the mutant position; currently: Active sites (ACT_SITE), Metal binding (METAL), PTMs (MOD_RES), general binding (BINDING), Lipid (LIPID), Glycosylation (CARBOHYD), disulfide bond (DISULFID), located in a domain? (DOMAIN) functional_effect text effect of the variant; MUTAGEN only var_id varchar(7) SwissVar id, VARIANT only disease text Disease association with the VARIANT SNPs_dbSNP (collects all nssnps from dbsnp; merges all organism-specific SNPContigLocusId tables from dbsnp plus additional information) mut_id int(11) Unique mutant id; ref to geno2func build smallint(6) dbsnp build for that organism assembly varchar(10) version of the reference genome verified enum('0','1',' 2','3') md5 varchar(32) Sequence md5 hashsum snp_id int(11) dbsnp rs id contig_acc varchar(32) contig accession; unused contig_ver tinyint(4) contig version; unused 0: OK, 1: mutant position out of sequence, 2: reported mt found at given position, 3: no RefSeq sequence available; note, only mutants with status '0' inserted into geno2func! asn_from int(11) starting position on contig (0-based); unused asn_to int(11) end position on contig (0-based); unused locus_id int(11) NCBI locus id; unused locus_symbol varchar(64) locus name mrna_acc varchar(32) accession of mrna refseq; unused mrna_ver int(11) version of mrna refseq; unused protein_acc varchar(32) protein accession protein_ver int(11) protein accession version fxn_class int(11) dbsnp internal variation code; only contains 42; i.e. nssnp reading_frame int(11) base position in codon; unused allele snp allele in the orientation of the mrna_acc; unused mt varchar(8) mutant amino acid pos int(11) protein sequence position of the mutant (0-based) ctg_id int(11) contig id; unused

mrna_pos int(11) position in mrna; unused mrna_start int(11) start position in mrna; unused mrna_stop int(11) stop position in mrna; unused codon varchar(257) ddbsnp internal usage; unused protres varchar(8) 3-letter code for mutant amino acid contig_gi int(11) unused mrna_gi int(11) unused mrna_orien tinyint(4) unused cp_mrna_ver int(11) unused cp_mrna_gi int(11) unused vercomp varchar(7) unused SNPs_PMD (collects all SAASs from PMD (ProteinMutantDatabase); also contains effects on function and disease) mut_id int(11) unique mutant id; ref to geno2func md5 varchar(32) sequence md5 hashsum pmd_id char(10) PMD primary identifier nr tinyint(3) PMD subidentifier authors text unused journal varchar(70) unused title text unused medline varchar(10) unused crossref varchar(75) unused uniprot_id varchar(10) unused ensembl_id varchar(20) unused other_ref varchar(20) unused protein text unused source unused expression_sys text unused mut_pmd varchar(20) unused mut_real varchar(10) the mutant in the form XposY function text effect on protein function fb tinyint(4) unused structure text effect on protein structure strb tinyint(4) unused stability text effect on protein stability stab tinyint(4) unused

expression text unused eb tinyint(4) unused transport text unused tb tinyint(4) unused maturation text unused mb tinyint(4) unused disease text disease associated with that mutant db tinyint(4) unused uni_real varchar(20) unused uni_realid varchar(20) unused uni_start int(11) unused uni_finish int(11) unused uni_loc int(11) unused ens_real varchar(20) unused ens_organism int(11) unused ens_start int(11) unused ens_finish int(11) unused ens_loc int(11) unused pos_real smallint(5) mutant position, 0-based mt_real char(1) mutant residue SNPs_1KG (collects all SAASs mapped from 1KG onto RefSeq) mut_id int(11) unique mutant id; ref to geno2func md5 varchar(32) sequence md5 hashsum protein_acc varchar(32) RefSeq protein id protein_ver int(11) RefSeq protein ver mrna_acc varchar(32) RefSeq mrna id exon int(10) Number of exon locus_symbol varchar(64) locus chromosome int(10) Number of chromosome mrna_wt char(1) Wild type nucleotide in the mrna mrna_pos int(11) Mutant position in the mrna (0-based) mrna_mt char(1) Mutant nucleotide in the mrna chrom_wt char(1) Wild type nucleotide in the chromosome chrom_pos int(11) Mutant position in the chromosome (0-based)

chrom_mt char(1) Mutant nucleotide in the chromosome LDAF float MLE Allele Frequency Accounting for LD (from 1KG vcf) AC int(10) Alternate Allele Count (from 1KG vcf) AN int(10) Total Allele Count (from 1KG vcf) wt char(1) Wildtype amino acid pos int(11) Mutant position in protein sequence (0-based) mt char(1) Mutant amino acid SP_AC_map (maps the primary SwissProt accession with alternatives) first varchar(10) primary accession additional varchar(10) alternative accession