The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1
Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases. Show how to access different genomic data from a variety of databases, using UniProt and GQuery (Entrez). Introduce BLAST. 2
Databases Nucleotide databases: DDBJ/EMBL/NCBI form the International Nucleotide Sequence Database Collaboration and store Genomic/cDNAs/ESTs sequences. Protein database: UniProt: Swiss-Prot (manually curated) and TrEMBL (automated annotation) sequences. Accession numbers (a unique number or combination of letters and numbers assigned to each record in a database) identify such sequences. e.g. (AL034553). 3
Information is mirrored daily between DDBJ/EMBL/NCBI. DDBJ/EMBL/GenBank: INSDC (International Nucleotide Sequence Database). DDBJ: DNA databank of Japan. CIB-DDBJ: Centre for Information Biology and DNA Data Banks of Japan. EBI: European Bioinformatics Institute. ENA: European Nucleotide Archive contains EMBL Nucleotide Sequence Database EMBL: European Molecular Biology Laboratory. NCBI: National Centre for Biotechnology Information. IAM: International Advisory Meeting. ICM: International Collaborative Meeting. 4
EMBL format See abbreviation table 5
Abbreviations found in the EMBL flat file: 6
NCBI format DDBJ format 7
Sequence Read Archive (SRA) for next-generation sequencing submission. INSDC now accept sequence data produced by next-generation sequencing machines. This screen shot is taken from the ENA hosted at EBI. For further information go here: http://www.ebi.ac.uk/ena/about/sra_submissions http://www.ebi.ac.uk/ena/about/sra_format 8
NGS file formats BAM format - A BAM file (.bam) is the binary version of a SAM file bigbed format indexed BED file (1 line per feature and at least 3 columns) BigWig format used for dense continuous data and displayed as a graphwiggle plot VCF format - for variants See here for more information: http://www.ensembl.org/info/website/upload/index.html http://www.ensembl.org/info/website/upload/large.html Next Generation Sequencing Courses Wellcome Trust Next Generation Sequencing course 6-13 April 2014 http://www.wellcome.ac.uk/education-resources/courses-and-conferences/ EBI - Monday, October 14, 2013 - Thursday, October 17, 2013 http://www.ebi.ac.uk/training/course/next-generation-sequencing-workshop-0 9
NCBI s ENTREZ system 10
GQuery (Entrez) entry point http://www.ncbi.nlm.nih.gov/books/nbk3837/! 11
Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule chromosome NC_000000 contig NT_000000 Reference Sequences mrna NM_000000 predicted mrna XM_000000 protein NP_000000 predicted protein XP_000000 Key: curated calculated non-coding RNA NR_000000 predicted non-coding RNA XR_000000 Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID. 12
CCDS Comparison of common CDSs to form consensus gene set QC by UCSC (filter out possible pseudogenes) Build 104.0 27,752 agreed CCDS IDs CCDS set is displayed in Ensembl/Vega/UCSC/NCBI Non redundant gene set agreed by all institutes 13
EBI databases 14
EBI search: access all databases http://www.ebi.ac.uk/ena/about/browser.html 15
ENA sequence window! 16
ENA data view See here for a clone example: http://www.ebi.ac.uk/ena/data/view/bn000065 17
UniProt! 18
PE (protein existence) line Format!! PE Level: Evidence;!! Values" " 1: Evidence at protein level" " 2: Evidence at transcript level" " 3: Inferred from homology" " 4: Predicted" " 5: Uncertain" " http://www.expasy.org/cgi-bin/lists?pe_criteria.txt" 19
TrEMBL entry All information is automatically generated 20
Swiss-Prot entry Manually curated entry containing more information than Trembl 21
BLAST similarity searching Basic Local Alignment Search Tool There are many different databases available to search against, which may vary depending on which site you start from. The most commonly used BLAST site is hosted by the NCBI: http://www.ncbi.nlm.nih.gov/blast/ 22
Blast output. 1) The score is a measure of the similarity of the query sequence to the subject sequence." It is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and subject sequence." 2) E-value is estimate of the likelihood that a sequence match with that score has occurred by chance. The E-value is calculated from the size of the sequence, database and score (or scoring system used) and so is specific to that search. Thus, two results on different databases may not be directly comparable. But the take home message: The smaller the E-value, the smaller the likelihood that it has happened at random and is therefore more likely to be real. For example: 0.000001 1 in a million searches would produce a false positive with this score 0.01 1 in 100 searches would produce a false positive with this score 1 1 match above threshold is likely to be FP 100 100 matches above threshold are FP For further details see Karlin & Altschul - PNAS 1990 87:2264-8 23
Worked examples Tasks start on page 18 24