Protein sequence databases Rolf Apweiler 1,, Amos Bairoch 2 and Cathy H Wu 3
|
|
|
- Christina Lily Mitchell
- 9 years ago
- Views:
Transcription
1 Protein sequence databases Rolf Apweiler 1,, Amos Bairoch 2 and Cathy H Wu 3 A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly curated universal databases that cover all species and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases will play an even more important role as central comprehensive resources of protein information. Several the leading protein sequence databases are discussed here, with special emphasis on the databases now provided by the Universal Protein Knowledgebase (UniProt) consortium. Addresses 1 The EMBL Outstation The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 2 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland 3 Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box , Washington, DC , USA [email protected] This review comes from a themed issue on Proteomics and genomics Edited by Michael Snyder and John Yates III /$ see front matter ß 2003 Elsevier Ltd. All rights reserved. DOI /j.cbpa Abbreviations DDBJ DNA Data Bank of Japan EMBL European Molecular Biology Laboratory GO Gene Ontology NCBI National Center of Biotechnology Information NREF non-redundant reference databases PDB Protein Data Bank PIR Protein Information Resource PIR-PSD Protein Information Resource Protein Sequence Database RefSeq Reference Sequence TrEMBL Translation from EMBL UniParc UniProt Archive Introduction With the availability of over 165 completed genome sequences from both eukaryotic and prokaryotic organisms, efforts are now being focused on the identification and functional analysis of the proteins encoded by these genomes. The large-scale analysis of these proteins has started to generate huge amounts of data due to the new information provided by the genome projects and to a range of new technologies in protein science. For example, mass spectrometry approaches are being used in protein identification and in determining the nature of post-translational modifications [1]. These and other methods make it possible to quickly identify large numbers of proteins, to map their interactions, to determine their location within the cell [2 ] and to analyse their biological activities. Protein sequence databases play a vital role as a central resource for storing the data generated by these and more conventional efforts, and making them available to the scientific community. To exploit the various resources fully, it is essential to distinguish between them and to identify the types of data they contain. Universal protein databases cover proteins from all species whereas specialized data collections contain information about a particular protein family or group of proteins, or related to a specific organism. Universal protein sequence databases can be further subdivided into two categories: sequence repositories, in which data are stored with little or no manual intervention in the creation of the records; and expertly curated databases, in which the original data are enhanced by the addition of further information. In the following, we present the current status of the leading protein sequence databases. Sequence repositories Several protein sequence databases act as repositories of protein sequences. These databases add little or no additional information to the sequence records they contain and generally make no effort to provide a nonredundant collection of sequences to users. GenPept The most basic example of this type of database is the GenBank Gene Products Data Bank (GenPept), produced by the National Center of Biotechnology Information (NCBI) [3]. The entries in the database are derived from translations of the sequences contained in the nucleotide database maintained collaboratively by the DNA Data Bank of Japan (DDBJ) [4], the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database [5] and GenBank [6], and contain minimal annotation, which has been extracted primarily from the corresponding nucleotide entry. The entries lack additional annotation and the database does not contain proteins derived from amino acid sequencing. Also, each protein may be represented by multiple records and no attempt is made to group these records into a single database entry.
2 Protein sequence databases Apweiler, Bairoch and Wu 77 NCBI s Entrez Protein NCBI s Entrez Protein ( entrez/query.fcgi?db¼protein) is another example of a sequence repository. The database contains sequence data translated from the nucleotide sequences of the DDBJ/EMBL/GenBank database as well as sequences from Swiss-Prot [7], the Protein Information Resource (PIR) [8], RefSeq[9] and the Protein Data Bank (PDB) [10]. The database differs from GenPept in that many of the entries contain additional information that has been extracted from curated databases such as Swiss- Prot and PIR. As with GenPept, the sequence collection is redundant. RefSeq A more ambitious approach is taken by the Reference Sequence (RefSeq) collection produced by the NCBI ( The aim of the project is to provide a non-redundant collection of reference protein sequences. RefSeq sequences exist for a limited set of species, including approximately 1100 viruses and 150 bacteria, as well as a small number of higher organisms, such as human, mouse, rat, zebrafish, honeybee, sea urchin, cow, and several important plant species. The main features of the RefSeq collection include nonredundancy, explicitly linked nucleotide and protein sequences, updates to reflect current knowledge of sequence data and biology, data validation and format consistency, distinct accession series, and ongoing curation by NCBI staff and collaborators, with review status indicated on each record. However, the majority of the records are automatically generated with minimal manual intervention. In December 2003, the database contained entries with approximately manually reviewed entries so the database is closer to a sequence repository than to any of the curated databases discussed below. Universal curated databases Although repositories are an essential means of providing the user with sequences as quickly as possible, it is clear that, when additional information is added to a sequence, this greatly increases the value of the resource for users. The curated databases enrich the sequence data by adding additional information, which gets validated by expert biologists before being added to the databases to ensure that the data in these collections can be considered to be highly reliable. There is also a large effort invested in maintaining non-redundant datasets by compiling all reports for a given protein sequence into a single record. PIR-PSD The oldest universal curated protein sequence database is the Protein Information Resource Protein Sequence Database (PIR-PSD) ( It was established in 1984 as a successor to the original National Biomedical Research Foundation Protein Sequence Database, developed over a 20 year period by the late Margaret O Dayhoff and published as the Atlas of Protein Sequence and Structure from 1965 to 1978 [11]. It compiles comprehensive, non-redundant protein sequence data, organized by superfamily and family, and annotated with functional, structural, bibliographic and genetic data. In addition to the sequence data, the database contains the name and classification of the protein, the name of the organism in which it naturally occurs, references to the primary literature, function and general characteristics of the protein, and regions of biological interest within the sequence. The database is extensively cross-referenced with DDBJ/EMBL/GenBank nucleic acid and protein identifiers, PubMed and MEDLINE IDs, and unique identifiers from many other source databases. In October 2003, the database contained annotated and classified entries, covering the entire taxonomic range and organized into superfamilies and over families. Protein family classification is central to the organization and annotation of PIR. Automated procedures have allowed the placement of more than 99% sequences into families and more than 70% into superfamilies. The classification approach improves the sensitivity of protein identification, helps to detect and correct genome annotation errors systematically, and allows a more complete understanding of sequence function structure relationships. Swiss-Prot The leading universal curated protein sequence database is Swiss-Prot ( html), which contained as of November 2003 (release 42.6) curated sequence entries from over 8300 different species. The database is non-redundant, which means that all reports for a given protein are merged into a single entry, and is highly integrated with other databases [12]. Each entry in Swiss-Prot is thoroughly analysed and annotated by biologists to ensure that the database is of a high quality. Literature-based curation is used to extract experimental data. This experimental knowledge is supplemented by manually confirmed results from various sequence analysis programs. Annotation includes the description of properties such as the function of a protein, post-translational modifications, domains and sites, secondary and quaternary structure, similarities to other proteins, diseases associated with deficiencies in a protein, developmental stages in which the protein is expressed, in which tissues the protein is found, pathways in which the protein is involved, and sequence conflicts and variants.
3 78 Proteomics and genomics The annotation added is stored mainly in the description (DE) and gene (GN) lines, the comment (CC) lines, the feature table (FT) lines, and the keyword (KW) lines. There are over comments and sequence features in Swiss-Prot release The addition of several qualifiers in the comment and feature table lines during the annotation process allows users to distinguish between experimentally verified data, data that has been propagated from a characterized protein based on sequence similarity, and data for which no experimental evidence currently exists [13]. To provide correct gene names, use is made of authoritative gene name sources such as Genew, the database of the Human Genome Organization (HUGO) gene nomenclature committee [14], FlyBase [15] and the Mouse Genome Database (MGD) [16]. TrEMBL To produce a fully curated Swiss-Prot entry is a highly labor-intensive process and is the rate-limiting step in the growth of the database because more new sequences are submitted than can be efficiently annotated manually and integrated into the database. To address this, the TrEMBL (Translation from EMBL) database ( was introduced to make new sequences available as quickly as possible [7]. TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences in the DDBJ/EMBL/ GenBank nucleotide sequence database that are not yet included in Swiss-Prot. To ensure completeness, it also contains several protein sequences extracted from the literature or submitted directly by the user community. TrEMBL Release 25.6 of November 2003 contained entries from more than different species. TrEMBL follows the Swiss-Prot format and conventions described above as closely as possible. The production of TrEMBL starts with the translation of coding sequences in the DDBJ/EMBL/GenBank nucleotide sequence database. At this stage, all annotation in a TrEMBL entry derives from the corresponding nucleotide entry. The next steps involve redundancy removal through merging of multiple records [17] and the automated enhancement of the information content in TrEMBL [18]. The process is based on a system of standardized transfer of annotation from well-characterized proteins in Swiss-Prot to unannotated TrEMBL entries belonging to defined groups [19]. To assign entries to these groups, InterPro [20], an integrated resource of protein families, domains and functional sites, is used. This amalgamates the efforts of the member databases, which are currently PROSITE [21], PRINTS [22], Pfam[23], ProDom[24] SMART [25], TIGRFAMS [26], PIR SuperFamilies [27] and SUPER- FAMILY [28]. This process of adding accurate, highquality information to TrEMBL entries brings the standard of annotation in TrEMBL closer to that found in Swiss-Prot. UniProt: the next generation of protein sequence databases One of the most significant developments with regard to protein sequence databases is the recent decision by the National Institutes of Health to award a grant [29] to combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single resource, UniProt ( uniprot.org) [30 ]. UniProt was launched on 15 December 2003 and comprises three components: first, the UniProt Knowledgebase which will continue the work of Swiss-Prot, TrEMBL and PIR by providing an expertly curated database; second, the UniProt Archive (UniParc) into which new and updated sequences are loaded on a daily basis; third, the UniProt non-redundant reference databases (UniProt NREF), which provide non-redundant views on top of the UniProt Knowledgebase and UniParc. The UniProt Archive (UniParc) UniParc is the most comprehensive publicly accessible non-redundant protein sequence collection available. It contains publicly available protein sequences from Swiss- Prot, TrEMBL, PIR-PSD, EMBL, Ensembl [31], International Protein Index (IPI) ( PDB, RefSeq, FlyBase, WormBase [32] and the patent offices in Europe, the United States and Japan, making it the most comprehensive protein sequence database available. While a protein sequence may exist in multiple databases and more than once in a given database, Uni- Parc stores every unique sequence only once and assigns a unique UniParc identifier. Furthermore, UniParc provides cross-references to the source databases (accession numbers), sequence versions, and status (active or obsolete). A UniParc sequence version is also provided, and incremented each time the underlying sequence changes, thus making it possible to observe sequence changes in all source databases. The UniProt knowledgebase (UNIPROT) Swiss-Prot, TrEMBL and PIR-PSD have been merged to form the UniProt knowledgebase. All suitable PIR-PSD sequences that are missing from Swiss-Prot þ TrEMBL were incorporated into UniProt. Bi-directional crossreferences between Swiss-Prot þ TrEMBL and PIR- PSD were created to allow the easy tracking of the PIR-PSD entries. The transfer into UniProt of references and experimentally verified data present in PIR but missing from Swiss-Prot þ TrEMBL is ongoing. The UniProt knowledgebase consists of two parts: a section containing fully manually annotated records resulting from literature information extraction and curator-evaluated computational analysis, and a section with computationally analysed records awaiting full manual
4 Protein sequence databases Apweiler, Bairoch and Wu 79 annotation. For the sake of continuity and name recognition, the two sections are referred to as Swiss-Prot and TrEMBL. The main principles of the UniProt knowledgebase follow the established procedures used to annotate Swiss- Prot, TrEMBL and PIR, and this has already been explained in some detail. However, some additional advances have been made during the creation of UniProt. Gene Ontology annotation To provide the high level of annotation described above, the UniProt curators read a large amount of scientific literature related to each protein. This enables them to contribute to the work of the Gene Ontology (GO) consortium [33] by assigning GO terms during the annotation process as they extract information related to each of the gene ontologies (i.e. the function of a protein, what processes it is involved in and where in the cell it is located) [34 ]. Integration with other databases UniProt provides cross-references to external data collections such as the underlying DNA sequence entries in the DDBJ/EMBL/GenBank nucleotide sequence databases, 2D PAGE and 3D protein structure databases, various protein domain and family characterization databases, post-translational modification databases, species-specific data collections, variant databases and disease databases. As a result of this, UniProt acts as a central hub for biomolecular information archived in more than 50 cross-referenced databases. Isoform and feature identifiers Unique and stable feature identifiers (FTId) allow reference to a position-specific annotation item in the feature table. Currently, these are systematically attributed to FT VARIANT lines of human sequence entries, to alternative splicing events (VARSPLIC), and to certain glycosylation sites (CARBOHYD), but will ultimately be assigned to all types of FT lines. Isoform identifiers have been introduced for splice isoforms, which may differ considerably from one another, with potentially less than 50% sequence similarity between isoforms. The tool VARSPLIC [35], which is freely available, enables the recreation of all annotated splice variants from the feature table of a UniProt entry, or for the complete database. A FASTA-formatted file containing all splice variants annotated in UniProt can be downloaded for use with similarity search programs. The UniProt NREF (UniRef) databases The UniProt NREF (UniRef) databases, NREF100, NREF90 and NREF50, provide a complete coverage of sequence space while hiding redundant sequences from view. NREF100 provides a comprehensive non-redundant sequence collection clustered by sequence identity and taxonomy with source attribution. Identical sequences and subfragments from the same source organism (species) are presented as a single NREF entry with accession numbers of all the merged UniProt entries, the protein sequence, taxonomy, bibliography, links to the corresponding UniProt knowledgebase and archive records, as well as close sequence neighbors (with at least 95% sequence identity) from the same source organism. NREF90 and NREF50 are built from NREF100 to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. All records from all source organisms with mutual sequence identity of >90% or >50%, respectively, are merged into a single record that links to the corresponding UniProt knowledgebase records. NREF90 and NREF50 yield a size reduction of approximately 40% and 65%, respectively. Conclusions Complete and up-to-date databases of biological knowledge are vital for information-dependent biological and biotechnological research. With the rapid accumulation of genome sequences for many organisms, attention is turning to the identification and function of proteins encoded by these genomes. The recent joining of forces by the major protein databases Swiss-Prot, TrEMBL and PIR in the UniProt consortium to handle the increasing volume and variety of protein sequences and functional information will provide a cornerstone for scientists active in modern biological research. Acknowledgements UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG Minor support for the EBI s involvement in UniProt comes from the two European Union contracts BioBabel (QLRT ) and TEMBLOR (QLRI ) and from the NIH grant 1R01HGO Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the National Science Foundation (NSF) grants DBI and ITR References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: of special interest of outstanding interest 1. Sickmann A, Mreyen M, Meyer HE: Mass spectrometry a key technology in proteome research. Adv Biochem Eng Biotechnol 2003, 83: Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O Shea EK: Global analysis of protein expression in yeast. Nature 2003, 425: This article describes the construction and analysis of yeast strains expressing full-length, chromosomally tagged green fluorescent protein fusion proteins, representing 75% of the yeast proteome. The authors provide localization information for 70% of previously unlocalized proteins. Analysis of this dataset will help to define the functions of proteins in the context of the cellular environment.
5 80 Proteomics and genomics 3. Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res 2003, 31: Miyazaki S, Sugawara H, Gojobori T, Tateno Y: DNA Data Bank of Japan in XML. Nucleic Acids Res 2003, 31: Stoesser G, Baker W, van den Broek A, Garcia-Pastor M, Kanz C, Kulikova T, Leinonen R, Lin Q, Lombard V, Lopez R et al.: The EMBL Nucleotide Sequence Database: major new developments. Nucleic Acids Res 2003, 31: Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2003, 31: Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O Donovan C, Phan I et al.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL in Nucleic Acids Res 2003, 31: Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Kourtesis P, Ledley RS, Suzek BE et al.: The Protein Information Resource. Nucleic Acids Res 2003, 31: Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence Project: update and current status. Nucleic Acids Res 2003, 31: Westbrook J, Feng Z, Chen L, Yang H, Berman HM: The Protein Data Bank and structural genomics. Nucleic Acids Res 2003, 31: Dayhoff MO: Atlas of Protein Sequence and Structure, vol5, suppl. 3. Washington, DC: National Biomedical Research Foundation; Gasteiger E, Jung E, Bairoch A: SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol 2001, 3: Junker V, Apweiler R, Bairoch A: Representation of functional information in the Swiss-Prot data bank. Bioinformatics 1999, 15: Wain HM, Lush M, Ducluzeau F, Povey S: Genew: the human gene nomenclature database. Nucleic Acids Res 2002, 30: FlyBase consortium: The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res 2003, 31: Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT: MGD: the Mouse Genome Database. Nucleic Acids Res 2003, 31: O Donovan C, Martin MJ, Glemet E, Codani J, Apweiler R: Removing redundancy in Swiss-Prot and TrEMBL. Bioinformatics 1999, 15: Apweiler R: Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2001, 2: Fleischmann W, Moeller S, Gateau A, Apweiler R: A novel method for automatic and reliable functional annotation. Bioinformatics 1999, 15: Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P et al.: The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003, 31: Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJA, Hofmann K, Bairoch A: The PROSITE database, its status in Nucleic Acids Res 2002, 30: Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P et al.: PRINTS and its automatic supplement, preprints. Nucleic Acids Res 2003, 31: Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res 2002, 30: Corpet F, Servant F, Gouzy J, Kahn D: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res 2000, 28: Letunic I, Goodstadt L, Dickens NJ, Doerks T, Schultz J, Mott R, Ciccarelli F, Copley RR, Ponting CP, Bork P: Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 2002, 30: Haft DH, Selengut JD, White O: The TIGRFAMs database of protein families. Nucleic Acids Res 2003, 31: Huang H, Barker WC, Chen Y, Wu CH: iproclass: an integrated database of protein family, function and structure information. Nucleic Acids Res 2003, 31: Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of Hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313: Butler D: NIH pledges cash for global protein database. Nature 2002, 419: Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro F, Gasteiger E, Huang H, Lopez R, Magrane M et al.: UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res 2004, 32:in press. To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. This paper describes the broad goals of the UniProt Consortium, and gives the basic practical information about UniParc, UniProt and UniRef, the three databases components provided by the consortium. 31. Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V et al.: Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res 2003, 31: Harris TW, Lee R, Schwarz E, Bradnam K, Lawson D, Chen W, Blasier D, Kenny E, Cunningham F, Kishore R et al.: WormBase: a cross-species database for comparative genomics. Nucleic Acids Res 2003, 31: The Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res 2001, 11: Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A, Apweiler R: The Gene Ontology Annotation (GOA) project: implementation of GO in Swiss-Prot, TrEMBL and InterPro. Genome Res 2003, 13: Gene Ontology Annotation (GOA) is a project that aims to provide assignments of GO terms to gene products in Swiss-Prot and TrEMBL. This paper explains in detail the manual and automatic annotation procedures used to assign millions of GO terms to hundreds of thousands of proteins. 35. Kersey P, Hermjakob H, Apweiler R: VARSPLIC: alternativelyspliced protein sequences derived from Swiss-Prot and TrEMBL. Bioinformatics 2000, 16:
Linear Sequence Analysis. 3-D Structure Analysis
Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Scientific databases. Biological data management
Scientific databases Biological data management The term paper within the framework of the course Principles of Modern Database Systems by Aleksejs Kontijevskis PhD student The Linnaeus Centre for Bioinformatics
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
Gene2Pathway Predicting Pathway Membership via Domain Signatures
Gene2Pathway Predicting Pathway Membership via Domain Signatures Holger Fröhlich October 27, 2010 Abstract Functional characterization of genes is of great importance, e.g. in microarray studies. Valueable
BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS
BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
The Pfam Protein Families Database
The Pfam Protein Families Database Robert D. Finn *1, John Tate 1, Jaina Mistry 1, Penny C. Coggill 1, Stephen John Sammut 1, Hans-Rudolf Hotz 1, Goran Ceric 2, Kristoffer Forslund 3, Sean R. Eddy 2, Erik
Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome
Module 2 Genome Viewing Using Genome Browsers to View Annotation of the Human Genome Bert Overduin, Ph.D. PANDA Coordination & Outreach EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus
Sequence Information. Sequence information. Good web sites. Sequence information. Sequence. Sequence
Sequence information Multiple Pair-wise SRS Entrez Comparisons Database searches Sequence Information Orthologue clusters Sequence Organell localisation Patterns Protein families Membrane attachment Bengt
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility
Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility Report of a meeting organized by the Wellcome Trust and held on 14 15 January 2003 at Fort Lauderdale,
INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B
INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE ICH HARMONISED TRIPARTITE GUIDELINE QUALITY OF BIOTECHNOLOGICAL PRODUCTS: ANALYSIS
EMBL-EBI Web Services
EMBL-EBI Web Services Rodrigo Lopez Head of the External Services Team SME Workshop Piemonte 2011 EBI is an Outstation of the European Molecular Biology Laboratory. Summary Introduction The JDispatcher
Global and Discovery Proteomics Lecture Agenda
Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global
Protein Analysis and WEKA Data Mining
GenMiner: a Data Mining Tool for Protein Analysis Gerasimos Hatzidamianos 1, Sotiris Diplaris 1,2, Ioannis Athanasiadis 1,2, Pericles A. Mitkas 1,2 1 Department of Electrical and Computer Engineering Aristotle
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
THE GENBANK SEQUENCE DATABASE
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D. Baxevanis, B.F. Francis Ouellette Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-38390-2 (Hardback);
Bioinformatics, Sequences and Genomes
Bioinformatics, Sequences and Genomes BL4273 Bioinformatics for Biologists Week 1 Daniel Barker, School of Biology, University of St Andrews Email [email protected] BL4273 and 4273π 4273π is a custom
Committee on WIPO Standards (CWS)
E CWS/1/5 ORIGINAL: ENGLISH DATE: OCTOBER 13, 2010 Committee on WIPO Standards (CWS) First Session Geneva, October 25 to 29, 2010 PROPOSAL FOR THE PREPARATION OF A NEW WIPO STANDARD ON THE PRESENTATION
European Medicines Agency
European Medicines Agency July 1996 CPMP/ICH/139/95 ICH Topic Q 5 B Quality of Biotechnological Products: Analysis of the Expression Construct in Cell Lines Used for Production of r-dna Derived Protein
Teaching Bioinformatics to Undergraduates
Teaching Bioinformatics to Undergraduates http://www.med.nyu.edu/rcr/asm Stuart M. Brown Research Computing, NYU School of Medicine I. What is Bioinformatics? II. Challenges of teaching bioinformatics
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Discovering Bioinformatics
Discovering Bioinformatics Sami Khuri Natascha Khuri Alexander Picker Aidan Budd Sophie Chabanis-Davidson Julia Willingale-Theune English version ELLS European Learning Laboratory for the Life Sciences
Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr
Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
Converting GenMAPP MAPPs between species using homology
Converting GenMAPP MAPPs between species using homology 1 Introduction and Background 2 1.1 Fundamental principles of the GenMAPP Gene Database 2 1.1.1 Gene Database data types 2 1.1.2 GenMAPP System Codes
Web-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Sequence homology search tools on the world wide web
44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: [email protected] Introduction Sequence homology
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
UCM-MACB 2.0: A COMPLUTENSE UNIVERSITY VIRTUAL HERBARIUM PROJECT
UCM-MACB 2.0: A COMPLUTENSE UNIVERSITY VIRTUAL HERBARIUM PROJECT A. Avalos, A. Barrera, J.M. Gabriel y Galán, T. Gallardo, A. Gómez, J.M. Hernández, R. Lahoz, P. López, N. Marcos, L. Martin, A. G.Moreno,
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Hidden Markov models in biological sequence analysis
Hidden Markov models in biological sequence analysis by E. Birney The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of
Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction
Chapter Modeling Experimental Design for Proteomics Jan Eriksson and David Fenyö Abstract The complexity of proteomes makes good experimental design essential for their successful investigation. Here,
CCR Biology - Chapter 9 Practice Test - Summer 2012
Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible
UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS
UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS 1.0 Aims and Objectives 1.1 Introduction to Bioinformatics 1.2 Landmark Sequences Completed 1.3 Sequence Analysis: Sequence to Potential Function 1.4 The
SUBMITTING DNA SEQUENCES TO THE DATABASES
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D. Baxevanis, B.F. Francis Ouellette Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-38390-2 (Hardback);
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title: Bioinformatics
Processing Genome Data using Scalable Database Technology. My Background
Johann Christoph Freytag, Ph.D. [email protected] http://www.dbis.informatik.hu-berlin.de Stanford University, February 2004 PhD @ Harvard Univ. Visiting Scientist, Microsoft Res. (2002)
Integrating Bioinformatics, Medical Sciences and Drug Discovery
Integrating Bioinformatics, Medical Sciences and Drug Discovery M. Madan Babu Centre for Biotechnology, Anna University, Chennai - 600025 phone: 44-4332179 :: email: [email protected] Bioinformatics
Distributed Bioinformatics Computing System for DNA Sequence Analysis
Global Journal of Computer Science and Technology: A Hardware & Computation Volume 14 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish- [email protected].
Module 3 Genome Browsing Using Web Browsers to View Genome Annota4on Kers4n Howe Wellcome Trust Sanger Ins4tute zfish- [email protected] Introduc.on Genome browsing The Ensembl gene set Guided examples
Analysis of Illumina Gene Expression Microarray Data
Analysis of Illumina Gene Expression Microarray Data Asta Laiho, Msc. Tech. Bioinformatics research engineer The Finnish DNA Microarray Centre Turku Centre for Biotechnology, Finland The Finnish DNA Microarray
Agenda. Today s session will look at the following... Ovid Open Access What is it? The Content What are the sources behind Ovid Open Access?
Slide 1 Agenda Today s session will look at the following... Ovid Open Access What is it? The Content What are the sources behind Ovid Open Access? A Trainer s View Some hidden benefits of Ovid Open Access
Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness
Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness Melanie Dulong de Rosnay Fellow, Science Commons and Berkman Center for Internet & Society at Harvard University This article
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
Annex 6: Nucleotide Sequence Information System BEETLE. Biological and Ecological Evaluation towards Long-Term Effects
Annex 6: Nucleotide Sequence Information System BEETLE Biological and Ecological Evaluation towards Long-Term Effects Long-term effects of genetically modified (GM) crops on health, biodiversity and the
Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London
Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London 1. What is Discovery Net 2. Distributed Data Mining for Compute Intensive Tasks 3. Distributed
Worksheet - COMPARATIVE MAPPING 1
Worksheet - COMPARATIVE MAPPING 1 The arrangement of genes and other DNA markers is compared between species in Comparative genome mapping. As early as 1915, the geneticist J.B.S Haldane reported that
BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database
Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database Dina VISHNYAKOVA a,b,d,1, Emilie PASCHE a,b,d, Julien GOBEILL a,c,d, Arnaud GAUDINAT a,c,d, Christian
A leader in the development and application of information technology to prevent and treat disease.
A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute
European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute Justin Paschall Team Leader Genetic Variation / EGA ! European Genome-phenome
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
Biochemistry Major Talk 2014-15. Welcome!!!!!!!!!!!!!!
Biochemistry Major Talk 2014-15 August 14, 2015 Department of Biochemistry The University of Hong Kong Welcome!!!!!!!!!!!!!! Introduction to Biochemistry A four-minute video: http://www.youtube.com/watch?v=tpbamzq_pue&l
REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])
820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor
