Bioinformatics Integration Framework for Metabolic Pathway Data-Mining
|
|
|
- Nelson Casey
- 10 years ago
- Views:
Transcription
1 Bioinformatics Integration Framework for Metabolic Pathway Data-Mining Tomás Arredondo V. 1, Michael Seeger P. 2, Lioubov Dombrovskaia 3, Jorge Avarias 3 A., Felipe Calderón B. 3, Diego Candel C. 3, Freddy Muñoz R. 3, Valeria Latorre R. 2, Loreine Agulló 2, Macarena Cordova H. 2, and Luis Gómez 2 1 Departamento de Electrónica [email protected] 2 Millennium Nucleus EMBA, Departamento de Química 3 Departamento de Informática, Universidad Técnica Federico Santa María Av. España 1680, Valparaíso, Chile Abstract. A vast amount of bioinformatics information is continuously being introduced to different databases around the world. Handling the various applications used to study this information present a major data management and analysis challenge to researchers. The present work investigates the problem of integrating heterogeneous applications and databases towards providing a more efficient data-mining environment for bioinformatics research. A framework is proposed and GeXpert, an application using the framework towards metabolic pathway determination is introduced. Some sample implementation results are also presented. 1 Introduction Modern biotechnology aims to provide powerful and beneficial solutions in diverse areas. Some of the applications of biotechnology include biomedicine, bioremediation, pollution detection, marker assisted selection of crops, pest management, biochemical engineering and many others [3, 14, 22, 21, 29]. Because of the great interest in biotechnology there has been a proliferation of separate and disjoint databases, data formats, algorithms and applications. Some of the many types of databases currently in use include: nucleotide sequences (e.g. Ensemble, Genbank, DDBJ, EMBL), protein sequences (e.g. SWISS-PROT, InterPro, PIR, PRF), enzyme databases (e.g. Enzymes), metabolic pathways (e.g. ERGO, KEGG: Kyoto Encyclopedia of Genes and Genomes database) and literature references (e.g. PubMed) [1, 4, 11, 6, 23]. The growth in the amount of information stored has been exponential: since the 1970s, as of April 2004 there were over 39 billion bases in Entrez NCBI (National Center of Bioinformatics databases), while the number of abstracts in PubMed has been growing by 10,000 abstracts per week since 2002 [3, 13]. The availability of this data has undoubtedly accelerated biotechnological research. However, because these databases were developed independently and are
2 managed autonomously, they are highly heterogeneous, hard to cross-reference, and ill-suited to process mixed queries. Also depending on the database being accessed the data in them is stored in a variety of formats including: a host of graphic formats, RAW sequence data, FASTA, PIR, MSF, CLUSTALW, and other text based formats including XML/HTML. Once the desired data is retrieved from the one of the database(s) it typically has to be manually manipulated and addressed to another database or application to perform a required action such as: database and homology search (e.g. BLAST, Entrez), sequence alignment and gene analysis (e.g. ClustalW, T-Coffee, Jalview, GenomeScan, Dialign, Vector NTI, Artemis) [8]. Beneficial application developments would occur more efficiently if the large amounts of biological feature data could be seamlessly integrated with data from literature, databases and applications for data-mining, visualization and analysis. In this paper we present a framework for bioinformatic literature, database, and application integration. An application based of this framework is shown that supports metabolic pathway research within a single easy to use graphical interface with assisting fuzzy logic decision support. To our knowledge, an integration framework encompassing all these areas has not been attempted before. Early results have shown that the integration framework and application could be useful to bioinformatics researchers. In Section 2, we describe current metabolic pathway research methodology. In Section 3 we describe existing integrating architectures and applications. Section 4 describes the architecture and GeXpert, an application using this framework is introduced. Finally, some conclusions are drawn and directions of future work are presented. 2 Metabolic Pathway Research For metabolic pathway reconstruction experts have been traditionally used a time intensive iterative process [30]. As part of this process genes first have to be selected as candidates for encoding an enzyme within a potential metabolic pathway within an organism. Their selection then has to be validated with literature references (e.g. non-hypothetical genes in Genbank) and using bioinformatics tools (e.g. BLAST: Basic Local Alignment Search Tool) for finding orthologous genes in various other organisms. Once a candidate gene has been determined, sequence alignment of the candidate gene with the sequence of the organism under study has to be performed in a different application for gene locations (e.g. Vector NTI, ARTEMIS). Once the genes required have been found in the organism then the metabolic pathway has to be confirmed experimentally in the laboratory. For example, using the genome sequence of Acidithiobacillus ferrooxidans diverse metabolic pathways have been determined. One major group of organisms that is currently undergoing metabolic pathways research is bacteria. Bacteria possess the highest metabolic versatility of the three domains of living organisms. This versatility stems from their expansion into different natural niches, a remarkable degree of physiological and genetic
3 adaptability and their evolutionary diversity. Microorganisms play a main role in the carbon cycle and in the removal of natural and man-made waste chemical compounds from the environment [17]. For example, Burkholderia xenovorans LB400 is a bacterium capable of degrading a wide range of PCBs [7, 29]. Because of the complex and noisy nature of the data, any selection of candidate genes as part of metabolic pathways is currently only done by human experts prior to biochemical verification. The lack of integration and standards in database, application and file formats is time consuming and forces researchers to develop ad hoc data management processes that could be prone to error. In addition, the possibility of using Softcomputing based pattern detection and analysis techniques (e.g. fuzzy logic) have not been fully explored as an aid to the researcher within such environments [1, 23]. 3 Integration Architectures The trend in the field is towards data integration. Research projects continue to generate large amounts of raw data, and this is annotated and correlated with the data in the public databases. The ability to generate new data continues to outpace the ability to verify them in the laboratory and therefore to exploit them. Validation experiments and the ultimate conversion of data to validated knowledge need expert human involvement with the data and in the laboratory, consuming time and resources. Any effort of data and system integration is an attempt towards reducing the time spent by experts unnecessarily which could be better spent in the lab [5, 9, 20]. The biologist or biochemist not only needs to be an expert in his field as well stay up to date with the latest software tools or even develop his own tools to be able to perform his research [16, 28]. One example of these types of tools is BioJava, which is an open source set of Java components such as parsers, file manipulation tools, sequence translation and proteomic components that allow extensive customization but still require the development of source code [25]. The goal should be integrated user-friendly systems that would greatly facilitate the constructive cycle of computational model building and experimental verification for the systematic analysis of an organism [5, 8, 16, 20,28, 30]. Sun et al. [30] have developed a system, IdentiCS, which combines the identification of coding sequences (CDS) with the reconstruction, comparison and visualization of metabolic networks. IdentiCS uses sequences from public databases to perform a BLAST query to a local database with the genome in question. Functional information from the CDSs is used for metabolic reconstruction. One shortcoming is that the system does not incorporate visualization or ORF (Open Reading Frames) selection and it includes only one application for metabolic sequence reconstruction (BLAST). Information systems for querying, visualization and analysis must be able to integrate data on a large scale. Visualization is one of the key ways of making the researcher work easier. For example, the spatial representation of the genes within the genome shows location of the gene, reading direction and metabolic
4 function. This information is made available by applications such as Vector NTI and Artemis. The function of the gene is interpreted through its product, normally the protein in a metabolic pathway, which description is available from several databases such as KEGG or ERGO [24]. Usually, the metabolic pathways are represented as a flowchart of several levels of abstraction, which are constructed manually. Recent advances on metabolic network visualization include virtual reality use [26], mathematical model development [27], and a modeling language [19]. These techniques synthesize the discovery of genes and are available in databases such as Expasy, but they should be transparent to the researcher by their seamless integration into the research application environment. The software engineering challenges and opportunities in integrating and visualizing the data are well documented [16, 28]. There are some applications servers that use a SOAP interface to answer queries. Linking such tools into unified working environment is non-trivial and has not been done to date [2]. One recent approach towards human centered integration is BioUse, a web portal developed by the Human Centered Software Engineering Group at Concordia University. BioUse provided an adaptable interface to NCBI, BLAST and ClustalW, which attempted to shield the novice user from unnecessary complexity. As users became increasingly familiar with the application the portal added shortcuts and personalization features for different users (e.g. pharmacologists, microbiologists) [16]. BioUse was limited in scope as a prototype and its website is no longer available. Current research is continued in the CO-DRIVE project but it is not completed yet [10]. 4 Integration Framework and Implementation This project has focused on the development of a framework using open standards towards integrating heterogeneous databases, web services and applications/tools. This framework has been applied in GeXpert, an application for bioinformatics metabolic pathway reconstruction research in bacteria. In addition this application includes the utilization of fuzzy logic for helping in the selection of the best candidate genes or sequences for specified metabolic pathways. Previously, this categorization was typically done in an ad hoc manner by manually combining various criteria such as e-value, identities, gaps, positives and score. Fuzzy logic enables an efficiency enhancement by providing an automated sifting mechanism for a very manually intensive bioinformatics procedure currently being performed by researchers. GeXpert is used to find, build and edit metabolic pathways (central or peripheral), perform protein searches in NCBI, perform nucleotide comparisons of organisms versus the sequenced one (using tblastn). The application can also perform a search of 3D models associated with a protein or enzyme (using the Cn3D viewer), generation of ORF diagrams for the sequenced genome, generation of reports relating to the advance of the project and aid in the selection of BLAST results using T-S-K (Takagi Sugeno Kang) fuzzy logic.
5 The architecture is implemented in three layers: presentation, logic and data [18]. As shown in Figure 1, the presentation layer provides for a web based as well as a desktop based interface to perform the tasks previously mentioned. Fig. 1. High Level Framework Architecture The logical layer consists of a core engine and a web server. The core engine performs bioinformatic processing functions and uses different tools as required: BLAST for protein and sequence alignment, ARTEMIS for analysis of nucleotide sequences, Cn3D for 3D visualization, and a T-S-K fuzzy logic library for candidate sequence selection. The Web Server component provides an HTTP interface into the GeXpert core engine. As seen in Figure 1, the data layer includes a communications manager and a data manager. The communication manager is charged with obtaining data (protein and nucleotide sequences, metabolic pathways and 3D models) from various databases and application sources using various protocols (application calls, TCP/IP, SOAP). The data manager implements data persistence as well as temporary cache management for all research process related objects. 4.1 Application Implementation: GeXpert GeXpert [15] is an open source implementation of the framework previously described. It relies on Eclipse [12] builder for multiplatform support. The GeXpert desktop application as implemented consists of the following:
6 Metabolic pathway editor: tasked with editing or creating metabolic pathways, shows them as directed graphs of enzymes and compounds. Protein search viewer: is charged with showing the results of protein searches with user specified parameters. Nucleotide search viewer: shows nucleotide search results given user specified protein/parameters. ORF viewer: is used to visualize surrounding nucleotide ORFs with a map of colored arrows. The colors indicate the ORF status (found, indeterminate, erroneous or ignored). The GeXpert core component implements the application logic. It consists of the following elements: Protein search component: manages the protein search requests and its results. Nucleotide search component: manages the nucleotide search requests and its results. In addition, calls the Fuzzy component with the search parameters specified in order to determine the best candidates for inclusion into the metabolic pathway. Fuzzy component: attempts to determine the quality of nucleotide search results using fuzzy criteria [1].The following normalized (0 to 1 values) criteria are used: e-value, identities, gaps. Each has five membership functions (very low, low, medium, high, very high), the number of rules used is 243 (3 5 ). ORF component: identifies the ORFs present in requested genome region. Genome component: manages the requests for genome regions and its results. GeXpert communications manager receives requests from the GeXpert core module to obtain and translates data from external sources. This module consists of the following subcomponents: BLAST component: calls the BLAST application indicating the protein to be analyzed and the organism data base to be used. BLAST parser component: translates BLAST results from XML formats into objects. Cn3D component: sends three-dimensional (3D) protein models to the Cn3D application for display. KEGG component: obtains and translates metabolic pathways received from KEGG. NCBI component: performs 3D protein model and document searches from NCBI. EBI component: performs searches on proteins and documentation from the EBI (European Bioinformatic Institute) databases. GeXpert data manager is tasked with load administrating and storage of application data: Cache component: in charge of keeping temporary search results to improve throughput. Also implements aging of cache data.
7 Application data component: performs data persistence of metabolic paths, 3D protein models, protein searches, nucleotide searches, user configuration data, project configuration for future usage. Fig. 2. Metabolic Pathway Viewer Fig. 3. Protein Search Viewer
8 Fig. 4. Nucleotide Search Viewer 4.2 Application Implementation: GeXpert User Interface and Workflow GeXpert is to be used in metabolic pathway research work using a research workflow similar to identics [30]. The workflow and some sample screenshots are given: 1. User must provide the sequenced genome of the organism to be studied. 2. The metabolic pathway of interest must be created (can be based on the pathway of a similar organism). In the example in Figure 2, the glycolisis/gluconeogenesis metabolic pathway was imported from KEGG. 3. For each metabolic pathway a key enzyme must be chosen in order to start a detailed search. Each enzyme could be composed of one or more proteins (subunits). Figure 3 shows the result of the search for the protein homoprotocatechuate 2,3-dioxygenase. 4. Perform a search for amino acid sequences (proteins) in other organisms. Translate these sequences into nucleotides (using tblastn) and perform an alignment in the genome of the organism under study. If this search does not give positives results it return to the previous step and search another protein sequence. In Figure 4, we show that the protein homoprotocatechuate 2,3-dioxygenase has been found in the organism under study (Burkholderia xenovorans LB400) in the chromosome 2 (contig 481) starting in position and ending in region Also the system shows the nucleotide sequence for this protein. 5. For a DNA sequence found, visualize the ORF map (using the ORF Viewer) and identify if there is an ORF that contains a large part of said sequence. If this is not the case go back to the previous step and chose another sequence. 6. Verify if the DNA sequence for the ORF found corresponds to the chosen sequence or a similar one (using blastx). If not choose another sequence. 7. Establish as found the ORF of the enzyme subunit and start finding in the surrounding ORFs sequences capable of coding the other subunits of the enzyme or other enzymes of the metabolic pathway.
9 8. For the genes that were not found in the surrounding ORFs repeat the entire process. 9. For the enzyme, the 3D protein model can be obtained and viewed if it exists (using CN3D as integrated into the GeXpert interface). 10. The process is concluded by the generation of reports with the genes found and with associated related documentation that supports the information about the proteins utilized in the metabolic pathway. 5 Conclusions The integrated framework approach presented in this paper is an attempt to enhance the efficiency and capability of bioinformatics researchers. GeXpert is a demonstration that the integration framework can be used to implement useful bioinformatics applications. GeXpert has so far shown to be a useful tool for our researchers; it is currently being enhanced for functionality and usability improvements. The current objective of the research group is to use GeXpert in order to discover new metabolic pathways for bacteria [29]. In addition to our short term goals, the following development items are planned for the future: using fuzzy logic in batch mode, web services and a web client, blastx for improved verification of the ORFs determined, multi-user mode to enable multiple users and groups with potentially different roles to share in a common research effort, peer to peer communication to enable the interchange of documents (archives, search results, research items) thus enabling a network of collaboration in different or shared projects, and the use of intelligent/evolutionary algorithms to enable learning based on researcher feedback into GeXpert. Acknowledgements This research was partially funded by Fundación Andes. References 1. Arredondo, T., Neelakanta, P.S., DeGroff, D.: Fuzzy Attributes of a DNA Complex: Development of a Fuzzy Inference Engine for Codon- Junk Codon Delineation. Artif. Intell. Med (2005) Barker, J., Thornton, J.: Software Engineering Challenges in bioinformatics. Proceedings of the 26th International Conference on Software Engineering, IEEE (2004) 3. Bernardi, M., Lapi, M., Leo, P., Loglisci, C.: Mining Generalized Association Rules on Biomedical Literature. In: Moonis, A. Esposito, F. (eds): Innovations in Applied Artificial Intelligence. Lect. Notes Artif. Int (2005) Brown, T.A.: Genomes. John Wiley and Sons, NY (1999) 5. Cary M.P., Bader G.D., Sander C.: Pathway information for systems biology (Review Article). FEBS Lett. 579 (2005) , 6. Claverlie, J.M.: Bioinformatics for Dummies. Wiley Publishing (2003)
10 7. Cámara, B., Herrera, C., González, M., Couve, E., Hofer, B., Seeger, M.: From PCBs to highly toxic metabolites by the biphenyl pathway. Environ. Microbiol. (6) (2004) Cohen, J.: Computer Science and Bioinformatics. Commun. ACM 48 (3) (2005) Costa, M., Collins, R., Anterola, A., Cochrane, F., Davin, L., Lewis, N.: An in silico assessment of gene function and organization of the phenylpropanoid pathway metabolic networks in Arabidopsis thaliana and limitations thereof. Phytochem. 64 (2003) CO-Drive project: Durbin, R.: Biological Sequence Analysis, Cambridge, UK (2001) 12. Eclipse project: Entrez NCBI Database: Gardner, D.,: Using genomics to help predict drug interactions. J Biomed. Inform. 37 (2004) GeXpert sourceforge page: Javahery, H., Seffah, A., Radhakrishnan, T.: Beyond Power: Making Bioinformatics Tools User-Centered. Commun. ACM (2004) 17. Jimenez, J. I., Miambres, B., Garca, J., Daz, E.: Genomic insights in the metabolism of aromatics compounds in Pseudomonas. In: Ramos, J. L. (ed): Pseudomonas, vol. 3. NY: Kluwer Academic Publishers, (2004) Larman, C.: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development. Prentice Hall PTR (2004) 19. Loew, L. W., Schaff, J. C.: The Virtual Cell: a software environment for computational cell biology. Trends Biotechnol (2001) 20. Ma H., Zeng A., Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 19 (2003) Magalhaes, J., Toussaint, O.: How bioinformatics can help reverse engineer human aging. Aging Res. Rev. 3 (2004) Molidor, R., Sturn, A., Maurer, M., Trajanosk, Z.: New trends in bioinformatics: from genome sequence to personalized medicine. Exp. Gerontol. 38 (2003) Neelakanta, P.S., Arredondo, T., Pandya, S., DeGroff, D.: Heuristics of AI- Based Search Engines for Massive Bioinformatic Data-Mining: An Example of Codon/Noncodon Delineation Search in a Binary DNA Sequence, Proceeding of IICAI (2003) 24. Papin, J.A., Price, N.D., Wiback, S.J., Fell, D.A., Palsson, B.O.: Metabolic Pathways in the Post-genome Era. Trends Biochem. Sci (2003) 25. Pocock, M., Down, T., Hubbard, T.: BioJava: Open Source Components for Bioinformatics. ACM SIGBIO Newsletter 20 2 (2000) Rojdestvenski, I.: VRML metabolic network visualizer. Comp. Bio. Med. 33 (2003) 27. SBML: Systems Biology Markup Language Segal, T., Barnard, R.: Let the shoemaker make the shoes - An abstraction layer is needed between bioinformatics analysis, tools, data, and equipment: An agenda for the next 5 years. First Asia-Pacific Bioinformatics Conference, Australia (2003) 29. Seeger, M., Timmis, K. N., Hofer, B.: Bacterial pathways for degradation of polychlorinated biphenyls. Mar. Chem. 58 (1997) Sun, J., Zeng, A.: IdentiCS - Identification of coding sequence and in silico reconstruction of the metabolic network directly from unannotated low-coverage bacterial genome sequence. BMC Bioinformatics 5:112 (2004)
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])
820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor
Committee on WIPO Standards (CWS)
E CWS/1/5 ORIGINAL: ENGLISH DATE: OCTOBER 13, 2010 Committee on WIPO Standards (CWS) First Session Geneva, October 25 to 29, 2010 PROPOSAL FOR THE PREPARATION OF A NEW WIPO STANDARD ON THE PRESENTATION
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London
Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London 1. What is Discovery Net 2. Distributed Data Mining for Compute Intensive Tasks 3. Distributed
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
Teaching Bioinformatics to Undergraduates
Teaching Bioinformatics to Undergraduates http://www.med.nyu.edu/rcr/asm Stuart M. Brown Research Computing, NYU School of Medicine I. What is Bioinformatics? II. Challenges of teaching bioinformatics
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Integrating Bioinformatics, Medical Sciences and Drug Discovery
Integrating Bioinformatics, Medical Sciences and Drug Discovery M. Madan Babu Centre for Biotechnology, Anna University, Chennai - 600025 phone: 44-4332179 :: email: [email protected] Bioinformatics
UGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM [email protected]
Lecture 11 Data storage and LIMS solutions Stéphane LE CROM [email protected] Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
BIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
Dr Alexander Henzing
Horizon 2020 Health, Demographic Change & Wellbeing EU funding, research and collaboration opportunities for 2016/17 Innovate UK funding opportunities in omics, bridging health and life sciences Dr Alexander
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Global and Discovery Proteomics Lecture Agenda
Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global
Business Intelligence and Decision Support Systems
Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley
Application of Graph-based Data Mining to Metabolic Pathways
Application of Graph-based Data Mining to Metabolic Pathways Chang Hun You, Lawrence B. Holder, Diane J. Cook School of Electrical Engineering and Computer Science Washington State University Pullman,
Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices
overview Pipeline Pilot Enterprise Server Pipeline Pilot Enterprise Server (PPES) is a powerful client-server platform that streamlines the integration and analysis of the vast quantities of data flooding
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Processing Genome Data using Scalable Database Technology. My Background
Johann Christoph Freytag, Ph.D. [email protected] http://www.dbis.informatik.hu-berlin.de Stanford University, February 2004 PhD @ Harvard Univ. Visiting Scientist, Microsoft Res. (2002)
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
A Web Based Software for Synonymous Codon Usage Indices
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 3 (2013), pp. 147-152 International Research Publications House http://www. irphouse.com /ijict.htm A Web
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Module 10: Bioinformatics
Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior
Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives
Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives [email protected] 2015-05-21 Functional Bioinformatics, Örebro University Vad är bioinformatik och varför
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
Web-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks Semester II Paper II: Mathematics I 85 marks B.Sc. II Year
org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
Visualizing Networks: Cytoscape. Prat Thiru
Visualizing Networks: Cytoscape Prat Thiru Outline Introduction to Networks Network Basics Visualization Inferences Cytoscape Demo 2 Why (Biological) Networks? 3 Networks: An Integrative Approach Zvelebil,
Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova
Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
A leader in the development and application of information technology to prevent and treat disease.
A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Bioinformatics: course introduction
Bioinformatics: course introduction Filip Železný Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Intelligent Data Analysis lab http://ida.felk.cvut.cz
Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome
Module 2 Genome Viewing Using Genome Browsers to View Annotation of the Human Genome Bert Overduin, Ph.D. PANDA Coordination & Outreach EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
EMBL-EBI Web Services
EMBL-EBI Web Services Rodrigo Lopez Head of the External Services Team SME Workshop Piemonte 2011 EBI is an Outstation of the European Molecular Biology Laboratory. Summary Introduction The JDispatcher
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
An agent-based layered middleware as tool integration
An agent-based layered middleware as tool integration Flavio Corradini Leonardo Mariani Emanuela Merelli University of L Aquila University of Milano University of Camerino ITALY ITALY ITALY Helsinki FSE/ESEC
University of Glasgow - Programme Structure Summary C1G5-5100 MSc Bioinformatics, Polyomics and Systems Biology
University of Glasgow - Programme Structure Summary C1G5-5100 MSc Bioinformatics, Polyomics and Systems Biology Programme Structure - the MSc outcome will require 180 credits total (full-time only) - 60
200630 - FBIO - Fundations of Bioinformatics
Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 2015 200 - FME - School of Mathematics and Statistics 1004 - UB - (ENG)Universitat de Barcelona MASTER'S DEGREE IN STATISTICS AND
Molecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
Rotorcraft Health Management System (RHMS)
AIAC-11 Eleventh Australian International Aerospace Congress Rotorcraft Health Management System (RHMS) Robab Safa-Bakhsh 1, Dmitry Cherkassky 2 1 The Boeing Company, Phantom Works Philadelphia Center
Master of Science in Software Engineering (MSC)
Master of Science in Software Engineering The MSc in Software Engineering provides a thorough grounding in how to apply rigorous engineering principles to deliver elegant, effective software solutions
Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes
Integration of Genetic and Familial Data into Electronic Medical Records and Healthcare Processes By Thomas Kmiecik and Dale Sanders February 2, 2009 Introduction Although our health is certainly impacted
DOE Office of Biological & Environmental Research: Biofuels Strategic Plan
DOE Office of Biological & Environmental Research: Biofuels Strategic Plan I. Current Situation The vast majority of liquid transportation fuel used in the United States is derived from fossil fuels. In
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Linear Sequence Analysis. 3-D Structure Analysis
Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic
A Framework of Model-Driven Web Application Testing
A Framework of Model-Driven Web Application Testing Nuo Li, Qin-qin Ma, Ji Wu, Mao-zhong Jin, Chao Liu Software Engineering Institute, School of Computer Science and Engineering, Beihang University, China
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title: Bioinformatics
Vector NTI Advance 11 Quick Start Guide
Vector NTI Advance 11 Quick Start Guide Catalog no. 12605050, 12605099, 12605103 Version 11.0 December 15, 2008 12605022 Published by: Invitrogen Corporation 5791 Van Allen Way Carlsbad, CA 92008 U.S.A.
Bioinformatics Tools Tutorial Project Gene ID: KRas
Bioinformatics Tools Tutorial Project Gene ID: KRas Bednarski 2011 Original project funded by HHMI Bioinformatics Projects Introduction and Tutorial Purpose of this tutorial Illustrate the link between
Using Provenance to Improve Workflow Design
Using Provenance to Improve Workflow Design Frederico T. de Oliveira, Leonardo Murta, Claudia Werner, Marta Mattoso COPPE/ Computer Science Department Federal University of Rio de Janeiro (UFRJ) {ftoliveira,
An Automated Workflow System Geared Towards Consumer Goods and Services Companies
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 An Automated Workflow System Geared Towards Consumer Goods and Services
How To Change Medicine
P4 Medicine: Personalized, Predictive, Preventive, Participatory A Change of View that Changes Everything Leroy E. Hood Institute for Systems Biology David J. Galas Battelle Memorial Institute Version
Abdullah Mohammed Abdullah Khamis
Abdullah Mohammed Abdullah Khamis Jeddah, Saudi Arabia Email: [email protected] Mobile: +966 567243182 Tel: +966 2 6340699 (Yemeni) Research and Professional Objective To Complete my Ph.D. in Pattern
Gene Models & Bed format: What they represent.
GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,
Bioinformatics, Sequences and Genomes
Bioinformatics, Sequences and Genomes BL4273 Bioinformatics for Biologists Week 1 Daniel Barker, School of Biology, University of St Andrews Email [email protected] BL4273 and 4273π 4273π is a custom
Course Syllabus For Operations Management. Management Information Systems
For Operations Management and Management Information Systems Department School Year First Year First Year First Year Second year Second year Second year Third year Third year Third year Third year Third
Masters in Information Technology
Computer - Information Technology MSc & MPhil - 2015/6 - July 2015 Masters in Information Technology Programme Requirements Taught Element, and PG Diploma in Information Technology: 120 credits: IS5101
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
MEng, BSc Applied Computer Science
School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions
Acceleration for Personalized Medicine Big Data Applications
Acceleration for Personalized Medicine Big Data Applications Zaid Al-Ars Computer Engineering (CE) Lab Delft Data Science Delft University of Technology 1" Introduction Definition & relevance Personalized
