Processing Genome Data using Scalable Database Technology. My Background

Size: px
Start display at page:

Download "Processing Genome Data using Scalable Database Technology. My Background"

Transcription

1 Johann Christoph Freytag, Ph.D. Stanford University, February 2004 Harvard Univ. Visiting Scientist, Microsoft Res. (2002) My Background Professor of ERCR (European Computer Industry Research Centre), München (87-89) DEC s Database Technology Center, München (90-93) Starburst project, IBM Almaden Research Center (85-87) Visiting Scientist, Almaden Research Center (97/98) Visiting Scientist, IBM SVL (2001) nur zum nicht-kommerziellen Gebrauch 1

2 What s the Meaning of Life DNA RNA Protein Genomic Transmitter Transcription? Messenger Translation? Gene product Replication!? Overview (Biological) Motivation/Problems Using Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow for dry Experiments Summary nur zum nicht-kommerziellen Gebrauch 2

3 View of Biological Areas Environment Diseases Experiments Pathways Life Evolution DNA Genome RNA Transcriptome Amino Acids Proteome Biological Motivation View of Data Data Source Environment Diseases Experiments OMIM Express Brenda Pathways Life Evolution Gene Ontology Taxonomy KEGG EMBL DNA RefSeq Genome LocusLink RNA EMBL Transcriptome (EST) Amino Acids SWISS-PROT Proteome Interpro Biological Motivation nur zum nicht-kommerziellen Gebrauch 3

4 Complex Relationships A graph depicting the relationships between 400+ biological data sources served by the EBI via SRS Database Growth of EMBL (# of records) More than 400 Data Sources on the WEB Source: DBIS (our) Approach SwissProt EMBL Database Model of the Biological World ESEMBLE... KABAT nur zum nicht-kommerziellen Gebrauch 4

5 (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview Gene-EYe Integration-Platform Vision Provide mechanisms for unified handling of different data sources data source integration change management user defined data preparation Provide relevant tools for sequence manipulation and retrieval work flow support for operation and administration Gen-Eye Vision nur zum nicht-kommerziellen Gebrauch 5

6 Gene-EYe Integration-Platform The Big Picture Genome Data Warehouse Layer (GDW Schema) KNOWLEDGE Biological Entities -> Biological Concepts (e.g. Life Cycle) Genome DataBase Layer (GDB Schema) CONTENT Relational Entities -> Biological Entities (e.g. Gene) Genome Data Store Layer (GDS Schema) DATA Flat File Data -> Relational Entities (e.g. EMBL) Design GDS: From Flat File to Database Genome Data Store Layer (GDS Schema) Data Storage Data Cleansing Update/Admin GDS Load Tools GDS Admin Tools ENSEMBL DDL InterPro DDL TAXO DDL SWALL DDL EMBL DDL ENSEMBL scanner InterPro scanner TAXO scanner SWALL scanner EMBL scanner nur zum nicht-kommerziellen Gebrauch 6

7 The Data Import Pipeline - Revisited Data File Scanner Load Files Loader Summary Instance Spec. CLOB Files Load Spec. Gene-EYe GDS Format Spec. Content Spec. DDL-Gen. DDL Script Controller Phase 1: Property Files Perl scripts Hand crafted Phase 2: GEM 1 Repository de.hui.dbis.geneeye.* (Java) Autogenerate from Metadata 1: CWM compliant GeneEYe Metadata Repository Modeling the Maintenance Process nur zum nicht-kommerziellen Gebrauch 7

8 GDB-Layer: From Data to Biology Genome Database Layer (GDB Schema) Data Integration Data Cleansing (Sem.) Queries Data GDB Builder (IBM Clio?) Schema Gene Protein Transcript Tissue Variant [Data] EMBL SWALL TAXO InterPro ENSEMBL GDB Mapper (IBM Clio) [Definition] Defined by and in cooperation w/ domain experts Genome Data Store Layer (GDS Schema) Data Storage Data Cleansing (Syn.) Update/Admin Schema Mapping with Clio with permission of Dr. Felix Naumann IBM Almaden Research Center Clio Source Schema User mapping Clio Target Schema DB SQL or XQuery DB nur zum nicht-kommerziellen Gebrauch 8

9 Clio Features with permission of Dr. Felix Naumann IBM Almaden Research Center Schema Viewer Visual mapping between schema elements Attribute Matcher Intelligent suggestions of likely mappings Data Viewer Data examples for mapping queries Queries SQL, XSLT, Xquery Use and adhere to source and target schema constraints GDW: Providing Facts for Research Genome Data Warehouse Layer (GDW Schema) Data Mining Ontology Mapping Process Simulation Ontology GDW Miner GDB Explorer Variant Tissue Transcript Protein Gene Variant Tissue Transcript Protein Gene Genome Database Layer (GDB Schema) Data Integration Data Cleansing (Sem.) Queries nur zum nicht-kommerziellen Gebrauch 9

10 (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview Errors in Genome Data DNA Sequence Determination Classes of errors in genome data production Genome Experimental errors DNA Feature Gene Annotation Analysis errors mrna Transformation errors Propagated errors Stale data Protein Sequence Determination [Müller, Naumann, Freytag, ICIQ, 2003] only 1.3 % difference Protein Function SUBUNIT Annotation The main difference are transcript DISEASE copy numbers in the brain DNA, RNA Sequence agagattagcgcgctagatcgatatgataga 0,23% gctatatcatccgagatagcagatagctcta gcacactattacacgagcagcgaccttatat 2,58% Structure Annotation Protein Sequence MDDREDLVYQAKLAEQAERYDEMVESMKKVD AGMDVELTVEERNLLSVAYKNVIGARRASWY RIISSIEQKEENKGGEDKLKMIREYRQMVER FUNCTION Function Annotation 5% - 30% MAY ACT AS INTRACELLULAR SIGNALING COMPONENT... BINDS DIRECTLY TO 5% ZO-1-40% INVOLVED IN ACUTE LEUKEMIAS nur zum nicht-kommerziellen Gebrauch 10

11 Reliability-based Merging (cont.) Domain expert identifies reliable parts for merging Definition of a set of views for integration Current work: r 1 ID A 1 A 2 A 3 1 A A B r 1 4 B mismatch patterns? 5 C How to Merge assess their r 2 ID A 1 A 2 A 3 1 B B B B D Which are the relevant relevance & importance? ID A 1 A 2 A 3 1 A A B B C e.g. MIN() (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview nur zum nicht-kommerziellen Gebrauch 11

12 BLAST: General Introduction DC-File Algorithm/Package: Similarity Search Devloped by Altschul et al. (1990) Three Steps: FormatDB Preprocessing 1. Search for Word Pairs (Iseq, DSeq) of Length L on the Data Collection of Sequences above Threshhold T 2. Expansion of each Word Pair until the Value V of their Alignment is away from the local maximum 3. Output of complete alignment (Highscoring Segment Pair, HSP), if Value(Alignment) > S Index Sequences BLAST Report Features Query sequence BLAST Call Output: Powerset of Alignments BLAST UDF Implementation Goal: Using BLAST in SQL-statements How? BLAST-UDF implemented as Table Function Use in SQL Query SELECT * FROM TABLE( BLAST(<Parameter>, <Query Sequence>, <Comparison Sequence> )) Each call returns a set of alignments over Sequences in the Database nur zum nicht-kommerziellen Gebrauch 12

13 Structure of UDB Table Function Implementation: Mapping of program into calling structure for table functions Communiaction between the different calls via scratchpad scratchpad: Storage area which remains intact and unchanged between UDF calls Storage of data structures for different steps especially for output from postprocessing: SeqAlign For all Initialize SEQUENCES Alignment without gaps Postprocessing For all UDF BLAST ALIGNMENTS Output of the results Release data structures related with sequences Release global data structures FIRST OPEN FETCH CLOSE FINAL (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview nur zum nicht-kommerziellen Gebrauch 13

14 The Challenge: Exon Skipping Gene Protein One Gen with 100 Exons ~ Variations n Exons within one Gene linearly combined (splicing) Used as Pattern for Protein Generation Challenge: Exon Skipping Do alternative fusion points new funktional (i.e. biologigacl meaningful) patterns? nur zum nicht-kommerziellen Gebrauch 14

15 Functional Genomics: Gain of New Insight First Horizon: Simple Exon Skipping New Functionality! Flow of Processing Steps Generate Exon Sequence Local Database (automatic) Remote Tool (Web Based) Find Similarity Search Supported by local DB Check for Biological Validity nur zum nicht-kommerziellen Gebrauch 15

16 Implementation Some facts 60 days 100% load One splice form per minute So far: ca splice forms First biolog. meaningful results Cooperations Cooperation with Univ. of Jena (Rolf Backofen) Berlin Center of Bioinformatics (BCB) Charite, FU, Max-Planck-Institut (M. Vingron) Industry: IBM, small companies, Patrick Chappatte, Switzerland nur zum nicht-kommerziellen Gebrauch 16

17 Database Environment IBM p-server (sponsored by IBM) CPU CPU CPU CPU 2.3 TByte CPU CPU CPU CPU DB2 Summary Lesson learnt Highly Dynamic Environment Data: changes frequently User: changes frequently Provide a framework for Date integration Data processing Data changes Data dependencies.. Meta data management Future Work Query processing Include domain knowledge Data cleansing Set of UDFs for biological data processing Visualization of Data Summary nur zum nicht-kommerziellen Gebrauch 17

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr Lecture 11 Data storage and LIMS solutions Stéphane LE CROM lecrom@biologie.ens.fr Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog

More information

GenBank, Entrez, & FASTA

GenBank, Entrez, & FASTA GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

New solutions for Big Data Analysis and Visualization

New solutions for Big Data Analysis and Visualization New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

Gene Models & Bed format: What they represent.

Gene Models & Bed format: What they represent. GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk Three data delivery cases for EMBL- EBI s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation European Nucleotide Archive 1000 Genomes Ensembl Ensembl Genomes

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

IO Informatics The Sentient Suite

IO Informatics The Sentient Suite IO Informatics The Sentient Suite Our software, The Sentient Suite, allows a user to assemble, view, analyze and search very disparate information in a common environment. The disparate data can be numeric

More information

The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management

The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management Victor M. Markowitz 1, Frank Korzeniewski 1, Krishna Palaniappan 1, Ernest Szeto 1, Natalia Ivanova 2, and Nikos

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

A Practitioner's G uide to Data Management and Data Integration in Bioinformatics

A Practitioner's G uide to Data Management and Data Integration in Bioinformatics 3 CHAPTER A Practitioner's G uide to Data Management and Data Integration in Bioinformatics Barbara A. Eckman 3.1 INTRODUCTION Integration of a large and widely diverse set of data sources and analytical

More information

Linear Sequence Analysis. 3-D Structure Analysis

Linear Sequence Analysis. 3-D Structure Analysis Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

Data integration for metagenomics: current status and future plans

Data integration for metagenomics: current status and future plans integration for metagenomics: current status and future plans Neil Wipat Computing Science University of Newcastle NERC Microbial Metagenomics Overview metamicrobase Current method of data integration

More information

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome Module 2 Genome Viewing Using Genome Browsers to View Annotation of the Human Genome Bert Overduin, Ph.D. PANDA Coordination & Outreach EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus

More information

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

The Galaxy workflow. George Magklaras PhD RHCE

The Galaxy workflow. George Magklaras PhD RHCE The Galaxy workflow George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

More information

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Dirk.Repsilber@oru.se 2015-05-21 Functional Bioinformatics, Örebro University Vad är bioinformatik och varför

More information

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo Preparing the scenario for the use of patient s genome sequences in clinic Joaquín Dopazo Computational Medicine Institute, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB),

More information

SAP HANA Enabling Genome Analysis

SAP HANA Enabling Genome Analysis SAP HANA Enabling Genome Analysis Joanna L. Kelley, PhD Postdoctoral Scholar, Stanford University Enakshi Singh, MSc HANA Product Management, SAP Labs LLC Outline Use cases Genomics review Challenges in

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Teaching Bioinformatics to Undergraduates

Teaching Bioinformatics to Undergraduates Teaching Bioinformatics to Undergraduates http://www.med.nyu.edu/rcr/asm Stuart M. Brown Research Computing, NYU School of Medicine I. What is Bioinformatics? II. Challenges of teaching bioinformatics

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London 1. What is Discovery Net 2. Distributed Data Mining for Compute Intensive Tasks 3. Distributed

More information

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2) The Ensembl Core databases and API Useful links Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html

More information

Data Integration of Bioinformatics and Web-Based Software Development

Data Integration of Bioinformatics and Web-Based Software Development Integration of Biological XML data Ph. D. Lecture Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Information and Communications Univ. Where are we? Client-Side Info. Management Business related Issues

More information

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences DNA and the Cell Anastasios Koutsos Alexandra Manaia Julia Willingale-Theune Version 2.3 English version ELLS European Learning Laboratory for the Life Sciences Anastasios Koutsos, Alexandra Manaia and

More information

The EcoCyc Curation Process

The EcoCyc Curation Process The EcoCyc Curation Process Ingrid M. Keseler SRI International 1 HOW OFTEN IS THE GOLDEN GATE BRIDGE PAINTED? Many misconceptions exist about how often the Bridge is painted. Some say once every seven

More information

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011 Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear

More information

Oracle Warehouse Builder 10g

Oracle Warehouse Builder 10g Oracle Warehouse Builder 10g Architectural White paper February 2004 Table of contents INTRODUCTION... 3 OVERVIEW... 4 THE DESIGN COMPONENT... 4 THE RUNTIME COMPONENT... 5 THE DESIGN ARCHITECTURE... 6

More information

Software Description Technology

Software Description Technology Software applications using NCB Technology. Software Description Technology LEX Provide learning management system that is a central resource for online medical education content and computer-based learning

More information

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins The Future of the Electronic Health Record Gerry Higgins, Ph.D., Johns Hopkins Topics to be covered Near Term Opportunities: Commercial, Usability, Unification of different applications. OMICS : The patient

More information

EMBL Identity & Access Management

EMBL Identity & Access Management EMBL Identity & Access Management Rupert Lück EMBL Heidelberg e IRG Workshop Zürich Apr 24th 2008 Outline EMBL Overview Identity & Access Management for EMBL IT Requirements & Strategy Project Goal and

More information

An Introduction to Genomics and SAS Scientific Discovery Solutions

An Introduction to Genomics and SAS Scientific Discovery Solutions An Introduction to Genomics and SAS Scientific Discovery Solutions Dr Karen M Miller Product Manager Bioinformatics SAS EMEA 16.06.03 Copyright 2003, SAS Institute Inc. All rights reserved. 1 Overview!

More information

Introduction to Genome Annotation

Introduction to Genome Annotation Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

University of Glasgow - Programme Structure Summary C1G5-5100 MSc Bioinformatics, Polyomics and Systems Biology

University of Glasgow - Programme Structure Summary C1G5-5100 MSc Bioinformatics, Polyomics and Systems Biology University of Glasgow - Programme Structure Summary C1G5-5100 MSc Bioinformatics, Polyomics and Systems Biology Programme Structure - the MSc outcome will require 180 credits total (full-time only) - 60

More information

Genetomic Promototypes

Genetomic Promototypes Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,

More information

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days Objectives SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

More information

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genomes and SNPs in Malaria and Sickle Cell Anemia Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing

More information

Scientific databases. Biological data management

Scientific databases. Biological data management Scientific databases Biological data management The term paper within the framework of the course Principles of Modern Database Systems by Aleksejs Kontijevskis PhD student The Linnaeus Centre for Bioinformatics

More information

Activity 7.21 Transcription factors

Activity 7.21 Transcription factors Purpose To consolidate understanding of protein synthesis. To explain the role of transcription factors and hormones in switching genes on and off. Play the transcription initiation complex game Regulation

More information

13.4 Gene Regulation and Expression

13.4 Gene Regulation and Expression 13.4 Gene Regulation and Expression Lesson Objectives Describe gene regulation in prokaryotes. Explain how most eukaryotic genes are regulated. Relate gene regulation to development in multicellular organisms.

More information

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information

More information

Data Integration and ETL with Oracle Warehouse Builder: Part 1

Data Integration and ETL with Oracle Warehouse Builder: Part 1 Oracle University Contact Us: + 38516306373 Data Integration and ETL with Oracle Warehouse Builder: Part 1 Duration: 3 Days What you will learn This Data Integration and ETL with Oracle Warehouse Builder:

More information

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio

More information

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices overview Pipeline Pilot Enterprise Server Pipeline Pilot Enterprise Server (PPES) is a powerful client-server platform that streamlines the integration and analysis of the vast quantities of data flooding

More information

PeptidomicsDB: a new platform for sharing MS/MS data.

PeptidomicsDB: a new platform for sharing MS/MS data. PeptidomicsDB: a new platform for sharing MS/MS data. Federica Viti, Ivan Merelli, Dario Di Silvestre, Pietro Brunetti, Luciano Milanesi, Pierluigi Mauri NETTAB2010 Napoli, 01/12/2010 Mass Spectrometry

More information

Data Management for Biobanks

Data Management for Biobanks Data Management for Biobanks JOHANN EDER CLAUS DABRINGER MICHAELA SCHICHO KONRAD STARK University of Klagenfurt and University of Vienna Data Management for Biobanks Local Integration Project Support Anonymization

More information

Proteome Data Integration: Characteristics and Challenges

Proteome Data Integration: Characteristics and Challenges Proteome Data Integration: Characteristics and Challenges K. Belhajjame 1, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob 4, S.J. Hubbard 1, D. Jones 3, P. Jones 4, N. Martin 2, S. Oliver 1, C. Orengo

More information

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction Work Package 13.5: Report summarising the technical feasibility of the European Genotype Archive to collect, store, and use genotype data stored in European biobanks in a manner that complies with all

More information

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent

More information

To be able to describe polypeptide synthesis including transcription and splicing

To be able to describe polypeptide synthesis including transcription and splicing Thursday 8th March COPY LO: To be able to describe polypeptide synthesis including transcription and splicing Starter Explain the difference between transcription and translation BATS Describe and explain

More information

Applying data integration into reconstruction of gene networks from micro

Applying data integration into reconstruction of gene networks from micro Applying data integration into reconstruction of gene networks from microarray data PhD Thesis Proposal Dipartimento di Informatica e Scienze dell Informazione Università degli Studi di Genova December

More information

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel

More information

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!! DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other

More information

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG) roderic.guigo@crg.cat

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG) roderic.guigo@crg.cat Bioinformatique et Séquençage Haut Débit, Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG) roderic.guigo@crg.cat 1 RNA Transcription to RNA and subsequent

More information

Protein Synthesis How Genes Become Constituent Molecules

Protein Synthesis How Genes Become Constituent Molecules Protein Synthesis Protein Synthesis How Genes Become Constituent Molecules Mendel and The Idea of Gene What is a Chromosome? A chromosome is a molecule of DNA 50% 50% 1. True 2. False True False Protein

More information

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department

More information

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper CAST-2015 provides an opportunity for researchers, academicians, scientists and

More information

Challenges associated with analysis and storage of NGS data

Challenges associated with analysis and storage of NGS data Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group gabry@ebi.ac.uk Next-generation sequencing Next-generation sequencing

More information

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies 1 Enterprise Information Challenge Source: Oracle customer 2 Vision of Semantically Linked Data The Network of Collaborative

More information

UGENE Quick Start Guide

UGENE Quick Start Guide Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.

More information

The Steps. 1. Transcription. 2. Transferal. 3. Translation

The Steps. 1. Transcription. 2. Transferal. 3. Translation Protein Synthesis Protein synthesis is simply the "making of proteins." Although the term itself is easy to understand, the multiple steps that a cell in a plant or animal must go through are not. In order

More information

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features 1 Oracle SQL Developer 3.0: Overview and New Features Sue Harper Senior Principal Product Manager The following is intended to outline our general product direction. It is intended

More information

DataFoundry Data Warehousing and Integration for Scientific Data Management

DataFoundry Data Warehousing and Integration for Scientific Data Management UCRL-ID-127593 DataFoundry Data Warehousing and Integration for Scientific Data Management R. Musick, T. Critchlow, M. Ganesh, K. Fidelis, A. Zemla and T. Slezak U.S. Department of Energy Livermore National

More information

Kam D. Dahlquist Department of Biology. John David N. Dionisio Department of Electrical Engineering & Computer Science

Kam D. Dahlquist Department of Biology. John David N. Dionisio Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Kam D. Dahlquist Department of Biology John David N. Dionisio Department of Electrical Engineering & Computer Science Loyola Marymount University A Reusable, Open Source Tool

More information

A demonstration of the use of Datagrid testbed and services for the biomedical community

A demonstration of the use of Datagrid testbed and services for the biomedical community A demonstration of the use of Datagrid testbed and services for the biomedical community Biomedical applications work package V. Breton, Y Legré (CNRS/IN2P3) R. Météry (CS) Credits : C. Blanchet, T. Contamine,

More information

MDM and Data Warehousing Complement Each Other

MDM and Data Warehousing Complement Each Other Master Management MDM and Warehousing Complement Each Other Greater business value from both 2011 IBM Corporation Executive Summary Master Management (MDM) and Warehousing (DW) complement each other There

More information

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness Melanie Dulong de Rosnay Fellow, Science Commons and Berkman Center for Internet & Society at Harvard University This article

More information

HL7 Clinical Genomics and Structured Documents Work Groups

HL7 Clinical Genomics and Structured Documents Work Groups HL7 Clinical Genomics and Structured Documents Work Groups CDA Implementation Guide: Genetic Testing Report (GTR) Amnon Shabo (Shvo), PhD shabo@il.ibm.com HL7 Clinical Genomics WG Co-chair and Modeling

More information

Basic processing of next-generation sequencing (NGS) data

Basic processing of next-generation sequencing (NGS) data Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

LifeScope Genomic Analysis Software 2.5

LifeScope Genomic Analysis Software 2.5 USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use

More information

iway Roadmap Michael Corcoran Sr. VP Corporate Marketing

iway Roadmap Michael Corcoran Sr. VP Corporate Marketing 16.06.2015 iway Roadmap Michael Corcoran Sr. VP Corporate Marketing iway 7 Products 1 iway 7 Products iway 7 Products 360 Viewer Remediation Sentinel Portal Golden Record Search and View Omni Patient Data

More information

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives Chalapathy Neti, Ph.D. Associate Director, Healthcare Transformation, Shahram Ebadollahi, Ph.D. Research Staff Memeber IBM Research,

More information

DAWIS-M.D.-adata warehouse system for metabolic data

DAWIS-M.D.-adata warehouse system for metabolic data DAWIS-M.D.-adata warehouse system for metabolic data Klaus Hippe, Benjamin Kormeier, Thoralf Töpel, Sebastian Janowski and Ralf Hofestädt Bioinformatics Department Bielefeld University Universitätsstraße

More information

Translation Study Guide

Translation Study Guide Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to

More information

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov Search and Data Mining: Techniques Applications Anya Yarygina Boris Novikov Introduction Data mining applications Data mining system products and research prototypes Additional themes on data mining Social

More information

Optimization of ETL Work Flow in Data Warehouse

Optimization of ETL Work Flow in Data Warehouse Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. Sivaganesh07@gmail.com P Srinivasu

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov

More information

An agent-based layered middleware as tool integration

An agent-based layered middleware as tool integration An agent-based layered middleware as tool integration Flavio Corradini Leonardo Mariani Emanuela Merelli University of L Aquila University of Milano University of Camerino ITALY ITALY ITALY Helsinki FSE/ESEC

More information

SQL Server Training Course Content

SQL Server Training Course Content SQL Server Training Course Content SQL Server Training Objectives Installing Microsoft SQL Server Upgrading to SQL Server Management Studio Monitoring the Database Server Database and Index Maintenance

More information

Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October 06-07 2008

Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October 06-07 2008 Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation D. POLVERARI, CTO October 06-07 2008 Data integration definition and aims Definition : Data integration consists

More information

Biological Databases and Protein Sequence Analysis

Biological Databases and Protein Sequence Analysis Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to

More information

Doctor of Philosophy in Computer Science

Doctor of Philosophy in Computer Science Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects

More information

Human Genome Organization: An Update. Genome Organization: An Update

Human Genome Organization: An Update. Genome Organization: An Update Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion

More information

EFFECTIVE STORAGE OF XBRL DOCUMENTS

EFFECTIVE STORAGE OF XBRL DOCUMENTS EFFECTIVE STORAGE OF XBRL DOCUMENTS An Oracle & UBmatrix Whitepaper June 2007 Page 1 Introduction Today s business world requires the ability to report, validate, and analyze business information efficiently,

More information

Extraction and Visualization of Protein-Protein Interactions from PubMed

Extraction and Visualization of Protein-Protein Interactions from PubMed Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin Finding Relevant Knowledge Find information about Much

More information