Processing Genome Data using Scalable Database Technology. My Background



Similar documents
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Module 1. Sequence Formats and Retrieval. Charles Steward

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

GenBank, Entrez, & FASTA

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

New solutions for Big Data Analysis and Visualization

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Gene Models & Bed format: What they represent.

Bioinformatics Resources at a Glance

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Bioinformatics Grid - Enabled Tools For Biologists.

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Frequently Asked Questions Next Generation Sequencing

Linear Sequence Analysis. 3-D Structure Analysis

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

SAP HANA Enabling Genome Analysis

Web-Based Genomic Information Integration with Gene Ontology

Teaching Bioinformatics to Undergraduates

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Protein Protein Interaction Networks

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Oracle Warehouse Builder 10g

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins

EMBL Identity & Access Management

An Introduction to Genomics and SAS Scientific Discovery Solutions

Introduction to Genome Annotation

CD-HIT User s Guide. Last updated: April 5,

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Genetomic Promototypes

SQL Server Administrator Introduction - 3 Days Objectives

Genomes and SNPs in Malaria and Sickle Cell Anemia

Scientific databases. Biological data management

Activity 7.21 Transcription factors

13.4 Gene Regulation and Expression

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Data Integration and ETL with Oracle Warehouse Builder: Part 1

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

PeptidomicsDB: a new platform for sharing MS/MS data.

Data Management for Biobanks

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

To be able to describe polypeptide synthesis including transcription and splicing

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Protein Synthesis How Genes Become Constituent Molecules

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Challenges associated with analysis and storage of NGS data

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

UGENE Quick Start Guide

The Steps. 1. Transcription. 2. Transferal. 3. Translation

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

MDM and Data Warehousing Complement Each Other

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

HL7 Clinical Genomics and Structured Documents Work Groups

Basic processing of next-generation sequencing (NGS) data

Introduction to Data Mining

LifeScope Genomic Analysis Software 2.5

iway Roadmap Michael Corcoran Sr. VP Corporate Marketing

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Translation Study Guide

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Optimization of ETL Work Flow in Data Warehouse

Delivering the power of the world s most successful genomics platform

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

An agent-based layered middleware as tool integration

SQL Server Training Course Content

Biological Databases and Protein Sequence Analysis

Doctor of Philosophy in Computer Science

Human Genome Organization: An Update. Genome Organization: An Update

EFFECTIVE STORAGE OF XBRL DOCUMENTS

Extraction and Visualization of Protein-Protein Interactions from PubMed

Transcription:

Johann Christoph Freytag, Ph.D. freytag@dbis.informatik.hu-berlin.de http://www.dbis.informatik.hu-berlin.de Stanford University, February 2004 PhD @ Harvard Univ. Visiting Scientist, Microsoft Res. (2002) My Background Professor of DBIS @HUB ERCR (European Computer Industry Research Centre), München (87-89) DEC s Database Technology Center, München (90-93) Starburst project, IBM Almaden Research Center (85-87) Visiting Scientist, Almaden Research Center (97/98) Visiting Scientist, IBM SVL (2001) nur zum nicht-kommerziellen Gebrauch 1

What s the Meaning of Life DNA RNA Protein Genomic Transmitter Transcription? Messenger Translation? Gene product Replication!? Overview (Biological) Motivation/Problems Using Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow for dry Experiments Summary nur zum nicht-kommerziellen Gebrauch 2

View of Biological Areas Environment Diseases Experiments Pathways Life Evolution DNA Genome RNA Transcriptome Amino Acids Proteome Biological Motivation View of Data Data Source Environment Diseases Experiments OMIM Express Brenda Pathways Life Evolution Gene Ontology Taxonomy KEGG EMBL DNA RefSeq Genome LocusLink RNA EMBL Transcriptome (EST) Amino Acids SWISS-PROT Proteome Interpro Biological Motivation nur zum nicht-kommerziellen Gebrauch 3

Complex Relationships A graph depicting the relationships between 400+ biological data sources served by the EBI via SRS Database Growth of EMBL (# of records) More than 400 Data Sources on the WEB Source: http://www3.ebi.ac.uk/services/dbstats/ DBIS (our) Approach SwissProt EMBL Database Model of the Biological World ESEMBLE... KABAT nur zum nicht-kommerziellen Gebrauch 4

(Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview Gene-EYe Integration-Platform Vision Provide mechanisms for unified handling of different data sources data source integration change management user defined data preparation Provide relevant tools for sequence manipulation and retrieval work flow support for operation and administration Gen-Eye Vision nur zum nicht-kommerziellen Gebrauch 5

Gene-EYe Integration-Platform The Big Picture Genome Data Warehouse Layer (GDW Schema) KNOWLEDGE Biological Entities -> Biological Concepts (e.g. Life Cycle) Genome DataBase Layer (GDB Schema) CONTENT Relational Entities -> Biological Entities (e.g. Gene) Genome Data Store Layer (GDS Schema) DATA Flat File Data -> Relational Entities (e.g. EMBL) Design GDS: From Flat File to Database Genome Data Store Layer (GDS Schema) Data Storage Data Cleansing Update/Admin GDS Load Tools GDS Admin Tools ENSEMBL DDL InterPro DDL TAXO DDL SWALL DDL EMBL DDL ENSEMBL scanner InterPro scanner TAXO scanner SWALL scanner EMBL scanner nur zum nicht-kommerziellen Gebrauch 6

The Data Import Pipeline - Revisited Data File Scanner Load Files Loader Summary Instance Spec. CLOB Files Load Spec. Gene-EYe GDS Format Spec. Content Spec. DDL-Gen. DDL Script Controller Phase 1: Property Files Perl scripts Hand crafted Phase 2: GEM 1 Repository de.hui.dbis.geneeye.* (Java) Autogenerate from Metadata 1: CWM compliant GeneEYe Metadata Repository Modeling the Maintenance Process nur zum nicht-kommerziellen Gebrauch 7

GDB-Layer: From Data to Biology Genome Database Layer (GDB Schema) Data Integration Data Cleansing (Sem.) Queries Data GDB Builder (IBM Clio?) Schema Gene Protein Transcript Tissue Variant [Data] EMBL SWALL TAXO InterPro ENSEMBL GDB Mapper (IBM Clio) [Definition] Defined by and in cooperation w/ domain experts Genome Data Store Layer (GDS Schema) Data Storage Data Cleansing (Syn.) Update/Admin Schema Mapping with Clio with permission of Dr. Felix Naumann IBM Almaden Research Center Clio Source Schema User mapping Clio Target Schema DB SQL or XQuery DB nur zum nicht-kommerziellen Gebrauch 8

Clio Features with permission of Dr. Felix Naumann IBM Almaden Research Center Schema Viewer Visual mapping between schema elements Attribute Matcher Intelligent suggestions of likely mappings Data Viewer Data examples for mapping queries Queries SQL, XSLT, Xquery Use and adhere to source and target schema constraints GDW: Providing Facts for Research Genome Data Warehouse Layer (GDW Schema) Data Mining Ontology Mapping Process Simulation Ontology GDW Miner GDB Explorer Variant Tissue Transcript Protein Gene Variant Tissue Transcript Protein Gene Genome Database Layer (GDB Schema) Data Integration Data Cleansing (Sem.) Queries nur zum nicht-kommerziellen Gebrauch 9

(Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview Errors in Genome Data DNA Sequence Determination Classes of errors in genome data production Genome Experimental errors DNA Feature Gene Annotation Analysis errors mrna Transformation errors Propagated errors Stale data Protein Sequence Determination [Müller, Naumann, Freytag, ICIQ, 2003] only 1.3 % difference Protein Function SUBUNIT Annotation The main difference are transcript DISEASE copy numbers in the brain DNA, RNA Sequence agagattagcgcgctagatcgatatgataga 0,23% gctatatcatccgagatagcagatagctcta gcacactattacacgagcagcgaccttatat 2,58% Structure Annotation Protein Sequence MDDREDLVYQAKLAEQAERYDEMVESMKKVD AGMDVELTVEERNLLSVAYKNVIGARRASWY RIISSIEQKEENKGGEDKLKMIREYRQMVER FUNCTION Function Annotation 5% - 30% MAY ACT AS INTRACELLULAR SIGNALING COMPONENT... BINDS DIRECTLY TO 5% ZO-1-40% INVOLVED IN ACUTE LEUKEMIAS nur zum nicht-kommerziellen Gebrauch 10

Reliability-based Merging (cont.) Domain expert identifies reliable parts for merging Definition of a set of views for integration Current work: r 1 ID A 1 A 2 A 3 1 A 1.2 1900 2 A 2.3 1900 3 B 3.1 1900 r 1 4 B 3.4 2000 mismatch patterns? 5 C 5.6 2002 How to Merge assess their r 2 ID A 1 A 2 A 3 1 B 1 2000 2 B 2 2000 3 B 3.1 2000 4 B 3.4 2000 5 D 5.6 2002 Which are the relevant relevance & importance? ID A 1 A 2 A 3 1 A 1.2 2000 2 A 2.3 2000 3 B 3.1 2000 4 B 3.4 2000 5 C 5.6 2002 e.g. MIN() (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview nur zum nicht-kommerziellen Gebrauch 11

BLAST: General Introduction DC-File Algorithm/Package: Similarity Search Devloped by Altschul et al. (1990) Three Steps: FormatDB Preprocessing 1. Search for Word Pairs (Iseq, DSeq) of Length L on the Data Collection of Sequences above Threshhold T 2. Expansion of each Word Pair until the Value V of their Alignment is away from the local maximum 3. Output of complete alignment (Highscoring Segment Pair, HSP), if Value(Alignment) > S Index Sequences BLAST Report Features Query sequence BLAST Call Output: Powerset of Alignments BLAST UDF Implementation Goal: Using BLAST in SQL-statements How? BLAST-UDF implemented as Table Function Use in SQL Query SELECT * FROM TABLE( BLAST(<Parameter>, <Query Sequence>, <Comparison Sequence> )) Each call returns a set of alignments over Sequences in the Database nur zum nicht-kommerziellen Gebrauch 12

Structure of UDB Table Function Implementation: Mapping of program into calling structure for table functions Communiaction between the different calls via scratchpad scratchpad: Storage area which remains intact and unchanged between UDF calls Storage of data structures for different steps especially for output from postprocessing: SeqAlign For all Initialize SEQUENCES Alignment without gaps Postprocessing For all UDF BLAST ALIGNMENTS Output of the results Release data structures related with sequences Release global data structures FIRST OPEN FETCH CLOSE FINAL (Biological) Motivation/Problems Using Scalable Database Technology Gene-EYe Integration-Platform BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for dry Experiments Summary Overview nur zum nicht-kommerziellen Gebrauch 13

The Challenge: Exon Skipping Gene Protein One Gen with 100 Exons 2 100 ~ 10 30 Variations n Exons within one Gene linearly combined (splicing) Used as Pattern for Protein Generation Challenge: Exon Skipping Do alternative fusion points new funktional (i.e. biologigacl meaningful) patterns? nur zum nicht-kommerziellen Gebrauch 14

Functional Genomics: Gain of New Insight First Horizon: Simple Exon Skipping 110110000111111 New Functionality! Flow of Processing Steps Generate Exon Sequence Local Database (automatic) Remote Tool (Web Based) Find Similarity Search Supported by local DB Check for Biological Validity nur zum nicht-kommerziellen Gebrauch 15

Implementation Some facts 60 days 100% load One splice form per minute So far: ca. 90.000 splice forms First biolog. meaningful results Cooperations Cooperation with Univ. of Jena (Rolf Backofen) Berlin Center of Bioinformatics (BCB) Charite, FU, Max-Planck-Institut (M. Vingron) Industry: IBM, small companies, Patrick Chappatte, Switzerland nur zum nicht-kommerziellen Gebrauch 16

Database Environment IBM p-server (sponsored by IBM) CPU CPU CPU CPU 2.3 TByte CPU CPU CPU CPU DB2 Summary Lesson learnt Highly Dynamic Environment Data: changes frequently User: changes frequently Provide a framework for Date integration Data processing Data changes Data dependencies.. Meta data management Future Work Query processing Include domain knowledge Data cleansing Set of UDFs for biological data processing Visualization of Data Summary nur zum nicht-kommerziellen Gebrauch 17