Data Warehouse Technology And The MSD Databases



Similar documents
Data W a Ware r house house and and OLAP Week 5 1

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Chapter 3, Data Warehouse and OLAP Operations

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Overview of Data Warehousing and OLAP

Data Warehousing and Online Analytical Processing

Data Warehousing and OLAP Technology

Lection 3-4 WAREHOUSING

Introduction to Data Warehousing. Ms Swapnil Shrivastava

DATA WAREHOUSING AND OLAP TECHNOLOGY

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data Mining for Knowledge Management. Data Warehouses

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

14. Data Warehousing & Data Mining

CHAPTER 3. Data Warehouses and OLAP

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Lecture 2 Data warehousing


TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Part 22. Data Warehousing

Data Warehousing Systems: Foundations and Architectures

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

A Critical Review of Data Warehouse

Module 1: Introduction to Data Warehousing and OLAP

Fluency With Information Technology CSE100/IMT100

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

A Design and implementation of a data warehouse for research administration universities

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehousing & OLAP

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse: Introduction

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

When to consider OLAP?

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Dimensional Modeling for Data Warehouse

Data Warehousing and elements of Data Mining

Business Intelligence. 1. Introduction September, 2013.

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

B.Sc (Computer Science) Database Management Systems UNIT-V

Week 3 lecture slides

MDM and Data Warehousing Complement Each Other

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Foundations of Business Intelligence: Databases and Information Management

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehousing and OLAP

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

INFO 321, Database Systems, Semester

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Course MIS. Foundations of Business Intelligence

Week 13: Data Warehousing. Warehousing

Foundations of Business Intelligence: Databases and Information Management

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

DATA WAREHOUSING - OLAP

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Data Warehousing and Data Mining in Business Applications

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

The Role of Data Warehousing Concept for Improved Organizations Performance and Decision Making

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

New Approach of Computing Data Cubes in Data Warehousing

Data Warehousing and OLAP Technology for Knowledge Discovery

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Hybrid Support Systems: a Business Intelligence Approach

CHAPTER 4: BUSINESS ANALYTICS

DATA MINING AND WAREHOUSING CONCEPTS

Advanced Data Management Technologies

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Performance Enhancement Techniques of Data Warehouse

A Survey on Data Warehouse Architecture

Indexing Techniques for Data Warehouses Queries. Abstract

Data Warehousing and Data Mining

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Meta-data and Data Mart solutions for better understanding for data and information in E-government Monitoring

Database Design Patterns. Winter Lecture 24

What is Management Reporting from a Data Warehouse and What Does It Have to Do with Institutional Research?

THE QUALITY OF DATA AND METADATA IN A DATAWAREHOUSE

SIZE & ESTIMATION OF DATA WAREHOUSE SYSTEMS

Chapter 3 - Data Replication and Materialized Integration

Foundations of Business Intelligence: Databases and Information Management

OLAP Theory-English version

Basics of Dimensional Modeling

Data Warehousing Concepts

Transcription:

Data Warehouse Technology And The MSD Databases Philip McNeil

Data Warehouses The MSD Databases Populating & using the Search Database

Data Warehouses

What is a Data Warehouse? A subject-oriented, integrated, nonvolatile, and time-variant collection of data that is used primarily in organizational decision making. (W. H. Inmon, Building the Data Warehouse, John Wiley & Sons, 2002)

Data Warehouse Subject-Oriented Organised around major subjects, such as customer, product, sales Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing Provides a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Data Warehouse Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data are moved to the warehouse, they are converted into a consistent format to facilitate integration.

Data Warehouse Non-Volatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data

Data Warehouse Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element

Dimensional Modelling A typical commercial data warehouse is based on multidimensional data model which views data in the form of a data cube A data cube allows data to be modelled and viewed in multiple dimensions (such as sales) Dimension tables, representing important subject areas such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table containing varying levels of summarised data (such as dollars_sold) and keys to each of the related dimension tables

Data Warehouse Models Star schema A fact table, containing varying levels of summarised data, in the middle connected to a set of dimension tables, representing important subject areas Snowflake schema A refinement of star schema where some dimensional hierarchy is normalised into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellation Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Star Schema

Star Schema with Sample Data

Snowflake Schema Merchandise ItemID Description QuantityOnHand ListPrice Category Transactions SaleID ItemID Quantity SalePrice Amount Sale SaleID SaleDate EmployeeID CustomerID SalesTax Customer CustomerID Phone FirstName LastName Address ZipCode CityID City CityID ZipCode City State Dimension tables can join to other dimension tables.

Fact Constellation Shipping Fact Table Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Sales Fact Table time_key item_key branch_key location_key unit_sold euros_sold avg_sales Item item_key item_name brand type supplier_key Location location_key street city province/street country time_key item_key shipper_key from_location_key to_location_key unit_shipped Shipper shipper_key shipper_name location_key shipper_type

Why Build a Data Warehouse? Access to data from multiple sources, have a comprehensive data collection. Separate transactional and analysis systems: Improve query response time (without slowing down transaction processing) Easy formulation of complex queries Access to historical data (not in operational systems) Improved data quality (fewer errors and missing values)

The Data Warehouse Pipeline other sources OLAP Server Operational DBs Extract Transform Load Refresh Data Warehouse Serve Analysis Query Reports Data mining Data Storage OLAP Engine Front-End Tools

Using the Warehouse Ad Hoc query simplistic submission of SQL statements from a command-line tool or a SQL-generation tool Complex Analytical Questions using custom written query tools or commercial online analytical processing (OLAP) tools Data Mining searching data sets for hitherto unknown correlations

The MSD Databases

What is a Data Warehouse? (revisited) From the MSD Viewpoint: A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM Consultant

The MSD databases The MSD actually consists of two separate databases: the deposition database is highly normalised, with thousands of relationships linking some 325 tables; the deposition database is the definitive archive for all structural data at MSD the search database is a simpler, but larger denormalised database,which contains a large amount of additional derived data, with data items duplicated and aggregated into 170 much wider tables, making it more amenable to searching and retrieval of data A third intermediate database is involved in transforming the data from the deposition database to the search database and in calculating and adding the derived data

Deposition Database The deposition database comprises: common reference data, such as amino-acid connectivity, HET groups structures, etc. older PDB entries, loaded from legacy files schema includes strict constraints, enforcing internal consistency and performing type checking and validation against the reference data a huge amount of effort has gone into cleaning up the legacy data new entries, loaded from recent PDB submissions new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data

Deposition Database Schema!

Part of Deposition Database Schema DEP_PDB_TAXONOMY o PDB_COMMON_NAME o PDB_SCIENTIFIC_NAME o PDB_STRAIN a c ETAXI SYNONYMS # NAME_TXT # NAME_CLASS es_eta_fk NCBI SYNONYMS # NAME_CLASS # NAME_TXT ns_ncb_fk DEP_ENTITY * NAME o ENTITY_SRC * ID o DETAILS o SYSTEM DEP_POLY_ENTITY * ENGINEERED * HETERO_FLAG * MUTANT_FLAG o FRAGMENT_FLAG o MUTATION_STRING * SYNTHETIC is child of * MASTER_ID precedes b DEP_POLY_ENTITY_SEGMENT * RCSB_DEFINE_AS_ENTITY * RCSB_SERIAL_OFFSET follows matched by is composed of d es_eta_fk ETAXI o PARENT_ID o UPPER_NAME o FULL_NAME * COMPLETE_GENOME_FLAG o RANK * HIDDEN * SCIENTIFIC_NAME o LEFT_NUMBER o RIGHT_NUMBER * ANNOTATION_SOURCE eta_ncb_fk taxonomy of eta_ncb_fk ns_ncb_fk NCBI # TAX_ID o PARENT_ID * SCIENTIFIC_NAME o PREFERRED_COMMON_NAME o RANK * HIDDEN has taxonomy ncti_ncb_fk ncti_ncb_fk NTX CHANGED TAX ID # OLD_TAX_ID * NEW_NAME_TXT * OLD_NAME_TXT * USERSTAMP o TIMESTAMP_NCBI is parent of DEP_POLY_ENTITY_MASTER R/1374 CP606 DEP_NATURAL_SOURCE * CELL_ID o ATCC o CELL_LINE o CELL_LOCATION o CELL_TYPE o DETAILS o FRAGMENT o GENE o ORGAN o ORGANELLE o PLASMID o SECRETION o TISSUE o VARIANT JA3 is a component of JR3 DEPOSITION * NUM_XTALS o PDB_EXP_TYPE * DEPOSITION_PROCESSED_BY * CREATION_DATE * LAST_UPDATE * TITLE... DEP_POLY_ENTITY_SEQ * HETERO * SERIAL describes is defined by conflicts with matches entity referrs to referred to by conflict refers to DEP_SEQ_REF * PROC_MATCHED_FLAG o CIF_ID o DETAILS R/1937 DEP_SEQ_MATCH * CIF_ID o DETAILS CP623 R/1939 DEP_SEQ_CORR * CONFLICT_ANNOTATED_FLAG * TYPE o CIF_TYPE o CONFLICT_TYPE o DETAILS a CP621 referred to by b REF_SEQ_REF * DB_NAME * PRIMARY_ID o SECONDARY_ID o VERSION sequence of REF_SEQ_REF_SEQ * COMPONENT * SERIAL compound has sequence is described by DEP_RESIDUE defines REF_CHEM_COMP defines compound o EBI_ID child comp o HETGROUP_PARENT o MODEL_DETAILS parent comp o MOLECULAR_FORMULA o MOLECULAR_WEIGHT o EBI_RESERVED_NAME o RCSB_RESERVED_NAME...

NOT The MSD Data Warehouse The MSD Search Database: Is not a true data warehouse Breaks several of the data warehouse cardinal rules Non-volatile Time Variant But does make use of many of the data warehousing concepts and techniques Closest to a fact constellation

Search Database Each data item occurs only once in the deposition database, so that data from a single entry are spread across many tables To make searching faster, the data are aggregated into fewer, larger tables in the Search database Searching the Search database requires fewer table joins, making database queries significantly faster and much less complex The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)

Representing Macromolecular Structures (1) Exp. Result Assembly Chains Residues Atoms ENTRY ASSEMBLY CHAIN ASSEMBLY DATA MODEL ATOM DATA ALT ATOM RESIDUE

Representing Macromolecular Structures (2) ASU observed Biological Independent exp data Unit(s) units Chains Each level of the hierarchy can have associated properties, e.g. Bound molecules Domains Site residues Derived properties (e.g. asa) Reference information (e.g. standard geometry) Atoms Residues

Derived Data During transformation from the deposition database to the search database additional derived data are added Numerous processes are run on the deposition data, including: characterisation of ligand binding sites derivation of secondary structure information mapping data onto other databases such as UniProt, Pfam, InterPro, GO, SCOP and CATH

Search Database Data Models Entity-relationship data model as used in the deposition database but denormalised to aid querying could be generated from deposition data model Dimensional model typical of commercial data warehouses requires separate data model A hybrid of these two required to handle to complexity of macromolecular structure data very complex fact tables could be dimensions for other fact tables

Part of Search Database Schema strand_data_fk6 STRAND DATA strand_data_fk9 sheet_order_fk1 strand_data_fk1 sheet_order_fk1 SHEET ORDER sheet_order_fk4 sheet_order_fk3 strand_data_fk1 SHEET CATH_MAPPING ETAXI SYNONYMS _strand_data_fk6 _strand_data_fk9 sheet_order_fk2 sheet_order_fk2 HAIRPIN MOTIF etaxi_syn_fk strand_data_fk5 strand_data_fk4 _strand_data_fk7 hairpin_motif_fk5 HELIX DATA helix_data_fk1 hairpin_motif_fk4 hairpin_motif_fk3 _strand_data_fk8 strand_data_fk7 shb_fk1 shb_fk2 strand_data_fk2 helix_data_fk2 strand_data_fk3 strand_data_fk8 bulge_fk3 bulge_fk5 bulge_fk6 bulge_fk7 strand_data_fk2 helix_data_fk2 strand_data_fk3 bulge_fk3 bulge_fk5 bulge_fk6 bulge_fk7 bulge_fk4 COMPONENT DATA bulge_fk4 BULGE hairpin_motif_fk4 hairpin_motif_fk3 hairpin_motif_fk5 DEPOSITION hairpin_motif_fk1 assembly_deposition_fk hairpin_motif_fk1 ASSEMBLY DATA assembly_deposition_fk sheet_order_fk3 CHAIN ASSEMBLY sheet_datak1 sheet_datak2 sheet_datak1 db_chain_fk db_comp_fk ETAXI ncbi_syn_fk etaxi_syn_fk NCBI NCBI SYNONYMS ncbi_syn_fk helix_d_fk2 helix_d_fk1 helix_data_fk1 bulge_fk1 assembly_data_model_fk bulge_fk1 turn_fk1 db_chain_fk chain_tax_fk chain_tax_fk etaxi_ncbi_fk etaxi_ncbi_fk TURN assembly_data_model_fk shb_fk7 strand_data_fk4 ALT helix_d_fk2 helix_d_fk1 turn_fk5 turn_fk3 turn_fk4 turn_fk6 turn_fk5 turn_fk3 turn_fk4 turn_fk6 component_data_fk4 bulge_fk2 bulge_fk2 component_data_fk4 MODEL sheet_order_fk4 sheet_datak2 alt atl_atd_fk COMPONENT db_comp_fk comp_tax_fk comp_tax_fk HELIX HELIX turn_fk2 turn_fk2 strand_data_fk5 ATOM DATA turn_fk1 shb_fk8 atd_model_fk atd_model_fk shb_fk1 shb_fk2 shb_fk8 atom_data SHEET HBOND shb_fk7 atom ATOM MSD SEARCH DATABASE MARTS: SECONDARY STRUCTURE COORDINATES-SEQUENCE TAXONOMY CATH

Star Database Design Assembly Dimension Tables PDB Entry Model Fact Table (residue_contact) Interactions Number of interactions Strongest interaction Geometry Bond type Residue Ligand Neighbour Chemical compound Secondary structure Helix Turn Strand

From Snowflake to Star Chemical compound PDB entry Assembly Residue Residue Contact

Populating & Using The Search Database

From Deposition to Search Database Deposition Database Search database Transformation Normalised relationships authoritative complete Denormalised fewer relationships derived information subset of data

MSD Database Pipeline Web services Web services Deposition database 40 GB deposit load via mmcif transform Transformation database 200 GB replicate search Search database 200 GB distribute D I S T R I B U T I O N PDB files derived data External Processes 130 GB

Transformation Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented model Assignment of consistent identifiers In addition to PDB identifiers Extensive indexing Based on a flexible metadata driven mechanism, inhouse developed to overcome Oracle limitations Models composite entities and their dependencies Allows incremental transformation

Post Transformation Calculation of derived/aggregated data Consistent across whole archive Ability to query derived data Adding value to the database Active-site information Structure, Sequence Alignment Cross referencing SCOP, CATH, PFAM, UniProt, InterPro Additional derived data and indexing required for web-based search services Stored in search database, but conceptually part of search systems Scientific parameterisation reflects requirements of search services

Efficiently Querying The Database Requires using many of the tools provided by the Oracle DBMS STAR Joins STAR Tranformations Bitmap indexes Index Organized Tables Set operations (intersect, minus, union) Optimizer hints (e.g. leading,fact, no_merge)

3D Spatial searches Search example: find the following triangle site: Cbeta of Isoleucine or Leucine 6-8 Angstroms 6-8 Angstroms Cbeta of Arginine 6-8 Angstroms Cbeta of Tryptophan or Tyrosine or Phenylalanine

The Query select d1.atom_data1_id, d1.atom_data2_id from (select /*+ NO_MERGE INDEX(atomic_dists)*/ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 in ('ILE','LEU') and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d1, (select /*+ NO_MERGE INDEX(atomic_dists) */ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 = 'ARG' and code_3_letter2 in ('ILE','LEU') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d2, (select /*+ NO_MERGE INDEX(atomic_dists) */ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 = 'ARG' and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d3 where d1.atom_data1_id = d2.atom_data2_id and d1.atom_data2_id = d3.atom_data2_id and d2.atom_data1_id = d3.atom_data1_id;

Managing Data The information available in the MSD database is organised in application areas (data marts) Users may replicate only the data marts that they are interested in Some data marts are quite valuable and still small enough to be used on desktop systems as in the demonstration The data marts are also loosely interrelated and can be synchronised independently

Data Marts The Search database is divided into application areas, or data marts: Structure Data Descriptions Secondary Structure Taxonomy Ligands Experimental details Citations Mapping to UniProt, SCOP, CATH, Pfam, InterPro, Go, IntEnz, PubMed Active Sites Structural-Sequence alignment Each data mart can be distributed and managed separately

What is the Search Database used for? A target for data integration efamily A direct backend for web-based services MSDLite, MSDPro, MSDSite etc. A source for data files Data exported from DB to support web-based services (indirectly a backend) Atlas pages XML representation of sections of the DB Including efamily Coordinates in PDB format for software that requires it Data Mining MSDMine

Database Documentation http://www.ebi.ac.uk/msd-srv/docs/dbdoc/

Database Documentation (2)