Data Warehouse Technology And The MSD Databases Philip McNeil
Data Warehouses The MSD Databases Populating & using the Search Database
Data Warehouses
What is a Data Warehouse? A subject-oriented, integrated, nonvolatile, and time-variant collection of data that is used primarily in organizational decision making. (W. H. Inmon, Building the Data Warehouse, John Wiley & Sons, 2002)
Data Warehouse Subject-Oriented Organised around major subjects, such as customer, product, sales Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing Provides a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Data Warehouse Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data are moved to the warehouse, they are converted into a consistent format to facilitate integration.
Data Warehouse Non-Volatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data
Data Warehouse Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element
Dimensional Modelling A typical commercial data warehouse is based on multidimensional data model which views data in the form of a data cube A data cube allows data to be modelled and viewed in multiple dimensions (such as sales) Dimension tables, representing important subject areas such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table containing varying levels of summarised data (such as dollars_sold) and keys to each of the related dimension tables
Data Warehouse Models Star schema A fact table, containing varying levels of summarised data, in the middle connected to a set of dimension tables, representing important subject areas Snowflake schema A refinement of star schema where some dimensional hierarchy is normalised into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellation Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Star Schema
Star Schema with Sample Data
Snowflake Schema Merchandise ItemID Description QuantityOnHand ListPrice Category Transactions SaleID ItemID Quantity SalePrice Amount Sale SaleID SaleDate EmployeeID CustomerID SalesTax Customer CustomerID Phone FirstName LastName Address ZipCode CityID City CityID ZipCode City State Dimension tables can join to other dimension tables.
Fact Constellation Shipping Fact Table Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Sales Fact Table time_key item_key branch_key location_key unit_sold euros_sold avg_sales Item item_key item_name brand type supplier_key Location location_key street city province/street country time_key item_key shipper_key from_location_key to_location_key unit_shipped Shipper shipper_key shipper_name location_key shipper_type
Why Build a Data Warehouse? Access to data from multiple sources, have a comprehensive data collection. Separate transactional and analysis systems: Improve query response time (without slowing down transaction processing) Easy formulation of complex queries Access to historical data (not in operational systems) Improved data quality (fewer errors and missing values)
The Data Warehouse Pipeline other sources OLAP Server Operational DBs Extract Transform Load Refresh Data Warehouse Serve Analysis Query Reports Data mining Data Storage OLAP Engine Front-End Tools
Using the Warehouse Ad Hoc query simplistic submission of SQL statements from a command-line tool or a SQL-generation tool Complex Analytical Questions using custom written query tools or commercial online analytical processing (OLAP) tools Data Mining searching data sets for hitherto unknown correlations
The MSD Databases
What is a Data Warehouse? (revisited) From the MSD Viewpoint: A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM Consultant
The MSD databases The MSD actually consists of two separate databases: the deposition database is highly normalised, with thousands of relationships linking some 325 tables; the deposition database is the definitive archive for all structural data at MSD the search database is a simpler, but larger denormalised database,which contains a large amount of additional derived data, with data items duplicated and aggregated into 170 much wider tables, making it more amenable to searching and retrieval of data A third intermediate database is involved in transforming the data from the deposition database to the search database and in calculating and adding the derived data
Deposition Database The deposition database comprises: common reference data, such as amino-acid connectivity, HET groups structures, etc. older PDB entries, loaded from legacy files schema includes strict constraints, enforcing internal consistency and performing type checking and validation against the reference data a huge amount of effort has gone into cleaning up the legacy data new entries, loaded from recent PDB submissions new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data
Deposition Database Schema!
Part of Deposition Database Schema DEP_PDB_TAXONOMY o PDB_COMMON_NAME o PDB_SCIENTIFIC_NAME o PDB_STRAIN a c ETAXI SYNONYMS # NAME_TXT # NAME_CLASS es_eta_fk NCBI SYNONYMS # NAME_CLASS # NAME_TXT ns_ncb_fk DEP_ENTITY * NAME o ENTITY_SRC * ID o DETAILS o SYSTEM DEP_POLY_ENTITY * ENGINEERED * HETERO_FLAG * MUTANT_FLAG o FRAGMENT_FLAG o MUTATION_STRING * SYNTHETIC is child of * MASTER_ID precedes b DEP_POLY_ENTITY_SEGMENT * RCSB_DEFINE_AS_ENTITY * RCSB_SERIAL_OFFSET follows matched by is composed of d es_eta_fk ETAXI o PARENT_ID o UPPER_NAME o FULL_NAME * COMPLETE_GENOME_FLAG o RANK * HIDDEN * SCIENTIFIC_NAME o LEFT_NUMBER o RIGHT_NUMBER * ANNOTATION_SOURCE eta_ncb_fk taxonomy of eta_ncb_fk ns_ncb_fk NCBI # TAX_ID o PARENT_ID * SCIENTIFIC_NAME o PREFERRED_COMMON_NAME o RANK * HIDDEN has taxonomy ncti_ncb_fk ncti_ncb_fk NTX CHANGED TAX ID # OLD_TAX_ID * NEW_NAME_TXT * OLD_NAME_TXT * USERSTAMP o TIMESTAMP_NCBI is parent of DEP_POLY_ENTITY_MASTER R/1374 CP606 DEP_NATURAL_SOURCE * CELL_ID o ATCC o CELL_LINE o CELL_LOCATION o CELL_TYPE o DETAILS o FRAGMENT o GENE o ORGAN o ORGANELLE o PLASMID o SECRETION o TISSUE o VARIANT JA3 is a component of JR3 DEPOSITION * NUM_XTALS o PDB_EXP_TYPE * DEPOSITION_PROCESSED_BY * CREATION_DATE * LAST_UPDATE * TITLE... DEP_POLY_ENTITY_SEQ * HETERO * SERIAL describes is defined by conflicts with matches entity referrs to referred to by conflict refers to DEP_SEQ_REF * PROC_MATCHED_FLAG o CIF_ID o DETAILS R/1937 DEP_SEQ_MATCH * CIF_ID o DETAILS CP623 R/1939 DEP_SEQ_CORR * CONFLICT_ANNOTATED_FLAG * TYPE o CIF_TYPE o CONFLICT_TYPE o DETAILS a CP621 referred to by b REF_SEQ_REF * DB_NAME * PRIMARY_ID o SECONDARY_ID o VERSION sequence of REF_SEQ_REF_SEQ * COMPONENT * SERIAL compound has sequence is described by DEP_RESIDUE defines REF_CHEM_COMP defines compound o EBI_ID child comp o HETGROUP_PARENT o MODEL_DETAILS parent comp o MOLECULAR_FORMULA o MOLECULAR_WEIGHT o EBI_RESERVED_NAME o RCSB_RESERVED_NAME...
NOT The MSD Data Warehouse The MSD Search Database: Is not a true data warehouse Breaks several of the data warehouse cardinal rules Non-volatile Time Variant But does make use of many of the data warehousing concepts and techniques Closest to a fact constellation
Search Database Each data item occurs only once in the deposition database, so that data from a single entry are spread across many tables To make searching faster, the data are aggregated into fewer, larger tables in the Search database Searching the Search database requires fewer table joins, making database queries significantly faster and much less complex The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)
Representing Macromolecular Structures (1) Exp. Result Assembly Chains Residues Atoms ENTRY ASSEMBLY CHAIN ASSEMBLY DATA MODEL ATOM DATA ALT ATOM RESIDUE
Representing Macromolecular Structures (2) ASU observed Biological Independent exp data Unit(s) units Chains Each level of the hierarchy can have associated properties, e.g. Bound molecules Domains Site residues Derived properties (e.g. asa) Reference information (e.g. standard geometry) Atoms Residues
Derived Data During transformation from the deposition database to the search database additional derived data are added Numerous processes are run on the deposition data, including: characterisation of ligand binding sites derivation of secondary structure information mapping data onto other databases such as UniProt, Pfam, InterPro, GO, SCOP and CATH
Search Database Data Models Entity-relationship data model as used in the deposition database but denormalised to aid querying could be generated from deposition data model Dimensional model typical of commercial data warehouses requires separate data model A hybrid of these two required to handle to complexity of macromolecular structure data very complex fact tables could be dimensions for other fact tables
Part of Search Database Schema strand_data_fk6 STRAND DATA strand_data_fk9 sheet_order_fk1 strand_data_fk1 sheet_order_fk1 SHEET ORDER sheet_order_fk4 sheet_order_fk3 strand_data_fk1 SHEET CATH_MAPPING ETAXI SYNONYMS _strand_data_fk6 _strand_data_fk9 sheet_order_fk2 sheet_order_fk2 HAIRPIN MOTIF etaxi_syn_fk strand_data_fk5 strand_data_fk4 _strand_data_fk7 hairpin_motif_fk5 HELIX DATA helix_data_fk1 hairpin_motif_fk4 hairpin_motif_fk3 _strand_data_fk8 strand_data_fk7 shb_fk1 shb_fk2 strand_data_fk2 helix_data_fk2 strand_data_fk3 strand_data_fk8 bulge_fk3 bulge_fk5 bulge_fk6 bulge_fk7 strand_data_fk2 helix_data_fk2 strand_data_fk3 bulge_fk3 bulge_fk5 bulge_fk6 bulge_fk7 bulge_fk4 COMPONENT DATA bulge_fk4 BULGE hairpin_motif_fk4 hairpin_motif_fk3 hairpin_motif_fk5 DEPOSITION hairpin_motif_fk1 assembly_deposition_fk hairpin_motif_fk1 ASSEMBLY DATA assembly_deposition_fk sheet_order_fk3 CHAIN ASSEMBLY sheet_datak1 sheet_datak2 sheet_datak1 db_chain_fk db_comp_fk ETAXI ncbi_syn_fk etaxi_syn_fk NCBI NCBI SYNONYMS ncbi_syn_fk helix_d_fk2 helix_d_fk1 helix_data_fk1 bulge_fk1 assembly_data_model_fk bulge_fk1 turn_fk1 db_chain_fk chain_tax_fk chain_tax_fk etaxi_ncbi_fk etaxi_ncbi_fk TURN assembly_data_model_fk shb_fk7 strand_data_fk4 ALT helix_d_fk2 helix_d_fk1 turn_fk5 turn_fk3 turn_fk4 turn_fk6 turn_fk5 turn_fk3 turn_fk4 turn_fk6 component_data_fk4 bulge_fk2 bulge_fk2 component_data_fk4 MODEL sheet_order_fk4 sheet_datak2 alt atl_atd_fk COMPONENT db_comp_fk comp_tax_fk comp_tax_fk HELIX HELIX turn_fk2 turn_fk2 strand_data_fk5 ATOM DATA turn_fk1 shb_fk8 atd_model_fk atd_model_fk shb_fk1 shb_fk2 shb_fk8 atom_data SHEET HBOND shb_fk7 atom ATOM MSD SEARCH DATABASE MARTS: SECONDARY STRUCTURE COORDINATES-SEQUENCE TAXONOMY CATH
Star Database Design Assembly Dimension Tables PDB Entry Model Fact Table (residue_contact) Interactions Number of interactions Strongest interaction Geometry Bond type Residue Ligand Neighbour Chemical compound Secondary structure Helix Turn Strand
From Snowflake to Star Chemical compound PDB entry Assembly Residue Residue Contact
Populating & Using The Search Database
From Deposition to Search Database Deposition Database Search database Transformation Normalised relationships authoritative complete Denormalised fewer relationships derived information subset of data
MSD Database Pipeline Web services Web services Deposition database 40 GB deposit load via mmcif transform Transformation database 200 GB replicate search Search database 200 GB distribute D I S T R I B U T I O N PDB files derived data External Processes 130 GB
Transformation Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented model Assignment of consistent identifiers In addition to PDB identifiers Extensive indexing Based on a flexible metadata driven mechanism, inhouse developed to overcome Oracle limitations Models composite entities and their dependencies Allows incremental transformation
Post Transformation Calculation of derived/aggregated data Consistent across whole archive Ability to query derived data Adding value to the database Active-site information Structure, Sequence Alignment Cross referencing SCOP, CATH, PFAM, UniProt, InterPro Additional derived data and indexing required for web-based search services Stored in search database, but conceptually part of search systems Scientific parameterisation reflects requirements of search services
Efficiently Querying The Database Requires using many of the tools provided by the Oracle DBMS STAR Joins STAR Tranformations Bitmap indexes Index Organized Tables Set operations (intersect, minus, union) Optimizer hints (e.g. leading,fact, no_merge)
3D Spatial searches Search example: find the following triangle site: Cbeta of Isoleucine or Leucine 6-8 Angstroms 6-8 Angstroms Cbeta of Arginine 6-8 Angstroms Cbeta of Tryptophan or Tyrosine or Phenylalanine
The Query select d1.atom_data1_id, d1.atom_data2_id from (select /*+ NO_MERGE INDEX(atomic_dists)*/ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 in ('ILE','LEU') and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d1, (select /*+ NO_MERGE INDEX(atomic_dists) */ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 = 'ARG' and code_3_letter2 in ('ILE','LEU') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d2, (select /*+ NO_MERGE INDEX(atomic_dists) */ atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists where code_3_letter1 = 'ARG' and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB' and dist in (6,7,8))) d3 where d1.atom_data1_id = d2.atom_data2_id and d1.atom_data2_id = d3.atom_data2_id and d2.atom_data1_id = d3.atom_data1_id;
Managing Data The information available in the MSD database is organised in application areas (data marts) Users may replicate only the data marts that they are interested in Some data marts are quite valuable and still small enough to be used on desktop systems as in the demonstration The data marts are also loosely interrelated and can be synchronised independently
Data Marts The Search database is divided into application areas, or data marts: Structure Data Descriptions Secondary Structure Taxonomy Ligands Experimental details Citations Mapping to UniProt, SCOP, CATH, Pfam, InterPro, Go, IntEnz, PubMed Active Sites Structural-Sequence alignment Each data mart can be distributed and managed separately
What is the Search Database used for? A target for data integration efamily A direct backend for web-based services MSDLite, MSDPro, MSDSite etc. A source for data files Data exported from DB to support web-based services (indirectly a backend) Atlas pages XML representation of sections of the DB Including efamily Coordinates in PDB format for software that requires it Data Mining MSDMine
Database Documentation http://www.ebi.ac.uk/msd-srv/docs/dbdoc/
Database Documentation (2)