Introduction to Data Warehousing and OLAP Outline Part I Introduction OLAP vs OLTP Data Cleaning and Integration Part II Data Models and Warehouse Design Part III Index Structures for Data Warehouses 1
Types of Data Operational Data (OLTP applications) Data that works Frequent updates and queries Normalized for efficient search and updates (minimize update anomalies) Fragmented and local relevance Point Queries: queries accessing individual tuples Types of Data Historical Data (OLAP applications) Data that tells Very infrequent updates Integrated data set with global relevance Analytical queries that require huge amounts of aggregation Performance issues mainly in query response time (not in updates) 2
Example OLTP Queries What is the salary of Mr. Ali? What is the address and phone number of the person in charge of the Supplies department? How many employees have received an excellent credential in the latest appraisal? Example OLAP Queries How is the employee attrition scene changing over the years across the company? Is there a correlation between the geographical location of a company unit and excellent employee appraisals? Is it financially viable to continue our manufacturing unit in Taiwan? 3
A Data Warehouse An infrastructure to manage historical data Designed to support OLAP queries involving gratuitous use of aggregations Post retrieval processing (reporting) just as complex, if not more, as the retrieval itself Warehousing Data OLTP Unit Operational Data OLTP Unit OLTP Unit Data Cleaning and Integration Data Warehouse 4
Data Marts Data warehouses seen as a collection of data-marts or historical data about each OLTP segment that feeds into the warehouse Data-marts also seen as small warehouses for OLAP activities within a given segment Data Cleaning and Integration Back flush DCU DIU Data Warehouse OLTP Databases Updates / Feedback 5
Dirty Data Lack of Standardization Multiple encodings, locales, languages Spurious abbreviations: Allama Iqbal Road and A.I. Road are the same Semantic equivalence: Rawalpindi is the same as Pindi Multiple standards: 1.6 kilometer is the same as 1 miles Dirty Data Missing, spurious and duplicate data Missing age field for an employee Spurious (incorrectly entered) sales values Duplication of data-sets across OLTP units Semantic duplication (M. A. Khan appearing in another data set as Khan Muhammad Ali) 6
Dirty Data Inconsistencies Incorrect use of codes (use of M/F in addition to 0/1 for gender) Codes with inconsistent or outdated meaning (Travel eligibility C denoting eligibility to travel only by III class sleeper, which no longer exists) Inconsistent duplicate data (two data sets are found to belong to the same person, but have two different address information) Dirty Data Inconsistencies Inconsistent associations (Sales figures provided by the marketing department do not add up to the total sales figures by the retail units) Semantic inconsistencies (Feb 31 st ) Referential inconsistency (Rs. 10 lakhs sales reported from a unit that has been closed down) 7
Issues in Data Cleaning Cannot be fully automated GIGO (Garbage in Garbage out) Requires considerable knowledge that is tacit and beyond the purview of the warehouse (metrics, geography, govt. policies, etc.) Complexity increases (usually geometrically) with increase in data sources Complexity increases with the history span that is taken up for cleaning Steps in Data Cleaning (Rahm and Do [1]) 1. Data Analysis: Analyze data set to obtain meta-data and detect dirty data 2. Definition of transformation rules: Transform data from its current dirty form to the required clean form. Transformation can be either at the schema level or data level 3. Rule Verification: Verification of the transformation rules on test data sets 4. Transformation: Execution of transformation rules on data set 5. Backflow: Re-populating data sources with cleaned data. 8
Data Analysis Techniques (Refs [1],[2]) Problem to be Detected Illegal values Spelling mistakes Lack of standards Duplicate and missing values (max, min), (mean, deviation), Cardinality Hashing, N-gram outliers Meta-data used Column comparison (compare value sets from given column across tables) Compare cardinality with #rows, detect nulls, use rules to predict incorrect or missing values. Transformation Algorithms Hash-Merge for duplicate elimination 1. Hash tuples based on given column into buckets 2. (Tuples with duplicate values are hashed onto the same bucket) 3. Merge tuples within each bucket separately 9
Transformation Algorithms Name M.A. Khan Saleem Address 50, Lvl Rd. 25, LB Rd Dept Sales R&D M.A. Khan Rahim 50, Lvl Rd. 30, Snky Rd. PR Products Hash Buckets Hash key Transformation Algorithms Sorted Neighborhood Technique for misspelling integration 1. Identify a set of data values within a given row as key 2. Sort table based on key 3. Slide a window of n rows over the sorted table and merge data values based on rules. (Ex: Merge names if all other values like age, address, dept, etc. match) 4. Make multiple passes until there are no more merges of records 10
Transformation Algorithms Name M.A. Khan Saleem M.A. Khan Rahim Address 50, Lvl Rd. 25, LB 50, Rd. Lvl Rd. 30, Snky Rd. Dept Sales R&D PR Products Rule: Merge rows if name and address match. Window size n = 3. Name M.A. Khan M.A. Khan Rahim Saleem Address 50, Lvl Rd. 50, Lvl Rd. 30, Snky Rd. 25, LB Rd. Dept Sales PR Products R&D Transformation Algorithms (Monge Elkan 97, [3]) Graph-Based transitive closure to reduce number of passes 1. Use sorted neighborhood technique and sort records based on identified keys 2. Create an undirected graph structure where nodes correspond to records and edges correspond to is a duplicate of relationship 3. Records R1 and R2 need not be compared in any pass if they belong to the same connected component 11
R3 R1 Transformation algorithms R2 R1 R2 R1 R2 R4 R3 R4 R3 R4 R5 R5 R5 Slide 0: R1, R2, R3 Slide 1: R2, R3, R4 Slide 2: R3, R4, R5 Slide 0: R1, R2, R3 Slide 1: R3, R4, R5 Naïve sliding window Graph-based transitive closure No need to compare R1/R2 with R4/R5 Integration Combining disparate data sources into a single schematic structure Schema Integration: Forming an integrated schematic structure from the disparate data sources Data Integration: Cleaning and merging data from different sources 12
Schema Integration Consider the following schemata [4]: Cars (serialno, model, colour, stereo, glasstint, ) and Autos(serienNr, modelle, farbe) Optionen(serienNr, stereo, glastint, ) Schema Integration Challenges Naming differences Structural differences Data type differences Missing fields Semantic differences 13
Schema Integration Generic Architecture of an Integrator Mediator / Constructor Wrapper / Extractor Wrapper / Extractor Wrapper / Extractor Data Sources Integration Wrapper / Extractor Creates a common view across all data sources Bridges differences in naming, type and schema structure Wrappers do not physically extract data from the data sources Mediator / Constructor Constructs an integrated schematic structure Performs data integration and populates the data warehouse 14
Tools for Data Cleaning and Integration dfpower From Dataflux corporation (http://www.dataflux.com/) De-duplication engine Analyzes data based on values and number of occurrences Does not support detection of semantic duplicates based on user specified rules Permits duplicates to be grouped or merged Tools for Data Cleaning and Integration ETI* Data Cleanse From Evolutionary Technologies Int l (http://www.evtech.com/) Table driven data cleaning, matching and quality review, duplicate matching, imprecise spelling correction Supports meta-data repositories to store schemas, transformation rules, interrelationships, etc. 15
Tools for Data Cleaning and Integration SSA Name/Data Clustering Engine From Search Software America (http://www.searchsoftware.co.uk/) Addresses errors in spelling, typing, transcription, nicknames, synonyms, abbreviations, prefix/suffix variations, punctuation, casing, etc. Supports user specified transformation rules Scalable up to 500 M records Summary OLAP versus OLTP Characteristics of OLAP queries Data Warehousing systems Data Cleaning Issues Dirty Data Cleaning Algorithms Integration of Data and Schema 16
Introduction to Data Warehousing and OLAP Part II: Data Models and Warehouse Design Example OLAP Queries How is the employee attrition scene changing over the years across the company? Is there a correlation between the geographical location of a company unit and excellent employee appraisals? Is it financially viable to continue our manufacturing unit in Taiwan? 17
OLAP Query Characteristics Aggregation and summarization over large data sets Clustering Trend detection Multi-dimensional projections A Typical Warehouse Hypercube Core Materialized Views 18
A Typical Warehouse Hypercube Core Manages the atomic data elements Global schematic structure for the entire warehouse Based on the multi-dimensional data model Materialized Views Physical views for faster aggregate-query answering De-normalization of the core The Sales Hyper Cube Product Week Branch 19
The Sales Hyper Cube Sales is the fact Branch, Product and Week are dimensions Operations on Hyper Cubes Pivoting: Choosing (rotating the cube on a pivot) a set of dimensions for display Slicing-dicing: Select some subset of the cube Roll-up: Aggregate a dimension to a smaller dimension (Roll-up weeks dimensions into months) Drill-down: Open an aggregated dimension to reveal details (Open up months to reveal week-by-week information) 20
Implementation of Hyper Cubes Multi-dimensional to relational mapping (ROLAP) Map hyper cube queries to relational queries and maintain the data cube in a set of RDBMS tables Ex: True Relational OLAP from Microstrategy Inc (http://www.microstrategy.com/) Native multidimensional model (MOLAP) Use a separate storage model for multidimensional data Ex: Arbor Essbase (http://www.arborsoft.com/) Physical models: Star Branch Dimension Table Product Dimension Table Brnch Prod Wk Sales Fact Table Week Dimension Table 21
Star Schema Features Central Fact Table Set of supporting dimension tables Denormalized data storage Advantages Simple to comprehend and design Small meta-data Quick query responses Limitations Not robust towards changes (Changes in dimension table) Enormous amount of redundancy in dimension-table data Physical models: Snowflake Branch Product Division Brnch Prod Wk Sales Options Scheme Unit Week 22
Snowflake Schema Features Central Fact Table Normalized dimension tables storing atomic data units Advantages Faster query responses Easy updation Limitations Large amount of meta-data May result in too many tables Harder to comprehend manually Physical models: Constellation Branch Dimension Table Brnch Prod Wk Sales Product Dimension Table Discounts Fact Table Wk Prod Sch Dist Sales Fact Table Week Dimension Table Scheme Dimension Table 23
Constellation Most commonly used architecture Used when multiple fact tables are needed Usually has a main fact table and several auxiliary fact tables which are summary tables or materialized views over the main fact table Helps in faster query answering for frequently asked queries Costlier to update than snowflake Issues in Data Cubes Curse of high dimensionality Currently known index structures degrade to linear search when number of dimensions become high Categorical dimensions In order to run certain algorithms like clustering, dimensions should belong to ordinal classes Categorical dimensions difficult to index Ordinal changes during aggregation Certain dimensions may change their ordinal property when aggregated and should be indexed at several levels. Ex: Student names are ordered lexicographically, but when aggregated into classes, are ordered on their graduation year. 24
The Time Dimension Mandatory in most warehouse applications Has several meanings and roll-up techniques depending on application context Simple calendar based rollup Fiscal calendar based rollup Academic calendar based rollup Need to separately index special dates like releases, events, etc. Order of traversal of time dimension important Materialized Views Summary tables that create physical views of fact table Trade-off between faster query answering and increased complexity during updates When to materialize? Use the result to search space (RSS) ratio: (#rows returned / # rows scanned) for query Summarize if RSS ratio too small and query is too frequent 25
Revision History Table(s) Manages data that is revised over time Queries select appropriate value based on relevant version Usually required for most warehousing applications Id 1 (turnover per employee) 1 Val 110,050 130,045 Revised 01-01-2000 01-06-2000 1 140,011 01-01-2001 Designing a Data Warehouse Enterprise Model DW Logical Model DW Physical Model End-user + DBA DBA + Automated Tools 26
Enterprise to Warehouse Some thumb rules: Warehouse logical model closely resembles enterprise model Some transformation usually necessary from enterprise to warehouse models Warehouse logical model should depict denormalized data sets implicit in enterprise model Special planning required for managing time dimension and revision histories OLTP to Warehouse Models OLTP databases usually organized around the enterprise model OLTP schemas provide a good starting point for designing OLAP logical models 27
OLTP to Warehouse Models Some thumb rules when converting OLTP schemas into OLAP schemas: Look for operational data fields and remove them (Ex: Counter-sales table containing register number, cashier Emp_id) Add time element (and version elements if necessary) to data sets before populating the warehouse Decide on derived data and summary tables at design time itself Iterate between transformation rule specification, integration and schema design Add the commonly required summary information ALL to every domain Introduction to Data Warehousing and OLAP Part III: Index structures and query processing 28
Classes of Dimensions Categorical {Cats, dogs, sheep, cows, bulls, buffaloes} Ordinal Totally ordered (integers), partially ordered (credentials of a candidate) Sparse Small number of data points per value Dense Large number of data points per value Multi-dimensional indexes Usually based around ordinal classes Different kinds of indexes for sparse and dense data sets Performance may depend on storage structure for data set 29
Representing Multi-dimensional Data Multi-level sorting Sorts data based on different dimensions one after the other Simple to implement Searching is fast if dominant attribute is part of query Search becomes fragmented if dominant attributed omitted from query Dim 1 Dim 2 Dim 3 1 34 2 1 34 5 1 34 10 1 56 20 2 45 9 2 49 10 3 69 20 3 69 30 4 23 29 4 23 48 4 40 50 Representing Multi-dimensional Data Space filling curves Sorted on all attributes at once Location of a data point easily computable Suffers with increase in number of dimensions 0 1 2 3 4 5 0 1 2 3 4 30
Multi-dimensional Indexes Ordered index on multiple attributes Considers a composite key as a tuple of simple keys (k 1, k 2, k n ) Ordered index files maintained by ordering each key in sequence. Multi-dimensional Indexes Partitioned Hashing Given a composite key (k 1, k 2, k n ), partitioned hashing returns n different bucket numbers Hash bucket is determined by concatenating the n numbers. 31
Multi-dimensional Indexes Grid Files Partitions the range of key values for each key into several buckets Combinations of buckets of each key forms a grid A grid file stores a grid as any other multidimensional data set. Grid Files Grade A B C D Roll No. 1 2 3 4 5 Roll No. 1 001 025 2 026 050 3 051 075 4 076 100 5 101 125 Bucket Pool 32
Multi-dimensional Indexes Bit-map indexes Used on fields that are sparse (i.e. has only a small number of values. Example, gender, grade, etc.) A bit vector enumerates all possible values and sets corresponding bit for each data element Much more compact than other index structures Useful for efficiently answering composite queries over multiple bit-vectored fields Can be integrated with tree indexes Multi-dimensional Indexes Encoding Bit-map indexes Grade = {A, B, C, D, E, F} Subject = {DB, AI, PDS} A = 000001 DB = 001 B = 000010 AI = 010 C = 000100 PDS = 100 D = 001000 no value = 000 E = 010000 F = 100000 Student who has scored A in DB and AI No value = 000000 (000001 && 001 && 001) 33
Multidimensional Tree Indexes KD Trees A binary tree structure that can store n- dimensional data points Each dimension compared at appropriate level Useful for point queries KD Trees Let data be represented as 2-dimensional points of the form (x,y) representing (salary, age) Example data set: (2500, 20) (5000, 32) (4500, 28) (2000, 23) (4800, 25) (1800, 18) (6500, 27) 34
KD Trees (2500, 20) (5000, 32) (4500, 28) (2000, 23) (4800, 25) (1800, 18) (6500, 27) 1800, 18 2000, 23 2500, 20 4500, 28 5000, 32 4800, 25 6500, 27 KD Trees 35
KD Trees Each point divides search space along one of the dimensions Structure of the tree (and hence its performance) sensitive to the order of insertion of data points Quad Trees Initially, index contains only one bucket representing the entire space If number of data points in any bucket exceeds maximum limit, it is split into two along each dimension and are added as children of the larger bucket When number of dimensions = 2, splitting results in a quad 36
Quad Trees R Trees Manages regions Leaf nodes represent data regions and non-leaf nodes represent virtual (non-data) regions A node is split when it contains too many regions Addition of regions begins from root node until the smallest accommodating region is found (possibly by splitting one or more regions) Sibling regions may overlap but may not subsume one another 37
R Trees Data region Virtual region R Trees Suitable for range, neighborhood and nearness searches Tree structure and performance sensitive to order in which data regions are added Suffers from the curse of high dimensionality 38
Indexing Categorical Data (Ref [7]) Categorical Data Have no ordinal relationship Cannot be compared, except for equality Can be represented as sets in many cases Example categorical attribute: Team members of a given project, Ingredients for a given recipe, Products manufactured by a unit, etc. Comparison operators on sets: equality, membership, superset, subset Signatures Represent a set as a bitmap where each bit corresponds to an object in a larger UoD UoD = {set of all ingredients} S, T UoD : ingredients for two recipies s, t : corresponding bit maps of S and T Queries: S T s ~t = 0 S T t ~s = 0 39
Signature Trees Leaf nodes contain (signature, datapointer) pairs Non-leaf nodes formed by bit-wise ORing of its children nodes Traverse the tree by AND ing the query signature with the node signature 1111 1100 1011 1000 0100 1001 0011 Extensible Signature Hashing Hash tables constructed based on the most significant d bits of signature Hash levels extended by extending d whenever overflow occurs 40
Extensible Signature Hashing d = 2 Bucket for records whose hash values starts with 00 000 001 010 011 100 101 110 111 Global depth n = 3 d = 3 d = 3 d = 1 Bucket for records whose hash values starts with 010 Bucket for records whose hash values starts with 011 Bucket for records whose hash values starts with 1 Summary The OLAP Hypercube Materialized views ROLAP and MOLAP implementations Star, Snowflake and Constellation Time dimensions and revision tables Thumb rules for OLAP design Multi-dimensional index structures 41
Furthermore Topics not addressed for reasons of brevity Query Language constructs Data Mining over warehouses Handling semi-structured data in warehouses Performance Tuning Maintenance of materialized views Browsing and Visualization Thank You 42
References 1. Erhard Rahm, Hong Hai Do. Data Cleaning: Problems and Current Approaches. Bulletin on the Technical Committee on Data Engineering, IEEE Computer Society, Vol. 23, No. 4, Dec 2000. 2. Vijay T. Raisinghani. Cleaning Methods in Data Warehousing. PhD seminar report, IIT Bombay, Dec 1999. 3. A. Monge, C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997. http://citeseer.ist.psu.edu/monge97efficient.html References 4. H. Garcia-Molina, J.D. Ullman, J. Widom. Database Systems: The Complete Book. Pearson Education, 2004. 5. R. Agrawal, A. Gupta, and S. Sarawagi, Modeling multidimensional databases, ICDE, 1997. 6. Oliver Guenther. Data Warehouses and Data Mining. Course Notes, Humboldt University, Berlin. http://www.wiwi.huberlin.de/~guenther/dw/dw_ws03.html 7. Helmer, S., Moerkotte, G., 1999. A study of four index structures for set-valued attributes of low cardinality. Reihe Informatik 2, University of Mannheim. pp. 20. 43
Conferences and Workshops DaWaK: Data Warehousing and Knowledge Discovery (http://www.dexa.org/) VLDB: Very Large Databases (http://www.vldb.org/) EDBT: Extending Database Technology (http://www.edbt.org/) DOLAP: ACM International Workshop on Data Warehousing and OLAP (http://www.cis.drexel.edu/faculty/song/dolap.html) Some WWW Links DW Infocenter (http://www.dwinfocenter.org/) Data.com (http://www.data.com/) The Data Warehousing Institute (http://www.dw-institute.com/) KDNuggets, a comprehensive portal on knowledge discovery (http://www.kdnuggets.com/) Oracle Data Warehousing Tutorial (regn. Required) (http://www.oracle.com/technology/idevelop/online/courses/oln/h ow_to04.html) Data Warehouse: Online Recourses (http://www.dci.com/news/datawarehouse/articles/1998/05/links. htm) Data Warehousing and OLAP bibliography (http://www.ondelette.com/olap/dwbib.html) 44