CSI 4352, Introduction to Data Mining Chapter 3, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 1
What is Data Warehouse? Data Warehouse ( defined in many different ways ) A decision support database that is maintained separately from the organization s operational database The support of information processing by providing a solid platform of consolidated, historical data for analysis A data warehouse is a (1) subject-oriented, (2) integrated, (3) time-variant, and (4) nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Data Warehousing The process of constructing and using data warehouses Data Warehouse Subject-Oriented Organized around major subjects e.g., customers, products, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues Excluding data that are not useful in the decision support process 2
Data Warehouse Integrated Integrating multiple, heterogeneous data sources Relational databases, flat files, on-line transaction records Apply data cleaning and data integration techniques Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse Time Variant The time horizon of data warehouses is significantly longer than that of operational systems Operational databases have current data values Data warehouses provide information from a historical perspective (e.g., 10-20 years) Time is a key structure in data warehouses Contain the attribute of time (explicitly or implicitly) 3
Data Warehouse Nonvolatile A physically separate storage of data transformed from operational databases Operational update of data does not occur Not require transaction processing, recovery, and concurrency control mechanisms Require only two operations, initial loading of data and access of data Data Integration Methods Methods (1) Process to provide uniform interface to multiple data sources Tradition Database Integration (2) Process to combine multiple data sources into coherent storage Data warehousing Traditional DB Integration A query-driven approach Wrappers / mediators on top of heterogeneous data sources Data Warehousing An update-driven approach Combined the heterogeneous data sources in advance Stored them in a warehouse for direct query and analysis 4
OLTP vs. OLAP OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct Features (OLTP vs. OLAP) User and system orientation (customers vs. market analysts) Data contents (current, detailed vs. historical, consolidated) Database design (ER + application vs. star + subject) View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5
Why Data Warehouse? Performance Issue DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation Data Issue Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 6
Data Format for Warehouse Dimensions Multi-dimensional data model Data are stored in the form of a data cube Data Cube A view of multi-dimensions Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Cuboid Each combination of dimensional spaces in a data cube 0-D cuboid, 1-D cuboid, 2-D cuboid,, n-d cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier time,supplier item,supplier 2-D cuboids time,item,location time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier 3-D cuboids 4-D (base) cuboid 7
Conceptual Modeling Key of Modeling Data Warehouses Handling dimensions & measures Examples Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state country 8
Example of Snowflake Schema time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_key supplier supplier_key supplier_type branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city_key city city_key city state country Example of Fact Constellation time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_type Shipping Fact Table time_key item_key shipper_key from_location branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city state country to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 9
Cube Definition in DMQL Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_dimension_list>) Special Case (Shared Dimension Table) define dimension <dimension_name> as <dimension_name_first> in cube <cube_name_first> Star Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, state, country) 10
Snowflake Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) Fact Constellation Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales 11
Measures Distributive If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning e.g., count(), sum(), min(), max() Algebraic If it can be computed by an algebraic function with m arguments, each of which is obtained by applying a distributive aggregate function e.g., avg(), standard_deviation() Holistic If there is no constant bound on the storage size needed to describe a subaggregate e.g., median(), mode(), rank() Concept Hierarchy Schema Hierarchy e.g., street < city < state < country e.g., day < { month < quarter ; week } < year Year Quarter Week Month Day Set-group hierarchy e.g., { (0..100] ; (100..200] } < (0..200] e.g., { (0..10] < lowprice ; { (10..100] ; (100..200] } < highprice } < allproducts allproducts (0..200] lowprice highprice (0..100] (100..200] (0..10] (10..100] (100..200] 12
Three Components of Data Cube Example Measures: data values as a function of products, locations and time Dimensions: Hierarchies: product location time locations Company Category Product Country City Office Year Quarter Week Month Day time Example of Cuboid Cells time dimension product dimension all location dimension 0-D (apex) cuboid product quarter country product, quarter product, country quarter, country product, quarter, country 1-D cuboids 2-D cuboids 3-D (base) cuboid 13
Example of Data Cube TV PC VCR sum Quarter 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. U.S.A Canada Mexico Country sum OLAP Operations Roll-up (drill-up) Summarizes (aggregates) data by climbing up hierarchy or by dimension reduction Drill-down (roll-down) Reverse of roll-up by stepping down to lower-level data or introducing new dimensions Slice Selecting data on one dimension Dice Selecting data on multi-dimensions Pivot (rotate) Reorienting the cube, or transforming 3-D data to a series of 2-D spaces 14
Roll-UP & Drill-Down location country city quarter time quarter country month Slice, Dice & Pivot location country supercom pc laptop mexico canada USA Q1 Q2 Q3 Q4 quarter time country country mexico pc laptop canada USA Q1 Q2 quarter canada USA Q1 Q2 Q3 Q4 quarter 15
Starnet Query Model Shipping Method air Orders contracts Customer customerid Time Location annually region country quarterly truck Each circle is called a footprint daily city order Promotion item sales-person division Product category division Organization CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 16
Data Warehouse Design Top-Down View Allows the selection of the relevant information necessary for the data warehouse Data Source View Exposes the information being captured, stored, and managed by operational systems Data Warehouse View Consists of fact tables and dimension tables Business Query View Shows the perspectives of data in the warehouse to end-users Data Warehouse Design Process Categories by Process Direction Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Categories by Software Engineering View Waterfall: structured, systematic analysis at each step before proceeding to the next Spiral: rapid generation of functional systems, short turn around time Typical Data Warehouse Design Process Choose business processes for modeling, e.g., orders, invoices, etc Choose the grain (atomic level of data) of the business processes Choose the dimensions that will apply to each fact table Choose the measures that will populate each fact table 17
Data Warehouse Architecture Operational DBs Metadata Monitor & Integrator OLAP Server Other sources Extract Transform Load Refresh Data Warehouse Serve Query Analysis Reports Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Three Data Warehouse Models Enterprise Warehouse A global view with all the information about subjects spanning the entire organization Data Mart A subset of corporate-wide data that is of value to a specific group of users Its scope is confined to specific, selected groups Independent vs. dependent (directly from warehouse) data mart Virtual Warehouse A set of views over operational databases Only some of possible summary views may be materialized 18
Development of Data Warehouse Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Enterprise Data Warehouse Model Refinement Model Refinement Define a high-level corporate data model Utilities of Back-End Tools Data Extraction Get data from multiple, heterogeneous, and external sources Data Cleaning Detect errors in the data and rectify them when possible Data Transformation Convert data from the original format to the warehouse format Loading Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh Propagate the updates from data sources to the warehouse 19
OLAP Server Architecture Relational OLAP (ROLAP) Uses relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware Includes optimization of DBMS back-end, implementation of aggregation navigation logic, and additional tools and services High scalability Multidimensional OLAP (MOLAP) Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) Low-level: relational / high-level: array High flexibility Metadata Repository Definition of Metadata The data defining data warehouse objects Examples Description of the structure of the data warehouse, e.g., schema, view, dimensions, hierarchies, data definitions, data mart locations and contents Operational meta-data, e.g., history of migrated data, currency of data, warehouse usage statistics, error reports Algorithms used for summarization Mapping from operational environment to data warehouse Data related to system performance Business data, e.g., business terms and definitions, ownership of data, charging policies 20
CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining Data Cube Computation View as a Lattice of Cuboids How many cuboids in an n-dimensional cube? n 2 How many cuboid cells in an n-dimensional cube with L i levels? product, n ( time L i 1) i 1 Materialization of Data Cube Full materialization (all cuboids), Partial materialization (some cuboids), No materialization (only base cuboid) Selection of cuboids to materialize Based on the size, sharing, access frequency, etc. all product time location product, location time, location product, time, location 21
Cube Operation Cube Definition and Computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales Cube Definition and Computation in SQL select item, city, year, SUM (amount) from sales cube by item, city, year Internal Operations group by (item, city, year) group by (item, city), (item, year), (city, year) group by (item), (city), (year) group by () Iceberg Cube Iceberg Cube Computation Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= min_sup Motivation Only a small portion of cube cells may be above the water in a sparse cube Only calculate interesting cells data above certain threshold Avoid explosive growth of the cube 22
Indexing OLAP Data Bitmap Indexing Index on a particular column Each value in the column has a bit vector The length of bit vector is the number of records in the base table The i th bit is set if the i th row of the base table has the value Not suitable for high cardinality domains Base Table Index on Region Index on Type CusID Region Type RecID Asia Europe US RecID Retail Dealer C1 Asia Retail 1 1 0 0 1 1 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 US Retail 4 0 0 1 4 1 0 C5 Europe Dealer 5 0 1 0 5 0 1 Join Indexing Link the value of dimensions to rows in the fact table CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 23
Data Warehouse Usage Information Processing Supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical Processing Supports OLAP operations in multi-dimensional space Data Mining Supports pattern discovery from warehouse data, and presenting the mining results using visualization tools From OLAP To OLAM On-Line Analytical Mining (OLAM) ( OLAP + Data Mining ) in data warehouse Why OLAM? High quality data Data warehouse contains integrated, consistent and cleaned data Information processing infrastructure ODBC/OLE DB connections, web accessing, service facilities OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions Integration and swapping of multiple data mining functions 24
OLAM System Architecture mining query User GUI API mining result Layer4 User Interface OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Data Warehouse Meta Data Layer2 MDDB Database API data cleaning data integration Layer1 Databases Data Repository Questions? References Gray, J., et al., Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery, Vol. 1 (1997) Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 25