Chapter 3, Data Warehouse and OLAP Operations

Similar documents

Data Warehousing and OLAP Technology

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data W a Ware r house house and and OLAP Week 5 1

Data Mining for Knowledge Management. Data Warehouses

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Lecture 2 Data warehousing

Overview of Data Warehousing and OLAP

Data Warehousing and Online Analytical Processing

DATA WAREHOUSING AND OLAP TECHNOLOGY

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Building Data Cubes and Mining Them. Jelena Jovanovic

DATA WAREHOUSING - OLAP

Data Warehousing and elements of Data Mining

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data W a Ware r house house and and OLAP II Week 6 1

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Data Warehousing & OLAP

Part 22. Data Warehousing

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Week 13: Data Warehousing. Warehousing

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

CHAPTER 4 Data Warehouse Architecture

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Week 3 lecture slides

14. Data Warehousing & Data Mining

A Technical Review on On-Line Analytical Processing (OLAP)

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Data Warehousing and OLAP

CHAPTER 3. Data Warehouses and OLAP

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Data Warehousing and OLAP Technology for Knowledge Discovery

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

OLAP Systems and Multidimensional Expressions I

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Lection 3-4 WAREHOUSING

Data Warehousing Systems: Foundations and Architectures

Multi-dimensional index structures Part I: motivation

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Fluency With Information Technology CSE100/IMT100

IST722 Data Warehousing

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

A Critical Review of Data Warehouse

New Approach of Computing Data Cubes in Data Warehousing

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Basics of Dimensional Modeling

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Data Warehouse: Introduction

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Module 1: Introduction to Data Warehousing and OLAP

B.Sc (Computer Science) Database Management Systems UNIT-V

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse Technology And The MSD Databases

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Designing a Dimensional Model

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Dimensional Modeling for Data Warehouse

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Data Warehousing. Paper

The Study on Data Warehouse Design and Usage

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Advanced Data Management Technologies

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

OLAP and Data Warehousing! Introduction!

When to consider OLAP?

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehousing, OLAP, and Data Mining

Chapter 7 Multidimensional Data Modeling (MDDM)

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Turkish Journal of Engineering, Science and Technology

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

UNIT-3 OLAP in Data Warehouse

Data Warehousing and Data Mining

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Transcription:

CSI 4352, Introduction to Data Mining Chapter 3, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 1

What is Data Warehouse? Data Warehouse ( defined in many different ways ) A decision support database that is maintained separately from the organization s operational database The support of information processing by providing a solid platform of consolidated, historical data for analysis A data warehouse is a (1) subject-oriented, (2) integrated, (3) time-variant, and (4) nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Data Warehousing The process of constructing and using data warehouses Data Warehouse Subject-Oriented Organized around major subjects e.g., customers, products, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues Excluding data that are not useful in the decision support process 2

Data Warehouse Integrated Integrating multiple, heterogeneous data sources Relational databases, flat files, on-line transaction records Apply data cleaning and data integration techniques Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse Time Variant The time horizon of data warehouses is significantly longer than that of operational systems Operational databases have current data values Data warehouses provide information from a historical perspective (e.g., 10-20 years) Time is a key structure in data warehouses Contain the attribute of time (explicitly or implicitly) 3

Data Warehouse Nonvolatile A physically separate storage of data transformed from operational databases Operational update of data does not occur Not require transaction processing, recovery, and concurrency control mechanisms Require only two operations, initial loading of data and access of data Data Integration Methods Methods (1) Process to provide uniform interface to multiple data sources Tradition Database Integration (2) Process to combine multiple data sources into coherent storage Data warehousing Traditional DB Integration A query-driven approach Wrappers / mediators on top of heterogeneous data sources Data Warehousing An update-driven approach Combined the heterogeneous data sources in advance Stored them in a warehouse for direct query and analysis 4

OLTP vs. OLAP OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct Features (OLTP vs. OLAP) User and system orientation (customers vs. market analysts) Data contents (current, detailed vs. historical, consolidated) Database design (ER + application vs. star + subject) View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5

Why Data Warehouse? Performance Issue DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation Data Issue Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 6

Data Format for Warehouse Dimensions Multi-dimensional data model Data are stored in the form of a data cube Data Cube A view of multi-dimensions Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Cuboid Each combination of dimensional spaces in a data cube 0-D cuboid, 1-D cuboid, 2-D cuboid,, n-d cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier time,supplier item,supplier 2-D cuboids time,item,location time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier 3-D cuboids 4-D (base) cuboid 7

Conceptual Modeling Key of Modeling Data Warehouses Handling dimensions & measures Examples Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state country 8

Example of Snowflake Schema time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_key supplier supplier_key supplier_type branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city_key city city_key city state country Example of Fact Constellation time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_type Shipping Fact Table time_key item_key shipper_key from_location branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city state country to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 9

Cube Definition in DMQL Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_dimension_list>) Special Case (Shared Dimension Table) define dimension <dimension_name> as <dimension_name_first> in cube <cube_name_first> Star Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, state, country) 10

Snowflake Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) Fact Constellation Schema Definition in DMQL Example define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales 11

Measures Distributive If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning e.g., count(), sum(), min(), max() Algebraic If it can be computed by an algebraic function with m arguments, each of which is obtained by applying a distributive aggregate function e.g., avg(), standard_deviation() Holistic If there is no constant bound on the storage size needed to describe a subaggregate e.g., median(), mode(), rank() Concept Hierarchy Schema Hierarchy e.g., street < city < state < country e.g., day < { month < quarter ; week } < year Year Quarter Week Month Day Set-group hierarchy e.g., { (0..100] ; (100..200] } < (0..200] e.g., { (0..10] < lowprice ; { (10..100] ; (100..200] } < highprice } < allproducts allproducts (0..200] lowprice highprice (0..100] (100..200] (0..10] (10..100] (100..200] 12

Three Components of Data Cube Example Measures: data values as a function of products, locations and time Dimensions: Hierarchies: product location time locations Company Category Product Country City Office Year Quarter Week Month Day time Example of Cuboid Cells time dimension product dimension all location dimension 0-D (apex) cuboid product quarter country product, quarter product, country quarter, country product, quarter, country 1-D cuboids 2-D cuboids 3-D (base) cuboid 13

Example of Data Cube TV PC VCR sum Quarter 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. U.S.A Canada Mexico Country sum OLAP Operations Roll-up (drill-up) Summarizes (aggregates) data by climbing up hierarchy or by dimension reduction Drill-down (roll-down) Reverse of roll-up by stepping down to lower-level data or introducing new dimensions Slice Selecting data on one dimension Dice Selecting data on multi-dimensions Pivot (rotate) Reorienting the cube, or transforming 3-D data to a series of 2-D spaces 14

Roll-UP & Drill-Down location country city quarter time quarter country month Slice, Dice & Pivot location country supercom pc laptop mexico canada USA Q1 Q2 Q3 Q4 quarter time country country mexico pc laptop canada USA Q1 Q2 quarter canada USA Q1 Q2 Q3 Q4 quarter 15

Starnet Query Model Shipping Method air Orders contracts Customer customerid Time Location annually region country quarterly truck Each circle is called a footprint daily city order Promotion item sales-person division Product category division Organization CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 16

Data Warehouse Design Top-Down View Allows the selection of the relevant information necessary for the data warehouse Data Source View Exposes the information being captured, stored, and managed by operational systems Data Warehouse View Consists of fact tables and dimension tables Business Query View Shows the perspectives of data in the warehouse to end-users Data Warehouse Design Process Categories by Process Direction Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Categories by Software Engineering View Waterfall: structured, systematic analysis at each step before proceeding to the next Spiral: rapid generation of functional systems, short turn around time Typical Data Warehouse Design Process Choose business processes for modeling, e.g., orders, invoices, etc Choose the grain (atomic level of data) of the business processes Choose the dimensions that will apply to each fact table Choose the measures that will populate each fact table 17

Data Warehouse Architecture Operational DBs Metadata Monitor & Integrator OLAP Server Other sources Extract Transform Load Refresh Data Warehouse Serve Query Analysis Reports Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Three Data Warehouse Models Enterprise Warehouse A global view with all the information about subjects spanning the entire organization Data Mart A subset of corporate-wide data that is of value to a specific group of users Its scope is confined to specific, selected groups Independent vs. dependent (directly from warehouse) data mart Virtual Warehouse A set of views over operational databases Only some of possible summary views may be materialized 18

Development of Data Warehouse Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Enterprise Data Warehouse Model Refinement Model Refinement Define a high-level corporate data model Utilities of Back-End Tools Data Extraction Get data from multiple, heterogeneous, and external sources Data Cleaning Detect errors in the data and rectify them when possible Data Transformation Convert data from the original format to the warehouse format Loading Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh Propagate the updates from data sources to the warehouse 19

OLAP Server Architecture Relational OLAP (ROLAP) Uses relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware Includes optimization of DBMS back-end, implementation of aggregation navigation logic, and additional tools and services High scalability Multidimensional OLAP (MOLAP) Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) Low-level: relational / high-level: array High flexibility Metadata Repository Definition of Metadata The data defining data warehouse objects Examples Description of the structure of the data warehouse, e.g., schema, view, dimensions, hierarchies, data definitions, data mart locations and contents Operational meta-data, e.g., history of migrated data, currency of data, warehouse usage statistics, error reports Algorithms used for summarization Mapping from operational environment to data warehouse Data related to system performance Business data, e.g., business terms and definitions, ownership of data, charging policies 20

CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining Data Cube Computation View as a Lattice of Cuboids How many cuboids in an n-dimensional cube? n 2 How many cuboid cells in an n-dimensional cube with L i levels? product, n ( time L i 1) i 1 Materialization of Data Cube Full materialization (all cuboids), Partial materialization (some cuboids), No materialization (only base cuboid) Selection of cuboids to materialize Based on the size, sharing, access frequency, etc. all product time location product, location time, location product, time, location 21

Cube Operation Cube Definition and Computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales Cube Definition and Computation in SQL select item, city, year, SUM (amount) from sales cube by item, city, year Internal Operations group by (item, city, year) group by (item, city), (item, year), (city, year) group by (item), (city), (year) group by () Iceberg Cube Iceberg Cube Computation Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= min_sup Motivation Only a small portion of cube cells may be above the water in a sparse cube Only calculate interesting cells data above certain threshold Avoid explosive growth of the cube 22

Indexing OLAP Data Bitmap Indexing Index on a particular column Each value in the column has a bit vector The length of bit vector is the number of records in the base table The i th bit is set if the i th row of the base table has the value Not suitable for high cardinality domains Base Table Index on Region Index on Type CusID Region Type RecID Asia Europe US RecID Retail Dealer C1 Asia Retail 1 1 0 0 1 1 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 US Retail 4 0 0 1 4 1 0 C5 Europe Dealer 5 0 1 0 5 0 1 Join Indexing Link the value of dimensions to rows in the fact table CSI 4352, Introduction to Data Mining Lecture 3, Data Warehouse & OLAP Operations Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining 23

Data Warehouse Usage Information Processing Supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical Processing Supports OLAP operations in multi-dimensional space Data Mining Supports pattern discovery from warehouse data, and presenting the mining results using visualization tools From OLAP To OLAM On-Line Analytical Mining (OLAM) ( OLAP + Data Mining ) in data warehouse Why OLAM? High quality data Data warehouse contains integrated, consistent and cleaned data Information processing infrastructure ODBC/OLE DB connections, web accessing, service facilities OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions Integration and swapping of multiple data mining functions 24

OLAM System Architecture mining query User GUI API mining result Layer4 User Interface OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Data Warehouse Meta Data Layer2 MDDB Database API data cleaning data integration Layer1 Databases Data Repository Questions? References Gray, J., et al., Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery, Vol. 1 (1997) Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 25