Lecture 2 Data warehousing



Similar documents
Chapter 3, Data Warehouse and OLAP Operations

Data Warehousing and OLAP Technology

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data W a Ware r house house and and OLAP Week 5 1

Data Warehousing & OLAP

Data Mining for Knowledge Management. Data Warehouses

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Overview of Data Warehousing and OLAP

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT


Data Warehousing and Online Analytical Processing

DATA WAREHOUSING AND OLAP TECHNOLOGY

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Data Mining. Session 4 Main Theme Data Warehousing and OLAP. Dr. Jean-Claude Franchitti

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

OLAP and Data Warehousing! Introduction!

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Data Warehousing & OLAP

Data Warehousing & OLAP

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Data Warehousing & OLAP

Building Data Cubes and Mining Them. Jelena Jovanovic

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Data Warehousing and elements of Data Mining

DATA WAREHOUSING - OLAP

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Data Warehousing and OLAP.

1960s 1970s 1980s 1990s. Slow access to

Part 22. Data Warehousing

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

A Technical Review on On-Line Analytical Processing (OLAP)

Data W a Ware r house house and and OLAP II Week 6 1

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehouse: Introduction

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Multi-dimensional index structures Part I: motivation

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Data Warehousing and OLAP

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

CS54100: Database Systems

Integrate multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied

Week 3 lecture slides

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing and OLAP Technology for Knowledge Discovery

Week 13: Data Warehousing. Warehousing

CHAPTER 3. Data Warehouses and OLAP

(Week 10) A04. Information System for CRM. Electronic Commerce Marketing

Lecture Data Warehouse Systems

What is OLAP - On-line analytical processing

Lection 3-4 WAREHOUSING

Turkish Journal of Engineering, Science and Technology

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

CHAPTER 4 Data Warehouse Architecture

When to consider OLAP?

14. Data Warehousing & Data Mining

New Approach of Computing Data Cubes in Data Warehousing

Fluency With Information Technology CSE100/IMT100

IST722 Data Warehousing

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

A Critical Review of Data Warehouse

OLAP Systems and Multidimensional Expressions I

Module 1: Introduction to Data Warehousing and OLAP

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Data Warehouse design

CS2032 Data warehousing and Data Mining Unit II Page 1

Basics of Dimensional Modeling

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Mario Guarracino. Data warehousing

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Indexing Techniques for Data Warehouses Queries. Abstract

Monitoring Genebanks using Datamarts based in an Open Source Tool

Data Warehousing Systems: Foundations and Architectures

Data Warehousing OLAP

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Apache Kylin Introduction Dec 8,

Main Memory & Near Main Memory OLAP Databases. Wo Shun Luk Professor of Computing Science Simon Fraser University

Why Business Intelligence

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Advanced Data Management Technologies

Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar

Data Warehousing and Data Mining

UNIT-3 OLAP in Data Warehouse

A Design and implementation of a data warehouse for research administration universities

Transcription:

King Saud University College of Computer & Information Sciences IS 466 Decision Support Systems Lecture 2 Data warehousing Dr. Mourad YKHLEF The slides content is derived and adopted from many references 1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 2

What is Data Warehouse? A decision support database that is maintained separately from the organization s operational database Not rigorous A data warehouse is a subject-oriented, integrated, time-variant, and non volatile collection of data in support of management s decision-making process. W. H. Inmon 3 Data Warehouse - Subject-Oriented Organized around major subjects, such as product, sales. Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision process. 4

Data Warehouse - Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions (e.g., LastName and FamilyName in DB1 and DB2 have the same signification) encoding structures (e.g, Attribute User_Id is a long int in DB1 and it is a string in DB2 attribute measures (e.g, cm vs inch) When data is moved to the warehouse, it is converted. 5 Data Warehouse - Time Variant Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Operational database: current value data. Every data in the data warehouse Contains an element of time, explicitly or implicitly But the data of operational database may or may not contain time element. IS 466 Data Warehousing - Dr. Mourad Ykhlef 6

Data Warehouse - Non-Volatile A physically separate store. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and querying (read) 7 Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach A query posed to a client site, will be transformed into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Data warehouse: update-driven Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis 8

Data Warehouse vs. Operational DBMS OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making 9 Data Warehouse vs. Operational DBMS OLTP OLAP users Any one knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, historical, summarized, multidimensional integrated, consolidated lots of scans access read/write index/hash on prim. key unit of work short, simple transaction complex query DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 10

1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 11 From tables to Data Cubes A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables such as item(item_name, type, ) time(day, week, month, quarter, year) location(location_name, country) Fact table contains measures and keys to related dimension tables 12

From tables to Data Cubes 2-D view of sales cross-tabulation (pivot table) Location = «Mekkah» item TV PC DVD SUM time Q1 670 200 500 1370 Q2 400 250 300 950 Q3 800 400 500 1700 Q4 200 500 400 1100 SUM 2070 1350 1700 5120 13 From tables to Data Cubes time item measure Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 Q3 Q3 Q3 Q3 Q4 Q4 Q4 Q4 * * * * TV PC DVD * TV PC DVD * TV PC DVD * TV PC DVD * TV PC DVD * 670 200 500 1370 400 250 300 950 800 400 500 1700 200 500 400 1100 2070 1350 1700 5120 Relational representation of pivot table The symbol * means ALL By considering the pivot tables of Mekkah, Madimah and Quds we obtain the following cube 14

From tables to Data Cubes 3-D view of sales cube TV PC DVD sum Date 1Qtr 2Qtr 3Qtr 4Qtr sum Mekkah Madinah Quds City sum 15 Cube: A lattice of Cuboids all product date city product,date product,city date, city 0-D(apex) cuboid 1-D cuboids 2-D cuboids product, date, city 3-D(base) cuboid 16

Cube: A lattice of Cuboids all time item city supplier 0-D(apex) cuboid 1-D cuboids time,item time, city time,supplier item, city item,supplier city,supplier 2-D cuboids time,item, city time, city,supplier time,item,supplier item, city,supplier time, item, city, supplier 3-D cuboids 4-D(base) cuboid 17 Cube: A lattice of Cuboids How many cuboids in an n-dimensional cube with L levels? T = n ( L i i +1) =1 If n=10 and each dimension has one level then T= (2) 10 If n=10 and each dimension has 4levels then T= (4+1) 10 18

Relational Data Modeling of Data Warehouses Modeling data warehouses: dimensions & measures 1. Star schema: A fact table in the middle connected to a set of dimension tables 2. Snowflake schema: represents dimensional hierarchy by normalizing the dimension tables (i.e., each level of a dimension represented in one table) save storage reduces the effectiveness of browsing 3. Fact constellations: Multiple fact tables share dimension tables 19 time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Example of Star Schema Measures Sales Fact Table time_key item_key branch_key location_key units_sold currency_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state_or_province country 20

Example of Snowflake Schema time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type This is not a full snowflake schema Measures Sales Fact Table time_key item_key branch_key location_key units_sold currency_sold avg_sales item item_key item_name brand type supplier_key location location_key street city_key supplier supplier_key supplier_type city city_key city state_or_province country 21 time time_key day day_of_the_week month quarter year Example of Fact Constellation Sales Fact Table time_key item_key branch_key item item_key item_name brand type supplier_type Shipping Fact Table time_key item_key shipper_key from_location branch branch_key branch_name branch_type Measures location_key units_sold currency_sold avg_sales location location_key street city province_or_state country to_location currency_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 22

Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: currency_sold = sum(sales_price), avg_sales = avg(sales_price), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 23 Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: currency_sold = sum(sales_price), avg_sales = avg(sales_price), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) 24

Multidimensional Data Dimensions are : product, month, region Measure is sales_amount Hierarchical summarization paths Industry Region Year Category Country Quarter Product Product City Month Week Office Day Time 25 An example of Data Cube TV PC DVD sum Date 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in Mekkah Mekkah Medinah Quds City sum 26

OLAP operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. 27 OLAP operations and SQL time time_key month quarter year Simple star schema Sales Fact Table time_key item_key location_key qsales product item_key name type location location_key city state country 28

Creation of tables create materialized view product as select name, type from DB1.product create materialized view location as select city, state, country from DB2.location The time dimension can be created in the same manner 29 Data Cube TV PC DVD sum time feb200 juin2000sept2003oct2003 Total annual sales of TV in Mekkah sum Mekkah Medinah Quds location sum 30

Roll up query month quarter select quarter, year, item_key, location_key, sum(qsales) from sales S, time T where S.time_key = T.time_key group by year, quarter, item_key, location_key In order to provide three dimensions we have to concatenate quarter and year in one string. 31 Drill down query Consider that roll up query is created as follows create view salesquarter as select quarter, year, item_key, location_key, sum(q_sales) from sales S, time T where S.time_key = T.time_key group by year, quarter, item_key, location_key Drill down(quarter month) Use of salesquarter, sales (provides qsales) and time dimesnsion 32

Slice query create view sliceview as select time_key, item_key, location_key, q_sales from sales S where S.time_key = 1000 order by item_key, location_key 33 Dice query select time_key, item_key, location_key, q_sales from sliceview where sliceview.location_key = 2000 Dont forget order by 34

1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 35 Three Data Warehouse Models Enterprise warehouse: collects all information about subjects (customer, products, sales, assets, personnel) that span the entire organization Requires extensive business modeling May take years to design and build Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer, product, sales Faster roll-out Complex integration in the long term Virtual warehouse 1. A set of views over operational databases 2. Only some of views may be materialized 36

OLAP Server Architectures Relational OLAP (ROLAP): Leaves detail values in the relational fact table Stores aggregated values in the relational database as well. Multidimensional OLAP (MOLAP) Stores both detail and aggregated within the cube. Hybrid OLAP (HOLAP): Leaves detail values in the relational fact table Stores aggregated values in the cube. 37 OLAP Server Architectures ROLAP HOLAP MOLAP Map Map Map Detail Detail Detail Aggregated values Aggregated values Aggregated values 38

1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 39 Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids 1. The bottom-most cuboid is the base cuboid 2. The top-most cuboid (apex) contains only one cell 3. How many cuboids in an n-dimensional cube with L levels? T = n ( L i i +1 ) =1 Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize 1. Based on size, sharing, access frequency, etc. 40

Cube Operation Cube definition and computation in DMQL Define cube sales[item, city, year]: sum(sales_in_currency) compute cube sales Transform it into a SQL-like language (with a new operator cube by -Gray et al. 96-) SELECT item, city, year, SUM (sales_in_currency) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys ( item, city, year), (item) (city) (year) (item, city),(item, year), (city, year), (item), (city), (year) () (item,city) (item, year) (city, year) () (item, city, year) 41 Partial materialization (choose view to answer a query) Identify all of the materialized cuboids that may potentially be used to answer a query. Pruning the above set using knowledge of dominance relationship among cuboids. Estimating the costs of using the remaining materialized cuboids and selecting the cuboid with the least cost. 42

Partial materialization (choose view to answer a query) Define cube sales[year, product, location]: sum(sales_in_currency) Dimension hierarchies used are item < brand street < city < state < country Cuboids are : cuboid 1: {item, city, year} cuboid 2: {brand, country, year} cuboid 3: {brand, state, year} cuboid 4: {item, state} where year = 2000 Query {brand, state} with year = 2000? IS 465 Data Warehousing - Dr. Mourad Ykhlef 43 Partial materialization (choose view to answer a query) Define cube sales[year, product, location]: sum(sales_in_currency) Dimension hierarchies used are item < brand street < city < state < country Cuboids are : cuboid 1: {item, city, year} cuboid 2: {brand, country, year} Query {brand, state} with year = 2000? cuboid 1 costs the most since item and city are at lower level than brand and state cuboid 2 can not be used since state < country 44

Partial materialization (choose view to answer a query) Define cube sales[year, product, location]: sum(sales_in_currency) Dimension hierarchies used are item < brand street < city < state < country Cuboids are : cuboid 3: {brand, state, year} cuboid 4: {item, state} where year = 2000 Query {brand, state} with year = 2000? cuboid 3 is better than cuboid 4 if there are few year values associated with items and several items for each brand cuboid 4 is better than cuboid 3 if efficient indices are available for cuboid 4. 45 Full materialization Multi-way Array Aggregation for Cube Computation Partition arrays into chunks (a small subcube which fits in memory). Compute aggregates in multiway by visiting cube cells in the order (1) which minimizes the # of times to visit each cell, and (2) reduces memory access and storage cost. C c3 61 62 63 64 What is the best traversing order to do multi-way aggregation? B c1 c 0 b3 b2 b1 b0 B c2 45 46 47 48 29 30 31 32 13 14 15 16 9 5 1 2 3 4 a0 a1 A a2 a3 60 44 28 56 40 24 52 36 20 46

Full materialization Multi-way Array Aggregation for Cube Computation Cuboids are : A, B, C, AB, BC, AC, ABC and Ø size(a) = 40, size(b)=400, size(c)=4000 The size of each chunk of A, B and C are 10, 100 and 1000 Computation of cuboid BC is done by computing cuboids b i c j. b 0 c 0 is fully aggregated after scanning 1,2,3,4. For BC aggregation we scan chunk 1 to 64 Is the computation of cuboids AC et AB, needs scanning of 64 chunks another time? NO 1. Scanning of a 0 b 0 c 0 leads to compute a 0 b 0 and a 0 c 0 (Multi-way aggregation) 2. So by scanning of 64 chunks one time, one can compute 3 cuboides AB, AC and BC 47 Full materialization Multi-way Array Aggregation for Cube Computation AC BC AB B C c3 61 62 63 64 c2 45 46 47 48 c1 c 0 29 30 31 32 B b3 13 14 15 16 60 44 28 b2 9 56 40 24 b1 5 52 36 20 b0 1 2 3 4 a0 a1 A a2 a3 Scanning a 0 b 0 c 0 leads to scan b 0 c 0,a 0 c 0 and a 0 b 0 of BC, AC and AB b 0 c 0 = aggregation(1-4) b 1 c 0 =aggregation(5-8) 48

Full materialization Multi-way Array Aggregation for Cube Computation AC BC AB B C c3 61 62 63 64 c2 45 46 47 48 c1 c 0 29 30 31 32 B b3 13 14 15 16 60 44 28 b2 9 56 40 24 b1 5 52 36 20 b0 1 2 3 4 a0 a1 A a2 a3 49 In order to avoid bringing 3-D chunk into memory more than once the minimum memory requirement for holding 2-D plans according to chunk ordering of 1 to 64 is 40*400 (for AB) + 40*1000 (for one row of AC) + 100 * 1000 (for one chunk of BC) = 156 000 If the chunk ordering is 1,17,33,49,5,21,37,53, the memory requirement is 400*4000(for BC) + 10*4000 (for one row of CA) + 10*100 (for one chunk of AB) = 1 641 000 Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, «bottom-up computation» (Beyer & Ramakrishnan, SIGMOD 99) and «iceberg cube computation» methods can be used. 50

No materialization Indexing OLAP Data Bitmap Index An effective indexing technique for attributes with low-cardinality domains. Each value in the column has a bit vector The length of the bit vector: # of records in the base table Example: attribute gender has value M and F. A table of million people needs 2 lists of million bits. 51 Bitmap Index base table Region Index Rating Index Cust Region Rating C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 W H RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1 RowId H M L 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 7 1 0 0 select cust from base table where region= W and rating= L 52

Bitmap Index base table Region Index Rating Index Cust Region Rating C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 W H RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1 RowId H M L 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 7 1 0 0 region= W rating= L 53 Bitmap Index base table Region Index Rating Index Cust Region Rating C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 W H RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1 RowId H M L 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 7 1 0 0 region= W and rating= L 54

Indexing OLAP Data: Join Indices Join index roughly JI(Cf, Row-id-id) where D(Cd, ) >< Cd=Cf F(Row-id, Cf, ) Traditional indices map the values to a list of record ids In data warehouse, join index relates the values of the dimensions of a star schema to rows (ids) in the fact table. Join indices can span multiple dimensions 55 Location Key City 1 Mekkah 2 Medinah 3 Quds Product Key Name 1 PC 2 TV 3 DVD Time Key Month 1 1 2 2 3 3 4 4 Join Index Sales Lkey Pkey Tkey Qnt rid4 1 1 1 5 rid5 1 2 1 7 rid6 1 3 1 4 rid7 2 1 1 8 rid8 2 2 1 3 rid9 2 3 1 5 rid10 3 1 1 20 rid11 3 2 1 10 rid12 3 3 1 30 rid13 1 1 2 10 rid14 1 2 2 9 rid15 1 3 2 7 rid16 2 1 2 5 rid17 2 2 2 10 rid18 2 3 2 8 rid19 3 1 2 20 rid20 3 2 2 50 rid21 3 3 2 30 Location-ProductJI CityK PrdK Rid 1 1 rid4 1 1 rid13 1 2 rid5 1 2 rid14 1 3 rid6 1 3 rid15 56

Location Key City 1 Mekkah 2 Medinah 3 Quds Product Key Name 1 PC 2 TV 3 DVD Time Key Month 1 1 2 2 3 3 4 4 Join Index Sales Lkey Pkey Tkey Qnt rid4 1 1 1 5 rid5 1 2 1 7 rid6 1 3 1 4 rid7 2 1 1 8 rid8 2 2 1 3 rid9 2 3 1 5 rid10 3 1 1 20 rid11 3 2 1 10 rid12 3 3 1 30 rid13 1 1 2 10 rid14 1 2 2 9 rid15 1 3 2 7 rid16 2 1 2 5 rid17 2 2 2 10 rid18 2 3 2 8 rid19 3 1 2 20 rid20 3 2 2 50 rid21 3 3 2 30 LocationJI CityK Rid 1 rid4 1 rid5 1 rid6 1 rid13 1 rid14 1 rid15 2 rid7 2 rid8 2 rid9 2 rid16 2 rid17 2 rid18 57 Efficient Processing of OLAP Queries Determine which operations should be performed on the available cuboids: transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP 58

Data Warehouse Utilities Data extraction: get data from sources Data cleaning: detect errors in the data and rectify them when possible Data transformation: convert data from host format to warehouse format Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh: propagate the updates from the data sources to the warehouse 59 1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 60

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(r.sales) from sales where year = 2000 cube by item, region, month: R such that R.price = max(price) 61 1. What is a data warehouse? 2. A multi-dimensional data model 3. Data warehouse architecture 4. Data warehouse implementation 5. Further development of data cube technology 6. Data warehousing and data mining 62

From On-Line Analytical Processing to On Line Analytical Mining (OLAM) OLAM integrates OLAP with Data mining Why OLAM? High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. 63 Mining query OLAM Engine User GUI API Data Cube API Mining result OLAP Engine Layer4 User Interface Layer3 OLAP/OLAM MDDB Meta Data Layer2 MDDB Filtering&Integration Database API Filtering Layer1 Databases Data cleaning Data integration Data Warehouse Data Repository 64

Summary A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization Multiway array aggregation Bitmap index and join index implementations Further development of data cube technology From OLAP to OLAM (on-line analytical mining) 65 A1. Views and Warehousing A DW is just a collection of asynchronously replicated tables and periodically maintained views OLAP queries are typically aggregate queries. Analysts want fast answers to OLAP queries over very large datasets and it is natural to consider precomputing views. CUBE operator gives rise to several aggregate queries. 66

A1. Views and Warehousing (Virtual view) create view RegionalSales (category, sales, state) as select P.category, S.sales, L.state from Products P, Sales S, Location L where P.pid = S.pid and S.locid=L.locid Compute the total sales for each category by state? select R.category, R.state, sum(r.sales) from RegionalSales R group by R.category, R.state Query unfolding consists to replace the occurrence of RegionalSales in the query by view definition 67 A1. Views and Warehousing (Materialised view) Unfolding is not suitable for OLAP because it is a time consumer. View materialization is better: there is no necessity to unfold, the query is executed directly on the pre-computed result. The drawback is that we must maintain the consistency of the materialized view whenever the underlying tables are updated. For more details on data warehousing see the site http://www.ondelette.com/olap/dwbib.html 68