Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies



Similar documents
Multi-dimensional index structures Part I: motivation

DATA WAREHOUSING - OLAP

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Week 3 lecture slides

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

New Approach of Computing Data Cubes in Data Warehousing

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

DATA WAREHOUSING AND OLAP TECHNOLOGY

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehouse: Introduction

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

The Cubetree Storage Organization

Data Warehouse and OLAP. Methodologies, Algorithms, Trends

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

OLAP and Data Warehousing! Introduction!

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Part 22. Data Warehousing

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

What is OLAP - On-line analytical processing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Data Mining for Knowledge Management. Data Warehouses

A Critical Review of Data Warehouse

CS2032 Data warehousing and Data Mining Unit II Page 1

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

OLAP Systems and Multidimensional Expressions I

Turkish Journal of Engineering, Science and Technology

Data Warehousing and OLAP

Cognos 8 Best Practices

CS54100: Database Systems

Basics of Dimensional Modeling

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

14. Data Warehousing & Data Mining

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

CHAPTER 3. Data Warehouses and OLAP

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Indexing Techniques for Data Warehouses Queries. Abstract

Monitoring Genebanks using Datamarts based in an Open Source Tool

Building Data Cubes and Mining Them. Jelena Jovanovic

Optimizing Your Data Warehouse Design for Superior Performance

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Week 13: Data Warehousing. Warehousing

ON THE SCALABILITY OF MULTIDIMENSIONAL DATABASES

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Lecture 2 Data warehousing

Data Warehousing. Paper

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Data Warehouse Design

Database Design Patterns. Winter Lecture 24

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Module 1: Introduction to Data Warehousing and OLAP

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

SAS BI Course Content; Introduction to DWH / BI Concepts


Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Data W a Ware r house house and and OLAP II Week 6 1

Data W a Ware r house house and and OLAP Week 5 1

SQL Server Analysis Services Complete Practical & Real-time Training

Distance Learning and Examining Systems

A Brief Tutorial on Database Queries, Data Mining, and OLAP

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

A Technical Review on On-Line Analytical Processing (OLAP)

Data Warehousing and OLAP Technology

OLAP, Relational, and Multidimensional Database Systems

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

A Design and implementation of a data warehouse for research administration universities

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Data Warehousing and Online Analytical Processing

IST722 Data Warehousing

Transcription:

Review Data Warehousing CPS 216 Advanced Database Systems Data warehousing: integrating data for OLAP OLAP versus OLTP Warehousing versus mediation Warehouse maintenance Warehouse data as materialized views Recomputation versus incremental maintenance Self-maintenance 2 Today Star, snowflake, and cube ROLAP and MOLAP algorithms 3 PID name cost p1 beer 10 p2 diaper 16 Sale Star schema OID date CID PID SID qty price 100 11/23/01 c3 p1 s1 1 12 102 12/12/01 c3 p2 s1 2 17 105 12//01 c5 p1 s3 5 13 CID name address city c3 Amy 100 Main St. Durham c4 Ben 102 Main St. Durham c5 Carl 800 Eighth St. Durham SID city s1 s3 Durham Chapel Hill RTP Fact table Big Constantly growing s measures (often aggregated in queries) Small Updated infrequently 4 PID brandid typeid p1 b1 t1 p2 b2 t2 p3 b1 t4 Dimension hierarchies Brand type brandid b1 b2 name Huggies Budwiser typeid name catid t1 diaper cat7 t2 beer cat9 t3 juice cat9 t4 pacifier cat7 + star schema = snowflake schema category catid cat7 cat9 name babies beverages 5 Star join indexes Queries frequently join fact table with dimension tables» Materialize the join result to speed up queries For each combination of dimension attribute values, store the list of tuple ID s in the fact table Brand name, store city, customer city sales records; type, store city sales records; etc. Conceptually, multi-attribute indexes on the join result One index to support each combination of selection conditions on attributes? Too many indexes! 6 1

Bitmap join indexes» O Neil & Quass, SIGMOD 1997 Bitmap and projection indexes for each dimension attribute Value of the dimension attribute tuple ID s in the fact table To process an arbitrary combination of selection conditions, use bitmap indexes Bitmaps can be combined efficiently To retrieve attribute values for output, use projection indexes 7 p2 p1 Data cube Simplified schema: Sale(CID, PID, SID, qty) (c3, p2, s1) = 2 s3 (c3, p1, s1) = 1 s1 (c5, p1, s1) = 3 8 Completing the cube (slide 1) Total quantity of sales for each product in each store SELECT SUM(qty) FROM Sale PID, SID; (, p1, s3) = 5 (, p2, s1) = 2 (c3, p2, s1) = 2 p2 (, p1, s1) = 4 s3 p1 s1 Project all points onto - plane Completing the cube (slide 2) Total quantity of sales for each product SELECT SUM(qty) FROM Sale PID; (, p1, s3) = 5 (, p2, s1) = 2 (c3, p2, s1) = 2 (, p2, ) = 2 p2 (, p1, s1) = 4 s3 (, p1, ) = 9 p1 s1 Further project points onto axis 9 10 Completing the cube (slide 3) Total quantity of sales SELECT SUM(qty) FROM Sale; (, p1, s3) = 5 (, p2, s1) = 2 (c3, p2, s1) = 2 (, p2, ) = 2 p2 (, p1, s1) = 4 s3 (, p1, ) = 9 p1 s1 Further project points onto the origin CUBE operator» Gray et al., ICDE 1996 Sale(CID, PID, SID, qty) Proposed SQL extension: SELECT SUM(qty) FROM Sale CUBE CID, PID, SID; Output contains: Normal groups produced by (c1, p1, s1, sum), (c1, p2, s3, sum), etc. Groups with one or more s (, p1, s1, sum), (c2,,, sum), (,,, sum), etc. Can you write a CUBE query using only s? (,, ) = 11 11 12 2

ROLLUP operator Sometimes CUBE is too much (, state, city, street,, age, DOB, ) CUBE state, city, street returns meaningless groups (,, Main Street ): sales on any Main Street? CUBE age, DOB returns useless groups (, DOB): DOB functionally determines age! Proposed SQL extension: ROLLUP state, city, street; Output contains groups with s only as suffix ( NC, Durham, Main Street ), ( NC, Durham, ), ( NC,, ), (,, ) But not (,, Main Street ) or (, Durham, ) 13 Computing ROLAP (Relational OLAP) Use standard relational engine Sorting and clustering Using indexes Automatic summary tables MOLAP (Multidimensional OLAP) Use a sparse multidimensional array 14 Sorting and clustering More on sort order Sort (or cluster, e.g., using hashing) tuples according to attributes Tuples in the same group are processed together Only one intermediate aggregate result needs to be kept low memory requirement What if attributes sort attributes? Still fine if attributes form a prefix of the sort order Otherwise, need to keep intermediate aggregate results around 15 Sort by the order in which attributes appear? Not necessary; e.g., PID, SID can be processed just as efficiently by sorting on SID, PID Sort by the order in which ROLLUP attributes appear? Useful; e.g., ROLLUP state, city, street can be processed efficiently by sorting on state, city, street, but not by sorting on street, city, state 16 Using bitmap join indexes» O Neil & Quass, SIGMOD 1997 Use the bitmap join indexes on GB 1, GB 2,, GB k For each value v 1 of GB 1 in order: For each value v 2 of GB 2 in order: For each value v k of GB k in order: Intersect bitmaps to locate tuples; Retrieve their measures; Calculate aggregate for group (v 1, v 2,, v k ); Helps if data is sorted by GB 1, GB 2,, GB k So measures in the same group are clustered 17 Automatic summary tables Computing aggregates is expensive OLAP queries perform all the time Idea: precompute and store the aggregates!» Automatic summary tables Maintained automatically as base data changes Just another index/materialized view 18 3

Aggregation view lattice CID, PID CID CID, PID, SID CID, SID PID PID, SID SID A child can be computed from any parent 19 Selecting views to materialize Factors in deciding what to materialize What is its storage cost? What is its update cost? Which queries can benefit from it? How much can a query benefit from it? Example is small, but not useful to most queries CID, PID, SID is useful to any query, but too large to be beneficial» Harinarayan et al., SIGMOD 1996; Gupta & Mumick, ICDE 1999 Interlude: TPC-D, -H, and -R TPC-D: standard OLAP benchmark until 1999 With aggressive use of precomputation techniques (materialized views, automatic summary tables), vendors were able to cheat and achieve amazing performance Now, TPC-D has been replaced by TPC-H: ad hoc OLAP queries Cannot use materialized views TPC-R: business-reporting OLAP queries Can use materialized views» http://www.tpc.org/ 21 From tables to arrays» Zhao et al., SIGMOD 1997 Chunk an n-dimensional cube into n- dimensional subcubes For a dense chunk (>% full), store it as is For a sparse chunk (<% full), compress it using <coordinate, value> pairs To convert a table into chunks Pass 1: Partition table into memory-size partitions, each of which contains a number of chunks Pass 2: Read partitions back in one at a time, and chunk each partition in memory 22 Dimension order 52 Dimension order: 36 A, B, C = Sort order: C, B, A 23 Memory required: 1 chunk (could be more) Basic array cubing 52 36 B, C 4

Minimal spanning tree Recall the aggregation lattice MST of the lattice: parent is always chosen to be the one with minimum size Compute each node from its parent in the MST A, B, C100 A, B A, C 10 B, C 50 A 2 B 10 C5 1 25 Goal: compute all aggregates at the same time in a single pass over the array, using minimum amount of memory B, C requires 1 chunk A, C requires 4 chunks A, B requires 16 chunks Multiway array cubing 52 36 26 Memory requirement Dimension order is D 1, D 2,, D n Aggregate to compute projects out D p (i.e., D 1,, D p 1, D p + 1,, D n ) The memory required is roughly D 1 D 1 D p 1 chunks Where D i denotes the number of chunks along D i» It is harder to aggregate away dimensions that come later in the dimension order Minimum-memory spanning tree MMST of the aggregation lattice Parent is always chosen to be the one that makes the child require the minimum memory to compute Note that results are produced in dimension order too, so computation of the entire MMST can be pipelined Choose an optimal dimension order to minimize the total amount of memory required by MMST It turns out that this optimal order is D 1, D 2,, D n, where D 1 D 2 D n 27 ROLAP versus MOLAP Multiway array cubing algorithm (MOLAP) beats sorting-based ROLAP algorithms Compressed array representation is more compact than table representation Sorting-based ROLAP spends too much time on comparing and copying In MOLAP, order is implied by the array positions» An alternative ROLAP techinque Convert table to array Do MOLAP processing Next time Data mining Dump the result cube to a table 29 30 5