CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes



Similar documents
CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

OLAP and Data Warehousing! Introduction!

CS54100: Database Systems

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

DATA WAREHOUSING AND OLAP TECHNOLOGY

Multi-dimensional index structures Part I: motivation

Week 3 lecture slides

New Approach of Computing Data Cubes in Data Warehousing

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

DATA WAREHOUSING - OLAP

Week 13: Data Warehousing. Warehousing

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Warehousing and OLAP

Part 22. Data Warehousing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Chapter 3, Data Warehouse and OLAP Operations

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Data W a Ware r house house and and OLAP II Week 6 1

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

14. Data Warehousing & Data Mining

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

OLAP Systems and Multidimensional Expressions I

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehousing & OLAP

Building Data Cubes and Mining Them. Jelena Jovanovic

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

SQL Server 2012 Business Intelligence Boot Camp

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse: Introduction

Adobe Insight, powered by Omniture

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Overview of Data Warehousing and OLAP

LEARNING SOLUTIONS website milner.com/learning phone

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Module 1: Introduction to Data Warehousing and OLAP

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Business Intelligence, Analytics & Reporting: Glossary of Terms

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Winter 2009 Lecture 1 - Class Introduction

A Technical Review on On-Line Analytical Processing (OLAP)

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

The Cubetree Storage Organization

SQL Server 2008 Performance and Scale

Foundations of Business Intelligence: Databases and Information Management

Basics of Dimensional Modeling

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Business Intelligence & Product Analytics

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Fluency With Information Technology CSE100/IMT100

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Designing a Dimensional Model

When to consider OLAP?

Data Warehouse design

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data W a Ware r house house and and OLAP Week 5 1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Fall 2007 Lecture 1 - Class Introduction

Data Warehouses & OLAP

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Data Mining and Database Systems: Where is the Intersection?

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Chapter 3 - Data Replication and Materialized Integration

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

Data Mining for Knowledge Management. Data Warehouses

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

International Journal of Advanced Research in Computer Science and Software Engineering

UNIT-3 OLAP in Data Warehouse

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Challenges for Data Driven Systems

Transcription:

CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Final Exam Overview Open books and open notes No laptops and no other mobile devices But I may ask you to bring a calculator Content: all lectures except lectures 14 and 17 Review book chapters and papers Review lecture notes (ppt slides) Sample final is posted on course website CSE 544 - Winter 2009 2

Important Paper Sections L2 - Data models: Sec. 1 4 carefully, rest quickly L3 Codd s paper: Sec 1.1-1.4 and all of Sec. 2 But I will not ask you about the details of the operators L4 Book: Sec. 19.1-19.7 L5 Anatomy paper: Sec. 1 4 but skip 2.3 L6 Lecture notes only L7 Join algos paper: All but ignore equations and skim the end Make sure you understand how the algorithms work! L8 Selinger s paper: All L9 and L10: Transactions paper: All CSE 544 - Winter 2009 3

Important Paper Sections L11 Three papers: sections as indicated on website L12 Bigtable: ALL L13 DBMS as a service: Only lecture notes L14 Guest lecture not on the final L15 - Cube paper: everything except 1.2, 3.3-3.6, and 4 L16 - C-store paper: next lecture L17 Not on the final And remember that book chapters give fundamental material CSE 544 - Winter 2009 4

Where We Are We covered all the fundamental topics We are now discussing advanced topics Theme: big data management Parallel data processing (MapReduce, parallel DBMSs) High-performance distributed storage: Bigtable Databases as a service Amazon Web Services, Microsoft Azure, Google App Engine SimpleDB, SQL Data Services, Google App Engine Datastore Today: OLAP, decision support, and data cubes CSE 544 - Winter 2009 5

References Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross-Tab, and Sub-Totals. Jim Gray et. al. Data Mining and Knowledge Discovery 1, 29-53. 1997 Database management systems. Ramakrishnan and Gehrke. Third Ed. Chapter 25 CSE 544 - Winter 2009 6

Why Data Warehouses? DBMSs designed to manage operational data Goal: support every day activities Online transaction processing (OLTP) Ex: Tracking sales and inventory of each Wal-mart store Enterprises also need to analyze and explore their data Goal: summarize and discover trends to support decision making Online analytical processing (OLAP) Ex: Analyzing sales of all Wal-mart stores by quarter and region To support OLAP consolidate all data into a warehouse (note: this is an example of lazy master replication) CSE 544 - Winter 2009 7

Data Warehouse Overview Consolidated data from many sources Must create a single unified schema The warehouse is like a materialized view Very large size: terabytes of data are common Complex read-only queries (no updates) Fast response time is important CSE 544 - Winter 2009 8

Creating a Data Warehouse Extract data from distributed operational databases Clean to minimize errors and fill in missing information Transform to reconcile semantic mismatches Performed by defining views over the data sources Load to materialize the above defined views Build indexes and additional materialized views Refresh to propagate updates to warehouse periodically CSE 544 - Winter 2009 9

Alternative: Distributed DBMS User submits a query at one site Query is defined over data located at different sites Different physical locations Different types of DBMSs Query optimizer finds the best distributed query plan Query executes across all the locations Results shipped to query site and returned to user CSE 544 - Winter 2009 10

Distributed DBMS Limitations Top-down Global, a priori data placement Global query optimization One query at a time; no notion of load balance Distributed transactions, tight coupling Assumes full cooperation of all sites Assumes uniform sites Assumes short-duration operations Limited scalability CSE 544 - Fall 2007 11

Back to Warehouses: Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Winter 2009 12

Data Analysis Cycle Formulate query that extracts data from the database Typically ad-hoc complex query with group by and aggregate Visualize the data (e.g., spreadsheet) Dataset is an N-dimensional space Analyze the data Identify interesting subspace by aggregating other dimensions Categorize the data an compare categories with each other Roll-up and drill-down on the data CSE 544 - Winter 2009 13

Multidimensional Data Model Focus of the analysis is a collection of measures Example: Wal-mart sales Each measure depends on a set of dimensions Example: product (pid), location (lid), and time of the sale (timeid) pid locid 10 203 54 102 18 11 296 87 334 25 12 23 76 93 11 Slicing: equality selection on one or more dimensions 13 17 62 154 8 1 2 3 4 timeid Dicing: range selection CSE 544 - Winter 2009 14

Star Schema Representing multidimensional data as relations (ROLAP) Product pid pname category price Location locid city state country Facts table: Sales In BCNF Dimensions tables Product Location Times Not normalized Sales pid timeid locid sales timeid date week month quarter year Times CSE 544 - Winter 2009 15

Dimension Hierarchies Dimension values can form a hierarchy described by attributes Product category pname Time year quarter week month date Location country state city CSE 544 - Winter 2009 16

Desired Operations Histograms (agg. over computed categories) Problem: awkward to express in SQL (paper p.34) Summarize at different levels: roll-up and drill-down Ex: total sales by day, week, quarter, and year Pivoting Ex: pivot on location and time Result of pivot is a cross-tabulation Column values become labels 2005 2006 2007 Total WI CA Total 500 200 700 150 850 1000 250 400 650 900 1450 2350 CSE 544 - Winter 2009 17

Challenge 1: Representation Problem: How to represent multi-level aggregation? Ex: Table 3 in the paper need 2 N columns for N dimensions! Ex: Table 4 has even more columns! And that s without considering any hierarchy on the dimensions! Solution: special all value T.year L.state SUM(S.sales) 2005 WI 500 2005 CA 200 2005 ALL 700......... ALL ALL 2350 Note: SQL-1999 standard uses NULL values instead of ALL CSE 544 - Winter 2009 18

Challenge 2: Computing Agg. Need 2 N different SQL queries to compute all aggregates Expressing roll-up and cross-tab queries is thus daunting Cannot optimize all these independent queries Solution: CUBE and ROLLUP operators CSE 544 - Winter 2009 19

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Winter 2009 20

Data Cube CUBE is the N-dimensional generalization of aggregate Cube in SQL-1999 SELECT T.year, L.state, SUM(S.sales) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid and S.locid=L.locid GROUP BY CUBE (T.year,L.state) Creating a data cube requires generating the power set of the aggregation columns CSE 544 - Winter 2009 21

Rollup Rollup produces a subset of a cube Rollup in SQL-1999 SELECT T.year,T.quarter, SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY ROLLUP (T.year,T.quarter) Will aggregate over each pair of (year,quarter), each year, and total, but will not aggregate over each quarter CSE 544 - Winter 2009 22

Computing Cubes and Rollups Naive algorithm For each new tuple, update each of 2 N matching cells More efficient algorithm Use intermediate aggregates to compute others Relatively easy for distributive and algebraic functions Updating a cube in response to updates is more challenging CSE 544 - Winter 2009 23

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Winter 2009 24

Indexes Bitmap indexes: good for sparse attributes (few values) M F custid name gender rating 1 2 3 4 0 1 10 Alice F 3 0 0 1 0 1 0 11 Bob M 4 0 0 0 1 1 0 12 Chuck M 1 1 0 0 0 Join indexes: to speed-up specific join queries Example: Join fact table F with dimension tables D1 and D2 Index contain triples of rids <r 1,r 2,r> from D 1, D 2, and F that join Alternatively, two indexes, each one with pairs <v 1,r> or <v 2,r> where v 1, v 2 are values of tuples from D 1, D 2 that join with r CSE 544 - Winter 2009 25

Materialized Views How to choose views to materialize? Physical database tuning How to keep view up-to-date? Could recompute entire view for each update: expensive Better approach: incremental view maintenance Example: recompute only affected partition How often to synchronize? Periodic updates (at night) are typical CSE 544 - Winter 2009 26

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Winter 2009 27

Additional Extensions for Decision Support Window queries SELECT L.state, T.month, AVG(S.sales) over W AS movavg FROM Sales S, Times T, Locations L WHERE S.timeid = T.timeid AND S.locid=L.locid WINDOW W AS (PARTITION BY L.State ORDER BY T.month RANGE BETWEEN INTERVAL 1 MONTH PRECEDING AND INTERVAL 1 MONTH FOLLOWING) Top-k queries: optimize queries to return top k results Online aggregation: produce results incrementally CSE 544 - Winter 2009 28