CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Similar documents

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

OLAP and Data Warehousing! Introduction!

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

CS54100: Database Systems

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Multi-dimensional index structures Part I: motivation

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING - OLAP

Week 13: Data Warehousing. Warehousing

Week 3 lecture slides

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

New Approach of Computing Data Cubes in Data Warehousing

Part 22. Data Warehousing

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Warehousing and OLAP

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data W a Ware r house house and and OLAP II Week 6 1

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

OLAP Systems and Multidimensional Expressions I

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Chapter 3, Data Warehouse and OLAP Operations

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehousing & OLAP

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Building Data Cubes and Mining Them. Jelena Jovanovic

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Overview of Data Warehousing and OLAP

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Module 1: Introduction to Data Warehousing and OLAP

14. Data Warehousing & Data Mining

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Adobe Insight, powered by Omniture

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Business Intelligence, Analytics & Reporting: Glossary of Terms

Basics of Dimensional Modeling

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

A Technical Review on On-Line Analytical Processing (OLAP)

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

CS2032 Data warehousing and Data Mining Unit II Page 1

SQL Server 2012 Business Intelligence Boot Camp

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Data Warehouse: Introduction

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

International Journal of Advanced Research in Computer Science and Software Engineering

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Introduction to Data Warehousing. Ms Swapnil Shrivastava

IST722 Data Warehousing

INFO 321, Database Systems, Semester

Fluency With Information Technology CSE100/IMT100

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Designing a Dimensional Model

Data W a Ware r house house and and OLAP Week 5 1

Chapter 3 - Data Replication and Materialized Integration

Data Warehousing Concepts

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data Warehouse design

Course Design Document. IS417: Data Warehousing and Business Analytics

Business Intelligence, Data warehousing Concept and artifacts

Business Intelligence & Product Analytics

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Data Mining for Knowledge Management. Data Warehouses

Data Mart/Warehouse: Progress and Vision

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

When to consider OLAP?

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Turkish Journal of Engineering, Science and Technology

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

End to End Microsoft BI with SQL 2008 R2 and SharePoint 2010

OLAP, Relational, and Multidimensional Database Systems

UNIT-3 OLAP in Data Warehouse

LEARNING SOLUTIONS website milner.com/learning phone

Data Warehousing Systems: Foundations and Architectures

Transcription:

CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Class Projects Class projects are going very well! Project presentations: 15 minutes On Wednesday in two weeks There are 14 teams, so we will need to schedule extra time We will give you the grading guidelines ahead of time Final reports On Wednesday in two weeks (anytime) See class website for details about content & presentation Unfortunately, cannot accommodate extensions this year CSE 544 - Fall 2007 2

Where We Are We covered all the fundamental topics We are now discussing various extra topics Distributed databases (query optimization and transactions) Replication Physical database tuning (guest) Data warehousing (today) Model management (guest) Stream processing XML CSE 544 - Fall 2007 3

References Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross-Tab, and Sub-Totals. Jim Gray et. al. Data Mining and Knowledge Discovery 1, 29-53. 1997 Database management systems. Ramakrishnan and Gehrke. Third Ed. Chapter 25 CSE 544 - Fall 2007 4

Why Data Warehouses? DBMSs designed to manage operational data Goal: support every day activities Online transaction processing (OLTP) Ex: Tracking sales and inventory of each Wal-mart store Enterprises also need to analyze and explore their data Goal: summarize and discover trends to support decision making Online analytical processing (OLAP) Ex: Analyzing sales of all Wal-mart stores by quarter and region To support OLAP consolidate all data into a warehouse (note: this is an example of lazy master replication) CSE 544 - Fall 2007 5

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Fall 2007 6

Data Analysis Cycle Formulate query that extracts data from the database Typically ad-hoc complex query with group by and aggregate Visualize the data (e.g., spreadsheet) Dataset is an N-dimensional space Analyze the data Identify interesting subspace by aggregating other dimensions Categorize the data an compare categories with each other Roll-up and drill-down on the data CSE 544 - Fall 2007 7

Multidimensional Data Model Focus of the analysis is a collection of measures Example: Wal-mart sales Each measure depends on a set of dimensions Example: product (pid), location (lid), and time of the sale (timeid) pid locid 10 11 203 54 102 18 296 87 334 25 12 23 76 93 11 Slicing: equality selection on one or more dimensions 13 17 62 154 8 1 2 3 4 timeid Dicing: range selection CSE 544 - Fall 2007 8

Star Schema Representing multidimensional data as relations (ROLAP) Product pid pname category price Location locid city state country Facts table: Sales In BCNF Dimensions tables Product Location Times Not normalized Sales pid timeid locid sales timeid date week month quarter year Times CSE 544 - Fall 2007 9

Dimension Hierarchies Dimension values can form a hierarchy described by attributes Product category pname Time year quarter week month date Location country state city CSE 544 - Fall 2007 10

Desired Operations Histograms (agg. over computed categories) Problem: awkward to express in SQL (paper p.34) Summarize at different levels: roll-up and drill-down Ex: total sales by day, week, quarter, and year Pivoting Ex: pivot on location and time Result of pivot is a cross-tabulation Column values become labels 2005 2006 2007 Total WI CA Total 500 200 700 150 850 1000 250 400 650 900 1450 2350 CSE 544 - Fall 2007 11

Challenge 1: Representation Problem: How to represent multi-level aggregation? Ex: Table 3 in the paper need 2 N columns for N dimensions! Ex: Table 4 has even more columns! And that s without considering any hierarchy on the dimensions! Solution: special all value T.year L.state SUM(S.sales) 2005 WI 500 2005 CA 200 2005 ALL 700......... ALL ALL 2350 Note: SQL-1999 standard uses NULL values instead of ALL CSE 544 - Fall 2007 12

Challenge 2: Computing Agg. Need 2 N different SQL queries to compute all aggregates Expressing roll-up and cross-tab queries is thus daunting Cannot optimize all these independent queries Solution: CUBE and ROLLUP operators CSE 544 - Fall 2007 13

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Fall 2007 14

Data Cube CUBE is the N-dimensional generalization of aggregate Cube in SQL-1999 SELECT T.year, L.state, SUM(S.sales) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid and S.locid=L.locid GROUP BY CUBE (T.year,L.state) Creating a data cube requires generating the power set of the aggregation columns CSE 544 - Fall 2007 15

Rollup Rollup produces a subset of a cube Rollup in SQL-1999 SELECT T.year,T.quarter, SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY ROLLUP (T.year,T.quarter) Will aggregate over each pair of (year,quarter), each year, and total, but will not aggregate over each quarter CSE 544 - Fall 2007 16

Computing Cubes and Rollups Naive algorithm For each new tuple, update each of 2 N matching aggregates More efficient algorithm Use intermediate aggregates to compute others Relatively easy for distributive and algebraic functions Updating a cube in response to updates is more challenging CSE 544 - Fall 2007 17

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Fall 2007 18

Data Warehouse Properties Consolidated data from many sources Must create a single unified schema Very large size: terabytes of data are common Complex read-only queries (no updates) Fast response time is important CSE 544 - Fall 2007 19

Creating a Data Warehouse Extract data from distributed operational databases Clean to minimize errors and fill in missing information Transform to reconcile semantic mismatches Performed by defining views over the data sources Load to materialize the above defined views Build indexes and additional materialized views Refresh to propagate updates to warehouse periodically CSE 544 - Fall 2007 20

Indexes Bitmap indexes: good for sparse attributes (few values) M F custid name gender rating 1 2 3 4 0 1 10 Alice F 3 0 0 1 0 1 0 11 Bob M 4 0 0 0 1 1 0 12 Chuck M 1 1 0 0 0 Join indexes: to speed-up specific join queries Example: Join fact table F with dimension tables D1 and D2 Index contain triples of rids <r 1,r 2,r> from D 1, D 2, and F that join Alternatively, two indexes, each one with pairs <v 1,r> or <v 2,r> where v 1, v 2 are values of tuples from D 1, D 2 that join with r CSE 544 - Fall 2007 21

Materialized Views How to choose views to materialize? Physical database tuning: see last lecture How to keep view up-to-date? Could recompute entire view for each update: expensive Better approach: incremental view maintenance Example: recompute only affected partition How often to synchronize? Periodic updates (at night) are typical CSE 544 - Fall 2007 22

Outline Multidimensional data model and operations Data cube & rollup operators Data warehouse implementation issues Other extensions for data analysis CSE 544 - Fall 2007 23

Additional Extensions for Decision Support Window queries SELECT L.state, T.month, AVG(S.sales) over W AS movavg FROM Sales S, Times T, Locations L WHERE S.timeid = T.timeid AND S.locid=L.locid WINDOW W AS (PARTITION BY L.State ORDER BY T.month RANGE BETWEEN INTERVAL 1 MONTH PRECEDING AND INTERVAL 1 MONTH FOLLOWING) Top-k queries: optimize queries to return top k results Online aggregation: produce results incrementally CSE 544 - Fall 2007 24