Week 13: Data Warehousing. Warehousing



Similar documents
Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehousing and OLAP

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Overview of Data Warehousing and OLAP

DATA WAREHOUSING AND OLAP TECHNOLOGY

Part 22. Data Warehousing

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

DATA WAREHOUSING - OLAP

Week 3 lecture slides

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

14. Data Warehousing & Data Mining

Data Warehousing and OLAP Technology for Knowledge Discovery

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data W a Ware r house house and and OLAP II Week 6 1

Data Warehousing Systems: Foundations and Architectures

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehousing, OLAP, and Data Mining

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Turkish Journal of Engineering, Science and Technology

OLAP and Data Warehousing! Introduction!

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

IST722 Data Warehousing

Data Warehouse: Introduction

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Fluency With Information Technology CSE100/IMT100

Data Warehousing and Data Mining

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

CS54100: Database Systems

CHAPTER 5: BUSINESS ANALYTICS

When to consider OLAP?

CHAPTER 4 Data Warehouse Architecture

Lection 3-4 WAREHOUSING

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Module 1: Introduction to Data Warehousing and OLAP

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Multi-dimensional index structures Part I: motivation

Data Warehousing and Online Analytical Processing

Monitoring Genebanks using Datamarts based in an Open Source Tool

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Designing a Dimensional Model

Data Warehousing & OLAP

Business Intelligence & Product Analytics

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

SAS BI Course Content; Introduction to DWH / BI Concepts

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

CHAPTER 4: BUSINESS ANALYTICS

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Data Mart/Warehouse: Progress and Vision

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Chapter 3 - Data Replication and Materialized Integration

A Technical Review on On-Line Analytical Processing (OLAP)

OLAP Systems and Multidimensional Expressions I

B.Sc (Computer Science) Database Management Systems UNIT-V

Business Intelligence, Analytics & Reporting: Glossary of Terms

BENEFITS OF AUTOMATING DATA WAREHOUSING

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

BUILDING OLAP TOOLS OVER LARGE DATABASES

UNIT-3 OLAP in Data Warehouse

Data Warehouse design

Ramesh Bhashyam Teradata Fellow Teradata Corporation

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

OLAP Theory-English version

Building Cubes and Analyzing Data using Oracle OLAP 11g

SQL Server 2012 Business Intelligence Boot Camp

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

Transcription:

1 Week 13: Data Warehousing Warehousing Growing industry: $8 billion in 1998 Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot,... 2

2 Outline What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse Future directions 3 What is a Warehouse? Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile more 4

3 What is a Warehouse? Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse 5 Warehouse Architecture Client Client Query & Analysis Metadata Warehouse Integration Source Source Source 6

4 Motivating Examples Forecasting Comparing performance of units Monitoring, detecting fraud Visualization 7 Why a Warehouse? Two Approaches: Query-Driven (Lazy) Warehouse (Eager)? Source Source 8

5 Query-Driven Approach Client Client Mediator Wrapper Wrapper Wrapper Source Source Source 9 Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information 10

6 Advantages of Query-Driven No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 11 OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse 12

7 OLTP vs. OLAP OLTP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 13 Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? 14

8 Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other 15 Star product prodid name price p1 bolt 10 p2 nut 5 store storeid city c1 nyc c2 sfo c3 la sale oderid date custid prodid storeid qty amt o100 1/7/97 53 p1 c1 1 12 o102 2/7/97 53 p2 c1 2 11 105 3/8/97 111 p1 c3 5 50 customer custid name address city 53 joe 10 main sfo 81 fred 12 main sfo 111 sally 80 willow la 16

9 17 Star Schema sale orderid date custid prodid storeid qty amt customer custid name address city product prodid name price store storeid city 18 Terms Fact table Dimension tables Measures sale orderid date custid prodid storeid qty amt customer custid name address city product prodid name price store storeid city

10 Dimension Hierarchies store stype city store storeid cityid tid mgr s5 sfo t1 joe s7 sfo t2 fred s9 la t1 nancy snowflake schema constellations region stype tid size location t1 small downtown t2 large suburbs city cityid pop regid sfo 1M north la 5M south region regid name north cold region south warm region 19 Cube Fact table view: sale prodid storeid amt p1 c1 12 p2 c1 11 p1 c3 50 p2 c2 8 Multi-dimensional cube: c1 c2 c3 p1 12 50 p2 11 8 dimensions = 2 20

11 3-D Cube Fact table view: Multi-dimensional cube: sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 dimensions = 3 21 ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing 22

12 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 81 23 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 ans date sum 1 81 2 48 24

13 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodid sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 sale prodid date amt p1 1 62 p2 1 19 p1 2 48 rollup drill-down 25 Aggregates Operators: sum, count, max, min, median, ave Having clause Using dimension hierarchy average by region (within store) maximum by month (within date) 26

14 Cube Aggregation day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 Example: computing sums... c1 c2 c3 p1 56 4 50 p2 11 8 rollup drill-down c1 c2 c3 sum 67 12 50 sum p1 110 p2 19 129 27 Cube Operators day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8... sale(c1,*,*) c1 c2 c3 p1 56 4 50 p2 11 8 sale(c2,p2,*) c1 c2 c3 sum 67 12 50 sum p1 110 p2 19 129 sale(*,*,*) 28

15 Extended Cube * c1 c2 c3 * p1 56 4 50 110 p2 11 8 19 day 2 c1* c267 c3 12 * 50 129 p1 44 4 48 day 1 c1 p2 c2 c3 * p1 12* 44 50 4 62 48 p2 11 8 19 * 23 8 50 81 sale(*,p2,*) 29 Aggregation Using Hierarchies day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 customer region country region A region B p1 56 54 p2 11 8 (customer c1 in Region A; customers c2, c3 in Region B) 30

16 Pivoting Fact table view: sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 Multi-dimensional cube: day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 c1 c2 c3 p1 56 4 50 p2 11 8 31 Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs, ) Spreadsheet Systems Web Interfaces Data Mining 32

17 Other Operations Time functions e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z 33 Integration Data Cleaning Data Loading Derived Data Client Query & Analysis Client Metadata Warehouse Integration Source Source Source 34

18 Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB customer1(joe) customer2(joe) merged_customer(joe) Auditing: discover rules & relationships (like data mining) 35 Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load 36

19 Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh 37 Materialized Views Define new warehouse relations using SQL expressions sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 product id name price p1 bolt 10 p2 nut 5 jointb prodid name price storeid date amt p1 bolt 10 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 10 c3 1 50 p2 nut 5 c2 1 8 p1 bolt 10 c1 2 44 p1 bolt 10 c2 2 4 does not exist at any source 38

20 Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Query & Analysis Client Metadata Warehouse Integration Source Source Source 39 ROLAP Server Relational OLAP Server sale prodid date sum p1 1 62 p2 1 19 p1 2 48 tools utilities ROLAP server Special indices, tuning; Schema is denormalized relational DBMS 40

21 MOLAP Server Multi-Dimensional OLAP Server City A B Sales utilities M.D. tools multidimensional server Product milk soda eggs soap 1 2 3 4 Date could also sit on relational DBMS 41 Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, Popular in Warehouses inverted lists bit map indexes join indexes text indexes 42

22 Inverted Lists 18 19 20 23 20 21 22 23 25 26 r4 r18 r34 r35 r5 r19 r37 r40 rid name age r4 joe 20 r18 fred 20 r19 sally 21 r34 nancy 20 r35 tom 20 r36 pat 25 r5 dave 21 r41 jeff 26... age index inverted lists data records 43 Using Inverted Lists Query: Get people with age = 20 and name = fred List for age = 20: r4, r18, r34, r35 List for name = fred : r18, r52 Answer is intersection: r18 44

23 Managing Metadata Warehouse Design Tools Client Query & Analysis Client Metadata Warehouse Integration Source Source Source 45 Metadata Administrative definition of sources, tools,... schemas, dimension hierarchies, rules for extraction, cleaning, refresh, purging policies user profiles, access control,... 46

24 Metadata Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails 47 Design What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 48

25 Tools Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management reliable scripts for cleaning & analyzing data 49 Current State of Industry Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything 50

26 Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? 51 Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling 52

27 Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data 53 Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information 54