Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Similar documents
Week 13: Data Warehousing. Warehousing

Data Warehousing and OLAP

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Overview of Data Warehousing and OLAP

DATA WAREHOUSING AND OLAP TECHNOLOGY

Week 3 lecture slides

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Part 22. Data Warehousing

DATA WAREHOUSING - OLAP

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Introduction to Data Warehousing. Ms Swapnil Shrivastava

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

14. Data Warehousing & Data Mining

Data Warehousing and OLAP Technology for Knowledge Discovery

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

OLAP and Data Warehousing! Introduction!

Data W a Ware r house house and and OLAP II Week 6 1

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

CS54100: Database Systems

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

CS2032 Data warehousing and Data Mining Unit II Page 1

Module 1: Introduction to Data Warehousing and OLAP

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Fluency With Information Technology CSE100/IMT100

IST722 Data Warehousing

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehousing and Data Mining

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Business Intelligence & Product Analytics

Data Warehousing Systems: Foundations and Architectures

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Data Warehousing, OLAP, and Data Mining

Turkish Journal of Engineering, Science and Technology

Multi-dimensional index structures Part I: motivation

Chapter 3 - Data Replication and Materialized Integration

Monitoring Genebanks using Datamarts based in an Open Source Tool

Data Warehousing and Online Analytical Processing

Data Warehousing & OLAP

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

CHAPTER 5: BUSINESS ANALYTICS

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Building Data Cubes and Mining Them. Jelena Jovanovic

CHAPTER 4 Data Warehouse Architecture

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

SAS BI Course Content; Introduction to DWH / BI Concepts

OLAP Theory-English version

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Lection 3-4 WAREHOUSING

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Data Warehouse: Introduction

Data Mart/Warehouse: Progress and Vision

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Designing a Dimensional Model

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

Framework for Data warehouse architectural components

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course


Data Warehouse design

A Technical Review on On-Line Analytical Processing (OLAP)

Basics of Dimensional Modeling

When to consider OLAP?

Data W a Ware r house house and and OLAP Week 5 1

OLAP Systems and Multidimensional Expressions I

Optimizing Your Data Warehouse Design for Superior Performance

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

CHAPTER 4: BUSINESS ANALYTICS

Chapter 3, Data Warehouse and OLAP Operations

UNIT-3 OLAP in Data Warehouse

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

What is OLAP - On-line analytical processing

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Data Warehousing. Paper

Transcription:

Outline Data Warehousing What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse 2 What is a Warehouse? Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile What is a Warehouse? Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse more 3 4

Warehouse Architecture Motivating Examples Metadata Query & Analysis Warehouse Forecasting Comparing performance of units Monitoring, detecting fraud Visualization Integration Source Source Source 5 6 Why a Warehouse? Query-Driven Approach Two Approaches: Query-Driven (Lazy) Warehouse (Eager)? Mediator Wrapper Wrapper Wrapper Source Source Source Source Source 7 8

Advantages of Warehousing Advantages of Query-Driven High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 9 1 OLTP vs. OLAP OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse OLTP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 11 12

Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other 13 14 Star Star Schema product prodid name price p1 bolt 1 p2 nut 5 store storeid city c1 nyc c2 sfo c3 la sale oderid date custid prodid storeid qty amt o1 1/7/97 53 o12 2/7/97 53 p2 c1 2 11 15 3/8/97 111 p1 c3 5 5 product prodid name price sale orderid date custid prodid storeid qty amt customer custid name address city customer custid name address city 53 joe 1 main sfo 81 fred 12 main sfo 111 sally 8 willow la store storeid city 15 16

Terms Dimension Hierarchies Fact table Dimension tables Measures product prodid name price sale orderid date custid prodid storeid qty amt customer custid name address city store stype city store storeid cityid tid mgr s5 sfo t1 joe s7 sfo t2 fred s9 la t1 nancy region stype tid size location t1 small downtown t2 large suburbs city cityid pop regid sfo 1M north la 5M south store storeid city snowflake schema region regid name north cold region south warm region 17 18 Cube 3-D D Cube Fact table view: sale prodid storeid amt p1 c1 12 p2 c1 11 p1 c3 5 p2 c2 8 Multi-dimensional cube: p1 12 5 Fact table view: p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 Multi-dimensional cube: p2 p1 12 5 dimensions = 2 dimensions = 3 19 2

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing Aggregates Add up amounts for In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 81 21 22 Aggregates Aggregates Add up amounts for In SQL: SELECT sum(amt) FROM SALE WHERE date >= 1 AND date <=2 Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date p2 c1 1 11 p1 c3 2 4 p2 c2 2 8 p1 c1 3 44 p1 c2 3 4 71 p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 rollup ans date sum 1 81 2 48 drill-down 23 24

Another Example Add up amounts by day, product In SQL: SELECT prodid, date, sum(amt) FROM SALE GROUP BY date, prodid Another Example Add up amounts by month In SQL: SELECT month, prodid, storeid, sum(amt) FROM SALE JOIN DATE GROUP BY month, prodid, storeid p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 sale prodid date amt p1 1 62 p2 1 19 p1 2 48 p1 c1 1 11 p2 c2 1 5 p2 c2 1 8 p1 c3 2 44 p1 c3 2 4 sale prodid storeid month amt p1 c1 sep 23 p2 c2 sep 58 p1 c3 oct 48 rollup rollup drill-down drill-down 25 26 Aggregates Operations on the Cube Operators: sum, count, max, min, median, ave Having clause Using dimension hierarchy average by region (within store) maximum by month (within date) p2 p1 12 5 dicing p2 slicing (equality selection) c1 c2 p2 c1 c2 p1 12 (range selection) 27 28

Cube Aggregation Cube Operators p2 p1 12 5 Example: computing sums... p2 p1 12 5... sale(c1,*,*) p1 56 4 5 sum 67 12 5 129 p1 56 4 5 sum 67 12 5 129 rollup drill-down sum p1 11 p2 19 sale(c2,p2,*) sum p1 11 p2 19 sale(*,*,*) 29 3 Extended Cube Aggregation Using Hierarchies * * p1 56 4 5 11 19 c1* c267 c3 12 * 5 129 48 c1 p2 c2 c3 * p1 12* 44 5 4 62 48 19 * 23 8 5 81 sale(*,p2,*) rollup p2 p1 12 5 region A region B p1 56 54 drill-down customer region country (customer c1 in Region A; customers c2, c3 in Region B) 31 32

Pivoting Query & Analysis Tools Fact table view: p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 Multi-dimensional cube: p2 p1 12 5 Query Building Report Writers (comparisons, growth, graphs, ) Spreadsheet Systems Web Interfaces Data Mining p1 56 4 5 33 34 Other Operations Implementing a Warehouse Time functions e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... 35 36

Monitoring Monitoring Techniques Source Types: relational, flat file, IMS, VSAM, WWW, news-wire, Incremental vs. Refresh customer id name address city 53 joe 1 main sfo 81 fred 12 main sfo 111 sally 8 willow la new 37 Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring Advantages & Disadvantages!! 38 Monitoring Issues Integration Frequency periodic: daily, weekly, triggered: on big change, lots of changes,... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Data Cleaning Data Loading Derived Data Query & Analysis Metadata Warehouse Integration Source Source Source 39 4

Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., customer merging) billing DB service DB customer1(joe) customer2(joe) merged_customer(joe) Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load Auditing: discover rules & relationships (like data mining) 41 42 Derived Data Materialized Views Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh Define new warehouse relations using SQL expressions p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 product id name price p1 bolt 1 p2 nut 5 jointb prodid name price storeid date amt p1 bolt 1 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 1 c3 1 5 p2 nut 5 c2 1 8 p1 bolt 1 c1 2 44 p1 bolt 1 c2 2 4 does not exist at any source 43 44

Processing ROLAP Server ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Relational OLAP Server tools sale prodid date sum p1 1 62 p2 1 19 p1 2 48 Metadata Query & Analysis Warehouse utilities ROLAP server Special indices, tuning; Schema is denormalized Integration Source Source Source relational DBMS 45 46 MOLAP Server Index Structures Multi-Dimensional OLAP Server utilities M.D. tools multidimensional server Product milk soda eggs soap City A B 1 2 3 4 Date could also sit on relational DBMS Sales Traditional Access Methods B-trees, hash tables, grids, Popular in Warehouses inverted lists bit map indexes join indexes text indexes 47 48

Inverted Lists Using Inverted Lists 2 23 18 19 2 21 22 23 25 26 r4 r18 r34 r35 r5 r19 r37 r4 rid name age r4 joe 2 r18 fred 2 r19 sally 21 r34 nancy 2 r35 tom 2 r36 pat 25 r5 dave 21 r41 jeff 26... Query: Get people with age = 2 and name = fred List for age = 2: r4, r18, r34, r35 List for name = fred : r18, r52 Answer is intersection: r18 age index inverted lists data records 49 5 Bit Maps Using Bit Maps 2 23 age index 18 19 2 21 22 23 25 26 1 1 1 1 bit maps 1 1 1 1 id name age 1 joe 2 2 fred 2 3 sally 21 4 nancy 2 5 tom 2 6 pat 25 7 dave 21 8 jeff 26... data records Query: Get people with age = 2 and name = fred List for age = 2: 1111 List for name = fred : 11 Answer is intersection: 1 Good if domain cardinality small Bit vectors can be compressed 51 52

Join Join Indexes Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT p2 c1 1 11 p1 c3 1 5 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 product id name price p1 bolt 1 p2 nut 5 jointb prodid name price storeid date amt p1 bolt 1 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 1 c3 1 5 p2 nut 5 c2 1 8 p1 bolt 1 c1 2 44 p1 bolt 1 c2 2 4 product id name price jindex p1 bolt 1 r1,r3,r5,r6 p2 nut 5 r2,r4 join index sale rid prodid storeid date amt r1 r2 p2 c1 1 11 r3 p1 c3 1 5 r4 p2 c2 1 8 r5 p1 c1 2 44 r6 p1 c2 2 4 53 54 What to Materialize? Materialization Factors Store in warehouse results useful for common queries Example: p2 p1 12 5... total sales Type/frequency of queries Query response time Storage cost Update cost p1 56 4 5 p1 67 12 5 129 materialize c1 p1 11 p2 19 55 56

Cube Aggregates Lattice Dimension Hierarchies 129 all all p1 67 12 5 p1 56 4 5 city product date city, product city, date product, date state city cities city state c1 CA c2 NY 44 4 p1 p2 12 5 p1 city, product, date use greedy algorithm to decide what to materialize 57 58 Dimension Hierarchies Interesting Hierarchy all city product date city, product city, date product, date city, product, date state state, product state, date all weeks years quarters months time day week month quarter year 1 1 1 1 2 2 1 1 1 2 3 1 1 1 2 4 1 1 1 2 5 1 1 1 2 6 1 1 1 2 7 1 1 1 2 8 2 1 1 2 conceptual dimension table not all arcs shown... state, product, date days 59 6

Managing Metadata Metadata Warehouse Design Tools Metadata Query & Analysis Warehouse Administrative definition of sources, tools,... schemas, dimension hierarchies, rules for extraction, cleaning, refresh, purging policies user profiles, access control,... Integration Source Source Source 61 62 Metadata Design Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 63 64

Tools Current State of Industry Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management reliable scripts for cleaning & analyzing data Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything 65 66 Future Directions Better performance Larger warehouses Easier to use 67