Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Similar documents

Data warehouse design

Data Warehouse: Introduction

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Basics of Dimensional Modeling

Data Warehouse Logical Modeling and Design (6)

OLAP Systems and Multidimensional Expressions I

Mario Guarracino. Data warehousing

Index Selection Techniques in Data Warehouse Systems

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

DATA WAREHOUSING - OLAP

Week 3 lecture slides

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehousing and OLAP

Data Warehouse Design

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

DATA WAREHOUSING AND OLAP TECHNOLOGY

Week 13: Data Warehousing. Warehousing

Designing a Dimensional Model

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

Using the column oriented NoSQL model for implementing big data warehouses

OLAP OLAP. Data Warehouse. OLAP Data Model: the Data Cube S e s s io n

CHAPTER 3. Data Warehouses and OLAP

Optimizing Your Data Warehouse Design for Superior Performance

Data Warehousing. Paper

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Data warehouse Architectures and processes

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

CHAPTER 4 Data Warehouse Architecture

Business Intelligence, Data warehousing Concept and artifacts

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Monitoring Genebanks using Datamarts based in an Open Source Tool

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

The Design and the Implementation of an HEALTH CARE STATISTICS DATA WAREHOUSE Dr. Sreèko Natek, assistant professor, Nova Vizija,

Hybrid OLAP, An Introduction

Part 22. Data Warehousing

Data Warehousing Systems: Foundations and Architectures

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

When to consider OLAP?

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

SAS BI Course Content; Introduction to DWH / BI Concepts

Dimensional Modeling for Data Warehouse

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Lecture Data Warehouse Systems

Data W a Ware r house house and and OLAP II Week 6 1

Building Data Cubes and Mining Them. Jelena Jovanovic

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Data Warehousing Overview

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehouse design

University of Gaziantep, Department of Business Administration

Data Warehousing Concepts

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Overview of Data Warehousing and OLAP

What is OLAP - On-line analytical processing

A Critical Review of Data Warehouse

Data Warehousing & OLAP

My Favorite Issues in Data Warehouse Modeling

Database Design Patterns. Winter Lecture 24

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

14. Data Warehousing & Data Mining

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Data Mining and Data Warehousing Henryk Maciejewski Data Warehousing and OLAP

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Two approaches to the integration of heterogeneous data warehouses

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Data W a Ware r house house and and OLAP Week 5 1

SQL SERVER TRAINING CURRICULUM

THE DIMENSIONAL FACT MODEL: A CONCEPTUAL MODEL FOR DATA WAREHOUSES 1

Data warehousing with PostgreSQL

Multidimensional Modeling - Stocks

A Multidimensional Design for the Genesis Data Set

Data warehouse life-cycle and design

Main Memory & Near Main Memory OLAP Databases. Wo Shun Luk Professor of Computing Science Simon Fraser University

Foundations of Business Intelligence: Databases and Information Management

The Cubetree Storage Organization

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data Testing on Business Intelligence & Data Warehouse Projects

Data Warehousing and OLAP Technology

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Data Warehouse design

Creating BI solutions with BISM Tabular. Written By: Dan Clark

Presented by: Jose Chinchilla, MCITP

ETL TESTING TRAINING

OLAP Theory-English version

SAS Business Intelligence Online Training

Fluency With Information Technology CSE100/IMT100

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Data Warehousing, OLAP, and Data Mining

CS2032 Data warehousing and Data Mining Unit II Page 1

Jet Data Manager 2012 User Guide

LEARNING SOLUTIONS website milner.com/learning phone

Transcription:

Data Warehouse Logical Design Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Data Mart logical models MOLAP (Multidimensional On-Line Analytical Processing) stores data by using a multidimensional data structure ROLAP (Relational On-Line Analytical Processing) uses the relational data model to represent multidimensional data

Data Mart logical models MOLAP stands for Multidimensional OLAP. In MOLAP cubes the data aggregations and a copy of the fact data are stored in a multidimensional structure on the computer. It is best when extra storage space is available on the server and the best query performance is desired. MOLAP local cubes contain all the necessary data for calculating aggregates and can be used offline. MOLAP cubes provide the fastest query response time and performance but require additional storage space for the extra copy of data from the fact table. ROLAP stands for Relational OLAP. ROLAP uses the relational data model to represent multidimensional data. In ROLAP cubes a copy of data from the fact table is not (necessarily) made and the data aggregates may be stored in tables in the source relational database. A ROLAP cube is best when there is limited space on the server and query performance is not very important. ROLAP local cubes contain the dimensions and cube definitions but aggregates are calculated when needed. ROLAP cubes requires less storage space than MOLAP and HOLAP cubes. HOLAP stands for Hybrid OLAP. A HOLAP cube has a combination of the ROLAP and MOLAP cube characteristics. It does not necessarily create a copy of the source data; however, data aggregations are stored in a multidimensional structure on the server. HOLAP cubes are best when storage space is limited but faster query responses are needed.

ROLAP It is based on the Star Schema A star schema is : A set of relations DT 1, DT 2, DT n - dimension tables - each corresponding to a dimension. Each DT i is characterized by a primary key d i and by a set of attributes describing the analysis dimensions with different aggregation levels A relation FT, fact table, that imports the primary keys of dimensions tables. The primary key of FT is d 1 d 2 d n ; FT contains also an attribute for each measure

Star schema: example supplier product type category month week SALE Quantity Profit agent shop city state WEEK ID_Week Week Month PRODUCT ID_Product Product Type Category Supplier ID_Shop ID_Week ID_Product Quantity Profit SHOP ID_Shop Shop City State Agent

Star schema: considerations Dimension table keys are surrogates, for space efficiency reasons Dimension tables are de-normalized product type category is a transitive dependency De-normalization introduces redundancy, but fewer joins to do The fact table contains information expressed at different aggregation levels

OLAP queries on Star Schema WEEK ID_Week Week Month PRODUCT ID_Product Product Type Category Supplier SALE ID_Shop ID_Week ID_Product Quantity Profit SHOP ID_Shop Shop City State Agent select City, Week, Type, sum(quantity) from Week, Shop, Product, Sale where Week.ID_Week=Sale.ID_Week and Shop.ID_Shop=Sale.ID_Shop and Product.ID_Product=Sale.ID_Product and Product.Category = FoodStuff group by City,Week,Type

Snowflake schema The snowflake schema reduces the denormalization of the dimensional tables DT i of a star schema Removal of some transitive dependencies Dimensions tables of a snowflake schema are composed by A primary key d i,j A subset of DT i attributes that directly depends by d i,j Zero or more external keys that allow to obtain the entire information

Snowflake schema In a snowflake schema Primary dimension tables: their keys are imported in the fact table Secondary dimension tables

supplier product type category Snowflake schema month week SALE Quantity Profit agent shop city state WEEK ID_Week Week Month PRODUCT ID_Product Product ID_Type Supplier ID_Shop ID_Week ID_Product Quantity Profit DT 1,1 DT 1,2 SHOP ID_Shop Shop ID_City Agent CITY ID_City City State d 1,1 External key d 1,2 TYPE ID_Type Type Category

Snowflake schema: considerations Reduction of memory space New surrogate keys Advantages in the execution of queries related to attributes contained into fact and primary dimension tables

Normalization & Snowflake schema If there exists a cascade of transitive dependencies, attributes depending (transitively or not) on the snowflake attribute are placed in a new relation

OLAP queries on snowflake WEEK ID_Week Week Month PRODUCT ID_Product Product ID_Type Supplier ID_Shop ID_Week ID_Product Quantity Profit SHOP ID_Shop Shop ID_City Agent CITY ID_City City State schema TYPE ID_Type Type Category select City, Week, Type, sum(quantity) from Week, Shop, Type, City, Product, Sale where Week.ID_Week=Sale.ID_Week and Shop.ID_Shop=Sale.ID_Shop and Shop.ID_City=City.ID_City and Product.ID_Product=Sale.ID_Product and Product.ID_Type=Type.ID_Type and Product.Category = FoodStufs group by City,Week, Type

Views Aggregation allows to consider concise (summarized) information Aggregation computation is very expensive pre-computation A view denotes a fact table containing aggregate data

Views A view can be characterized by its aggregation level (pattern) Primary views: correspond to the primary aggregation levels Secondary views: correspond to secondary aggregation levels (secondary events)

Views (MultiDimensional Lattice) v 1 ={product,date,shop} v 2 ={type,date,city} v 4 ={type,month,region} v 3 ={category,month,city} v i <= v j iff v i is less aggregate than v j, i.e. v j s data can be computed from v i s data v 5 ={trimester,region}

Partial aggregations Sometimes it is useful to introduce new measures in order to manage aggregations correctly Derived measures: obtained by applying mathematical operators to two or more values of the same tuple

Partial aggregations Profit=Quantity*Price T1 T1 T2 Type Product Quantity Price P1 P2 P3 SUM 5 7 9 AVG 1,00 1,50 0,80 Profit 5,00 10,50 7,20 22,70 (total profits) T1 T2 Type Quantity Price 12 9 1,25 0,80 Profit 15,00 7,20 We can t t just sum up profits as before!! The correct solution consists in the aggregation of data on the primary table 22,20

Aggregate operators Distributive operator: allows to aggregate data starting from partially aggregated data (e.g. sum, max, min) Algebraic operator: requires further information to aggregate data (e.g. avg) Holistic operator: it is not possible to obtain aggregate data starting from partially aggregate data (e.g. mode, median)

Aggregate operators Currently, aggregate navigators are included in the commercial DW system They allow to re-formulate OLAP queries on the best view They manage aggregates only by means of distributive operators

Relational schema and aggregate data It is possible to define different variants of the star schema in order to manage aggregate data First solution: data of primary and secondary views are stored in the same fact table NULL values for attributes having aggregation levels finer than the current one

Aggregate data in a unique fact table 1 row represents sale values for the single shop, 2 row represents aggregate values for Roma, 3 row represents aggregate values for Lazio, etc SALE Shop_key 1 2 3 Date_key 1 1 1 Prod_key 1 1 1 qty 170 300 1700 profit 85 150 850 SHOP Shop_key shop city region 1 COOP1 Bologna E.R. 2 - Roma Lazio 3 - - Lazio

Relational schema and aggregate data Second solution: distinct aggregation patterns are stored in distinct fact tables: constellation schema Only the dimension of the fact table is optimized, but this is a great improvement already Max optimization level: separate fact tables, and also repeated dimension tables for different aggregation levels

Constellation schema v5 Date_key Shop_key Quantity Profit Unitary price Nr. customers v1 SHOP Shop_key Shop Shop city Shop region Shop state Manager district DATE Date_key Date Month Trimester Year Day Week Shop_key Date_key Product_key Quantity Profit Unitary price Nr. customers PRODUCT Product_key Product Type Category Division Marketing group Brand Brand city

TRIMESTER v5 REGION Trimester _ key Trimester _ key Region _ key Trimester Year Region _ key Quantity Profit Unitary price Nr. customers Shop region Shop state DATE Date _ key Date Month Trimester Year Day Week Shop _ key Data _ key Product _ key Quantity Profit Unitary price Nr. customers Alternative solution v1 Replicate the dimension tables at different aggregation levels Shop _ key Shop Shop city Shop region Shop state Manager district Product SHOP PRODUCT _ key Product Type Category Division Marketing group Brand Brand city

TRIMESTER v5 REGION Trimester _ key Trimester _ key Region _ key Trimester Year Region _ key Quantity Profit Unitary price Nr. customers Shop region Shop state DATE Date _ key Date Month Trimester Year Day Week _ key Shop _ key Date _ key Product Quantity Profit Unitary price Nr. customers Snowflake schema v1 for aggregate _ key Shop _ key Shop Shop city Region _ key Shop state Manager district Product SHOP PRODUCT _ key Product Type Category Division Marketing group Brand Brand city data

Logical design

Logical modelling Sequence of steps that, starting from the conceptual schema, allow one to obtain the logical schema for a specific data mart INPUT Conceptual Schema WorkLoad Data Volume System constraints Logical project OUTPUT Logical Schema

Worklad In OLAP systems, workload is dynamic in nature and intrinsically extemporaneous Users interests change during time Number of queries grows when users gain confidence in the system OLAP should be able to answer any (unexpected) request During requirement collection phase, deduce it from: Interviewswithusers Standard reports

Worklad Characterize OLAP operations: Basedon the required aggregation pattern Basedon the required measures Basedon the selection clauses At system run-time, workload can be desumed from the system log

Depends on: Data volume Number of distinct values for each attribute Attribute size Number of events (primary and secondary) for each fact Determines: Tabledimension Index dimension Access time

Logical modelling: steps 1. Choice of the logical schema (star/snowflake schema) 2. Conceptual schema translation 3. Choice of the materialized views 4. Optimization

From fact schema to star schema Create a fact table containing measures and descriptive attributes directly connected to the fact For each hierarchy, create a dimension table containing all the attributes

Guidelines Descriptive attributes (e.g. color) If it is connected to a dimensional attribute, it has to be included in the dimension table containing the attribute (see slide n. 13, snowflake example, agent) If it is connected to a fact, it has to be directly included in the fact schema Optional attributes (e.g. diet) Introduction of null values or ad-hoc values

Guidelines Cross-dimensional attributes (e.g. VAT) A cross-dimensional attribute b defines an N:M association between two or more dimensional attributes a 1,a 2,, a k It requires to create a new table including b and having as key the attributes a 1,a 2,, a k

Guidelines Shared hierarchies and convergence A shared hierarchy is a hierarchy which refers to different elements of the fact table (e.g. caller number, called number) The dimension table should not be duplicated Two different situations: The two hierarchies contain the same attributes, but with different meanings (e.g. phone call caller number, phone call called number) The two hierarchies contain the same attributes only for part of the hierarchy trees

Shared hierarchies and convergence The two hierarchies contain the same attributes, but with different meanings (e.g. phone call caller number, phone call called number) CALLS CALLER_ID CALLED_ID DATE_ID N_OF_CALLS USER USER_ID PH_NUMR DISTRICT

Shared hierarchies and convergence The two hierarchies contain the same attributes only for part of the trees. Here we could also decide to replicate STOREHOUSE the shared portion STOREHS_ID STOREHOUSE CITY_ID SHIPPINGS STOREHOUSE_ID ORDER_ID DATE_ID NUMBER ORDERS ORDER_ID ORDER CUSTOMER CITY_ID CITIES CITY_ID REGION

Guidelines Multiple edges A bridge table models the multiple edge the key of the bridge table is composed by the combination of attributes connected to the multiple edge BRIDGE BOOK_ID AUTH_ID WEIGHT The weight of the edge is the contribution of each edge to the cumulative relationship SALES BOOK_ID DATE_ID QUANTITY PROFIT BOOKS BOOK_ID BOOK GENRE AUTHORS AUTH_ID AUTHOR

Guidelines Multiple edges: bridge table Weighed queries take into account the weight of the edge Profit for each author SELECT AUTHORS.Author,SUM(SALES.Profit * BRIDGE.Weight) FROM AUTHORS, BRIDGE, BOOKS, SALES WHERE AUTHORS.Author_id=BRIDGE.Author_id AND BRIDGE.Book_id=BOOKS.Book_id AND BOOKS.Book_id=SALES.Book_id GROUP BY AUTHORS.Author

Guidelines Multiple edges: bridge table Impact queries do not take into account the weight of the edge Sold copies for each author SELECT AUTHORS.Author, SUM(SALES.Quantity) FROM AUTHORS, BRIDGE, BOOKS, SALES WHERE AUTHORS.Author_id=BRIDGE.Author_id AND BRIDGE.Book_id=BOOKS.Book_id AND BOOKS.Book_id=SALES.Book_id GROUP BY AUTHORS.Author

If we want to keep the star model Multiple edges with a star schema: add authors to the fact schema SALES BOOK_ID AUTH_ID DATE_ID QUANTITY PROFIT BOOKS BOOK_ID BOOK GENRE AUTHORS AUTH_ID AUTHOR Here we don t need the weight because the fact table records quantity and profit per book and per author

Secondary-view precomputation The choice about views that have to be materialized takes into account contrasting requirements: Cost functions minimization Workload cost View maintenance cost System constraints Disk space Time for data update Users constraints Max answer time Data freshness

Materialized views (MD lattice) = exact views: They solve exactly the queries = less aggregate views: They solve more than one query = candidate views They could reduce elaboration costs

Materialized views (MD lattice) 60 50 40 30 20 Query execution cost Disk space View comput. time 10 0 1

Materialized views (MD lattice) 60 50 40 30 Query execution cost Disk space View comput. time 20 10 0 1

Materialized views (MD lattice) 30 25 20 15 Query execution cost Disk space View comput. time 10 5 0 1

Materialized Views It is useful to materialize a view when: It directly solves a frequent query It reduce the costs of some queries It is not useful to materialize a view when: Its aggregation pattern is the same as another materialized view Its materialization does not reduce the cost

References M. Golfarelli, S. Rizzi: Data Warehouse: teoria e pratica della progettazione McGraw-Hill, 2002. Ralph Kimball: The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses John Wiley 1996.