Modeling of Data Warehouses

Similar documents
OLAP Systems and Multidimensional Expressions I

Data Integration and ETL Process

Data Integration and ETL Process

Designing a Dimensional Model

Basics of Dimensional Modeling

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

Optimizing Your Data Warehouse Design for Superior Performance

When to consider OLAP?

Dimensional Modeling 101. Presented by: Michael Davis CEO OmegaSoft,LLC

How To Write A Diagram

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Course chapter 9: Data Warehousing Dimensional modeling. F. Radulescu - Data warehousing - Dimensional modeling 1

Benefits of a Multi-Dimensional Model. An Oracle White Paper May 2006

Turning your Warehouse Data into Business Intelligence: Reporting Trends and Visibility Michael Armanious; Vice President Sales and Marketing Datex,

Mastering Data Warehouse Aggregates. Solutions for Star Schema Performance

Multidimensional Modeling - Stocks

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)

Dimensional Data Modeling

Data Warehouses & OLAP

Dx and Microsoft: A Case Study in Data Aggregation

Mario Guarracino. Data warehousing


Data Warehouse design

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Trivadis White Paper. Comparison of Data Modeling Methods for a Core Data Warehouse. Dani Schnider Adriano Martino Maren Eschermann

Sterling Business Intelligence

Data Warehousing Systems: Foundations and Architectures

Chapter 7 Multidimensional Data Modeling (MDDM)

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Deductive Data Warehouses and Aggregate (Derived) Tables

The Data Warehouse ETL Toolkit

Data warehousing with PostgreSQL

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

Microsoft Data Warehouse in Depth

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Data Warehousing and Data Mining

The Benefits of Data Modeling in Data Warehousing

Dimensional Modeling for Data Warehouse

SENG 520, Experience with a high-level programming language. (304) , Jeff.Edgell@comcast.net

ETL TESTING TRAINING

Dimensional Data Modeling for the Data Warehouse

Business Intelligence, Data warehousing Concept and artifacts

INTEROPERABILITY IN DATA WAREHOUSES

Trends in Data Warehouse Data Modeling: Data Vault and Anchor Modeling

OLAP Systems and Multidimensional Queries II

CHAPTER 3. Data Warehouses and OLAP

PeopleSoft EPM 9.1 Supply Chain Management Warehouse PeopleBook

Database Normalization. Mohua Sarkar, Ph.D Software Engineer California Pacific Medical Center

CHAPTER 4 Data Warehouse Architecture

The Design and the Implementation of an HEALTH CARE STATISTICS DATA WAREHOUSE Dr. Sreèko Natek, assistant professor, Nova Vizija,

Building a Cube in Analysis Services Step by Step and Best Practices

Business Warehouse BEX Query Guidelines

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Dimensional Modeling and E-R Modeling In. Joseph M. Firestone, Ph.D. White Paper No. Eight. June 22, 1998

Accessing multidimensional Data Types in Oracle 9i Release 2

Cognos 8 Best Practices

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

B. 3 essay questions. Samples of potential questions are available in part IV. This list is not exhaustive it is just a sample.

Top Data Management Terms to Know Fifteen essential definitions you need to know

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Part 22. Data Warehousing

Oracle Warehouse Builder 11gR2: Getting Started

Customer-Centric Data Warehouse, design issues. Announcements

Monitoring Genebanks using Datamarts based in an Open Source Tool

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Data Warehousing. SQL Server 2008 R2, Denali

CS2032 Data warehousing and Data Mining Unit II Page 1

OLAP Systems and Multidimensional Expressions II

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Kimball Dimensional Modeling Techniques

Data Warehouse Design

BUILDING OLAP TOOLS OVER LARGE DATABASES

Semantic Data Modeling: The Key to Re-usable Data

LEARNING SOLUTIONS website milner.com/learning phone

Data warehouse design

University of Gaziantep, Department of Business Administration

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES

Rocky Mountain Technology Ventures. Exploring the Intricacies and Processes Involving the Extraction, Transformation and Loading of Data

ETL evolution from data sources to data warehouse using mediator data storage

STRATEGIC AND FINANCIAL PERFORMANCE USING BUSINESS INTELLIGENCE SOLUTIONS

Database Design Patterns. Winter Lecture 24

Databases in Organizations

MCQs~Databases~Relational Model and Normalization

Week 3 lecture slides

A Multidimensional Design for the Genesis Data Set

PeopleSoft EPM 9.1 Customer Relationship Management Warehouse PeopleBook

Normalization of Database

Data Warehousing and Decision Support. Torben Bach Pedersen Department of Computer Science Aalborg University

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

OLAP OLAP. Data Warehouse. OLAP Data Model: the Data Cube S e s s io n

Chapter 3 - Data Replication and Materialized Integration

Data Warehousing Concepts

Transcription:

Modeling of Data Warehouses Krzysztof Dembczyński Institute of Computing Science Laboratory of Intelligent Decision Support Systems Politechnika Poznańska (Poznań University of Technology) Intelligent Decision Support Systems Master studies, first semester Academic year 2012/12 (summer course)

Conceptual Schemes of Data Warehouses Dimensional Modeling Summary

Conceptual Schemes of Data Warehouses Dimensional Modeling Summary

Three main goals for logical design: Simplicity, Expressiveness, Performance.

Simplicity: Users should understand the design, Data model should match users conceptual model, Queries should be easy and intuitive to write.

Expressiveness: Include enough information to answer all important queries, Include all relevant data (without irrelevant data).

Performance An efficient physical design should be possible to apply.

Three main goals given in an alternative way: Simplicity, Simplicity, Simplicity,...

There are typically a number of different dimensions from which a given pool of data can be analyzed. This plural perspective, or Multidimensional Conceptual View appears to be the way most business persons naturally view their enterprise. E.F. Codd, S.B. Codd and C.T. Salley, 1993

Three basic conceptual schemes: Star schema, Snowflake schema, Fact constellations.

Star schema: a single table in the middle connected to a number of dimension tables.

Basic notions Measures, e.g. grades, price, quantity. Measures should be aggregative. Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. Table, which relates the dimensions to the measures, is called the fact table (e.g. Student grades). Information about dimensions are represented as a collection of dimension tables (student, academic year, etc.). Each dimension has a set of descriptive attributes.

Fact Table Each fact table contains measurements about a process of interest. Each fact row contains foreign keys to dimension tables and numerical measure columns. Any new fact is added to the fact table. The aggregated fact columns are the matter of the analysis.

Dimension Tables Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. Dimension tables contain many descriptive columns. Generally do not have too many rows (in comparison to the fact table). Content is relatively static. The attributes of dimension tables are used for filtering and grouping. Dimension tables describe facts stored in the fact table.

Fact table: narrow, big (many rows), numeric (rows are described by numerical measures), dynamic (growing over time). Dimension table: wide, small (few rows), descriptive (rows are described by descriptive attributes), static. Facts contain numbers, dimensions contain labels

Star Schema

Dimension Hierarchies For each dimension, one can define several hierarchies of attributes.

Snowflake schema: refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

Normal forms provide criteria for determining a table s degree of vulnerability to inconsistencies, redundancy and update anomalies. One can distinguish the following normal forms: First normal form (1NF), Second normal form (2NF), Third normal from (3NF), Boyce-Codd normal form (BCNF), Fourth normal form (4NF), Fifth normal form (5NF).

Denormalization: Denormalization is the process of attempting to optimize the performance of a database by adding redundant data or by grouping data. Denormalization helps cover up the inefficiencies inherent in relational database software.

Fact constellation schema: multiple fact tables share dimension tables. Such schema appears in a design of a data warehouse for a huge and complex problem. One should start with a simple model, then try to define more complex (project success).

Conceptual Schemes of Data Warehouses Dimensional Modeling Summary

Four Steps in Dimensional Modeling Select the business process to model (e.g. sales). Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register). Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.). Identify the numeric measures for the facts (e.g. price, quantity).

Select the business process to model Business process has to be a natural activity performed in the given organization that is supported by an operational database system. Examples: sales, orders, teaching. Business process should not be referred to an organizational business department or function. Selection of the business process should be dependent on its complexity, and also on time and budget of the project.

Determine the grain of the business process Defines what an individual fact table row represents. Determines the maximum level of detail of the warehouse. Examples: line item from a cash register receipt, daily snapshot of inventory level for a product in a warehouse, monthly snapshot for each bank account. Finer-grained fact tables are more expressive, but have more rows and grow faster.

Choose the dimensions describing the business process Decorates fact table with a set of dimensions that determine a candidate key based on the grain and all other dimensions functionally determined by the candidate key. In other words: detailed description of the grain defined in the previous step. Grain of the fact table defines the grain of the dimensions. Examples: localization, product, date, promotion, etc..

Identify the numeric measures for the facts Numeric measures allows to measure the performance of the process. All the measures corresponds to the same grain defined previously. Measures should be numeric and additive (some problems: number of apples and pineapples).

Additional Aspects of Dimensional Modeling: Derived measures vs. calculated measures, Describing dimensions, Date dimension, Surrogate keys, Degenerate Dimensions, Factless fact tables, Slowly changing dimensions, Data warehouse bus architecture, Physical organization of the data warehouse.

Date dimension is a specific and inseparable dimension of the data warehouse design. Data warehouse should be treated as a temporal database. Date dimension allows analysis of time trends.

Typical attributes in date dimension: Date and time (candidate key), day (number) of year, day (number) of month, day (number) of week, day number in epoch, weekday, weekend, 24-hour work day holiday week of year, month, name of month, quarter, year, fiscal year, selling season, back-to-school Christmas

Date and time dimension: For finer-grained time measurements, use separate date and time-of-day dimensions.

Advantages of using date dimension: Capturing interesting date properties (holiday/non-holiday), Avoiding relying on special date-handling SQL functions, Making use of indexes.

Surrogate keys is meaningless integer key that is assigned by the data warehouse. Surrogate keys should be used instead of natural keys (like PESEL number in Poland): Surrogate keys can be shorter, Better handling of exceptional cases (e.g. for missing values try to use a specific row in the dimension table that represents missing value), Surrogate keys avoid to assume implicit semantic, Resistant to change of natural key meaning, Resistant to reuse of old value of natural key.

Degenerate Dimensions Dimension can be merely an identifier, without any interesting attributes: Consider a data warehouses for a retail market, Typical transaction can be described by (id_transaction, product,...), id_transaction is just a unique identifier.

Two options for dealing with such dimensions: Discard the dimension, Use a degenerate dimension: store the dimension identifier directly in the fact table; do not create a separate dimension table; such a dimension serves to group together products that were bought in the same market basket (market basket analysis).

Factless Fact Tables Fact tables take the advantage of sparsity (much less data to store, if events are rare). Fact tables do not contain rows for non-events: no rows for products that did not sell.

Factless Fact Tables Fact table without numeric measures columns. Dummy measure is included that always has value 1. Describe relationships between dimensions. Example: Which products were on promotion in which stores for which days?

Slowly Changing Dimensions Compared to fact tables, contents of dimension tables are relatively stable: New sales transactions occur constantly, New products are introduced rarely, New stores are opened very rarely. Attribute values for existing dimension rows do occasionally change over time: Customer moves to a new address, Administrative reform in Poland, Change of product categorization.

Slowly Changing Dimensions Type 1 (the simplest solution): update the dimension table: Simple to implement, Does not report history, Can affect analysis. Examples: Wrong name of the street (correct approach), Change of customer address: Nowak moves from Poznań to Warsaw: all products bought in Poznań move to Warsaw too!!!

Slowly Changing Dimensions Type II: create a new dimension row: Accurate historical reporting, Pre-computed aggregates unaffected, Dimension table grows over time, Type II requires surrogate keys. Examples: Nowak moves to Warsaw: id name city 23401 Nowak Poznań 23402 Nowak Warsaw Old fact table rows point to the old row, and new fact table rows point to the new row, To report on Nowak s activity over time one can use WHERE name = Nowak.

Slowly Changing Dimensions Type III: create a new dimension columns: Allows reporting using either the old or the new values, Mostly useful for different reorganizations. Examples: Administrative reform in Poland id previous name current name 23401 poznańskie wielkopolskie 23402 pilskie wielkopolskie All the fact table rows point to both values.

Data Warehouse Bus Architecture In order to create an data warehouse for entire organization, which takes into account several processes, one can use a data warehouse bus matrix.

Some Other Problems It is generally reckoned that some 80 percent of data warehouse queries are dimension-browsing queries. This means that they do not access any fact table. Example How many customers do we have? Ch. Todman 2003

Conceptual Schemes of Data Warehouses Dimensional Modeling Summary

Summary Three goals for logical design of data warehouse: simplicity, expressiveness and performance. The most popular conceptual schema: star schema. Designing data warehouses is not an easy task...

Bibliography Ch. Todman, Projektowanie hurtowni danych. Zarzadzanie kontaktami z klientami (CRM), Wydawnictwa Naukowo-Techniczne 2003 R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons 2002 E.F. Codd, S.B. Codd, C.T. Salley, Providing OLAP to User-Analysts: An IT Mandate. E. F. Codd Associates 1993