Data warehouse design



Similar documents
Data warehouse Architectures and processes

Data Warehouse: Introduction

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Data Warehouse Design

Multidimensional Modeling - Stocks

THE DIMENSIONAL FACT MODEL: A CONCEPTUAL MODEL FOR DATA WAREHOUSES 1

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

Designing a Dimensional Model

Data warehouse life-cycle and design

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

DATA WAREHOUSING AND OLAP TECHNOLOGY

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Dimensional Data Modeling for the Data Warehouse

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

1. Dimensional Data Design - Data Mart Life Cycle

Fluency With Information Technology CSE100/IMT100

Mastering Data Warehouse Aggregates. Solutions for Star Schema Performance

Advanced Data Management Technologies

Report on the Dagstuhl Seminar Data Quality on the Web

The Data Warehouse ETL Toolkit

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

DATA WAREHOUSE DESIGN

Dimodelo Solutions Data Warehousing and Business Intelligence Concepts

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

PeopleSoft EPM 9.1 Supply Chain Management Warehouse PeopleBook

Basics of Dimensional Modeling

Lection 3-4 WAREHOUSING

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Customer-Centric Data Warehouse, design issues. Announcements

The Data Quality Continuum*

Data Warehouse (DW) Maturity Assessment Questionnaire

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Sterling Business Intelligence

Foundations of Business Intelligence: Databases and Information Management

SAP InfiniteInsight Explorer Analytical Data Management v7.0

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

How To Write A Diagram

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

2 Associating Facts with Time

CHAPTER 4 Data Warehouse Architecture

POLAR IT SERVICES. Business Intelligence Project Methodology

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data W a Ware r house house and and OLAP Week 5 1

Optimizing Your Data Warehouse Design for Superior Performance

Data Warehousing with Oracle

Data Warehousing and OLAP

OLAP Theory-English version

Data Warehouse Management Using SAP BW Alexander Prosser

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehouse design

Customer Analysis - Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.

Establish and maintain Center of Excellence (CoE) around Data Architecture

Database Design Patterns. Winter Lecture 24

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Trivadis White Paper. Comparison of Data Modeling Methods for a Core Data Warehouse. Dani Schnider Adriano Martino Maren Eschermann

Data Warehousing and Decision Support. Torben Bach Pedersen Department of Computer Science Aalborg University

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Week 13: Data Warehousing. Warehousing

Data Warehousing Systems: Foundations and Architectures

Building an Effective Data Warehouse Architecture James Serra

BENEFITS OF AUTOMATING DATA WAREHOUSING

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

MDM and Data Warehousing Complement Each Other

Data Warehouse Logical Modeling and Design (6)

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Design of a Multi Dimensional Database for the Archimed DataWarehouse

MS Designing and Optimizing Database Solutions with Microsoft SQL Server 2008

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

OLAP Systems and Multidimensional Expressions I

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

THE OPEN UNIVERSITY OF TANZANIA FACULTY OF SCIENCE TECHNOLOGY AND ENVIRONMENTAL STUDIES BACHELOR OF SIENCE IN DATA MANAGEMENT

Data Warehousing and OLAP Technology for Knowledge Discovery

Patterns of Information Management

PeopleSoft EPM 9.1 Customer Relationship Management Warehouse PeopleBook

Presented by: Jose Chinchilla, MCITP

DIMENSIONAL MODELLING

Business Intelligence : a primer

14. Data Warehousing & Data Mining

6NF CONCEPTUAL MODELS AND DATA WAREHOUSING 2.0

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Transcription:

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, Data warehouse design DATA WAREHOUSE: DESIGN - 1 Risk factors Database and data mining group, High user expectation the data warehouse is the solution of the company s problems Data and OLTP process quality incomplete or unreliable data non integrated or non optimized business processes Political management of the project cooperation with information owners system acceptance by end users deployment appropriate training DATA WAREHOUSE: DESIGN - 2 Pag. 1

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, Top-down approach the data warehouse provides a global and complete representation of business data significant cost and time consuming implementation complex analysis and design tasks Bottom-up approach incremental growth of the data warehouse, by adding data marts on specific business areas separately focused on specific business areas limited cost and delivery time easy to perform intermediate checks DATA WAREHOUSE: DESIGN - 3 Database and data mining group, Business Dimensional Lifecycle Planning (Kimball) Requirement definition Project management Architecture design Product selection and installation TECHNOLOGY Dimensional modeling Physiscal design Feeding design and implementation DATA User Application Analysis User Application Development APPLICATIONS Deployment Maintenance DATA WAREHOUSE: DESIGN - 4 warehouse, teoria e pratica della progettazione, McGraw Hill 2006 Pag. 2

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Data mart design user requirements reconciled schema Database and data mining group, CONCEPTUAL DESIGN workload data volume logical model operational source schemas RECONCILIATION reconciled schema fact schema LOGICAL DESIGN logical schema workload data volume DBMS FEEDING DESIGN PHYSICAL DESIGN feeding schema physical schema DATA WAREHOUSE: DESIGN - 5 Requirement analysis Database and data mining group, It collects data analysis requirements to be supported by the data mart implementation constraints due to existing information systems Requirement sources business users operational system administrators The first selected data mart is crucial for the company feeded by (few) reliable sources DATA WAREHOUSE: DESIGN - 6 Pag. 3

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, Application requirements Description of relevant events (facts) each fact represents a of events which are relevant for the company examples: (in the CRM domain) complaints, services characterized by descriptive dimensions (setting the granularity), history span, relevant measures informations are gathered in a glossary Workload description periodical business reports queries expressed in natural language example: number of complaints for each product in the last month DATA WAREHOUSE: DESIGN - 7 Structural requirements Database and data mining group, Feeding periodicity Available space for data derived data (indices, materialized views) System architecture level number dependent or independent data marts Deployment planning start up training DATA WAREHOUSE: DESIGN - 8 Pag. 4

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, Conceptual design DATA WAREHOUSE: DESIGN - 9 Conceptual design Database and data mining group, No currently adopted modeling formalism ER model not adequate Dimensional Fact Model (Golfarelli, Rizzi) graphical model supporting conceptual design for a given fact, it defines a fact schema modelling dimensions hierarchies measures it provides design documentation both for requirement review with users, and after deployment DATA WAREHOUSE: DESIGN - 10 Pag. 5

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Dimensional Fact Model Database and data mining group, Fact it models a set of relevant events (sales, shippings, complaints) it evolves with time Dimension it describes the analysis coordinates of a fact (e.g., each sale is described by the sale, the and the sold product) it is characterized by many, typically categorical, attributes Measure it describes a numerical property of a fact (e.g., each sale is characterized by a sold quantity) aggregates are frequently performed on measures product fact dimension DATA WAREHOUSE: DESIGN - 11 SALE sold quantity sale amount number of customers unit price measure DFM: Hierarchy Database and data mining group, Each dimension can have a set of associated attributes The attributes describe the dimension at different abstraction levels and can be structured as a hierarchy The hierarchy represents a generalization relationship among a subset of attributes in a dimension (e.g., geografic hierarchy for the dimension) The hierarchy represents a functional dependency (1:n relationship) quarter month day holiday week DATA WAREHOUSE: DESIGN - 12 department marketing group type brand city product SALE brand sold quantity sale amount number of customers unit price sale manager sale district city dimension attribute region country hierarchy Pag. 6

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Comparison with ER Database and data mining group, department DEPARTMENT quarter month day holiday week department marketing group type brand city product brand SALE sold quantity sale amount number of customers unit price sale manager sale district city region country country region city COUNTRY REGION CITY MARKETING marketing GROUP YEAR CATEGORY BRAND Brand city CITY QUARTER type TYPE BRAND brand MONTH product PRODUCT quarter month SHOP (0,n) (0,n) SALE (0,n) DATE SALES MGR. sales mgr DATA WAREHOUSE: DESIGN - 13 SALES DISTRICT sale district sold qty sale amount cust. num. unit price HOLIDAY holiday WEEK DAY day week Aggregation Database and data mining group, Events can be aggregated based on the values of the attributes along the hierarchies Measures for the aggregated event are obtained by aggregating the measures of the corresponding events in the original fact schema standard aggregate operators: SUM, MIN, MAX, AVG, COUNT Aggregation computes measures with a coarser granularity than those in the original fact schema detail reduction is usually obtained by climbing a hierarchy DATA WAREHOUSE: DESIGN - 14 Pag. 7

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Aggregation Database and data mining group, quart. I 99 II 99 III 99 IV 99 I 00 II 00III 00IV 00 type product Brillo 100 90 95 90 80 70 90 85 Washing r home Sbianco 20 30 20 10 25 30 35 20 powder cleaning Lucido 60 50 60 45 40 40 50 40 soap Manipulite 15 20 25 30 15 15 20 10 Scent 30 35 20 25 30 30 20 15 Latte F Slurp 90 90 85 75 60 80 85 60 milk Latte U Slurp 60 80 85 60 70 70 75 65 food Yogurt Slurp 20 30 40 35 30 35 35 20 soda Bevimi 20 10 25 30 35 30 20 10 Colissima 50 60 45 40 50 60 45 40 Measure: sold quantity quart. I 99 II 99 III 99 IV 99 I 00 II 00 III 00 IV 00 home clean. 225 225 220 200 190 185 215 170 food 240 270 280 240 245 275 260 195 DATA WAREHOUSE: DESIGN - 15 warehouse, teoria e pratica della Aggregation Database and data mining group, quart. I 99 II 99 III 99 IV 99 I 00 II 00III 00IV 00 type product Brillo 100 90 95 90 80 70 90 85 Washing r home Sbianco 20 30 20 10 25 30 35 20 powder cleaning Lucido 60 50 60 45 40 40 50 40 soap Manipulite 15 20 25 30 15 15 20 10 Scent 30 35 20 25 30 30 20 15 Latte F Slurp 90 90 85 75 60 80 85 60 milk Latte U Slurp 60 80 85 60 70 70 75 65 food Yogurt Slurp 20 30 40 35 30 35 35 20 soda Bevimi 20 10 25 30 35 30 20 10 Colissima 50 60 45 40 50 60 45 40 Measure: sold quantity quart. I 99 II 99 III 99 IV 99 I 00 II 00 III 00 IV 00 home clean. 225 225 220 200 190 185 215 170 food 240 270 280 240 245 275 260 195 type home p. washing 670 605 cleaning soap 200 155 food milk 750 685 soda 280 290 DATA WAREHOUSE: DESIGN - 16 warehouse, teoria e pratica della Pag. 8

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Aggregation Database and data mining group, quart. I 99 II 99 III 99 IV 99 I 00 II 00III 00IV 00 type product Brillo 100 90 95 90 80 70 90 85 Washing r home Sbianco 20 30 20 10 25 30 35 20 powder cleaning Lucido 60 50 60 45 40 40 50 40 soap Manipulite 15 20 25 30 15 15 20 10 Scent 30 35 20 25 30 30 20 15 Latte F Slurp 90 90 85 75 60 80 85 60 milk Latte U Slurp 60 80 85 60 70 70 75 65 food Yogurt Slurp 20 30 40 35 30 35 35 20 soda Bevimi 20 10 25 30 35 30 20 10 Colissima 50 60 45 40 50 60 45 40 Measure: sold quantity quart. I 99 II 99 III 99 IV 99 I 00 II 00 III 00 IV 00 home clean. 225 225 220 200 190 185 215 170 food 240 270 280 240 245 275 260 195 type home p. washing 670 605 cleaning soap 200 155 food milk 750 685 soda 280 290 DATA WAREHOUSE: DESIGN - 17 home clean. 870 760 food 1030 975 warehouse, teoria e pratica della additive: Measure characteristics Database and data mining group, can be aggregated along a given hierarchy by means of the SUM operator not additive: cannot be aggregated along a given hierarchy by means of the SUM operator not aggregable: cannot be aggregated along any hierarchy by means of any aggregate operator DATA WAREHOUSE: DESIGN - 18 Pag. 9

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Aggregation Database and data mining group, non-additivity quarter month day holiday week department marketing group type brand city product brand SALE sold quantity sale amount number of customers unit price (AVG) sale manager sale district city region country DATA WAREHOUSE: DESIGN - 19 Measure classification Stream measures can be evaluated cumulatively at the end of a time period can be aggregated by means of all standard operators examples: sold quantity, sale amount Level measures evaluated at a given time (snapshot) not additive along the time dimension examples: inventory level, account balance Unit measures evaluated at a given time and expressed in relative terms not additive along any dimension examples: unit price of a product Database and data mining group, DATA WAREHOUSE: DESIGN - 20 Pag. 10

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Aggregate operators Database and data mining group, Distributive can always compute higher level aggregations from more detailed data examples: sum, min, max Algebraic can compute higher level aggregations from more detailed data only when supplementary support measures are available examples: avg (it requires count) Olistic can not compute higher level aggregations from more detailed data examples: mode, median DATA WAREHOUSE: DESIGN - 21 Database and data mining group, Distributive operator: SUM quart. I 99 II 99 III 99 IV 99 I 00 II 00III 00IV 00 type product Brillo 100 90 95 90 80 70 90 85 washing r home Sbianco 20 30 20 10 25 30 35 20 powder cleaning Lucido 60 50 60 45 40 40 50 40 soap Manipulite 15 20 25 30 15 15 20 10 Scent 30 35 20 25 30 30 20 15 Measure: sold quantity Latte F Slurp 90 90 85 75 60 80 85 60 milk Latte U Slurp 60 80 85 60 70 70 75 65 food Yogurt Slurp 20 30 40 35 30 35 35 20 soda Bevimi 20 10 25 30 35 30 20 10 Colissima 50 60 45 40 50 60 45 40 quart. I 99 II 99 III 99 IV 99 I 00 II 00 III 00 IV 00 home clean. 225 225 220 200 190 185 215 170 food 240 270 280 240 245 275 260 195 home clean. 870 760 food 1030 975 DATA WAREHOUSE: DESIGN - 22 type home p. washing 670 605 cleaning soap 200 155 food milk 750 685 soda 280 290 warehouse, teoria e pratica della Pag. 11

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Algebraic operator: AVG Database and data mining group, 1999 quart. I 99 II 99 III 99 IV 99 type product Brillo 2 2 2,2 2,5 washing Sbianco 1,5 1,5 2 2,5 home powder Lucido 3 3 3 cleaning Manipulite 1 1,2 1,5 1,5 soap Scent 1,5 1,5 2 Measure: unit price 1999 quart. I 99 II 99 III 99 IV 99 type home wash. p. 1,75 2,17 2,40 2,67 cleaning soap 1,25 1,35 1,75 1,50 avg: 1,50 1,76 2,08 2,09 1999 quart. I 99 II 99 III 99 IV 99 home clean. 1,50 1,84 2,14 2,38 DATA WAREHOUSE: DESIGN - 23 warehouse, teoria e pratica della Advanced DFM Database and data mining group, descritptive attribute quarter month dept. manager manager marketing department group brand city weight type brand product holiday day diet sales manager sales district week SALE sold quantity sale amount number of customers unit price (AVG) region city address phone country descritptive attribute DATA WAREHOUSE: DESIGN - 24 Pag. 12

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Advanced DFM Database and data mining group, quarter month dept. manager manager marketing department group optional edge brand city weight type brand product holiday day diet sales manager sales district week SALE sold quantity sale amount number of customers unit price (AVG) region city address phone country start end cost promotion discount advertisement optional dimension DATA WAREHOUSE: DESIGN - 25 quarter month Advanced DFM dept. manager manager marketing department group brand city weight type brand product holiday day diet sales manager sales district week SALE sold quantity sale amount number of customers unit price (AVG) Database and data mining group, country region city address phone convergence start end cost promotion discount advertisement DATA WAREHOUSE: DESIGN - 26 Pag. 13

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Advanced DFM Database and data mining group, shared hierarchy caller usage. caller district called district called usage caller number called number PHONE CALL number length hour month district usage phone number caller called PHONE CALL number length hour month role product SHIPPING number cost warehouse city region country order customer ship month shared hierarchy shared hierarchy DATA WAREHOUSE: DESIGN - 27 Advanced DFM Database and data mining group, genre SALE author book number amount month multiple edge diagnosis ward ADMISSION cost patient name surname gender income level city birth diagnosis diagnosis group ward name surname ADMISSION gender cost patient income level city birth DATA WAREHOUSE: DESIGN - 28 Pag. 14

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Factless fact schema Database and data mining group, Some events are not characterized by measures empty (i.e., factless) fact schema it records occurrence of an event Used for counting occurred events (e.g., course attendance) representing a coverage set (e.g., promotions which took place) semester address name nationality age student ATTENDANCE (COUNT) course area school gender DATA WAREHOUSE: DESIGN - 29 Representing time DATA WAREHOUSE: DESIGN - 30 Database and data mining group, Data modification over time is explicitly represented by event occurrences time dimension events stored as facts Also dimensions may change over time modifications are typically slower slowly changing dimension [Kimball] examples: client demographic data, product description if required, dimension evolution should be explicitly modeled Pag. 15

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, How to represent time (type I) Snapshot of the current value data is overwritten with the current value it overrides the past with the current situation used when an explicit representation of the data change is not needed example customer Mario Rossi changes marital status after marriage all his purchases correspond to the married customer Also known as type I DATA WAREHOUSE: DESIGN - 31 Database and data mining group, How to represent time (type II) Events are related to the temporally corresponding dimension value after each state change in a dimension a new dimension instance is created new events are related to the new dimension instance events are partitioned after the changes in dimensional attributes example customer Mario Rossi changes marital status after marriage his purchases are partitioned in purchases performed by unmarried Mario Rossi and purchases performed by married Mario Rossi (a new instance of Mario Rossi) Also known as type II DATA WAREHOUSE: DESIGN - 32 Pag. 16

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Database and data mining group, How to represent time (type III) All events are mapped to a dimension value sampled at a given time it requires the explicit management of dimension changes during time the dimension schema is modified by introducing two timestamps: validity start and validity end a new attribute which allows identifying the sequence of modifications on a given instance (e.g., a master attribute pointing to the root instance) each state change in the dimension requires the creation of a new instance example customer Mario Rossi changes marital status after marriage validity end timestamp of first Mario Rossi instance is given by the marriage validity start timestamp of the new instance is the same day purchases are partitioned as in type II a new attribute allows tracking all changes of Mario Rossi instance Also known as type III DATA WAREHOUSE: DESIGN - 33 Workload Database and data mining group, Workload defined by standard reports approximate estimates discussed with users Actual workload difficult to evaluate at design time if the data warehouse succeeds, user and query number may grow query type may vary over time Data warehouse tuning performed after system deployment requires monitoring the actual system workload DATA WAREHOUSE: DESIGN - 34 Pag. 17

DataBase and Data Mining Group of DataBase and Data Mining Group of DataBase and Data Mining Group of Data volume Database and data mining group, Estimation of the space required by the data mart for data for derived data (indices, materialized views) To be considered event cardinality for each fact domain cardinality (number of distinct values) for hierarchy attributes attribute length It depends on the temporal span of data storage Sparsity occurred events are not all combinations of the dimension elements example: the percentage of products actually sold in each and day is roughly 10% of all combinations DATA WAREHOUSE: DESIGN - 35 Sparsity Database and data mining group, It decreases with increasing data aggregation level May significantly affect the accuracy in estimating aggregated data cardinality DATA WAREHOUSE: DESIGN - 36 Pag. 18