F4: DW Architecture and Lifecycle. Erik Perjons, DSV, SU/KTH perjons@dsv.su.se. Data warehouse



Similar documents
Designing a Dimensional Model

DATA WAREHOUSING AND OLAP TECHNOLOGY

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Warehousing Systems: Foundations and Architectures

Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Data Warehouse: Introduction

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Business Intelligence, Analytics & Reporting: Glossary of Terms

IST722 Data Warehousing

Lection 3-4 WAREHOUSING

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Building Cubes and Analyzing Data using Oracle OLAP 11g

Week 3 lecture slides

Data warehouse and Business Intelligence Collateral

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Data Warehousing and Data Mining

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

B.Sc (Computer Science) Database Management Systems UNIT-V

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

Common Warehouse Metamodel (CWM): Extending UML for Data Warehousing and Business Intelligence

BUILDING OLAP TOOLS OVER LARGE DATABASES

14. Data Warehousing & Data Mining

Structure of the presentation

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

DATA WAREHOUSING - OLAP

MDM and Data Warehousing Complement Each Other

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

SAS BI Course Content; Introduction to DWH / BI Concepts

Data Mart/Warehouse: Progress and Vision

Week 13: Data Warehousing. Warehousing

Data Warehouse Overview. Srini Rengarajan

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

CHAPTER 4 Data Warehouse Architecture

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

BENEFITS OF AUTOMATING DATA WAREHOUSING

By Makesh Kannaiyan 8/27/2011 1

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Business Intelligence: Effective Decision Making

Presented by: Jose Chinchilla, MCITP

Microsoft Business Intelligence

Microsoft Data Warehouse in Depth

Oracle OLAP What's All This About?

Java Metadata Interface and Data Warehousing

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Testing on Business Intelligence & Data Warehouse Projects

Data W a Ware r house house and and OLAP II Week 6 1

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

A Service-oriented Architecture for Business Intelligence

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Data warehouse Architectures and processes

SQL Server 2012 Business Intelligence Boot Camp

Data Warehouse (DW) Maturity Assessment Questionnaire

When to consider OLAP?

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

The Oracle Enterprise Data Warehouse (EDW)

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence

SQL Server Administrator Introduction - 3 Days Objectives

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Introduction to Oracle Business Intelligence Standard Edition One. Mike Donohue Senior Manager, Product Management Oracle Business Intelligence

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Model-Driven Data Warehousing

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Information Management Metamodel

Establish and maintain Center of Excellence (CoE) around Data Architecture

Understanding Data Warehousing. [by Alex Kriegel]

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Turkish Journal of Engineering, Science and Technology

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

HYPERION MASTER DATA MANAGEMENT SOLUTIONS FOR IT

ETL-EXTRACT, TRANSFORM & LOAD TESTING

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Advanced Data Management Technologies

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

Databases in Organizations

<Insert Picture Here> Oracle BI Standard Edition One The Right BI Foundation for the Emerging Enterprise

BUSINESS INTELLIGENCE. Keywords: business intelligence, architecture, concepts, dashboards, ETL, data mining

Transcription:

F4: DW Architecture and Lifecycle Erik Perjons, DSV, SU/KTH perjons@dsv.su.se The data warehouse architecture The back room The front room Data warehouse Analysis/OLAP Productt Time1 Value1 Value11 External sources Extract Transform Load Serve Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting source systems Data marts Data mining Falö aöldf flaöd aklöd falö alksdf source systems (RK) Legacy systems OLTP/TP systems Data staging area (RK) Back end tools Data presentation area (RK) The data warehouse Presentation (OLAP) servers Data access tools (RK) End user applications Business Intelligence tools

Source Systems source systems characteristics: source systems the source data often in OLTP (Online Transaction Processing) systems, also called TPS (Transaction Processing Systems) high level of performance and availability often one-record-at-a time queries already occupied by the normal operations of the organisation OLTP vs. DSS (Decision Support Systems) OLTP vs. OLAP (Online analytical processing) Source Systems More operational source systems characteristics: source systems a OLTP system may be reliable and consistent, but there are often inconsistencies between different OLTP systems different types of data format and data structures in different OLTP systems AND DIFFERENT SEMANTICS

Source Systems Kimball et al s assumptions (p 7): source systems Source systems are not queried in the broad and unexpected ways Maintain little historical data Each source systems is often a natural stovepipe application DW architecture: Data staging area Data warehouse Analysis/OLAP Productt Time1 Value1 Value11 External sources Extract Transform Load Serve Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting source systems Data marts Data mining Falö aöldf flaöd aklöd falö alksdf source systems Data staging area Data presentation area Data access tools

The Data Staging Area Often the most complex part in the architecture, and involves... Extract Transform Load Extraction (E) Transformation (T) Load (L) indexing ETL-tools can be used Scripts for extraction, transformation and load are implemented Data staging area Extract Transform Load Extraction means reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation, i.e. transformation

Data staging area Transformation involves Extract Transform Load data conversion/transformation (specify transformation rules to convert to a common data format and common terms/semantics) data cleaning/cleansing data scrubbing (use domain-specific knowledge (e.g postal adresses) to check the data) data auditing (discover suspicious pattern, discover violation of stated rules) combining data from multiple sources assigning warehouse (surrogate) keys data aggregation Data staging area A debate questions: Extract Transform Load Should the data in the data staging area be stored in a 3NF relational database and loaded into the presentation area for querying and reporting? Kimball (p 8-9): a 3NF relational database in data staging area requires more time and resources for development, periodic loading and updating and more capacity of storing the multiple copies of the data

A Real World Example Flat file C DB2Connect Various source files Customer data F Customer data G Start balance H Fees (manually adjusted to individual agreements) I Some cleansing and scrubbing may be needed here DB2 table(s) D SQL, C++?? DB2 Preliminary target DW E +aggregation (new program) DB2 Final target DW E Staging area for checking, analysing, cleaning, complementing etc transaction data Three star/join schemas comprising altogether 8 tables Fact tables: - transactions (10 attributes) - fees (7 attributes) - start balance (4 attributes) Dimensional tables: - time (7 attr) - customer (> 40 attr) - company (> 90 attr) - product (13 attr) - Service charged (2 attr) E complemented with some aggregated tables DW architecture: Data presentation area Data warehouse Analysis/OLAP Productt Time1 Value1 Value11 External sources Extract Transform Load Serve Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting source systems Data marts Data mining Falö aöldf flaöd aklöd falö alksdf source systems Data staging area Data presentation area Data access tools

Data presentation area Data warehouse OLAP servers Data marts What is OLAP? Dimensional modelling vs. 3 NF modelling Data Marts ROLAP/MOLAP servers What is OLAP? Acronym for On-line analytical processing A decision support system (DSS) that support ad-hoc querying, i.e. enables managers and analysts to interactively manipulate data. The idea is to allow the users to easy and quickly manipulate and visualise the data through multidimensional views, i.e. different perspectives. office Service Quarter quarter product Office Facts Kimball: Dimensional modelling

Dimensional modelling Service Dimension Key Service Service group S1 Local call Group A S2 Intern. call Group A S3 SMS Group B S4 WAP Group C 1 0..* Sales Dimension Key Seller Office F11 Anders C Sundsvall F12 Lisa B Sundsvall F13 Janis B Kista Fact table - Transactions Number Sum of calls C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 1 0..* 0..* 1 Time Dimension Date/ Key Month Quarter Year 991011 9910 4-99 99 991012 9910 4-99 99 1 0..* Customer Dimension Key Customer Address Region Income group C210 Anna N Stockholm Stockholm B C211 Lars S Malmö Skåne B C212 Erik P Rättvik Dalarna C C213 Danny B Stockholm Stockholm A C214 Åsa S Stockholm Stockholm A Dimensional modelling Service Dimension Key Service Service group S1 Local call Group A S2 Intern. call Group A S3 SMS Group B S4 WAP Group C Sales Dimension Key Seller Office F11 Anders C Sundsvall F12 Lisa B Sundsvall F13 Janis B Kista Fact table - Transactions Number Sum of calls C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 Σ=37:00 Time Dimension Date/ Key Month Quarter Year 991011 9910 4-99 99 991012 9910 4-99 99 Customer Dimension Query: For how much did customers in Sthlm use service Local call in october 1999? Key Customer Address Region Income group C210 Anna N Stockholm Stockholm B C211 Lars S Malmö Skåne B C212 Erik P Rättvik Dalarna C C213 Danny B Stockholm Stockholm A C214 Åsa S Stockholm Stockholm A

3 NF modelling vs. Dimensional modelling Key difference between 3NF and Dimensional modelling: - the degree of normalisation 3 NF modelling - a logical design technique to eliminate data redundancy to keep consistency and storage efficiency, and makes transaction simple and deterministic - ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables Dimensional modelling - a logical design technique that present data in a intuitive, i.e. easier to navigate for the user - allow high performance access/queries (the complexity of 3NF models overwhelms the database systems optimizer, which means bad performance) - aims at model decision support data [Kimball et al, p 10-11] Data presentation area Data marts Kimball et al (p.10-12 and 396) we refer to the presentation area as a series of integrated data marts a data mart is a flexible set of data, ideally based on the most atomic (granular) data possible to extract from operational source, and presented in a symmetric (dimensional) model that is resilient when faced with unexpected user queries in its most simplistic form a data mart represent data from a single business process (business process=purchase order, store inventory and so on)

Data marts Service Quarter Calls Service Quarter Office Office Subscription orders Service Quarter Calls Office Subscription orders The data warehouse bus architecture A data mart A data mart Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center [Kimball et al, p 78-79]

Data marts A dimensional model for a large data warehouse consists of between 10 and 25 similar-looking data marts. Each data marts will have 5 to 15 dimensional tables. The Data marts Kimball et al s strong opinions (p.10-12) all data in the presentation area should be presented, stored and accesses in dimensional models the data marts must contain detailed, atomic data (it is unacceptable that the detailed data should be locked up in 3 NF models for drill-down) the data marts dimensions should be conformed for drill-across techniques, which tie the data marts together in the data warehouse bus architecture

The Data marts More about data marts: far smaller data volumes, fewer data sources easier data cleaning process, faster roll-out allows a piecemeal approach to some of the enormous integration problems involved in creating an enterprise wide data model, but complex integration in the long term Dependent vs. Independent Data marts Independent Data marts Data warehouse Data warehouse Dependent Data marts

The presentation/olap servers Extended Relational DBMS (ROLAP servers) data stored in RDB star-join schemas support SQL extensions index structures Data warehouse OLAP servers Data marts Multidimensional DBMS (MOLAP servers) data stored in arrays (n-dimensional array) direct access to array data structure excellent indexing properties poor storage utilisation, especially when the data is sparse. More about presentation servers What is characteristics regarding data warehouse, according to Chaudhiri&Dayal : Index structures (bit map indexes, join indexes) SQL extensions (operators like Cube, Crossjoin) Materialised views (pre-aggregations)

DW architechture: Metadata repository Monitoring & Administration External sources source systems Extract Transform Load Refresh Metadata repository Data warehouse OLAP servers Serve Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Data marts Falö aöldf flaöd aklöd falö alksdf source systems Data staging area Data presentation area Data access tools What is metadata? Data about data / Information about data Main functions are to give... data definitions the origin of data the structure of data rules for the selection and transfer of data qualitative and quantitative data about data Contained in metadata repository

The metadata repository An integrated complete source of metadata is at the heart of the data warehouse architecture supports the information needs of... system developers data administrators system administrators users applications on the data warehouse very complex data structure must contain full version history must always be up to date Metadata life cycle activities Collection identify and capture metadata in a central repository Maintenance establish processes to synchronise metadata with the changing data structure Deployment provide metadata to users in the right form and with the right tools

Different types of metadata Administrative metadata (includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas, dimensions, hierachies, predefined queries, physical organisation, rules and script for extraction, transformation and load, back-end and front end tools) Business metadata (business terms and definitions, ownership of data) metadata (information collected during the operations of the DW, e. g. usage statistics, error reports) DW architecture: End user applications Monitoring & Administration External sources DBs Extract Transform Load Refresh Metadata repository Data warehouse OLAP servers Serve Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Data marts Falö aöldf flaöd aklöd falö alksdf source systems Data staging area Data presentation area Data access tools

End user applications Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 OLAP tools, BI apps, DSS Query/Reporting tools Data mining Query/Reporting Data mining Falö aöldf flaöd aklöd falö alksdf Spreadsheet output of OLAP tool product product group mounth quarter office region Column headers (join constraints) Column header (application constraint) Answer set representing focal event Product Group Region First Quarter - 1997 Group A ABC 1245 Group A XYZ 34534 Group B ABC 45543 Group B XYZ 34533 Row headers

Graphical output of OLAP tool Functionalities of OLAP tools Drill-down - decreasing the level of aggregation Drill-up/Roll-up/Consolidation - increasing the level of aggregation Drill-across - move between different star-join schemas using conformed dimensions and joins Slicing and dicing ability to look at the database from different views, e.g. one slice shows all sales of product type within regions, another slice shows all sales by sales channel within each product type Pivoting - e.g. change columns to rows, rows to columns Ranking - sorting Think of an OLAP data structure as a Rubik s Cube of data that users can twist and twirl in different ways to work through what-if an what-happend scenarios [Lee Thé]

Business Intelligence (BI) apps Strategic Who: strategic leaders What: formulate strategy and monitor corporate performance Examples: Balance scorecard, Strategic Planning Who: operational managers What: execution of strategy againts objectives Examples: Budgeting, Sales forcasting Analytical Who: analysts, knowledge worker, controller What: ad-hoc analysis Examples: Financial and Sales Analysis, Customer Segmentation, Clickstream analysis Problems of Data Warehousing Complexity of integration Hidden problems with source systems Data homogenisation Underestimation of resources for data loading Required data not captured High maintenance Long duration projects Why not integrating the legacy applications (OLTP systems) instead?

Data Store (ODS) No singel universal defintion... ODS definition 1: Implemented to deliver operational reporting, especially when neither the legacy nor the modern OLTP systems provide adequate operational reports fixed queries and for tactical decision making ODS definition 2: Built to support real-time interactions, especially in Customer Relationsship Management applications the tradtional data warehouse typically is not in a position to support the demand for near-real-time data OMG s standards Meta Object Facility (MOF) M3 layer Meta metamodel UML MetamodelCWM Metamodel M2 layer Metamodel M1 layer Model M0 layer Instances Helen Nagy Invoice no 34

Common Warehouse Metamodel (CWM) Data Source Analysis Data Mart Data Source Data Store ETL Data Warehouse Data Mart Reporting Visualization Data Source Data Mart Data Mining The collection of metamodels by CWM can be used to model the whole data warehousing environment i.e from data sources to end use analysis, and data warehouse management Common Warehouse Metamodel Common Warehouse Metamodel (CWM) is a language specifically design to model data warehousing and data mining applications, i.e. integrating data warehousing and business analysis (business intelligence) tools CWM has a lot in common with the UML metamodel but has a number of special metamodels (metaclasses), e.g modelling relational databases, multidimensional databases, OLAP, schema transformations, XML [Kleppe et al, p.139-140 (2003)]

Why metamodelling? consists of Transformation Event Precedes Succedes consists of State Meta metamodel level or Reference model Function Precedes Event Precedes/ Succedes Activity Precedes State Metamodel level Succedes Succedes Order recieved Capture ordered items Capture ordered items Model level Ordered item captured Check material on stock X Check material on stock Ordered item [captured] Material on stock [checked] Material is not on stock Material is on stock [Rosemann, Green, 2002] CWM packages Management Warehouse Process Warehouse Operation Analysis Transformation OLAP Data Mining Information Visualization Business Nomenclature Resource Relational Record Multi-Dimensional XML Foundation Object Model Business Information Data Types Expressions Keys and Indexes Software Deployment Core Behavioral Relationships Instance Type Mapping Packages/Metamodels

CWM packages layers Object layer - base metamodels/packages, which are (re)used by the other metamodels/packages Foundation layer - extends the object layer with services required which are (re)used by the other metamodels/packages, e.g unique key in the Key Indexes metamodel/package is used by relational databases, OO-databases and record-oriented Resource layer - defines metamodels/packages for various types of data resouces Analysis layer - analysis-oriented metadata Management layer - describing the data warehousing process as a whole [Poole et al, p.36-40 (2002)] CWM packages relations Core package Element ModelElement Namespace ClassifierFeature Feature Expression Classifier StructuralFeature ProcedureExpression Class Attribute Relational package Datatype package ColumnSet Column QueryExpression NamedColumnSet QueryColumnSet Table View

CWM classifyer equality Object Package Classifier (Klass) Feature (Attribut) Relational Schema Table Column Record Record file RecordDef Field Multi Dimensional Schema Dimenson Dimension ed Objct XML Schema Element Type Attribute More about CWM Tool Y Metamodel Common Representation Tool X Metamodel Tool Z Metamodel <<metamodels>> CWM Packages

Business Dimensional Lifecycle Technical Architecture Design Product Selection & Installation Project Planning Business Requirement Definition Dimensional Modeling Physical Design Data Staging Design & Development Deployment Maintenance and Growth End-User Application Specification End-User Application Development Project Management The Data Warehouse Architecture Framework Level of ARCHITECTURE AREA detail Data Back room Front room Infrastructure Business reqs and audit Info needed for better decisions Enterprise models How get, transform, make available data Major business issues. How measure How analyse HW/SW capabilities needed vs what we have Architecture models and documents Focal events, facts, dimensions Dimensional models Capabilities needed to get and transform data Major data stores User s needs Major classes of analyses Priorities Where is data coming from Calc and storage reqs Detailed models and specs Logical and physical models Domains, derivation rules Standards, prods to provide capabilities How hook together Report layouts, derivation For whom, when How interact with capabilities System utilties, calls, APIs... Implementation DB, indexes backup... Write extracts, loads Automate process Implement report and analysis env Build rpt Train users Install, test infrastructure. Connect sourcesto targets to desktop