Data Mining and Data Warehousing Henryk Maciejewski Data Warehousing and OLAP



Similar documents
Data Warehousing. Paper

Hybrid OLAP, An Introduction

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Data Warehouse: Introduction

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Data Warehousing and Data Mining

Data Warehousing Systems: Foundations and Architectures

IST722 Data Warehousing

DATA WAREHOUSING AND OLAP TECHNOLOGY

MDM and Data Warehousing Complement Each Other

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Week 3 lecture slides

Lecture Data Warehouse Systems

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

DATA WAREHOUSING - OLAP

Part 22. Data Warehousing

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Building Cubes and Analyzing Data using Oracle OLAP 11g

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

Understanding Data Warehousing. [by Alex Kriegel]

Presented by: Jose Chinchilla, MCITP

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Mario Guarracino. Data warehousing

Designing a Dimensional Model

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

When to consider OLAP?

Data Warehousing Concepts

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

SQL Server Analysis Services Complete Practical & Real-time Training

University of Gaziantep, Department of Business Administration

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Dimodelo Solutions Data Warehousing and Business Intelligence Concepts

Week 13: Data Warehousing. Warehousing

OLAP Systems and Multidimensional Expressions I

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

SAS BI Course Content; Introduction to DWH / BI Concepts

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

CHAPTER 5: BUSINESS ANALYTICS

MS 50511A The Microsoft Business Intelligence 2010 Stack

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Business Intelligence, Analytics & Reporting: Glossary of Terms

Advanced Data Management Technologies

Fluency With Information Technology CSE100/IMT100

Data Warehousing and OLAP Technology for Knowledge Discovery

14. Data Warehousing & Data Mining

Turning your Warehouse Data into Business Intelligence: Reporting Trends and Visibility Michael Armanious; Vice President Sales and Marketing Datex,

SQL SERVER BUSINESS INTELLIGENCE (BI) - INTRODUCTION

Hybrid Support Systems: a Business Intelligence Approach

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Data Warehouse design

Establish and maintain Center of Excellence (CoE) around Data Architecture

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

Migrating a Discoverer System to Oracle Business Intelligence Enterprise Edition

Foundations of Business Intelligence: Databases and Information Management

CHAPTER 4: BUSINESS ANALYTICS

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

A Critical Review of Data Warehouse

Business Intelligence: Effective Decision Making

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

CHAPTER 4 Data Warehouse Architecture

Data Warehousing OLAP

Monitoring Genebanks using Datamarts based in an Open Source Tool

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Foundations of Business Intelligence: Databases and Information Management

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

SQL Server 2012 End-to-End Business Intelligence Workshop

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

BENEFITS OF AUTOMATING DATA WAREHOUSING

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

CS2032 Data warehousing and Data Mining Unit II Page 1

Sterling Business Intelligence

Using distributed technologies to analyze Big Data

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Foundations of Business Intelligence: Databases and Information Management

Oracle OLAP What's All This About?

Foundations of Business Intelligence: Databases and Information Management

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Optimizing Your Data Warehouse Design for Superior Performance

Course MIS. Foundations of Business Intelligence

Meta-data and Data Mart solutions for better understanding for data and information in E-government Monitoring

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

Data W a Ware r house house and and OLAP II Week 6 1

SQL Server Administrator Introduction - 3 Days Objectives

Integrating SAP and non-sap data for comprehensive Business Intelligence

Data Warehousing and Data Mining in Business Applications

Transcription:

Data Mining and Data Warehousing Henryk Maciejewski Data Warehousing and OLAP

Part II Data Warehousing Contents OLAP Approach to Data Analysis Database for OLAP = Data Warehouse Logical model Physical models (ROLAP, MOLAP, HOLAP) Querying multidimensional data DW project methodologies

Further Reading J. Han, M. Kamber, Data Mining: Concepts and Techniques, Second Edition, Elsevier 2006. W. Inmon: Building the data warehouse, Wiley 2005. F. Silvers: Building and maintaining a data warehouse, CRC Press 2008. www.information-management.com

From DBMS to Analytical Systems... The 1960s: first IT systems The 1970s: DBMS systems On-line transactional processing systems (OLTP) The 1990s: On-line analytical processing (OLAP), data warehousing, data mining Business Intelligence (BI), DSS

IT Systems Generate Data Deluge IT Systems in: Retail trade bar codes, credit cards, Banking, insurance, telecoms, healthcare, etc. etc. Science (biology, weather/earth monitoring, sky surveys,...) Data Deluge WalMart: 20 million transactions per day Mobil: ca. 100 TB of data (exploration of oil reserves) Human Genome Project: ~GB of data NASA Earth Observing System: 50 GB per hour (!) DISS solar energy plant monitoring: ~ 800 numbers / 5 secs

How to Get Information out of Data Efficient technologies available to gather and store data Simple approaches to data analysis prove inefficient Spreadsheet based, SQL query based,... Technologies + tools needed for efficient data analysis / knowledge extraction from data Hence OLAP, KDD (Knowledge Discovery in Databases), DM emerged Information data in context; data that have meaning, relevance and purpose

Various Approaches to Data Analysis Discovering relationships in data E.g., Customer profiles, Models to assess credit risk, etc. Data Mining Data Warehouse / OLAP SQL Multidimensional data model: y(w 1,w 2,...w n ) Database for OLAP Integrated data (ETL Extract-Transform-Load) SQL queries to raw data

Data Analysis Techniques SQL Queries Data source SQL Data source SQL Report Cross-sectional question Data source SQL Programmer DB admin generates an SQL program Drawbacks: Considerable coding effort Heavy load on OLTP servers Multiple versions of the truth

Data Warehouse (W. Inmon 1992) Source data Source data Data Warehouse Data Mart Source data Specific structure of database optimized for OLAP OLAP / DSS (MDDB, snowflake, star schema, ROLAP, MOLAP, HOLAP) ETL: Data access Data integration (cleaning, transformation)

Why OLAP Technology is Becoming Indispensable Getting information of out historical data Integration of data sources in the enterprise Cross-sectional analyses of enterprise data discovering relationships / patterns in large amounts of data trend analysis data mining

OLAP/Data Warehouse Key Data organization Design Issues Multidimensional data model (facts seen as a function of dimensions) Physical data storage that allows for fast (online) analysis of vast data volumes Data integration Ensure high quality of analytical data Taming the data chaos Single version of the truth

OLAP vs. OLTP Different Applications and Data Model OLTP operational data automation of day-to-day operations of organization: phone-call billing, orders / invoices processing, banking / credit card transactions, etc., etc. OLAP analytical data getting information for decision support Who are our best customers (characteristics)? Churn analysis How does increase in sales correlate with quality of service?

OLAP vs. OLTP Summary Problem OLTP OLAP Main applications Time horizon for data retention Automation of operations of organization: - entering data on routine day-to-day transactions - fixed structure reports / summaries created on regular basis (daily, monthly, etc.) Usually short term (90 days, 1 year) Decision support - multidimensional statistical analyses, forecasting, ad hoc queries, - advanced reporting Long term data retention, to support historic data analyses, comparative reports, trend analysis over time Data updates On the fly, during individual transaction Static data, updated on regular basis (e.g., monthly), data collected over time (time-stamped) Data access Frequent access to small portions of data (a few or tens of records) Simple, well structured queries Rare access involving large amounts of data Complex queries, ad-hoc

Schedule OLAP Approach to data analysis OLAP vs OLTP OLAP data integration Database for OLAP = Data Warehouse Logical data model multidimensionality Physical data models (ROLAP, MOLAP, HOLAP)

Data chaos Why it is Hard to Run Analytics Based on OLTP Main obstacles for building successful OLAP on top of transactional data: Data awareness Data understanding Data variability Data redundancy (and hence consistency) Data islands in disparate transactional systems

Data Chaos Example Faculty of EE Teachers DB Notes DB Courses DB Tutors DB Faculty of Architecture Exam results DB Problems / difficulties: how to find data how to extract data understand the meaning clean the data Recruitment DB Courses DB Data warehouse

Business Intelligence Based on How to get to the data in the DB? How to locate the right table / column? How to understand the meaning of the data? How to clean the data? OLTP? 17

Dedicated System for BI (OLAP) ETL (Extract Transform Load) Connect to source DB Integrate / clean Transform to the multidimensional model Multidimensional model of data (facts vs. dimensions)

Example: Multidimensional Model Cubes: Over-hours Availability Fuel consumption

Example: ETL Process ETL for the cube Availability

Data Warehouse Definition Date Warehouse subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making. Subject oriented data is organized around subjects of interest to data analyst (e.g., customer, product, supplier); transactional systems are process-oriented (e.g., order processing). Integrated data warehouse integrates data from several data sources; data characteristics (attributes) must be coded in a consistent way (e.g., consistent coding of SEX ( male - female, m - f, 0-1)). Non-volatile data loaded into data warehouse is a snapshot of operational data at a specific point in time; once loaded, data in warehouse cannot be changed. Time-varying data elements in warehouse are time-stamped to facilitate analysis of changes / trends over time.

Summary of This Part Concept of OLTP and OLAP Different use, different requirements for Data organization (data model) Database design Need for data integration Overcoming data chaos Ensuring high quality of analytical data in warehouse

Example: OLAP for Student Notes 23

Example: OLAP for Student Notes

Example: OLAP for Student Notes

Example: IBM Tivoli Monitoring Data Monitoring agents keep 24 h detailed data Data Warehouse aggregated, timestamped data drawn from agents Warehouse

Example: IBM Tivoli Monitoring Data Warehouse Agent Monitoring Agent for Windows OS Monitoring Agent for UNIX Monitoring Agent for Linux Monitoring Agent for DB2 Default attribute group Network_Interface NT_Processor NT_Logical_Disk NT_Memory NT_Physical_Disk NT_Server NT_System Disk System Linux_CPU Linux_CPU_Averages Linux_CPU_Config Linux_Disk Linux_Disk_IO Linux_Disk_Usage_Trends Linux_IO_Ext Linux_Network Linux_NFS_Statistics KUDDBASEGROUP00 KUDDBASEGROUP01 KUDBUFFERPOOL00 KUDINFO00 KUDTABSPACE

Schedule Multidimensional Model of OLAP Data Why OLAP Doesn t Like Normalized DB Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

OLAP: Multidimensional Model of Data OLAP = multidimensional analysis of data Multidimensional model of data: Measure as a value in multidimensional space of dimensions Numeric measures objects of analysis, also referred to as facts Dimensions variables on which the measure depends / that uniquely determine the measure E.g., measure: sales [$] dimensions: product, shop, date

OLAP: Multidimensional Model of Data Dimension hierarchies, e.g., Geographical hierarchy: shop city region country Time hierarchy: day of week week month year Product hierarchy: item type group

Example Model Built in Lab Multidimensional model for analysis of students notes: Measure: Student s grade (note) Dimensions: Characteristics of students Characteristics of teachers Characteristics of courses (group of courses, type of courses, etc.) Time hierarchy: calendar semester year Workload of students / teachers, etc. Various statistics will be of interest, e.g., average grade, number of grades, std deviation, distribution,...

Useful Concepts Aggregation: e.g., computing total sales by year based on more detailed data Drill-down: create more detailed view (i.e., decrease level of aggregation) Rollup: increase level of aggregation Slice-and-dice: reduce dimensionality of data: fix values of some dimensions and observe how data depends on the remaining dimensions

Schedule Multidimensional Model of OLAP Data Why OLAP Doesn t Like Normalized DB Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

Normalized DB (a Reminder) Database design for OLTP uses Entity Relationship diagrams and normalization techniques Normalized DB: No data redundancy Many tables with many-to-one relationships Optimized for easy / fast updates of data Efficient for constantly changing data Efficient for OLTP

Normalized DB - Example Contact Product Product ID Product name Product type... Order item Order ID Order item ID Product ID Quantity Task answer the following OLAP query: Shipment Shipment ID Status Order ID Order item ID Customer ID Order Order ID Customer ID Order date Sales rep ID Customer Customer ID Customer name Address City... Sales rep Sales rep ID Sales rep name District ID Contact ID Customer ID Contact name Contact type District District ID District name manager Which products were sold to a particular group of customers within specified time frame?

Normalized DB Problems with OLAP Queries Many join operations on tables low efficiency of SQL queries Circular join paths a query can be answered in two different ways different results possible Complicated database scheme SQL code difficult to build / maintain

OLAP: Requirements for Database Design Simplicity of database scheme Efficiency of multidimensional queries Consistency and accuracy of data Database schemes to meet these requirements Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

Schedule Multidimensional Model of OLAP Data Why OLAP Doesn t Like Normalized DB Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

Relational OLAP Warehouse data stored using a relational database server Multidimensional data model represented by a star-schema database or snowflake-schema database Star schema: Single fact table Single table for each dimension A fact table entry consist of: Aggregate value of the measure Foreign keys to dimension tables (composite key of the fact table)

Relational OLAP Warehouse data stored using a relational database server Multidimensional data model represented by a star-schema database or snowflake-schema database Snowflake schema: Variant of star schema with (some) dimension tables normalized (for easier maintenance of dimension data)

Example Star Schema Sales person Sales person ID Name Region Division Office Date Date ID Date Year Month Day Sales (fact table) Sales person ID Product ID Date ID Customer ID Number sold amount Product Customer Customer ID Name Sex Age Job name Product ID Prod code Prod name Prod type Prod category

Sales person Sales person ID Name Region Division Office Date Date ID Sales (fact table) Sales person ID Product ID Date ID Customer ID Number sold amount Example Snowflake Schema Customer Product Customer ID Product ID Prod code Prod name Prod type Prod category Job Code Date Year Month Day Name Sex Age Job ID Job ID Job name Job category

ROLAP Example of OLAP Query OLAP query: How many products were sold to a specific group of customers in a given time frame? Translates into the following SQL query: select sum(number_sold) as number_sold from fact_sales a, dimension_date b, dimension_customer c where b.date = 21jan2001 d and c.sex = F and a.dateid = b.dateid and a.customerid = c.customerid ;

Schedule Multidimensional Model of OLAP Data Why OLAP Doesn t Like Normalized DB Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

Multidimensional OLAP Warehouse data stored in a multidimensional database (MDDB) MDDB Specialized storage facility that directly reflects multidimensional model of data MDDB can be viewed as an N-dimensional (hyper)cube in which values of numerical measure (object of analysis) are stored Data stored in MDDB is presummarized, i.e., values stored in cross sections of dimensions have been aggregated at the MDDB build time (thus performance of multidimensional (OLAP) queries is high)

MDDB Idea Sample base table: Analysis variable (fact): note Classification variables (dimensions): attributes of students, attributes of teachers, semester, year, faculty, etc.

MDDB Idea select sum(note) as SUM, count(*) as N, spec, semester, year from base_table where spec='inf and semester=8 and year=2001 group by spec, semester, year

MDDB Data Aggregation Each crossing of the cube contains specified statistics for the analysis variable(s) Distributive measures can be stored in cube, such as N, SUM, SUMWGT, UWSUM, NMISS, USS, MIN, MAX Algebraic measures can be computed from stored measures, such as AVG=SUM/N

MDDB Data Aggregation Problem with holistic measures, ie. measures for which no algebraic aggregate function exists. E.g., MEDIAN In large cube applications approximate values of holistic measures are computed using algebraic measures

Cubes and Subcubes OLAP queries related to a subset of dimensions Result is aggregated at query time from the NWAY cube E.g., report on sales of all products over subsequent years sum for all products and all months needs to be computed at run time If there are many dimensions with high cardinality, this can be lengthy Subcubes are used to speed up performance for queries (related to subsets of dimensions) that users are likely to ask most frequently

Which Subcubes to Store? Idea: find categories which will be used most frequently, with smallest cardinality Starnet (spiral) model: put categories in ascending order of cardinality Draw spiral starting with YEAR (most frequent use anticipated, lowest cardinality) lists of categories = subcubes YEAR SECTOR REGION GRP_SUPP MONTH GRP SHOP SUPPLIER FAMILY DAY ARTICLE YEAR SECTOR REGION GRP_SUPP MONTH GRP SHOP SUPPLIER FAMILY DAY... YEAR SECTOR YEAR

Example: Building MDDB (SAS) proc mddb data=grades out=grades_mddb label='mddb for analysis of grade data'; class year sem sex faculty institute exam type id_title; var note /n sum min max; hierarchy year sem /name= Time Hierarchy"; hierarchy faculty institute /name= Affiliation Hierarchy"; run; NOTE: SAS/MDDB(R) Server Software has been initialized. NOTE: N-way complete cells=1455. NOTE: Time Hierarchy" computed from "NWAY" cells=10. NOTE: Affiliation Hierarchy" computed from "NWAY" cells=26. NOTE: PROCEDURE MDDB used: real time 1:26.54 cpu time 1:19.82

Example: Building MDDB (SAS) DATA specify base table for the MDDB CLASS statement specify classification variables (i.e., NWAY cube dimensions) VAR statement specify analysis variables (with statistics to be stored in MDDB distributive aggregate functions) HIERARCHY statements specify subcubes to include in MDDB Subcubes can be added / removed (ADDHIER, REMOVEHIER statements)

ROLAP vs. MOLAP MOLAP Very high query performance Easy maintenance Less scalable (fixed max size of a cube) ROLAP Very scalable Lower query performance Design and maintenance more difficult Problem with dimensions with very high cardinality Problem with constantly growing database Rule of thumb : use MOLAP as long as possible, then switch to... HOLAP

Schedule Multidimensional Model of OLAP Data Why OLAP Doesn t Like Normalized DB Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP)

HOLAP Data Model MDDB Relational DB Star schema Multidimensional data provider (MDP) cache viewer Viewer (OLAP applications) sees a logical MDDB (or a proxy or virtual MDDB) which is presented by the MDP

HOLAP Techniques Racking individual MDDBs for different values of one dimension (e.g., separate MDDBs for subsequent years) Stacking different subcubes stored in separate MDDBs or tables (e.g., YEAR*COUNTRY*PRODUCT local MDDB, YEAR*COUNTRY*PRODUCT*MONTH on remote server) year=2003 2004 2005 2006 Multidimensional data provider (MDP)

When to Use HOLAP? Too much data for one MDDB Access to existing ROLAP solutions Ensuring scalability with growing data volume Flexible integration of distributed data sources Improved performance distributed processing of queries Price: HOLAP metadata must be maintained

DW Architectures MOLAP RDBMS Server MDDBS Server MOLAP Engine RDB Flat files ERP ETL DW (ODS) Create/ store cubes MDDBs MDX XML/A OLTP Data Sources Data Layer OLAP Application Layer Presentation Layer

DW Architectures ROLAP RDBMS Server Analytical Server RDB Flat files ERP ETL DW (ODS) Complex SQL queries MDX XML/A OLTP Data Sources Data Layer OLAP Application Layer Presentation Layer

MS SQL Storage Settings Proactive caching MOLAP best performance; possible data latency (recent data changes not seen) ROLAP recent changes in data seen immediately; price poor performance Proactive caching: build MOLAP cache to boost performance? How frequently MOLAP cube should be rebuilt? Should outdated MOLAP be queried while cube is rebuilt? Rebuild cubes on schedule or based on changes in data Minimize latency vs maximize performance Partitions Vertical: cubes based on subsets of rows in fact table Horizontal: cubes based on separate fact tables (e.g. for subsequent years)

MS SQL Server Analysis Services Storage Settings

Standarizing Access to OLAP Data Sources XML/A XML for Analysis (XML/A) Standard API between OLAP client and OLAP data provider Design goals: Open standards based, not bound to any language or technology Optimized for the Web: minimize round-trip transactions and stateless Client server communicate using XML, HTTP, SOAP

Standarizing Access to OLAP Data Sources XML/A XML/A Methods: Discover retrieve information (metadata) from provider, such as list of available cubes and their properties Execute request a command execution by server (MDX language command e.g., OLAP MDX SELECT)

Multidimensional Expressions Language (MDX) Introduced by Microsoft in OLE DB for OLAP Now considered de facto standard for querying multidimensional data in OLAP cubes Simple form of MDX query expression: SELECT axis_specs ON COLUMNS, axis_specs ON ROWS FROM cube WHERE slicer_specs

MDX By Examples Examples based on cube built in lab A tuple uniquelly identifies a cell in a cube defined by a combination of attribute members for different attributes if some attribute is not specified its All (default) member is used if measure is not specified, the first (default) measure defined in the cube is used

MDX Tuples [Measures].[Note Count] is a tuple To identify a cell, the All member of other attributes was used

MDX Tuples Tuple points to male (M) students in Student Group (Studiengang) A Use ( ) to identify a tuple

MDX Sets of Tuples Two tuples (Note Avg and Note Count) form a set Use { } to identify a set of tuples

MDX Cartesian Products More axes Cartesian product.members MDX function lists members of an attribute on columns axis 0 on rows axis 1 (up to 128 axes)

MDX Cartesian Products Now set of tuples is used in Axis 0 (columns) specification Each cell is produced as an intesection of its attribute members

MDX Slicer Axis (WHERE) WHERE clause used to specify set, tuple or member that restrict the members returned for rows and columns

MDX Slicer Axis (WHERE) WHERE clause used to specify set, tuple or member that restrict the members returned for rows and columns

MDX Slicer Axis (WHERE) WHERE clause used to specify set, tuple or member that restrict the members returned for rows and columns

Data Warehouse Project Methodology(-ies) SAS Rapid Data Warehouse Methodology IBM DW / BI Project Methodology Purpose: Ensure disciplined, iterative, approach in the management and implementation of data warehousing projects Enable successful business and technical implementation of the data warehouse

DW Project Methodology - Phases Assessment Determine whether there exists a realistic need and opprotunity to develop a successful DW Project definition stage (team, sponsor, criteria for success, expectations) Initial assessment of IT infrastructure (is project feasibile?) Outcome: formal document Requirements Requirements gathering (in-depth interviews with business people) Reconciliation stage (analyze gap between expectations and IT capabilities) Outcome: Requirements Definition Document (logical and physical data model; data extraction paths from source OLTP systems; transformations required; DW update schedule) Desing / Implementation / deployment Implement logical data model Build ETL processes (validate, clean, integrate) Load data to DW Design, implement data analysis interfaces Train users Review

DW Specific Requirements - Remarks Analytical needs in company Types of reports, time schedules (daily / weekly etc.) Hierarchies of data / hierachies of reports Identification of data sources Updates of data in DW Data integration rules; handling missing / wrong data Time schedule for DW updates Data latency / performance Recent changes in OLTP seen immediately in OLAP? What latency is acceptable? OLAP query performance

Data Integration Analyze source OLTP systems Determine DBMS systems / data formats Select most appropriate sources / columns (cleanest) Analyze required integration Ensure the same coding conventions ( m-w, male-female, 0-1 ) Identify synonyms, homonyms, analogies Ensure data quality (integrity, accuracy, completeness) data value integrity data structure integrity Define exception handling rules / missing data handling / default values Finally, define data integration rule/algorithm for each variable

Example Synonyms, Homonyms, Analogies Define how to resolve name conficts between data sources / columns: Homonyms: same name but different meaning, e.g., Type in one source reffers to model of a car ( AURIS, CLIO, etc.), and in another source to category ( picup, truck, passenger, etc. ) Synonyms: different names but the same meaning, e.g., PersonID in one source, EmployeeCode in another Analogies: attributes describe the same object, but differently, e.g., PaymentMethod in one source refers to cash, check, credit card, and in another to VISA, MasterCard, USD etc.

Example Data Integrity Specify legal relationships between data values Employee Temporary Permanent Name + + Date of birth + + Contract final date + -- Anniversary date o + (+ required; -- not allowed; o optional) Number of values in a relationship Student can have 0,1 or n diplomas Undergraduate 0 Graduate 1 or n

Summary Build dedicated database for OLAP data mart / warehouse Data integration Data quality assurance Database organization Multidimensional model of data Physical data organization Denormalization Aggregation Benefits from user s perspective Integrated overall picture of the enterprise Easy access to historical data Trustworthy information returned (single version of the truth) DSS queries with no impact on transactional systems DW Methodology to ensure successful implementation