Data Warehousing (DW) Online Analytical Processing (OLAP) Data Mining

Similar documents

Part 22. Data Warehousing

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Fluency With Information Technology CSE100/IMT100

Business Intelligence, Analytics & Reporting: Glossary of Terms

Data Warehouse: Introduction

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Building Data Cubes and Mining Them. Jelena Jovanovic

DATA WAREHOUSING AND OLAP TECHNOLOGY

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

8. Business Intelligence Reference Architectures and Patterns

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Distance Learning and Examining Systems

Implementing Data Models and Reports with Microsoft SQL Server

DATA WAREHOUSING - OLAP

SQL Server 2012 End-to-End Business Intelligence Workshop

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

SQL Server Administrator Introduction - 3 Days Objectives

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

SQL SERVER BUSINESS INTELLIGENCE (BI) - INTRODUCTION

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Presented by: Jose Chinchilla, MCITP

SAS BI Course Content; Introduction to DWH / BI Concepts

Data Warehouse design

Week 13: Data Warehousing. Warehousing

Introduction to Data Warehousing. Ms Swapnil Shrivastava

COURSE SYLLABUS COURSE TITLE:

Data W a Ware r house house and and OLAP II Week 6 1

Data Warehousing Systems: Foundations and Architectures

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehousing, OLAP, and Data Mining

14. Data Warehousing & Data Mining

Justice Data Warehousing and Court Business Intelligence. Technical Introduction. Harris County Courts

Implementing Data Models and Reports with Microsoft SQL Server

University of Gaziantep, Department of Business Administration

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

Microsoft Business Intelligence

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CHAPTER 5: BUSINESS ANALYTICS

LEARNING SOLUTIONS website milner.com/learning phone

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

IST722 Data Warehousing

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining for Successful Healthcare Organizations

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Why Business Intelligence

When to consider OLAP?

Designing a Dimensional Model

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

CHAPTER 4: BUSINESS ANALYTICS

Turkish Journal of Engineering, Science and Technology

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Data Warehousing and OLAP Technology for Knowledge Discovery

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

SQL Server 2012 Business Intelligence Boot Camp

SQL Server 2005 Features Comparison

SQL SERVER TRAINING CURRICULUM

Business Intelligence & Product Analytics

Data Warehousing and Data Mining in Business Applications

Microsoft Data Warehouse in Depth

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Turning your Warehouse Data into Business Intelligence: Reporting Trends and Visibility Michael Armanious; Vice President Sales and Marketing Datex,

Business Intelligence, Data warehousing Concept and artifacts

SQL Server Analysis Services Complete Practical & Real-time Training

MS 50511A The Microsoft Business Intelligence 2010 Stack

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

End to End Microsoft BI with SQL 2008 R2 and SharePoint 2010

Week 3 lecture slides

Migrating a Discoverer System to Oracle Business Intelligence Enterprise Edition

CHAPTER 4 Data Warehouse Architecture

Google AdWords, 248 Google Analytics tools, 248 GoogleAdsExtract.xlsx file, 161 GoogleAnalytics, 161

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Business Intelligence: Real ROI Using the Microsoft Business Intelligence Platform. April 6th, 2006

Data Testing on Business Intelligence & Data Warehouse Projects

New Approach of Computing Data Cubes in Data Warehousing

Turnkey Hardware, Software and Cash Flow / Operational Analytics Framework

Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course 20467A; 5 Days

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

The Microsoft Business Intelligence 2010 Stack Course 50511A; 5 Days, Instructor-led

Data Warehousing and Data Mining

Implementing a Data Warehouse with Microsoft SQL Server

Monitoring Genebanks using Datamarts based in an Open Source Tool

Transcription:

Business Intelligence Workshop, Helia, May, 2008 DBTechNet Data Warehousing (DW) Online Analytical Processing (OLAP) Data Mining Topics 1. Introduction to BI and CPM 2. ETL Process 3. DW Modeling 4. OLAP 5. Data Mining 1 Introduction Process ing 2/70 (c) 2008,, 1

Critical Questions About an Enterprise Are we on the right way? Yes, we are! How about our competitors? Economical trends? 3/70 Prof. Dipl.-Kfm. A. Roth Critical Questions About an Enterprise Are we on the right way? Yes, we are! How about our competitors? Ahead of us! Economical trends? Turbulences! 4/70 Prof. Dipl.-Kfm. A. Roth (c) 2008,, 2

Where do we get the Knowledge from? About the enterprise From the company s operational information systems About the market and competitors From census bureau From public statistical data About economical trends From financial and economical publications How you gather, manage, and use information will determine whether you win or lose (Bill Gates, Business @ The Speed of Thought, 1999) 5/70 So, where is the problem? Definition and Problems to solve in Business Intelligence Definition: Business Intelligence (BI) refers to processes and technologies using fact based systems to analyze business BI needs to deal with: 1. Information overload 2. Missing knowledge 3. We do not know which are the right questions 4. We do not know the influencing factors and their impact 5. Key measures or indicators to steer an enterprise are missing 6/70 (c) 2008,, 3

Aspects IT view Business View Market View Growing knowledge Market View Information Pyramid Data Mining OLAP Data Warehouse OLTP EIS EIS Business Business View View DSS DSS Operational Operational Systems Systems 7/70 IT IT View Amount of information We're drowning in information and starving for knowledge. (Rutherford D. Rogers, Yale, 1985) Motivation What is the goal of my organization? How do we affect the market? How do we perform? 8/70 Prof. Dipl.-Kfm. A. Roth (c) 2008,, 4

Motivation Business Intelligence as critical success factor Purpose: Support business decision making 9/70 Prof. Dipl.-Kfm. A. Roth Corporate Performance Management (CPM) How can we steer an enterprise? start re-plan set goals analyze plan monitor execute Idea from MIK AG: http://www.mik.info BI Tools provide the means to steer an enterprise by Measuring the effect of decisions and Analyzing the performance and Compare with goals 10 /70 Definition: CPM is the framework for steering an enterprise by means of Business Intelligence (c) 2008,, 5

How Can we Measure Corporate Performance? 11 /70 Through Key Performance Indicators (KPIs) Definition: KPI is a metric to define and measure state and progress towards an organization s goal set Usually high plan goals level relative values Examples execute Customer re-plan KPIs Customers satisfaction Customer attrition (loss) Manufacturing analyzekpis monitor Overall Equipment Effectiveness OEE = Availability * Performance * Quality Financial KPIs Profit Margin PM = Net Income / Sales Return on Investment ROI = Turnover * Earnings / Sales = Return on Investment (ROI) Financial KPIs have natural metrics Source: Fred Nickols, 2000, originally by Johnson and Kaplan 12 /70 But how about soft factor metrics? (c) 2008,, 6

Soft Factor Metric Example: Customer satisfaction General satisfaction Specific satisfaction: quality/price of product, speed of delivery, How do we compare these? Search for a mapping of categorical values to ordinal values Totally satisfied (ts) 9 Partially satisfied (ps) 3 13 /70 Meaning of the metric ts = 3 * ps? No! But ts is-better-than ps Are two metrics comparable? No! But we do weighted comparisons. Motivation Why can t we use our OLTP System? Missing information Need for integration of economical and census data Need for soft factors to assess an enterprise Missing KPIs and steering parameters Need for highly significant KPIs and parameters Influencing factors and different perspectives not available Need for multidimensional analysis and presentation 14 /70 Source: One Hundred & Eighty Degrees Systems Limited. 2004 (c) 2008,, 7

Motivation Why can t we use our OLTP System? Queries only explicit information Select customer, sum(sales) from Orders where Region. Group by We don t know what to ask! Need for interactive, explorative analysis Inappropriate presentation of information Tabular presentation one dimensional analysis Sales ok? Trend ok? Reason? 15 /70 We can t see the problem! Need for multidimensional analysis and presentation Management Cockpit The CPM paradise Source: Juergen Daum, New Economy Analyst Report, 2004 16 /70 Source: SAP Whitepaper, SAP SEM / CPM, http://help.sap.com/ (c) 2008,, 8

The Business Intelligence Process Data Sources Data Warehouse Cubes, Data Marts Analysis xls DBS OLAP stats ETL Data Mining WWW WWW 17 /70 Build up Product Design Time Region Extraction Transformation Loading Data Sources Data Warehouse Cubes, Data Marts Analysis xls DBS OLAP stats ETL WWW WWW Product 18 /70 Data Mining Time Region (c) 2008,, 9

General sources Time Geography OLTP Master data Transaction data Data Sources Technical data sources supported by SQL Server Integration Services (SSIS) Planning Planning turnover profit, etc 19 /70 Economic data Business sector data Economic forecast Select Cleanse Convert Harmonize Extract and Transform which data are needed? where are the user data? have all facts the same unit, coding and granularity? have we synonyms and homonyms? Adjust grouping, classification? Correct are the data correct? 20 /70 Amend are the data complete? (c) 2008,, 10

Extract and Transform Example Select Cleanse Convert e.g. http://.../consumptionpercapita/coffee.html e.g. strip off html tags e.g. convert consumption into kg Harmonize e.g. import with consumption? <table border="1" width="21%"> <tr> <td width="58%">country</td> <td width="45%">1987</td> </tr> <tr> <td width="58%">finland</td> <td width="45%">12,04</td> </tr> </table> Adjust e.g. region grouping Country 1987 1988 1989 Correct e.g. incorrect value for D 1989 Finland Sweden 12,04 11,64? 11,71 11,68 11,08 Amend e.g. for NL 1988 Norway Denmark 20,13 lb 11 20,81 lb 10,65 18,19 lb 10,2 Benelux 19,65 20,48 19,89 21 /70 Austria Germany 7,75 7,38 8,17 8,17 8,01 0,827 Hands on Lab: Integration Services (SSIS) 1. Open SS Business Intelligence Studio 2. Create new project 3. Select 22 /70 (c) 2008,, 11

Hands on Lab: Integration Services (SSIS) 3. Build a control flow 2. Design a data flow from source to destination source destination 23 /70 1. Define connection managers for data sources and destinations Hands on Lab: Integration Services (SSIS) Graphically design control and data flow Example 1: Loop control, data and error flow Control loop Text file data source error flow Data flow 24 /70 (c) 2008,, 12

Hands on Lab: Integration Services (SSIS) Example 2: ETL control flow design & a data flow taking date entries from sales and purchase orders to build date dimension Start of control flow Excel data source 25 /70 Data transformation Destination DW End of control flow Data Warehouse Modeling Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 26 /70 Data Mining Time Region (c) 2008,, 13

Data Warehouse Definition: A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management s decision-making process. William H. Bill Inmon (1996) 27 /70 Data Warehouse 28 /70 Properties Subject-oriented data is selected and organized so support business analysis Optimized for query and analysis Objects (facts) and their determining factors (dimensions) are linked together Not to support OLTP Time-variant accumulates historical data over time Non-volatile (archival) Data is read-only; it is never updated, only added May have redundancies Contains pre-calculated aggregations Integrated contains data from different sources (OLTP systems, economical databases, etc) (c) 2008,, 14

5.3 Dimensional Fact Model Properties Multidimensional model Distinction between fact (measures) and dimension Structural dimensions Attributes of Dimension computed values 29 /70 Dimension Comp. value measure Year Month Week Fact average, semi-additive Sales amount onstock value Dim.attribute weight Type Product Prod.group 2.1a Taxonomy of Facts numerical Fact categorical additive semi-additive ordinal nominal temporal 30 /70 (c) 2008,, 15

DW Schemes Star : one Fact table, multiple Dimension tables Galaxy: multiple Fact table, multiple Dimension tables Snowflake: Dimension tables normalized, Fact tables aggregated 31 /70 All 3 Schemata are relational models in disguise Example Star Scheme SSAS Source View Dimension table Fact table 32 /70 (c) 2008,, 16

Example Galaxy Scheme SSAS Source View Joint dimension table Fact tables 33 /70 Example Snowflake Scheme SSAS Source View Aggregated fact table Normalized product dimension 34 /70 Fact table (c) 2008,, 17

Design Rules for DW Scheme Use Star if Dimensions have few or dynamic Attributes Measures are orthogonal Use Snowflake if Dimensions are structured (aggregation) Measures are orthogonal Use Galaxy if Dimension are reused Measures are not orthogonal 35 /70 Hands on Lab: SQL Server Management Studio 1. Start the SQL Server Management Studio 2. Create a new database 3. Add a new database diagram 36 /70 (c) 2008,, 18

Hands on Lab: SQL Server Management Studio 4. Create tables 5. Define foreign keys enter table definition Manage keys, relationships 37 /70 Drag and drop columns to define foreign keys Modeling Cubes, OLAP Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 38 /70 Data Mining Time Region (c) 2008,, 19

5.2 Cube Model Multidimensional view of the Data Warehouse Dimensions correspond with coordinates Structured Dimensions Facts are a function of multiple dimensions vehicle truck car E240 product country Fact: sales = f(product, country, time) 39 /70 C220 time 5.4 object oriented model Object-oriented view of the Data Warehouse Intelligent dimensions and Facts: Meta-information for dimensions and facts Example: Product Dimension has hierarchical aggregation costs can be compared with earnings, but not with nooforders Object oriented structure allows semantically correct navigation and aggregation Hierarchy Product level child Timespan start end Month 40 /70 #Orders price days (c) 2008,, 20

MS visualization of a hypercube Relational view on the OLAP cube structure 41 /70 MS visualization of a hypercube Pivot table view on the OLAP data Drag and drop measures and dimensions on the pivot table 42 /70 (c) 2008,, 21

OLAP Storage models MOLAP: Multidimensional (md) storage Single cube one large md array with sparse data Multi-cube galaxy structured md arrays Storing md array on a linear address space Optimized OLAP for small cubes ROLAP: Relational storage Storing facts and dimensions in tables Storing aggregations in tables Best choice for very large cubes 43 /70 HOLAP: Hybrid storage Storing facts and dimensions in tables Storing aggregations as ms arrays Best performance for large cubes Hands on Lab: SSAS Cube Design Start SQL Server Business Intelligence Studio Create a new SSAS project Add Data Source, View, and create a new cube Identify fact and dimension tables 44 /70 (c) 2008,, 22

Hands on Lab: SSAS Cube Design Select measures Define dimensions and aggregation hierarchies Save cube definition 45 /70 Hands on Lab: SSAS Cube Design Select storage model and its parameters Process and deploy cube 46 /70 (c) 2008,, 23

Hands on Lab: performing OLAP Drill down Roll up Slice and Dice Drill through 47 /70 Data Mining Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 48 /70 Data Mining Time Region (c) 2008,, 24

Decision Tree Classification Goal: Mapping/prediction of objects to predefined classes based on their attribute values Process: 1. Build a decision tree DT (classification model) with the help of sample objects (training data) 49 /70 2. Validation for the DT (e.g. precision) with test data 3. Classification of unknown objects car type = truck truck Risk = low age > 60 60 Risk = low Risk = high Regression Tree Goal: Prediction of a numeric value for objects based on a DT with linear regression functions on the leaf level 50 /70 Process: 1. Build a DT with the help of training data 2. Replace some branches by a linear regression formula 3. Generate prediction values tune regression parameters 4. Testing (like DT) 5. Prediction (like DT) car type Price = 20k + 2k *weight = truck truck insurance class < III > VI Price = 10k + 3k *class Price = 3ok + 6k *class [IV..VI] Price = 20k + 4k *class + 10 *HP (c) 2008,, 25

SSAS Decision Tree Viewer 51 /70 SSAS Dependency Network 52 /70 (c) 2008,, 26

SSAS Decision Tree Prediction 53 /70 Clustering Basics Clustering (Grouping) := Arrangement of objects into groups, that objects in the same cluster are most similar objects from different clusters are most dissimilar Types of clustering Partitioning clusters (an object o 1 belongs to only one cluster) Hierarchical clusters (nested clusters) Distance function d: d(o 1, o 2 ) 0; d(o 1, o 2 ) = 0 o 1 = o 2; d(o 1, o 2 ) = d(o 2, o 1 ) Similarity of o1 and o2 is defined via distance function The smaller the distance, the more alike are the objects 54 /70 Goal function Maximize the compactness of the clusters Compactness of a cluster C := C / Sum oi C (d(o i,c), where c = center of C (c) 2008,, 27

f1 K-Means based Clustering (1/2) Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) 3 1 Beispiel 5 a 6 k=2 2 b 4 7 Initiale Zentroide und 55 /70 f2 K-Means basiertes Clustering (2/2) Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) Repeat steps 2-3 until the centroids stabilize 3 1 Example 5 a a* k=2 6 2 b b* 7 4 Initial centroids and 56 /70 (c) 2008,, 28

Folie 55 f1 Animation für K-Mean Hans Muster; 09.11.2006 Folie 56 f2 Animation für K-Mean Hans Muster; 09.11.2006

SSAS Clustering Implements K-Means and EM Clustering Both are partitioning algorithms K-Means is distance based EM is probability based Scalable means: one single data scan only 57 /70 SSAS Cluster Viewer 58 /70 (c) 2008,, 29

MS Cluster Profile Viewer 59 /70 SSAS Cluster Characteristics 60 /70 (c) 2008,, 30

SSAS List Chart 61 /70 Lift = %ofcorrectpredictions / %ofpopulation Association Rules Example (basket analysis) Available items I = {Bred, Coffee, Milk, Cake, Butter, Tea} Support of X = {Coffee, Milk} Support(X) = 3/6 = 50% Support of R = X {Cake} i.e. Support of Rule: Milk, Coffee Cake Support(R) = 2/6 = 33% Confidence of Rule: Confidence ( Milk, Coffee Cake ) = Support(R)/Support(X) = 2/3 = 67% Transaction set T t 1 2 3 4 5 6 bought items Bred, Coffee, Milk, Cake Coffee, Milk, Cake Bred, Butter, Coffee, Milk Milk, Cake Bred, Cake Bred 62 /70 (c) 2008,, 31

SSAS Item Sets Viewer 63 /70 Probability = Confidence 64 /70 Importance (c) 2008,, 32

Key Performance Indicators (KPI) Idea to measure performance of an enterprise with simple numbers as return on investment (ROI), profit, capital turnover ROI := Earnings / Investments Profit := Revenue Costs 65 /70 Capital turnover := Sales / Investments SSAS Key Performance Indicators (KPI) KPI = f(measures, goal) Measures are compared with a goal function KPI is normally analyzed over time Define new KPI Drag measure to value or goal expression 66 /70 (c) 2008,, 33

Time Series Definition: A time series (TS) is a timely equidistant ordered sequence of numbers The ordering is relevant (i.e. following numbers are not independent) Additive TS Model y(t) := Trend(t) + Season(t) + R(t) (t {1, 2, 3, } Trend is monotonic (linear or non-linear) Season is periodic (sine or other) R(t) random value 67 /70 time SSAS Autoregressive Tree Models for Time-Series Analysis Definition: Let y = (y 1, y 2,, y t ) be a time series TS. The model for TS is called auto regressive, if for all p <τ tthe probability distribution of y τ depends as a linear regression on the previous p values of y τ -π yτ -p yτ -1 yτ Definition: An auto regressive tree model is a piecewise linear autoregressive model, where the boundaries are defined by a decision tree. Y τ-1 < a false true P(y t ) = N(m 1,σ 12 ) Yτ-1 > b false true 68 /70 P(y t ) = N(m 2,σ 22 ) P(yt) = N(m 3,σ 32 ) a b t (c) 2008,, 34

MS Time Series Uses regression tree 69 /70 70 /70 (c) 2008,, 35