Business Intelligence Workshop, Helia, May, 2008 DBTechNet Data Warehousing (DW) Online Analytical Processing (OLAP) Data Mining Topics 1. Introduction to BI and CPM 2. ETL Process 3. DW Modeling 4. OLAP 5. Data Mining 1 Introduction Process ing 2/70 (c) 2008,, 1
Critical Questions About an Enterprise Are we on the right way? Yes, we are! How about our competitors? Economical trends? 3/70 Prof. Dipl.-Kfm. A. Roth Critical Questions About an Enterprise Are we on the right way? Yes, we are! How about our competitors? Ahead of us! Economical trends? Turbulences! 4/70 Prof. Dipl.-Kfm. A. Roth (c) 2008,, 2
Where do we get the Knowledge from? About the enterprise From the company s operational information systems About the market and competitors From census bureau From public statistical data About economical trends From financial and economical publications How you gather, manage, and use information will determine whether you win or lose (Bill Gates, Business @ The Speed of Thought, 1999) 5/70 So, where is the problem? Definition and Problems to solve in Business Intelligence Definition: Business Intelligence (BI) refers to processes and technologies using fact based systems to analyze business BI needs to deal with: 1. Information overload 2. Missing knowledge 3. We do not know which are the right questions 4. We do not know the influencing factors and their impact 5. Key measures or indicators to steer an enterprise are missing 6/70 (c) 2008,, 3
Aspects IT view Business View Market View Growing knowledge Market View Information Pyramid Data Mining OLAP Data Warehouse OLTP EIS EIS Business Business View View DSS DSS Operational Operational Systems Systems 7/70 IT IT View Amount of information We're drowning in information and starving for knowledge. (Rutherford D. Rogers, Yale, 1985) Motivation What is the goal of my organization? How do we affect the market? How do we perform? 8/70 Prof. Dipl.-Kfm. A. Roth (c) 2008,, 4
Motivation Business Intelligence as critical success factor Purpose: Support business decision making 9/70 Prof. Dipl.-Kfm. A. Roth Corporate Performance Management (CPM) How can we steer an enterprise? start re-plan set goals analyze plan monitor execute Idea from MIK AG: http://www.mik.info BI Tools provide the means to steer an enterprise by Measuring the effect of decisions and Analyzing the performance and Compare with goals 10 /70 Definition: CPM is the framework for steering an enterprise by means of Business Intelligence (c) 2008,, 5
How Can we Measure Corporate Performance? 11 /70 Through Key Performance Indicators (KPIs) Definition: KPI is a metric to define and measure state and progress towards an organization s goal set Usually high plan goals level relative values Examples execute Customer re-plan KPIs Customers satisfaction Customer attrition (loss) Manufacturing analyzekpis monitor Overall Equipment Effectiveness OEE = Availability * Performance * Quality Financial KPIs Profit Margin PM = Net Income / Sales Return on Investment ROI = Turnover * Earnings / Sales = Return on Investment (ROI) Financial KPIs have natural metrics Source: Fred Nickols, 2000, originally by Johnson and Kaplan 12 /70 But how about soft factor metrics? (c) 2008,, 6
Soft Factor Metric Example: Customer satisfaction General satisfaction Specific satisfaction: quality/price of product, speed of delivery, How do we compare these? Search for a mapping of categorical values to ordinal values Totally satisfied (ts) 9 Partially satisfied (ps) 3 13 /70 Meaning of the metric ts = 3 * ps? No! But ts is-better-than ps Are two metrics comparable? No! But we do weighted comparisons. Motivation Why can t we use our OLTP System? Missing information Need for integration of economical and census data Need for soft factors to assess an enterprise Missing KPIs and steering parameters Need for highly significant KPIs and parameters Influencing factors and different perspectives not available Need for multidimensional analysis and presentation 14 /70 Source: One Hundred & Eighty Degrees Systems Limited. 2004 (c) 2008,, 7
Motivation Why can t we use our OLTP System? Queries only explicit information Select customer, sum(sales) from Orders where Region. Group by We don t know what to ask! Need for interactive, explorative analysis Inappropriate presentation of information Tabular presentation one dimensional analysis Sales ok? Trend ok? Reason? 15 /70 We can t see the problem! Need for multidimensional analysis and presentation Management Cockpit The CPM paradise Source: Juergen Daum, New Economy Analyst Report, 2004 16 /70 Source: SAP Whitepaper, SAP SEM / CPM, http://help.sap.com/ (c) 2008,, 8
The Business Intelligence Process Data Sources Data Warehouse Cubes, Data Marts Analysis xls DBS OLAP stats ETL Data Mining WWW WWW 17 /70 Build up Product Design Time Region Extraction Transformation Loading Data Sources Data Warehouse Cubes, Data Marts Analysis xls DBS OLAP stats ETL WWW WWW Product 18 /70 Data Mining Time Region (c) 2008,, 9
General sources Time Geography OLTP Master data Transaction data Data Sources Technical data sources supported by SQL Server Integration Services (SSIS) Planning Planning turnover profit, etc 19 /70 Economic data Business sector data Economic forecast Select Cleanse Convert Harmonize Extract and Transform which data are needed? where are the user data? have all facts the same unit, coding and granularity? have we synonyms and homonyms? Adjust grouping, classification? Correct are the data correct? 20 /70 Amend are the data complete? (c) 2008,, 10
Extract and Transform Example Select Cleanse Convert e.g. http://.../consumptionpercapita/coffee.html e.g. strip off html tags e.g. convert consumption into kg Harmonize e.g. import with consumption? <table border="1" width="21%"> <tr> <td width="58%">country</td> <td width="45%">1987</td> </tr> <tr> <td width="58%">finland</td> <td width="45%">12,04</td> </tr> </table> Adjust e.g. region grouping Country 1987 1988 1989 Correct e.g. incorrect value for D 1989 Finland Sweden 12,04 11,64? 11,71 11,68 11,08 Amend e.g. for NL 1988 Norway Denmark 20,13 lb 11 20,81 lb 10,65 18,19 lb 10,2 Benelux 19,65 20,48 19,89 21 /70 Austria Germany 7,75 7,38 8,17 8,17 8,01 0,827 Hands on Lab: Integration Services (SSIS) 1. Open SS Business Intelligence Studio 2. Create new project 3. Select 22 /70 (c) 2008,, 11
Hands on Lab: Integration Services (SSIS) 3. Build a control flow 2. Design a data flow from source to destination source destination 23 /70 1. Define connection managers for data sources and destinations Hands on Lab: Integration Services (SSIS) Graphically design control and data flow Example 1: Loop control, data and error flow Control loop Text file data source error flow Data flow 24 /70 (c) 2008,, 12
Hands on Lab: Integration Services (SSIS) Example 2: ETL control flow design & a data flow taking date entries from sales and purchase orders to build date dimension Start of control flow Excel data source 25 /70 Data transformation Destination DW End of control flow Data Warehouse Modeling Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 26 /70 Data Mining Time Region (c) 2008,, 13
Data Warehouse Definition: A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management s decision-making process. William H. Bill Inmon (1996) 27 /70 Data Warehouse 28 /70 Properties Subject-oriented data is selected and organized so support business analysis Optimized for query and analysis Objects (facts) and their determining factors (dimensions) are linked together Not to support OLTP Time-variant accumulates historical data over time Non-volatile (archival) Data is read-only; it is never updated, only added May have redundancies Contains pre-calculated aggregations Integrated contains data from different sources (OLTP systems, economical databases, etc) (c) 2008,, 14
5.3 Dimensional Fact Model Properties Multidimensional model Distinction between fact (measures) and dimension Structural dimensions Attributes of Dimension computed values 29 /70 Dimension Comp. value measure Year Month Week Fact average, semi-additive Sales amount onstock value Dim.attribute weight Type Product Prod.group 2.1a Taxonomy of Facts numerical Fact categorical additive semi-additive ordinal nominal temporal 30 /70 (c) 2008,, 15
DW Schemes Star : one Fact table, multiple Dimension tables Galaxy: multiple Fact table, multiple Dimension tables Snowflake: Dimension tables normalized, Fact tables aggregated 31 /70 All 3 Schemata are relational models in disguise Example Star Scheme SSAS Source View Dimension table Fact table 32 /70 (c) 2008,, 16
Example Galaxy Scheme SSAS Source View Joint dimension table Fact tables 33 /70 Example Snowflake Scheme SSAS Source View Aggregated fact table Normalized product dimension 34 /70 Fact table (c) 2008,, 17
Design Rules for DW Scheme Use Star if Dimensions have few or dynamic Attributes Measures are orthogonal Use Snowflake if Dimensions are structured (aggregation) Measures are orthogonal Use Galaxy if Dimension are reused Measures are not orthogonal 35 /70 Hands on Lab: SQL Server Management Studio 1. Start the SQL Server Management Studio 2. Create a new database 3. Add a new database diagram 36 /70 (c) 2008,, 18
Hands on Lab: SQL Server Management Studio 4. Create tables 5. Define foreign keys enter table definition Manage keys, relationships 37 /70 Drag and drop columns to define foreign keys Modeling Cubes, OLAP Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 38 /70 Data Mining Time Region (c) 2008,, 19
5.2 Cube Model Multidimensional view of the Data Warehouse Dimensions correspond with coordinates Structured Dimensions Facts are a function of multiple dimensions vehicle truck car E240 product country Fact: sales = f(product, country, time) 39 /70 C220 time 5.4 object oriented model Object-oriented view of the Data Warehouse Intelligent dimensions and Facts: Meta-information for dimensions and facts Example: Product Dimension has hierarchical aggregation costs can be compared with earnings, but not with nooforders Object oriented structure allows semantically correct navigation and aggregation Hierarchy Product level child Timespan start end Month 40 /70 #Orders price days (c) 2008,, 20
MS visualization of a hypercube Relational view on the OLAP cube structure 41 /70 MS visualization of a hypercube Pivot table view on the OLAP data Drag and drop measures and dimensions on the pivot table 42 /70 (c) 2008,, 21
OLAP Storage models MOLAP: Multidimensional (md) storage Single cube one large md array with sparse data Multi-cube galaxy structured md arrays Storing md array on a linear address space Optimized OLAP for small cubes ROLAP: Relational storage Storing facts and dimensions in tables Storing aggregations in tables Best choice for very large cubes 43 /70 HOLAP: Hybrid storage Storing facts and dimensions in tables Storing aggregations as ms arrays Best performance for large cubes Hands on Lab: SSAS Cube Design Start SQL Server Business Intelligence Studio Create a new SSAS project Add Data Source, View, and create a new cube Identify fact and dimension tables 44 /70 (c) 2008,, 22
Hands on Lab: SSAS Cube Design Select measures Define dimensions and aggregation hierarchies Save cube definition 45 /70 Hands on Lab: SSAS Cube Design Select storage model and its parameters Process and deploy cube 46 /70 (c) 2008,, 23
Hands on Lab: performing OLAP Drill down Roll up Slice and Dice Drill through 47 /70 Data Mining Data Sources Data Warehouse Cubes, Data Marts xls DBS OLAP stats ETL WWW WWW Product 48 /70 Data Mining Time Region (c) 2008,, 24
Decision Tree Classification Goal: Mapping/prediction of objects to predefined classes based on their attribute values Process: 1. Build a decision tree DT (classification model) with the help of sample objects (training data) 49 /70 2. Validation for the DT (e.g. precision) with test data 3. Classification of unknown objects car type = truck truck Risk = low age > 60 60 Risk = low Risk = high Regression Tree Goal: Prediction of a numeric value for objects based on a DT with linear regression functions on the leaf level 50 /70 Process: 1. Build a DT with the help of training data 2. Replace some branches by a linear regression formula 3. Generate prediction values tune regression parameters 4. Testing (like DT) 5. Prediction (like DT) car type Price = 20k + 2k *weight = truck truck insurance class < III > VI Price = 10k + 3k *class Price = 3ok + 6k *class [IV..VI] Price = 20k + 4k *class + 10 *HP (c) 2008,, 25
SSAS Decision Tree Viewer 51 /70 SSAS Dependency Network 52 /70 (c) 2008,, 26
SSAS Decision Tree Prediction 53 /70 Clustering Basics Clustering (Grouping) := Arrangement of objects into groups, that objects in the same cluster are most similar objects from different clusters are most dissimilar Types of clustering Partitioning clusters (an object o 1 belongs to only one cluster) Hierarchical clusters (nested clusters) Distance function d: d(o 1, o 2 ) 0; d(o 1, o 2 ) = 0 o 1 = o 2; d(o 1, o 2 ) = d(o 2, o 1 ) Similarity of o1 and o2 is defined via distance function The smaller the distance, the more alike are the objects 54 /70 Goal function Maximize the compactness of the clusters Compactness of a cluster C := C / Sum oi C (d(o i,c), where c = center of C (c) 2008,, 27
f1 K-Means based Clustering (1/2) Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) 3 1 Beispiel 5 a 6 k=2 2 b 4 7 Initiale Zentroide und 55 /70 f2 K-Means basiertes Clustering (2/2) Algorithm: 1. Choose k cluster centers (centroids) 2. Assign each object to its nearest centroid 3. Recalculate the cluster centers (centroids) Repeat steps 2-3 until the centroids stabilize 3 1 Example 5 a a* k=2 6 2 b b* 7 4 Initial centroids and 56 /70 (c) 2008,, 28
Folie 55 f1 Animation für K-Mean Hans Muster; 09.11.2006 Folie 56 f2 Animation für K-Mean Hans Muster; 09.11.2006
SSAS Clustering Implements K-Means and EM Clustering Both are partitioning algorithms K-Means is distance based EM is probability based Scalable means: one single data scan only 57 /70 SSAS Cluster Viewer 58 /70 (c) 2008,, 29
MS Cluster Profile Viewer 59 /70 SSAS Cluster Characteristics 60 /70 (c) 2008,, 30
SSAS List Chart 61 /70 Lift = %ofcorrectpredictions / %ofpopulation Association Rules Example (basket analysis) Available items I = {Bred, Coffee, Milk, Cake, Butter, Tea} Support of X = {Coffee, Milk} Support(X) = 3/6 = 50% Support of R = X {Cake} i.e. Support of Rule: Milk, Coffee Cake Support(R) = 2/6 = 33% Confidence of Rule: Confidence ( Milk, Coffee Cake ) = Support(R)/Support(X) = 2/3 = 67% Transaction set T t 1 2 3 4 5 6 bought items Bred, Coffee, Milk, Cake Coffee, Milk, Cake Bred, Butter, Coffee, Milk Milk, Cake Bred, Cake Bred 62 /70 (c) 2008,, 31
SSAS Item Sets Viewer 63 /70 Probability = Confidence 64 /70 Importance (c) 2008,, 32
Key Performance Indicators (KPI) Idea to measure performance of an enterprise with simple numbers as return on investment (ROI), profit, capital turnover ROI := Earnings / Investments Profit := Revenue Costs 65 /70 Capital turnover := Sales / Investments SSAS Key Performance Indicators (KPI) KPI = f(measures, goal) Measures are compared with a goal function KPI is normally analyzed over time Define new KPI Drag measure to value or goal expression 66 /70 (c) 2008,, 33
Time Series Definition: A time series (TS) is a timely equidistant ordered sequence of numbers The ordering is relevant (i.e. following numbers are not independent) Additive TS Model y(t) := Trend(t) + Season(t) + R(t) (t {1, 2, 3, } Trend is monotonic (linear or non-linear) Season is periodic (sine or other) R(t) random value 67 /70 time SSAS Autoregressive Tree Models for Time-Series Analysis Definition: Let y = (y 1, y 2,, y t ) be a time series TS. The model for TS is called auto regressive, if for all p <τ tthe probability distribution of y τ depends as a linear regression on the previous p values of y τ -π yτ -p yτ -1 yτ Definition: An auto regressive tree model is a piecewise linear autoregressive model, where the boundaries are defined by a decision tree. Y τ-1 < a false true P(y t ) = N(m 1,σ 12 ) Yτ-1 > b false true 68 /70 P(y t ) = N(m 2,σ 22 ) P(yt) = N(m 3,σ 32 ) a b t (c) 2008,, 34
MS Time Series Uses regression tree 69 /70 70 /70 (c) 2008,, 35