Data Warehouse design Design of Enterprise Systems University of Pavia 21/11/2013-1-
Data Warehouse design DATA PRESENTATION - 2-
BI Reporting Success Factors BI platform success factors include: Performance User interface Presentation of the data architecture Alignment with the data model Ability to answer questions Mobility Flexibility Availability - 3-
Information Needs Information is data interpreted within a context. Information questions yield to one or more answers that will help the enterprise. For example: Profitability: What was the recent margin between revenues and expenses? Trends: Did the business unit sell more or less product this quarter as compared to last quarter? Ratios: What is the ROI of the data warehouse? So, BI customers need information supporting: Proactive processes Reactive processes Predefined processes Analytic processes - 4-
Proactive Processes The enterprise needs to know when a problem is approaching with the maximum possible lead-time. Once an approaching problem has been observed, the lead-time allows the enterprise to align its resources to prepare the best possible response to the approaching problem. For proactive processes, a BI Reporting application should relieve customers of the need to remember to query enterprise data. The risk is that the customer will be too busy or just forget to run the query at the exact moment a problem emerges. - 5-
Reactive Processes The enterprise needs to assess its recent past in the context of long-term and seasonal trends. The information from these assessments helps the enterprise know whether short-term tactics and long-term strategies are currently working as intended or should they be modified in the near future. Business processes such as these are reactive because they allow the enterprise to react to recent events. BI Reporting customers need the toolsets necessary to review and analyze recent events in the context of long-term and seasonal trends. - 6-
Predefined Processes Some business processes are well defined, repeated, and stable. Predefined business processes could include such queries as: How many units did we sell? How much cash came in, and out, in the past week? What is the net present value of investments held by each customer? In predefined processes, everything is known, except the answer. The time frame, query, and audience are all known. A predefined process has very few, if any, variables that require the help or participation of a member of the enterprise. - 7-
Analytic Processes Sometimes, the question that must be answered is, What question should I ask? This search is the analytic process, searching for a correlation between events, for an association between factors within and around the enterprise. Business analysts need a toolset that will enable them to search for the questions that will lead to the answers. What exactly are analytics? Analytics are a subset of business intelligence Analytics are the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions. The analytics may be input for human decisions or may drive fully automated decisions. - 8-
BI maturity model - 9-
Architecture The architecture of BI Reporting tools includes one or more servers between the data warehouse and customers. 1. These servers have a roadmap of the data warehouse. 2. Customers tell a BI Reporting application the required information through its user interface 3. The BI Reporting application submits the SQL to the data warehouse. 4. When the result returns from the data warehouse, the BI Reporting application returns it to the customer. The companies that develop and own BI Reporting tools negotiate partnerships with the companies that develop and own RDBMS platforms. The partnership means that the owners of the RDBMS platform have shared their proprietary information, including application programming interfaces (APIs) and other interfaces, which allow a BI Reporting tool to connect with the most possible features and efficiency. The least efficient connectivity is through ODBC. A BI Reporting tool will use ODBC when no other connectivity is available. - 10-
Presentation Methods BI Reporting tools interact with data warehouse customers using several methods. Reports Dashboards On-Line Analytical Processing (OLAP) Data mining and Decision Support Systems (DSSs) Each method has its own advantages and disadvantages. None of these methods addresses all the data warehouse customers needs and skills. For that reason, most BI Reporting tools combine these methods. - 11-
Reports Reports are basically SQL statements with a label. The BI Reporting tool has a library of predefined reports. Data warehouse customers need to be able to find the exact permutation of Fact and Dimension data in a report. The SQL in all the reports can be optimized for maximum query efficiency. The BI Reporting team can test and validate each report, verifying it does indeed return the data that it promises to return. Also, the BI Reporting team can own and catalog all the BI reports, thus avoiding redundant reports. - 12-
Dashboards Dashboards require the BI Reporting tool translate the list of data elements required by the customer into a SQL statement. Then, the BI Reporting tool submits that SQL to the data warehouse and returns the result set back to the data warehouse customer. Dashboards usually use drop-down lists, menus, and user input boxes to indicate the list of data elements and WHERE clauses required by the data warehouse customer. To achieve the translation of data elements and WHERE clauses, a BI Reporting tool must have its own roadmap of the data warehouse. - 13-
Dashboards That roadmap of the data warehouse must be maintained and synchronized with the data warehouse If the data warehouse changes, the BI Reporting roadmap changes. Dashboards provide flexibility and ad hoc reporting that does not exist with predefined reports. The price for that flexibility is the roadmap of the data warehouse, which includes: the cost of a BI Reporting server the cost of a middleware server, likely development and maintenance - 14-
On-Line Analytical Processing OLAP applications pre-calculate and store the answers (i.e., result sets) to permutations of Dimensions. The pre-calculated result sets are stored in a multidimensional structure, which is referred to as a Cube. The multidimensional cube is able to navigate directly to the cell that holds the result set associated with the permutation of Dimensions indicated by the customer. As a result, the answer set comes back to the customer with nearly instant response time - 15-
On-Line Analytical Processing The final, and best, feature of an OLAP application is the user interface. An OLAP application uses a GUI interface. The customer is able to pointand-click on a cell that is a reference to a permutation of Dimensions. The result set returns immediately because the result set has been pre-calculated and stored, allowing the customer to ask questions (via point-and-click) and receive answers in near real time. - 16-
Multidimensional OLAP MOLAP stores all the result sets of all the permutations of Dimension in an OLAP cube. MOLAP requires significant storage capacity. The creation of all the result sets in a MOLAP cube requires significant CPU cycles, I/Os, and memory capacity. MOLAP provides the fastest performance for the customer. - 17-
Relational OLAP ROLAP stores no result sets. Rather, ROLAP identifies the data within an associated data warehouse by which it can calculate at runtime all result sets. When a customer indicates an intersection of Dimensions, the ROLAP cube translates that information into a SQL statement, which is submitted to a data warehouse. The result set comes back as a data value that is reflected in the OLAP GUI. A ROLAP cube requires the least storage capacity on the OLAP server; however, ROLAP transfers consumption of CPU cycles and I/Os over to the data warehouse. ROLAP provides the slowest performance and the maximum number of Dimensions for the customer. - 18-
Hybrid OLAP HOLAP is a combination of MOLAP and ROLAP. By pre-calculating and storing most, but not all, of the result sets within an OLAP cube, a HOLAP cube achieves a compromise between capacity, performance, and permutations of Dimensions available to the customer. - 19-
OLAP operations Roll up A roll-up involves summarizing the data along a dimension. The summarization rule might be computing totals along a hierarchy or applying a set of formulas such as "profit = sales - expenses - 20-
OLAP operations Drill down Drill Down/Up allows the user to navigate among levels of data ranging from the most summarized (up) to the most detailed (down) The analyst moves from the summary category "Outdoor-products" to see the sales figures for the individual products, e.g. outdoor table, outdoor chair, etc. - 21-
OLAP operations Slicing Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension. The sales of all regions and all product categories of the company in the year 2004 are filtered" out of the data cube. - 22-
OLAP operations Dicing The dice operation produces a sub-cube by allowing the analyst to pick specific values of multiple dimensions The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before. - 23-
OLAP operations Pivoting Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace products with time periods to see data across time for a single product. - 24-
Data Mining Data Mining is a search for patterns and associations within data that are not immediately obvious or may be hidden altogether. As a pattern emerges, it may lead to a question that will lead to another pattern that may open up a new line of inquiry and discovery. The inquiry and discovery in Data Mining follows one of two paths: Exploratory Analysis: This is the search for a hypothesis, a business rule that can predict future events and conditions. Confirmatory Analysis: This is the test of a hypothesis. A business rule has been found that requires validation and verification An enterprise wants to be able to predict an event or condition, i.e., what function and factors in f (x, y, z) = A? In the best case scenario, factors x, y, and z are within the power of the enterprise to manipulate. In that case, the enterprise can cause result A to occur by manipulating factors x, y, and z. In the next best-case scenario, factors x, y, and z are known by the enterprise. The enterprise can know result A is about to occur whenever factors x, y, and z have occurred. - 25-
Data Mining Tools Generally available Data Mining tools handle all the statistical and time-series functions as well as the confidence measurements. These Data Mining tools are powerful software packages that enhance and accelerate the Data Mining process. They include the statistical algorithms and functions that are at the center of Data Mining. Data Mining tools, because of all their statistical power, require the data be brought to them in specific formats. Data Preparation is usually two or three times the work of Data Mining. - 26-
Data Cleansing Data Mining tool needs a clean set of data, without any noise data that might cause confusion or distraction. Some of the Data Cleansing methods are: Missing Values: Identify missing values in the data. Fill them in with a reasonable value. This mitigates the risk that an empty spot in the data that does not normally occur may lead the Data Mining tool to believe that empty spot always occurs. Outliers: Identify unreasonable data values. In the data warehouse, these outliers are retained. But, in the data presented to a Data Mining tool, these values are modified to a more reasonable value. This mitigates the risk that an outlier in the data that does not normally occur may lead the Data Mining tool to believe that outlier always occurs. Sample Bias: Preferably, feed a Data Mining tool with a universe (a whole and complete set) of data, not just a sample. A sample of data should only be used when the delivery of a universe of data is physically and logistically impossible (including asking that person fours doors down and two doors over, who can move mountains of data, to help gather the universe of data). If, and only if, the universe of data is impossible, use a sample of data for Data Mining. If a sample is used, check the bias of that sample. - 27-
Data Inspection A Data Mining tool understands two kinds of variables: Independent Variables and Dependent Variables. In the cause effect concept of the world wherein every effect is preceded by one or more causes, Independent Variables are the cause and a Dependent Variable is the effect. In Data Inspection, a BI analyst reviews the meaning, content, and inconsistencies within each Variable. The methods applied in Source System Analysis can also be applied to Data Inspection: Data profile Histogram Business Rule validation - 28-
Exploratory and Confirmatory Analyses The hypothesis is that Independent Variables have some sort of connection to the Dependent Variable. Exploratory Analysis is a search for an explanation as to how (not necessarily why) some subset of these Independent Variables relates to, or associates with, the Dependent Variable. The relation, or association, derived from Exploratory Analysis is an algorithm. For example: Growth in sales is inversely proportional to changes in price. Increases in manufacturing throughput are directly proportional to certification levels. Confirmatory Analysis begins with the hypothesis. In Confirmatory Analysis, the BI analyst tries to predict the Dependent Variable by using the Independent Variables and the hypothesized algorithm. The variance between the predicted value and the actual value is a measurement of the confidence in the hypothesized algorithm. - 29-
Summary Reactive processes Proactive processes Predefined processes Analytic processes Analysis Long term and seasonal trends Near future / approaching potential events Repeated events Unknown / Hypothesis exploration / Hypothesis confirmation Access mode Dashboards and OLAP Alerts (e.g. emails, notifications, etc.) and DSS Reports Data mining and DSS - 30-
Data Warehouse design ASSIGNMENT - 31-
Deliverable Cube mapping by Pentaho Schema Workbench OLAP implementation by Saiku (C-tools) Dashboard implementation by CDE At least one dashboard for each KPI Deadline: December 17th - 32-