References References INFO 321 Chapter 3: Decision Support Systems Department of Information Science Semester 2, 2012 General Kifer Chapter 17 Silberschatz (5th ed.) Chapter 18 Data Warehousing for Cavemen (see Blackboard Other Documents Decision Support Systems) Coronel, Morris & Rob (9th ed.) Chapter 13 Mannino (3rd ed.) Chapter 16 http://www.searchcrm.com/ Oracle11g documentation Data Warehousing Guide OLAP User s Guide OLAP DML Reference Semester 2, 2012 INFO 321 1 Semester 2, 2012 INFO 321 3 Semester 2, 2012 INFO 321 4 What is decision support? (Kifer 1.4, 17.1; Silberschatz 18.1) Decision-making occurs at the operational level (see also Figure 3 1) Decision-making occurs at the tactical level (see also Figure 3 1) Data timely, relevant, well-visualised information. Tune information and presentation to specific purposes. Very short-term. Well-defined inputs. Produced by existing applications or simple front-end tools. Line managers. Short-term. Less well-defined inputs. Middle managers. Semester 2, 2012 INFO 321 5 Semester 2, 2012 INFO 321 6 Semester 2, 2012 INFO 321 7 Chapter 3: Decision Support Systems Lecture Slides (1 51) 61
Decision-making occurs at the strategic level (see also Figure 3 1) Long-term. Ill-defined inputs. Often cannot use pre-existing applications Decision Support Systems (DSS); or Executive Information Systems (EIS). Senior managers. Operational vs. decision support queries Operational How many brass reciprocating hammers do we have in stock? How much electrical twine did we sell yesterday? Decision support How many brass reciprocating hammers were sold to customers aged 18 25 in large North Island towns over each of the last six months? If we double the advertising budget for electrical twine, how might that affect revenues for the next six months? There is a strong need for DSS Modern business very complex. Shrinking time frame for decision-making. Data from multiple sources: (see Figure 3 1) internal vs. external formal vs. informal must be sensibly integrated Semester 2, 2012 INFO 321 8 Semester 2, 2012 INFO 321 9 Semester 2, 2012 INFO 321 10 Components of a DSS (adapted from Coronel, Morris & Rob, Figure 13.2) There are many types of decision support tool Operational vs. decision support data (Coronel, Morris & Rob, Table 13.4; see also Mannino 16.1.1) External data Operational data Data extraction, transformation and loading Decision support data Data store Decision Support System Business data analysis models End-user query tool 25000 20000 15000 10000 5000 0 Sales Expenses Profits 14 000 9 500 4 500 17 000 11 000 6 000 21 000 14 000 7 000 End-user presentation and visualisation tool Basic Ad hoc query tools (SQL?). Graph and report generators. Spreadsheets (small data sets only!). More advanced Data warehouses. Online analytical processing (OLAP). Characteristic Operational data Decision support data Data currency Current operations Historic data Real-time data Snapshot of company data Time component (week/month/year) Granularity Atomic, detailed data Summarised data Summarisation level Low; some aggregation High; heavily aggregated Data structure Highly normalised Non-normalised Mostly RDBMS Complex structures Some relational; mostly multidimensional Transaction type Mostly updates Mostly queries Transaction volumes High update volume Periodic loads and summary calculations Transaction speed Updates are critical Retrievals are critical Query activity Low to medium High Query scope Narrow range Broad range Query complexity Simple to medium Very complex Data volumes Hundreds of MiB GiB Hundreds of GiB TiB Semester 2, 2012 INFO 321 11 Semester 2, 2012 INFO 321 12 Semester 2, 2012 INFO 321 13 Chapter 3: Decision Support Systems Lecture Slides (1 51) 62
Timespan is a key difference Granularity is a key difference Dimensionality is a key difference (Kifer 17.2; Silberschatz 18.2.1) Operational Very short (current transactions). Decision support Long (past and future). Data may not be current. Operational Represent specific transactions (atomic). Decision support Varying levels of aggregation (atomic highly summarised). Drilling down vs. rolling up. Operational Flat (tables of atomic transactions). ** Decision support Many dimensions. View orders by region per quarter (2D). Compare sales of products during the last six months by region, city, store & customer (4D). Semester 2, 2012 INFO 321 14 Semester 2, 2012 INFO 321 15 Semester 2, 2012 INFO 321 16 Dimensionality is a key difference (adapted from Coronel, Morris & Rob, Figure 13.13; see also Kifer 17.2 & Silberschatz 18.2.1) Data warehouses store decision support data (Mannino 16.1.2; Coronel, Morris & Rob Table 13.7) Data warehouse data are integrated (Coronel, Morris & Rob Table 13.7) Product Ball Bat Club Location AKL CHC DUD April May June Time Conceptual three-dimensional cube of sales by product, location and time. Sales facts are stored in the cells of the intersection of each product, time and location dimension value. Designed and optimised for decision support data. Internal structure quite different from operational databases: aggregated denormalised data from multiple internal/external sources Operational database data Mostly internal sources. Multiple representations. Data warehouse data Both internal and external sources. Transformed, cleaned and summarised during integration. Semester 2, 2012 INFO 321 17 Semester 2, 2012 INFO 321 18 Semester 2, 2012 INFO 321 19 Chapter 3: Decision Support Systems Lecture Slides (1 51) 63
Data warehouse data are subject-oriented (Coronel, Morris & Rob Table 13.7) Operational database data Functional or process-oriented (invoices, payments, products). Data warehouse data Facts or measures organised by major subject areas (sales, marketing, etc.). Held according to dimensions or variables of interest: product, customer, region,... Aggregated data from many operational tables. Queries tuned to specific decision-making needs. Data warehouse data are time-variant (Coronel, Morris & Rob Table 13.7) Operational database data Current transactions with precise time stamps. Data warehouse data Time an important dimension for almost all subject areas. Data aggregated by time, e.g., sales by week, month, quarter, year... Historical focus (past and future). Data warehouse data are non-volatile (Coronel, Morris & Rob Table 13.7) Operational database data Frequent changes dynamic. Often archived periodically. Data warehouse data Read only (occasional batch updates) static. Historical data retained always growing (GiB... ). Semester 2, 2012 INFO 321 20 Semester 2, 2012 INFO 321 21 Semester 2, 2012 INFO 321 22 Defining a data warehouse in more detail (Silberschatz 18.3.1; Table 3 2) Data marts are small, specialised data warehouses Data warehouse analysis is more demanding (Mannino 16.3.1) Read-only database optimised for data analysis and query processing. Data from: legacy /archived databases operational databases other sources Optimisation includes: decisions on aggregations important dimensions appropriate indexing and physical design Focused subset of data. Clusters of data marts surrounding central enterprise data warehouse? Some queries may be impossible if not designed for. Not as flexible for ad hoc queries. Users must identify intended use. Data derived from both internal and external sources (e.g., Internet: NZX, Dow Jones, NASDAQ). Semester 2, 2012 INFO 321 23 Semester 2, 2012 INFO 321 24 Semester 2, 2012 INFO 321 25 Chapter 3: Decision Support Systems Lecture Slides (1 51) 64
The difficulty of data warehouse design (The Standish Group (1997), The Meta Myth ; http://standishgroup.com/) Facts are a key design aspect (Kifer pp. 713 715; Mannino 16.3.2) Dimensions are a key design aspect (Kifer pp. 713 715; Mannino 16.3.2) Interviewer: How many data warehouses have you had? Data warehouser: We have had eight. Interviewer: To what do you attribute so many warehouses? Data warehouser: Seven mistakes... A value that we are interested in. Examples: revenue, profits, cost, number of sales. Also known as measures. A factor/variable that influences the facts. Examples: time, product, customer, salesrep, location. Each has attributes. Semester 2, 2012 INFO 321 26 Semester 2, 2012 INFO 321 27 Semester 2, 2012 INFO 321 28 Time as a dimension (see also Mannino 16.2.3) Star schemas for relational data warehouses (Kifer pp. 715 717; Silberschatz 18.3.2; Figure 3 3; see also Data Warehousing Guide ch. 2) Star schemas for relational data warehouses (Kifer pp. 715 717; Silberschatz 18.3.2; Figure 3 3; see also Data Warehousing Guide ch. 2) Dimension table Central fact table. Not as simple as it seems! Cluster of related dimension tables. Granularity (unit size): year, month, week, day, hour. Alternate units (periodicity): season, financial year, quarter. Needed because of inadequate physical data independence? (denormalised) Partial normalisation snowflake or starflake structure (also constellation ). Dimension table Fact table Dimension table Dimension table Semester 2, 2012 INFO 321 29 Semester 2, 2012 INFO 321 30 Semester 2, 2012 INFO 321 31 Chapter 3: Decision Support Systems Lecture Slides (1 51) 65
Three steps to populate a data warehouse (Kifer 17.6; Mannino 16.4) Performance tuning for data warehouses (Kifer 17.5; Data Warehousing Guide ch. 3; see also INFO 321 Chapter 1) Performance tuning for data warehouses (Kifer 17.5; Data Warehousing Guide ch. 3; see also INFO 321 Chapter 1) Extraction: obtaining data from sources. Transformation: altering form of data (includes cleaning). Loading: adding data to warehouse. Possibly intermediate data staging steps. Critical for successful data warehouses. Complex queries denormalisation (fewer joins). Mostly read-only + complex queries index heavily. Other techniques: normalise dimension tables multiple fact tables for different aggregation levels physical tuning: partitioning, replication, etc. B-tree indexes and hashing generally useful. Bitmap indexes particularly for counting by category queries. Integrated indexes for dimension tables? Function-based indexes could be useful? (Time queries?) Semester 2, 2012 INFO 321 32 Semester 2, 2012 INFO 321 33 Semester 2, 2012 INFO 321 34 Oracle11g supports data warehouses (Oracle11g Data Warehousing Guide) The simple approach Use distribution and replication services. Scales poorly. Oracle data mart suite Add-on for constructing Oracle data marts. GUI interface; design & ETL modules; third-party tools. Oracle Data Integrator Enterprise Edition Build & manage high-end, complex data warehouses. Combines Oracle Data Integrator and Oracle Warehouse Builder. Oracle11g supports data warehouses (Oracle11g Data Warehousing Guide) Bitmap & function-based indexes, index-organised tables. Bitmap join indexes. Other relevant tools: SQL*Loader (possibly in conjunction with Transparent Gateways) export and import (basic) Also see Oracle s web site (good luck!). OLAP tools enable complex data processing (Silberschatz 18.2; Figure 3 4) Complex analysis of multidimensional data. Spreadsheet-like simplicity. Data stored in warehouse or tool s internal proprietary database. Semester 2, 2012 INFO 321 35 Semester 2, 2012 INFO 321 36 Semester 2, 2012 INFO 321 37 Chapter 3: Decision Support Systems Lecture Slides (1 51) 66
OLAP tools have many capabilities A simple OLAP example using Excel (Kifer 17.3; see also Silberschatz 18.2.3 18.2.5) Slice and dice enables dynamic visualisation (adapted from Coronel, Morris & Rob, Figure 13.14; see also Kifer 17.3.1) Data transformation. Business modelling. Statistical analysis. Powerful GUI query facility. Visualisation (graphics). Sales subject area dimensions: customer, salesreps, product, region, time,... View sales aggregated by dimensions. Dynamically alter presentation: drill down/roll up slice and dice (see next slide) pivot the table (see demo) highlight exceptions (e.g., high loss products) invent new columns (e.g., % sales revenue) Product Ball Bat Club Location AKL CHC April May June DUD Store manager s view of sales data Product manager s view of sales data Time Semester 2, 2012 INFO 321 38 Semester 2, 2012 INFO 321 39 Semester 2, 2012 INFO 321 40 Another example: Quicken Another example: Quicken Another example: Quicken Intuit s Quicken provides some simple OLAP-like features. Drill down to expand a summarised category. Drill down to expenses by month for a particular category. Semester 2, 2012 INFO 321 41 Semester 2, 2012 INFO 321 42 Semester 2, 2012 INFO 321 44 Chapter 3: Decision Support Systems Lecture Slides (1 51) 67
Another example: Quicken OLAP data may be stored in different ways (Kifer 17.4; Silberschatz 18.2.2) Oracle11g SQL has extensive OLAP support (Oracle11g SQL Language Reference: SELECT; see also Kifer 17.3.2 & Silberschatz 18.2.3) Drill down to individual category transactions within a month. Internal proprietary database (often MDD). Access external databases (data warehouses?): relational (ROLAP) multidimensional (MOLAP) both (HOLAP) GROUP BY CUBE (<columns>). GROUP BY ROLLUP (<columns>). GROUPING SETS (different from SQL:1999 s GROUPING function). MODEL clause. Various analytic functions, including RANK, PARTITION BY. (see Oracle11g SQL Language Reference: Analytic Functions) Crosstabs using PIVOT and UNPIVOT.... Semester 2, 2012 INFO 321 45 Semester 2, 2012 INFO 321 46 Semester 2, 2012 INFO 321 48 Data mining may find hidden trends (Kifer 17.7; Silberschatz 18.4) There are many data mining techniques (Kifer 17.8 17.11; Silberschatz 18.4) Some examples of data mining OLAP & data warehousing help identify trends and relationships. BUT: Some relationships too complex or subtle to easily notice. Data mining tools claim to sift through databases and find unrecognised relationships and trends. Neural networks. Complex visualisation. Genetic algorithms (evolve a solution). Advanced statistical analysis (traditional). See INFO 331 for many of these. Beer and nappies (probably apocryphal). Fraud detection (phone, credit card). MCI s statistical profiles. Risk assessment for car insurance (FIG). NBA strategy analysis. But data mining is not foolproof... Semester 2, 2012 INFO 321 49 Semester 2, 2012 INFO 321 50 Semester 2, 2012 INFO 321 51 Chapter 3: Decision Support Systems Lecture Slides (1 51) 68
Figure 3 1: Sources of information Table 3 2: Twelve rules that define a data warehouse Bill Inmon is widely referred to as the father of data warehouses. In 1994, he and Chuck Kelley defined a list of twelve rules defining a data warehouse. 1. The data warehouse and operational databases are separated. 2. Data warehouse data are integrated. 3. A data warehouse contains historical data over a long time horizon. 4. Data warehouse data are a snapshot captured at a particular point in time. 5. Data warehouse data are subject-oriented. 6. A data warehouse is mainly read-only with periodic batch updates from operational data. No online updates are allowed. 7. The data warehouse development cycle is data-driven, whereas the classical systems development approach is process-driven. 1 8. A data warehouse contains data at several levels of detail: current detail data, old detail data, lightly summarised and highly summarised data. 9. Database operations in a data warehouse are typically read-only transactions on very large data sets, whereas in an operational database, there are typically many update transactions to a few data entities at a time. 10. A data warehouse has a system that tracks data sources, transformations and storage. SOURCE: Unknown 11. Metadata are critical for a data warehouse, as they identify and define all data elements. Metadata provide the source, transformation, integration, storage, usage, relationships and history of each data element. 2 12. A data warehouse contains a charge-back mechanism for resource usage, in order to enforce optimal use of data by end users. 3 INFO 321 Chapter 3: Decision Support Systems F-1 1 There are many who would argue with the latter claim. 2 This sounds suspiciously similar to number 10. 3 This one seems somewhat out of place, leaving one to wonder whether it was included simply to ensure that there were twelve rules! INFO 321 Chapter 3: Decision Support Systems F-2 Chapter 3: Decision Support Systems Figures & Examples (4 1 4 4) 69
Figure 3 3: Star schemas Figure 3 4: Operational vs. multidimensional data (a) Orders star schema PRODUCT PRODUCT_ID DESCRIPTION PROD_TYPE_ID BRAND COLOUR SIZE PACKAGE 3000 rows VENDOR VENDOR_ID VENDOR_NAME 50 rows (b) Sales star schema LOCATION LOCATION_ID DESCRIPTION REGION_ID STATE CITY 25 rows PERSON PERSON_ID NAME GENDER 125 rows ORDER TIME_ID PRODUCT_ID VENDOR_ID QUANTITY PRICE AMOUNT TIME TIME_ID YEAR QUARTER MONTH WEEK DAY 85 000 rows 1827 rows (5 years) Daily aggregates by product and vendor SOURCE: adapted from Rob & Coronel, Figure 13.18 SALE TIME_ID LOCATION_ID PERSON_ID PRODUCT_ID QUANTITY PRICE AMOUNT 3 000 000 rows Daily aggregates by store, person and product SOURCE: adapted from Rob & Coronel, Figure 13.17 TIME TIME_ID YEAR QUARTER MONTH WEEK DAY 1827 rows PRODUCT PRODUCT_ID DESCRIPTION PROD_TYPE_ID BRAND COLOUR SIZE PACKAGE 3000 rows INVOICE_HEADER INV_NUM INV_DATE CUST_NO 2034 2035 2036 2037 2038 Customer Dimension Circuit Central Small Bytes Computer House Totals 15-May-2007 15-May-2007 16-May-2007 16-May-2007 16-May-2007 12345 82739 12345 82739 35348 (Circuit Central) (Small Bytes) (Circuit Central) (Small Bytes) (Computer House) INVOICE_HEADER INV_NUM PROD_NUM LINE_PRICE LINE_QTY 2034 M34661 $50.00 20 2034 D99280 $30.00 10 2035 D44346 $165.00 6 2036 M34661 $50.00 30 2037 M34661 $50.00 10 2037 C74316 $35.00 5 2037 D44346 $165.00 10 2038 S64371 $60.00 8 Time Dimension 15-May-2007 16-May-2007 $1,300.00 $990.00 Sales are located at the intersection of a customer row and a time column Operational View of Sales Two-dimensional View of Sales $1,500.00 $2,325.00 $480.00 $2,290.00 $4,305.00 (Microsoft Wireless Mouse) (SanDisk USB Flash Drive, 8 GB) (Seagate Hard Drive, 2 TB) (Microsoft Wireless Mouse) (Microsoft Wireless Mouse) (D-Link 4-Port Ethernet Switch) (Seagate Hard Drive, 2 TB) (Creative Speaker System) Totals $2,800.00 $3,315.00 $480.00 $6,595.00 Aggregrations are calculated for both dimensions SOURCE: adapted from Coronel, Morris & Rob, Figure 13.5 $Id: Chapter4figures.tex,v 1.11 2011/08/09 05:06:43 nstanger Exp $ INFO 321 Chapter 3: Decision Support Systems F-3 INFO 321 Chapter 3: Decision Support Systems F-4 Chapter 3: Decision Support Systems Figures & Examples (4 1 4 4) 70