14. Data Warehousing & Data Mining
Data Warehousing Concepts Decision support is key for companies wanting to turn their organizational data into an information asset Data Warehouse "A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process" Subject-oriented Integrated Time-variant Non-volatile OLAP (on-line analytical processing), DDS, EIS, and data mining applications 204
Benefits of Data Warehousing Potential high returns on investment Competitive advantage Increased productivity of corporate decision-makers 205
Comparison of OLTP and Data Warehousing OLTP systems Holds current data Stores detailed data Data is dynamic Repetitive processing High level of transaction throughput Predictable pattern of usage Transaction driven Application oriented Supports day-to-day decisions Serves large number of clerical / operational users Data warehousing systems Holds historic data Stores detailed, lightly, and summarized data Data is largely static Ad hoc, unstructured, and heuristic processing Medium to low transaction throughput Unpredictable pattern of usage Analysis driven Subject oriented Supports strategic decisions Serves relatively lower number of managerial users 206
Data Warehouse Architecture Operational Data Load Manager Warehouse Manager Query Manager Detailed Data Lightly and Highly Summarized Data Archive / Backup Data Meta-Data End-user Access Tools 207
End-user Access Tools Reporting and query tools Application development tools Executive Information System (EIS) tools Online Analytical Processing (OLAP) tools Data mining tools 208
Typical Architecture Operational data source 1 Load Manager Meta-data Higly summarized data Query Manager Reporting, query, application deve. tools Operational data source 2 Detaled data Lightly summarized data Warehouse Manager DBMS OLAP tools Operational data source 3 Data mining tools Archive/backup data End-user access tools 209
Data Warehousing Tools and Technologies Extraction, Cleansing, and Transformation Tools Data Warehouse DBMS Load performance Load processing Data quality management Query performance Terabyte scalability Networked data warehouse Warehouse administration Integrated dimensional tools Advanced query functionality 210
Data Marts A subset of data warehouse that supports the requirements of a particular department or business function 211
Designing Data Warehouses The Star Schema A logical structure that has a fact tables (containing factual data) in the center, surrounded by dimension tables (containing reference data) The Snowflake Schema A variant of the star schema where each dimension can have its own dimension 212
... Designing Data Warehouses Service srvice_id service_name service_type service_group... Time time_id holiday quarter day_of_week month date... Sales Customer customer_id name revenue credite_rate new?... srvice_id time_id salesrep_id customer_id amount commission... Salesrep salesrep_id name region manager_id salary hire_date birth_date... 213
Online Analytical Processing (OLAP) OLAP The dynamic synthesis, analysis, and consolidation of large volume of multi-dimensional data Multi-dimensional OLAP Cubes of data City Product type Time 214
Problems of Data Warehousing Underestimation of resources for data loading Hidden problem with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration 215
Codd's Rules for OLAP Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels 216
OLAP Tools Multi-dimensional OLAP (MOLAP) Multi-dimensional DBMS (MDDBMS) Relational OLAP (ROLAP) Creation of multiple multi-dimensional views of the twodimensional relations Managed Query Environment (MQE) Deliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally 217
SQL Extensions Sample extensions Decode Cume Show the monthly sales foe each branch, along with the monthly year-to-date figures MovingAvg(n) MovingSum(n) Rank When List the top 5 branches, based on last year sales; sort by branch number RatioToReport Tertile Create Macro 218
Data Mining Definition The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions Knowledge discovery Goals Association rules Sequential patterns Classification trees Prediction Identification Classification Optimization 219
Data Mining Techniques Predictive Modeling Supervised training with two phases Training phase : building a model using large sample of historical data called the training set Testing phase : trying the model on new data Database Segmentation Link Analysis Deviation Detection 220