This chapter reviews On-line Analytical Processing (OLAP) in Section 2.1 and data cubes in Section 2.2.

Similar documents
New Approach of Computing Data Cubes in Data Warehousing

DATA WAREHOUSING - OLAP

CHAPTER 4 Data Warehouse Architecture

Week 3 lecture slides

OLAP Systems and Multidimensional Expressions I

Data W a Ware r house house and and OLAP II Week 6 1

DATA WAREHOUSING AND OLAP TECHNOLOGY

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Basics of Dimensional Modeling

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

When to consider OLAP?

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

14. Data Warehousing & Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Data Warehouse design

OLAP and Data Warehousing! Introduction!

Part 22. Data Warehousing

CHAPTER 3. Data Warehouses and OLAP

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

A Technical Review on On-Line Analytical Processing (OLAP)

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Monitoring Genebanks using Datamarts based in an Open Source Tool

OLAP Theory-English version

Overview of Data Warehousing and OLAP

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Multi-dimensional index structures Part I: motivation

Week 13: Data Warehousing. Warehousing

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

Data Warehouse: Introduction

UNIT-3 OLAP in Data Warehouse

Data Warehousing. Paper

Data Warehousing and OLAP Technology for Knowledge Discovery

II. OLAP(ONLINE ANALYTICAL PROCESSING)

John R. Vacca INSIDE


Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

CS2032 Data warehousing and Data Mining Unit II Page 1

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

SQL Server 2012 Business Intelligence Boot Camp

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing and OLAP

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Implementing Data Models and Reports with Microsoft SQL Server

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Distance Learning and Examining Systems

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

LEARNING SOLUTIONS website milner.com/learning phone

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Chapter 3, Data Warehouse and OLAP Operations

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

University of Gaziantep, Department of Business Administration

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

A Critical Review of Data Warehouse

Hybrid OLAP, An Introduction

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

IST722 Data Warehousing

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

The Cubetree Storage Organization

The Design and the Implementation of an HEALTH CARE STATISTICS DATA WAREHOUSE Dr. Sreèko Natek, assistant professor, Nova Vizija,

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

PREFACE INTRODUCTION MULTI-DIMENSIONAL MODEL. Chris Claterbos, Vlamis Software Solutions, Inc.

OLAP Systems and Multidimensional Queries II

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Turkish Journal of Engineering, Science and Technology

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

Data Warehousing and Online Analytical Processing

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Fluency With Information Technology CSE100/IMT100

Transcription:

OLAP and Data Cubes This chapter reviews On-line Analytical Processing (OLAP) in Section 2.1 and data cubes in Section 2.2. 2.1 OLAP Coined by Codd et. a1 [18] in 1993, OLAP stands for On-Line Analytical Processing. The concept has its root in earlier products such as the IRI Express, the Comshare system, and the Essbase system [67]. Unlike statistical databases which usually store census data and economic data, OLAP is mainly used for analyzing business data collected from daily transactions, such as sales data and health care data [65]. The main purpose of an OLAP system is to enable analysts to construct a mental image about the underlying data by exploring it from different perspectives, at different level of generalizations, and in an interactive manner. As a component of decision support systems, OLAP interacts with other components, such as data warehouse and data mining, to assist analysts in making business decisions. A data warehouse usually stores data collected from multiple data sources, such as transactional databases throughout an organization. The data are cleaned and transformed to a common consistent format before they are stored in the data warehouse. Subsets of the data in a data warehouse can be extracted as data marts to meet the specific requirements of an organizational division. Unlike in transactional databases where data are constantly updated, typically the data stored in a data warehouse are refreshed from data sources only periodically. OLAP and data mining both allow analysts to discover novel knowledge about the data stored in a data warehouse. Data mining algorithms automatically produce knowledge in a pre-defined form, such as association rule or classification. OLAP does not directly generate such knowledge, but instead relies on human analysts to observe it by interpreting the query results. On the other hand, OLAP is more flexible than data mining in the sense that

14 2 OLAP and Data Cubes analysts may obtain all kinds of patterns and trends rather than only knowledge of fixed forms. OLAP and data mining can also be combined to enable analysts in obtaining data mining results from different portion of the data and at different level of generalization [39]. In a typical OLAP session, the analyst poses aggregation queries about underlying data. The OLAP system can usually return the result in a matter of seconds, even though the query may involve a large number of records. Based on the results, the analysts may decide to roll up to coarser-grained data so they can observe global patterns and trends. Upon observing an exception to any established pattern, the analysts may drill down to finer-grained data with more details to catch the outliers. Such a process is repeated in different portions of the data by slicing or dicing the data, until a satisfactory mental image of the data has been constructed. The requirements on OLAP systems have been defined differently, such as the FASMI (Fast Analysis of Shared Multidimensional Information) test [58] and the Codd rules [18]. Some of the requirements are unique to OLAP. First, to make OLAP analysis an interactive process, the OLAP system must be highly efficient in answering queries. OLAP systems usually rely on extensive pre-computations, indexing, and specialized storage to improve the performance. Second, to allow analysts to explore the data from different perspectives and at different level of generalization, OLAP organizes and generalizes data along multiple dimensions and dimension hierarchies. The data cube model we shall address shortly is one of the most popular abstract models for this purpose. The data to be analyzed by OLAP are usually stored based on the relational model in the backend data warehouse. The data are organized based on a star schema. Figure 2.1 shows an example of star schema. It has a fact table (timeid, orgid, commission), where the first two attributes timeid and orgid are called dimenions, and commission is called a measure. Each dimenion has a dimension table associated with it, indicating a dimension hierarchy. The dimension tables may contain redundancy, which can be removed by splitting each dimension table into multiple tables, one per attribute in the dimension table. The result is called a snowflake schema, as illustrated in Figure 2.2. Fig. 2.1. An Example of Star Schema

Fig. 2.2. An Example of Snowflake Schema branch? 2.2 Data Cube 15 \ / [ timen> I orgid ( commission I Popular architectures of OLAP systems include ROLAP (relational OLAP) and MOLAP (multidimensional OLAP). ROLAP provides a front-end tool that translates multidimensional queries into corresponding SQL queries to be processed by the relational backend. ROLAP is thus light weight and scalable to large data sets, whereas its performance is constrained because optimization techniques in the relational backend are typically not designed for multidimensional queries. MOLAP does not rely on the relational model but instead materializes the multidimensional views. MOLAP can thus provide better performance with the materialized and optimized multidimensional views. However, MOLAP demands substantial storage for materializing the views and is usually not scalable to large datasets due to the multidimensional explosion problem [57]. Using MOLAP for dense parts of the data and ROLAP for the others leads to a hybrid architecture, namely, the HOLAP or hybrid OLAP. 2.2 Data Cube Data cube was proposed as a SQL operator to support common OLAP tasks like histograms (that is, aggregation over computed categories) and subtotals [37]. Even though such tasks are usually possible with standard SQL queries, the queries may become very complex. For example, Table 2.1 shows a simple relation comm, where employee, quarter, and location are the three dimensions, and the commission is a measure. We call comm a base relation since it contains the ground facts to be analyzed. Suppose analysts are interested in the sub-total commissions in the twodimensional cross tabular in Table 2.2 (other possible ways of visually representing such subtotals are discussed in [37]). The inner tabular includes the quarterly commission for each employee; below the inner tabular are the total commissions for each employee; to the right are the total commissions in each

16 2 OLAP and Data Cubes employee quarter location Alice Q1 Domestic Alice Q1 International Bob Q1 Domestic Mary Q1 Domestic Mary Q1 International Bob Q2 International Mary Q2 Domestic Jim &2 Domestic commission 200 1200 1 Table 2.1. The Base Relation comm year; at the right bottom corner is the total commission of all employees in the two years. & 1 Q2 Alice Bob Mary Jim 2000 1 total(all) 3 3000 Table 2.2. A Two-dimensional Cross Table The sub-totals in Table 2.2 can be computed using SQL queries. However, we need to union four GROUP BY queries as follows: SELECT employee, quarter, SUM(commission) FROM comm GROUP BY employee,quarter UNION SELECT 'ALL), quarter, SUM(commission) FROM comm GROUP BY quarter UNION SELECT employee, 'ALL), SUM(commission) FROM comm GROUP BY employee UNION SELECT 'ALL', 'ALL', SUM(commission) FROM comm The number of needed unions is exponential in the number of dimensions. A complex query may result in many scans of the base table, leading to poor performance. Because such sub-totals are very common in OLAP queries, it is

2.2 Data Cube 17 desired to define a new operator for the collection of such sub-totals, namely, data cube. A data cube is essentially the generalization of the cross tabular illustrated in Table 2.2. The generalization happens in several perspectives. First, a data cube can be n dimensional. Table 2.3 shows a three-dimensional data cube built from this new base table. Based on the ALL values, the data cube is divided into eight parts, namely, cuboids. The first cuboid is a threedimensional cube, usually called the core cuboid. The next three cuboids have one ALL value and are the two-dimensional planes. The next three are the onedimensional lines. The last cuboid has a single value and is a zero-dimensional point. The second perspective of the generalization is the aggregation function. The aggregation discussed so far is SUM. In general any aggregation function, including customized ones, can be used to construct a data cube. Those functions can be classified into three categories, the distributive, the algebraic, and the holistic. Let I be a set of values, and P(I) = {pl,p2,...,p,) be any partition on I. Then an aggregation function F() is distributive, if there exists a function G() such that F(I) = G({F(pi) : 1 5 i 5 n)). It is straightforward to verify that SUM, COUNT, MIN, and MAX are distributive. By generalizing the function G() into one that returns an m-vector, the algebraic aggregations have a similar property like that of the distributive ones, that is F(I) = G({F(pi) : 1 < i < n)). For example, Let G() =< SUM(), COUNT() >, then AVERAGE is clearly an algebraic function. However, a holistic function like MEDIAN cannot be evaluated on P(I) with any G() that returns a vector of constant degree. The significance in distinguishing those three types of aggregation function lies in the difficulty of computing a data cube. Because the cuboids in a data cube form a hierarchy of aggregation, a cuboid can be more easily computed from other cuboids for distributive and algebraic functions. However, for a holistic function, any cuboid must be computed directly from the base table. The third perspective of the generalization is the dimension hierarchy. For the data cubes shown in Table 2.2 and Table 2.3, each dimension is a twolevel pure hierarchy. For example, the employee dimension has basically two attributes, employee and ALL (ALL should be regarded as both an attribute and its only value). In general, each dimension can have many attributes, such as the example in Figure 2.1. The attributes of a dimension may form a lattice instead of a pure hierarchy, such as day, week, month, and year (week and month are incomparable). The attribute in the base table is the lower bound of the lattice, and ALL (regarded as an attribute) is the upper bound. The product of the dimension lattices is still a lattice, as illustrated in Figure 2.3. This lattice is essentially the schema of a data cube. The lattice structure has played an important role in many perspectives of data cubes. For example, materializing the whole data cube usually incurs prohibitive costs in computation and storage. Moreover, materializing a cuboid does not always bring much benefits to answering queries. In Figure 2.3, the

18 2 OLAP and Data Cubes employee quarter location Alice Q1 Domestic Alice Q 1 International Bob Q1 Domestic Mary Q1 Domestic Mary Q1 International Bob Q2 International Mary Q2 Domestic Jim Q2 Domestic Alice Q1 ALL Bob Q1 ALL Mary Q1 ALL Bob Q2 ALL Mary Q2 ALL Jim Q2 ALL Alice ALL Domestic Alice ALL International Bob ALL Domestic Mary ALL Domestic Mary ALL International Bob ALL International Jim ALL Domestic ALL Q1 Domestic ALL Q1 International ALL Q2 International ALL Q2 Domestic Alice ALL ALL Bob ALL ALL Mary ALL ALL Jim ALL ALL ALL Q1 ALL ALL International ALL ALL ALL ommissio: 200 1200 1 2000 1 200 1700 1 2 1 1 2000 2 3300 3000 4000 2 6 Table 2.3. A Three-Dimensional Data Cube core cuboid must be materialized because it cannot be computed from others. However, if any of the cuboids with one ALL value has a comparable size to the core cuboid, then it may not need to be materialized, because answering a query using that cuboid incurs similar costs as using the core cuboid instead (the cost of answering a query is roughly proportional to the size of the cuboid being used). Greedy algorithms have been proposed to find optimal materialization of a data cube under resource constraints [40].

Fig. 2.3. The Lattice Structure of Data Cube 2.2 Data Cube 19 ALL, quarter, ALL employee, ALL, ALL ALL, ALL, location employee, quarter, ALL employee, ALL, location ALL, quarter, location employee, quarter, location Even if the whole data cube needs to be computed, the lattice structure can help to improve the performance of such computations. For example, if the computation is based on sorting the records, then the core cuboid can be sorted in three different ways (by any two of the three attributes). Each choice will simplify the computation of one of the three cuboids with one ALL value because that cuboid can be computed without additional sorting. However, computing the other two cuboids will require the core cuboid be resorted. Based on estimated costs, algorithms exist to make the optimal choice in sorting each cuboid, and those choices can be linked to form pipelines of computation so the needs for re-sorting cuboids can be reduced [2].

http://www.springer.com/978-0-387-46273-8