Data Warehouse and OLAP

Similar documents
Overview of Data Warehousing and OLAP

Data W a Ware r house house and and OLAP Week 5 1

Chapter 3, Data Warehouse and OLAP Operations

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

DATA WAREHOUSING AND OLAP TECHNOLOGY

Data Warehousing and Online Analytical Processing

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data Warehousing & OLAP

Data W a Ware r house house and and OLAP II Week 6 1

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

CHAPTER 3. Data Warehouses and OLAP

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data Warehousing & OLAP

Data Warehousing and OLAP Technology

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

DATA WAREHOUSING - OLAP

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Data Mining for Knowledge Management. Data Warehouses


Basics of Dimensional Modeling

CHAPTER 4 Data Warehouse Architecture

Lecture 2 Data warehousing

Building Data Cubes and Mining Them. Jelena Jovanovic

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

Data Warehousing & OLAP

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Week 3 lecture slides

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Integrate multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied

Data Warehousing and OLAP Technology for Knowledge Discovery

Week 13: Data Warehousing. Warehousing

Lecture Data Warehouse Systems

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

New Approach of Computing Data Cubes in Data Warehousing

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Fluency With Information Technology CSE100/IMT100

Part 22. Data Warehousing

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

What is OLAP - On-line analytical processing

Data Warehousing and OLAP

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

When to consider OLAP?

Data Warehousing and Data Mining in Business Applications

Lection 3-4 WAREHOUSING

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Mario Guarracino. Data warehousing

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

(Week 10) A04. Information System for CRM. Electronic Commerce Marketing

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Module 1: Introduction to Data Warehousing and OLAP

Why Business Intelligence

A Technical Review on On-Line Analytical Processing (OLAP)

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Turkish Journal of Engineering, Science and Technology

Data Warehousing and elements of Data Mining

Data Warehousing, OLAP, and Data Mining

The strategic importance of OLAP and multidimensional analysis A COGNOS WHITE PAPER

A Critical Review of Data Warehouse

Dimensional Modeling for Data Warehouse

14. Data Warehousing & Data Mining

Designing a Dimensional Model

Data Warehouse: Introduction

John R. Vacca INSIDE

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

B.Sc (Computer Science) Database Management Systems UNIT-V

Data Warehousing. Yeow Wei Choong Anne Laurent

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Introduction. Introduction to Data Warehousing

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

1960s 1970s 1980s 1990s. Slow access to

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Data Warehousing and Data Mining

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar

OLAP and Data Warehousing! Introduction!

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Chapter 7 Multidimensional Data Modeling (MDDM)

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Data Mining. Session 4 Main Theme Data Warehousing and OLAP. Dr. Jean-Claude Franchitti

A Design and implementation of a data warehouse for research administration universities

Data Warehousing Systems: Foundations and Architectures

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Data Mart/Warehouse: Progress and Vision

Visual Data Mining in Indian Election System

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

CHAPTER 5: BUSINESS ANALYTICS

Transcription:

Data Warehouse and OLAP Data warehouses generalize and consolidate data in multidimensional space. The construction of data warehouses involves data cleaning, data integration, and data transformation and can be viewed as an important preprocessing step for data mining. Data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining. Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Data Mining 1

What is a Data Warehouse? Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. A decision support database that is maintained separately from the organization s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. Data warehousing: The process of constructing and using data warehouses Data Mining 2

Data Warehouse - Subject-Oriented Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process Data Mining 3

Data Warehouse - Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Data Mining 4

Data Warehouse - Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element Data Mining 5

Data Warehouse - Nonvolatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data Data Mining 6

Operational Database Systems and Data Warehouses The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems. Data Mining 7

OLTP vs. OLAP Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject oriented database design. View: An OLTP system focuses mainly on the current data. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations Data Mining 8

Why a Separate Data Warehouse? High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Note: There are more and more systems which perform OLAP analysis directly on relational databases Data Mining 9

Multidimensional Data Model: Data Cube Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Dimensions are the perspectives or entities with respect to which an organization wants to keep records. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. For example, a dimension table for item may contain the attributes item name, brand, and type. Facts are numerical measures. Examples of facts for a sales data warehouse include dollars sold units sold Data Mining 10

Data Cube: A 2-D Data Cube Although we usually think of cubes as 3-D geometric structures, in data warehousing the data cube is n-dimensional. A 2-D data cube: dimensions time and item, the measure displayed (fact) is dollars_sold3.2. In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). Data Mining 11

Data Cube: A 3-D Data Cube A 3-D data cube representation of the data according to the dimensions time, item, and location. The measure displayed is dollars_sold. Data Mining 12

Data Cube: A Lattice of Cuboids A lattice of cuboids, making up a 4-D data cube for the dimensions time, item, location, and supplier. Each cuboid represents a different degree of summarization. The cuboid that holds the lowest level of summarization is called the base cuboid. the base cuboid for the given time, item, location, and supplier dimensions A 3-D (nonbase) cuboid for time, item, and location, summarized for all suppliers. The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid. this is the total sales summarized over all four dimensions. The apex cuboid is typically denoted by all. Data Mining 13

Conceptual Modeling of Data Warehouses The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model is appropriate for on-line transaction processing. A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. The most popular data model for a data warehouse is a multidimensional model. A multidimensional model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Data Mining 14

Multidimensional Model - Star Schema The most common modeling paradigm is the star schema, in which the data warehouse contains 1. a large central table (fact table) containing the bulk of the data, with no redundancy, and 2. a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. In the star schema, each dimension is represented by only one table, and each table contains a set of attributes. Data Mining 15

Star Schema Example time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state_or_province country Data Mining 16

Multidimensional Model Snowflake Schema The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Data Mining 17

Snowflake Schema Example time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_key supplier supplier_key supplier_type branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city_key city city_key city state_or_province country Data Mining 18

Multidimensional Model Fact Constellation Schema Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation schema. Data Mining 19

Fact Constellation Schema Example time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key item item_key item_name brand type supplier_type Shipping Fact Table time_key item_key shipper_key from_location branch branch_key branch_name branch_type Measures location_key units_sold dollars_sold avg_sales location location_key street city province_or_state country to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type Data Mining 20

Concept Hierarchies A concept hierarchy defines a sequence of mappings from a set of lowlevel concepts to higher-level, more general concepts. Many concept hierarchies are implicit within the database schema. Concept hierarchies may be provided manually by system users, domain experts, or knowledge engineers, or may be automatically generated based on statistical analysis of the data distribution. A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy. Concept hierarchies may also be defined by discretizing or grouping values for a given dimension, resulting in a set-grouping hierarchy. A total or partial order can be defined among groups of values. Data Mining 21

Concept Hierarchies A concept hierarchy for the dimension location Data Mining 22

Concept Hierarchies: Hierarchical and lattice structures of attributes in warehouse dimensions a hierarchy for location a lattice for time. Data Mining 23

Concept Hierarchies: A concept hierarchy for the attribute price Data Mining 24

OLAP Operations How are concept hierarchies useful in OLAP? In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand. Data Mining 25

Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Data Mining 26

OLAP Operation: Roll-up The roll-up operation (also called the drill-up operation) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Data Mining 27

OLAP Operation: Drill-down Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Data Mining 28

OLAP Operation: Slice and Dice The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The dice operation defines a subcube by performing a selection on two or more dimensions. Data Mining 29

Pivot (rotate) is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. OLAP Operation: Pivot Data Mining 30

Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record Data Mining 31

A Three-Tier DataWarehouse Architecture Data warehouses often adopt a three-tier architecture Data Mining 32

Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools Data Mining 33

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions Integration and swapping of multiple mining functions, algorithms, and tasks Data Mining 34

An integrated OLAM and OLAP architecture An OLAM server may perform multiple data mining tasks, such as concept description, association, classification, prediction, clustering, time-series analysis, and it usually consists of multiple integrated data mining modules and is more sophisticated than an OLAP server. Data Mining 35

Summary A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data organized in support of management decision making. Several factors distinguish data warehouses from operational databases. Because the two systems provide quite different functionalities and require different kinds of data, it is necessary to maintain data warehouses separately from operational databases. A multidimensional data model is typically used for the design of corporate data warehouses. A multidimensional data model can adopt a star schema, snowflake schema, or fact constellation schema. The core of the multidimensional model is the data cube, which consists of a large set of facts (or measures) and a number of dimensions. Dimensions are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature. A data cube consists of a lattice of cuboids, each corresponding to a different degree of summarization of the given multidimensional data. Data Mining 36

Summary Concept hierarchies organize the values of dimensions into gradual levels of abstraction. On-line analytical processing (OLAP) can be performed in data warehouses using the multidimensional data model. Typical OLAP operations include roll-up, drill-down,, slice-and-dice, pivot (rotate), as well as statistical operations such as ranking and computing moving averages and growth rates. Data warehouses often adopt a three-tier architecture. The bottom tier is a warehouse database server, which is a relational database system. The middle tier is an OLAP server, and The top tier is a client, containing query and reporting tools. A data warehouse contains back-end tools and utilities for populating and refreshing the warehouse. data extraction, data cleaning, data transformation, loading, refreshing, and warehouse management. Data warehouse metadata are data defining the warehouse objects. A metadata repository provides details regarding the warehouse structure, data history, the algorithms used for summarization, mappings from the source data to warehouse form. Data Mining 37