Roadmap DB Sys. Design & Impl. Citation. Detailed Outline. Data Ware-housing. Problem. Data Cubes. Christos Faloutsos

Similar documents
A Technical Review on On-Line Analytical Processing (OLAP)

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

DATA WAREHOUSING - OLAP

Classification and Prediction

DATA WAREHOUSING AND OLAP TECHNOLOGY

Building Data Cubes and Mining Them. Jelena Jovanovic

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Data W a Ware r house house and and OLAP II Week 6 1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

II. OLAP(ONLINE ANALYTICAL PROCESSING)

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Multi-dimensional index structures Part I: motivation

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

CHAPTER 4 Data Warehouse Architecture

Mauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Analyzing Polls and News Headlines Using Business Intelligence Techniques

Application of Data Warehouse and Data Mining. in Construction Management

Data Mining for Knowledge Management. Classification

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

OLAP Systems and Multidimensional Expressions I

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Main Memory & Near Main Memory OLAP Databases. Wo Shun Luk Professor of Computing Science Simon Fraser University

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Part 22. Data Warehousing

Data Warehouse design

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data Mining and Database Systems: Where is the Intersection?

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Application Tool for Experiments on SQL Server 2005 Transactions

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data Warehouse: Introduction

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Week 3 lecture slides

Decision Trees from large Databases: SLIQ

Data Mining Jargon. Bob Muenchen The Statistical Consulting Center

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Data Mining Algorithms Part 1. Dejan Sarka

Data Warehousing, OLAP, and Data Mining

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

The Cubetree Storage Organization

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

On-Line Application Processing. Warehousing Data Cubes Data Mining

Introduction to Data Mining

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

A Critical Review of Data Warehouse

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Week 13: Data Warehousing. Warehousing

What is OLAP - On-line analytical processing

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

UNIT-3 OLAP in Data Warehouse

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

CS2032 Data warehousing and Data Mining Unit II Page 1

Decision Tree Induction in High Dimensional, Hierarchically Distributed Databases

PaintingClass: Interactive Construction, Visualization and Exploration of Decision Trees

A Dynamic Load Balancing Strategy for Parallel Datacube Computation

Data Warehouse and OLAP. Methodologies, Algorithms, Trends

Scalable Classification over SQL Databases

Improving Analysis Of Data Mining By Creating Dataset Using Sql Aggregations

Data Warehousing & OLAP

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Data Warehousing and OLAP Technology for Knowledge Discovery

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Data Warehousing. Paper

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

New Approach of Computing Data Cubes in Data Warehousing

Introduction to Data Mining

Data Warehousing & OLAP

Lecture 10: Regression Trees

Principles of Data Mining by Hand&Mannila&Smyth

A Data Mining Tutorial

Monitoring Genebanks using Datamarts based in an Open Source Tool

FEATURES TO CONSIDER IN A DATA WAREHOUSING SYSTEM

DATA WAREHOUSING AND DATA MINING - A CASE STUDY

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

(Week 10) A04. Information System for CRM. Electronic Commerce Marketing

A Brief Tutorial on Database Queries, Data Mining, and OLAP

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

OLAP Systems and Multidimensional Queries II

While people are often a corporation s true intellectual property, data is what

Indexing Techniques for Data Warehouses Queries. Abstract

Spatial Data Warehouse and Mining. Rajiv Gandhi

Visual Data Mining in Indian Election System

Transcription:

572 DB Sys. Design & mpl. Data Cubes Christos Faloutsos www.cs.cmu.edu/~christos Roadmap ) Roots: System R and ngres 2) mplementation: buffering, indexing, qopt 3) Transactions: locking, recovery 4) Distributed DBMSs 5) Parallel DBMSs: Gamma, Alphasort 6) OO/OR DBMS 7) Data Analysis data mining data cubes association rules 8) Benchmarks 9) vision statements extras (streams/sensors, graphs, multimedia, web, fractals) 572 C. Faloutsos 2 Detailed Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) Citation Gray, et al.: "Data Cube: A Relational Aggregation Operator Generalizing Groupby, CrossTab, and Sub Totals." Data Mining and Knowledge Discovery (): 2953 (997) 572 C. Faloutsos 3 572 C. Faloutsos 4 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) PGH NY SF sales(pid, cid, date, $price) customers( cid, age, income,...)??? 572 C. Faloutsos 5 Data Warehousing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / nonhomegeneities? 572 C. Faloutsos 6

Data Warehousing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / nonhomegeneities? A: Wrappers/Mediators Data Warehousing Step 2: collect counts. (/OLAP) Eg.: 572 C. Faloutsos 7 572 C. Faloutsos 8 sales OLAP Problem: is it true that shirts in large s sell better in dark s? cid pid Size Color $ C0 Shirt L Blue 30 C0 Pants XL Red 50 C20 Shirt XL White 20... 572 C. Faloutsos 9, : DMENSONS count : MEASURE 572 C. Faloutsos 0, : DMENSONS count : MEASURE, : DMENSONS count : MEASURE 572 C. Faloutsos 572 C. Faloutsos 2 2

, : DMENSONS count : MEASURE, : DMENSONS count : MEASURE 572 C. Faloutsos 3 572 C. Faloutsos 4, : DMENSONS count : MEASURE DataCube 572 C. Faloutsos 5 SQL query to generate DataCube: Naively (and painfully:) select,, count(*) from sales where pid = shirt group by, select, count(*) from sales where pid = shirt group by... 572 C. Faloutsos 6 SQL query to generate DataCube: with cube by keyword: select,, count(*) from sales where pid = shirt cube by, DataCube issues: Q: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow 572 C. Faloutsos 7 572 C. Faloutsos 8 3

DataCube issues: Q: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: rollup, drill down, slice, dice [More details: book by HanKamber] Q: How to store a datacube? 572 C. Faloutsos 9 572 C. Faloutsos 20 Q: How to store a datacube? A: Relational (ROLAP) Color Size count all all 47 Blue all 4 Blue M 3 Q: How to store a datacube? A2: Multidimensional (MOLAP) A3: Hybrid (HOLAP) 572 C. Faloutsos 2 572 C. Faloutsos 22 Pros/Cons: ROLAP strong points: (DSS, Metacube) Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality 572 C. Faloutsos 23 572 C. Faloutsos 24 4

Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: highdimensionality; sparseness) Q: How to store a datacube Q3: How to index a datacube? HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP 572 C. Faloutsos 25 572 C. Faloutsos 26 Rollup 572 C. Faloutsos 27 572 C. Faloutsos 28 Drilldown Slice 572 C. Faloutsos 29 572 C. Faloutsos 30 5

Dice Rollup Drilldown Slice Dice 572 C. Faloutsos 3 572 C. Faloutsos 32 Q: How to store a datacube Q3: How to index a datacube? Q3: How to index a datacube? 572 C. Faloutsos 33 572 C. Faloutsos 34 Q3: How to index a datacube? A: Bitmaps S M L Red Blue Gray Q3: How to index a datacube? A2: Join indices (see [HanKamber]) 572 C. Faloutsos 35 572 C. Faloutsos 36 6

D/W OLAP Conclusions D/W: copy (summarized) data analyze OLAP concepts: DataCube R/M/HOLAP servers dimensions ; measures Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 572 C. Faloutsos 37 572 C. Faloutsos 38 Decision trees Problem Age Chollevel Gender CLASSD 30 50 M?? 572 C. Faloutsos 39 Pictorially, we have num. attr#2 (eg., chollevel) Decision trees num. attr# (eg., age ) 572 C. Faloutsos 40 Decision trees and we want to label? Decision trees so we build a decision tree: num. attr#2 (eg., chollevel)? num. attr#2 (eg., chollevel) 40? num. attr# (eg., age ) 572 C. Faloutsos 4 50 num. attr# (eg., age ) 572 C. Faloutsos 42 7

Decision trees so we build a decision tree: age<50 Y N chol. <40 Y N... 572 C. Faloutsos 43 Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 572 C. Faloutsos 44 Decision trees Typically, two steps: tree building tree pruning (for overtraining/overfitting) How? num. attr#2 (eg., chollevel) num. attr# (eg., age ) 572 C. Faloutsos 45 572 C. Faloutsos 46 How? A: Partition, recursively pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S and S2 Partition(S); Partition(S2) 572 C. Faloutsos 47 Q: how to introduce splits along attribute A i Q2: how to evaluate a split? 572 C. Faloutsos 48 8

Q: how to introduce splits along attribute A i A: for num. attributes: binary split, or multiple split for categorical attributes: compute all subsets (expensive!), or use a greedy algo Q: how to introduce splits along attribute A i Q2: how to evaluate a split? 572 C. Faloutsos 49 572 C. Faloutsos 50 Q: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is ie., we need a measure of uniformity: entropy: H(p, p) Any other measure? 0 0 0.5 p 572 C. Faloutsos 5 572 C. Faloutsos 52 entropy: H(p, p ) gini index: p 2 p 2 entropy: H(p, p ) gini index: p 2 p 2 0 0 0.5 p 0 0 0.5 p (How about multiple labels?) 572 C. Faloutsos 53 572 C. Faloutsos 54 9

ntuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess with prob. p 572 C. Faloutsos 55 Thus, we choose the split that reduces entropy/classificationerror the most: Eg.: num. attr#2 (eg., chollevel) num. attr# (eg., age ) 572 C. Faloutsos 56 Before split: we need (n n ) * H( p, p ) = (76) * H(7/3, 6/3) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (26) * H(2/8, 6/8) bits for the second half What for? num. attr#2 (eg., chollevel) Tree pruning num. attr# (eg., age )... 572 C. Faloutsos 57 572 C. Faloutsos 58 Tree pruning Shortcut for scalability: DYNAMC pruning: stop expanding the tree, if a node is reasonably homogeneous ad hoc threshold [Agrawal, vldb92] Minimum Description Language (MDL) criterion (SLQ) [Mehta, edbt96] Tree pruning Q: How to do it? A: use a training and a testing set prune nodes that improve classification in the testing set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) in detail: 572 C. Faloutsos 59 572 C. Faloutsos 60 0

Tree pruning envision the problem as compression (of what?) Tree pruning envision the problem as compression (of what?) and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree 572 C. Faloutsos 6 572 C. Faloutsos 62 (MDL) a brilliant idea eg.: best ndegree polynomial to compress these points: the one that minimizes (sum of errors n ) 572 C. Faloutsos 63 Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 572 C. Faloutsos 64 Scalability enhancements nterval Classifier [Agrawal,vldb92]: dynamic pruning SLQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRNT: even more clever partitioning Conclusions for classifiers Classification through trees Building phase splitting policies Pruning phase (to avoid overfitting) For scalability: dynamic pruning clever data partitioning 572 C. Faloutsos 65 572 C. Faloutsos 66

Overall Conclusions Data Mining: of high commercial interest DM = DB ML Stat Data warehousing / OLAP: to get the data Tree classifiers (SLQ, SPRNT) Association Rules apriori algorithm (clustering: BRCH, CURE, OPTCS) Reading material Agrawal, R., T. mielinski, A. Swami, Mining Association Rules between Sets of tems in Large Databases, SGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLQ: A Fast Scalable Classifier for Data Mining, Proc. of the Fifth nt'l Conference on Extending Database Technology (EDBT), Avignon, France, March 996 572 C. Faloutsos 67 572 C. Faloutsos 68 Additional references Agrawal, R., S. Ghosh, et al. (Aug. 2327, 992). An nterval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 200, chapters 2.22.3, 6.6.2, 7.3.5 572 C. Faloutsos 69 2