On-Line Application Processing. Warehousing Data Cubes Data Mining



Similar documents
OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Multi-dimensional index structures Part I: motivation

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Week 13: Data Warehousing. Warehousing

DATA WAREHOUSING - OLAP

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

IST722 Data Warehousing

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data Warehousing and OLAP

OLAP and Data Warehousing! Introduction!

OLAP OLAP. Data Warehouse. OLAP Data Model: the Data Cube S e s s io n

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

The Cubetree Storage Organization

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data W a Ware r house house and and OLAP II Week 6 1

Part 22. Data Warehousing

DATA WAREHOUSING AND OLAP TECHNOLOGY

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Week 3 lecture slides

(Week 10) A04. Information System for CRM. Electronic Commerce Marketing

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Turkish Journal of Engineering, Science and Technology

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Overview of Data Warehousing and OLAP

University of Gaziantep, Department of Business Administration

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

14. Data Warehousing & Data Mining

Data Warehousing and Data Mining

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehouse design

Building Data Cubes and Mining Them. Jelena Jovanovic

Mario Guarracino. Data warehousing

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehouse: Introduction

Informationslogistik Unit 10: OLTP, OLAP, SAP, Data Warehouse, and Object-relational Databases

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

CASE PROJECTS IN DATA WAREHOUSING AND DATA MINING

Monitoring Genebanks using Datamarts based in an Open Source Tool

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Transactions, Views, Indexes. Controlling Concurrent Behavior Virtual and Materialized Views Speeding Accesses to Data

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Lecture Data Warehouse Systems

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Why Business Intelligence

When to consider OLAP?

CHAPTER 4 Data Warehouse Architecture

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Data Warehousing. Paper

CS54100: Database Systems

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Main Memory & Near Main Memory OLAP Databases. Wo Shun Luk Professor of Computing Science Simon Fraser University

Data Warehousing & OLAP

Business Intelligence: Multidimensional Data Analysis

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Cognos 8 Best Practices

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Data W a Ware r house house and and OLAP Week 5 1

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Introduction to Data Warehousing. Ms Swapnil Shrivastava

PREFACE INTRODUCTION MULTI-DIMENSIONAL MODEL. Chris Claterbos, Vlamis Software Solutions, Inc.

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Data Warehousing Systems: Foundations and Architectures

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

What is OLAP - On-line analytical processing

Data Warehousing Overview

OLAP Systems and Multidimensional Expressions I

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

A Critical Review of Data Warehouse

<Insert Picture Here> Oracle Retail Data Model Overview

Implementing Data Models and Reports with Microsoft SQL Server

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

1960s 1970s 1980s 1990s. Slow access to

Introduction to Databases

II. OLAP(ONLINE ANALYTICAL PROCESSING)

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Integrating GIS and BI: a Powerful Way to Unlock Geospatial Data for Decision-Making

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Implementing Data Models and Reports with Microsoft SQL Server

Data Warehousing Concepts

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Mining Jargon. Bob Muenchen The Statistical Consulting Center

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

A Technical Review on On-Line Analytical Processing (OLAP)

Introduction to Data Mining

Dimensional Data Modeling for the Data Warehouse

Business Intelligence, Analytics & Reporting: Glossary of Terms

Data Warehousing OLAP

Transcription:

On-Line Application Processing Warehousing Data Cubes Data Mining 1

Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries. New architectures have been developed to handle analytic queries efficiently. 2

The Data Warehouse The most common form of data integration. Copy sources into a single DB (warehouse) and try to keep it up-to-date. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Frequently essential for analytic queries. 3

OLTP Most database operations involve On- Line Transaction Processing (OTLP). Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets. 4

OLAP On-Line Application Processing (OLAP, or analytic ) queries are, typically: Few, but complex queries --- may run for hours. Queries do not depend on having an absolutely up-to-date database. 5

OLAP Examples 1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2. Analysts at Wal-Mart look for items with increasing sales in some region. Use empty trucks to move merchandise between stores. 6

Common Architecture Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP. 7

Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. Often insert-only. 2. Dimension tables : smaller, generally static information about the entities involved in the facts. 8

Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: Sales(bar, beer, drinker, day, time, price) 9

Example -- Continued The dimension tables include information about the bar, beer, and drinker dimensions : Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone) 10

Visualization Star Schema Dimension Table (Bars) Dimension Table (Drinkers) Dimension Attrs. Dependent Attrs. Fact Table -Sales Dimension Table (Beers) Dimension Table (etc.) 11

Dimensions and Dependent Attributes Two classes of fact-table attributes: 1. Dimension attributes : the key of a dimension table. 2. Dependent attributes : a value determined by the dimension attributes of the tuple. 12

Example: Dependent Attribute price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes). 13

Approaches to Building Warehouses 1. ROLAP = relational OLAP : Tune a relational DBMS to support star schemas. 2. MOLAP = multidimensional OLAP : Use a specialized DBMS with a model such as the data cube. 14

MOLAP and Data Cubes Keys of dimension tables are the dimensions of a hypercube. Example: for the Sales data, the four dimensions are bar, beer, drinker, and time. Dependent attributes (e.g., price) appear at the points of the cube. 15

Visualization -- Data Cubes beer price bar drinker 16

Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions, 17

Visualization --- Data Cube w/aggregation beer price SUM over all Drinkers bar drinker 18

Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinkerday triples, 19

Structure of the Cube Think of each dimension as having an additional value *. A point with one or more * s in its coordinates aggregates over the dimensions with the * s. Example: Sales( Joe s Bar, Bud, *, *) holds the sum, over all drinkers and all time, of the Bud consumed at Joe s. 20

Drill-Down Drill-down = de-aggregate = break an aggregate into its constituents. Example: having determined that Joe s Bar sells very few Anheuser-Busch beers, break down his sales by particular A.-B. beer. 21

Roll-Up Roll-up = aggregate along one or more dimensions. Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed by each drinker. 22

Example: Roll Up and Drill Down $ of Anheuser-Busch by drinker/bar Joe s Bar Nut- House Blue Chalk Jim 45 50 38 Bob 33 36 31 Mary 30 42 40 Roll up by Bar $ of A-B / drinker Jim 133 Bob 100 Mary 112 $ of A-B Beers / drinker Jim Bob Drill down by Beer Mary Bud 40 29 40 M lob 45 31 37 Bud Light 48 40 35 23

Data Mining Data mining is a popular term for queries that summarize big data sets in useful ways. Examples: 1. Clustering all Web pages by topic. 2. Finding characteristics of fraudulent credit-card use. 24

Course Plug Winter 2007-8: Anand Rajaraman and Jeff Ullman are offering CS345A Data Mining. MW 4:15-5:30, Herrin, T185. 25

Market-Basket Data An important form of mining from relational data involves market baskets = sets of items that are purchased together as a customer leaves a store. Summary of basket data is frequent itemsets = sets of items that often appear together in baskets. 26

Example: Market Baskets If people often buy hamburger and ketchup together, the store can: 1. Put hamburger and ketchup near each other and put potato chips between. 2. Run a sale on hamburger and raise the price of ketchup. 27

Finding Frequent Pairs The simplest case is when we only want to find frequent pairs of items. Assume data is in a relation Baskets(basket, item). The support threshold s is the minimum number of baskets in which a pair appears before we are interested. 28

Frequent Pairs in SQL SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.basket = b2.basket AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; Throw away pairs of items that do not appear at least s times. Look for two Basket tuples with the same basket and different items. First item must precede second, so we don t count the same pair twice. Create a group for each pair of items that appears in at least one basket. 29

A-Priori Trick (1) Straightforward implementation involves a join of a huge Baskets relation with itself. The a-priori algorithm speeds the query by recognizing that a pair of items {i, j } cannot have support s unless both {i } and {j } do. 30

A-Priori Trick (2) Use a materialized view to hold only information about frequent items. INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets WHERE item IN ( ); SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s Items that appear in at least s baskets. 31

A-Priori Algorithm 1. Materialize the view Baskets1. 2. Run the obvious query, but on Baskets1 instead of Baskets. Computing Baskets1 is cheap, since it doesn t involve a join. Baskets1 probably has many fewer tuples than Baskets. Running time shrinks with the square of the number of tuples involved in the join. 32

Example: A-Priori Suppose: 1. A supermarket sells 10,000 items. 2. The average basket has 10 items. 3. The support threshold is 1% of the baskets. At most 1/10 of the items can be frequent. Probably, the minority of items in one basket are frequent -> factor 4 speedup. 33