TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage: http://www.cs.jyu.fi/~mpechen/ties443.

Similar documents

Chapter 3, Data Warehouse and OLAP Operations

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehousing and OLAP Technology

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Data W a Ware r house house and and OLAP Week 5 1

Overview of Data Warehousing and OLAP

DATA WAREHOUSING AND OLAP TECHNOLOGY

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Mining for Knowledge Management. Data Warehouses

Lecture 2 Data warehousing

Data Warehousing and Online Analytical Processing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Week 13: Data Warehousing. Warehousing

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Data Warehousing and OLAP

Building Data Cubes and Mining Them. Jelena Jovanovic

14. Data Warehousing & Data Mining

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

DATA WAREHOUSING - OLAP

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

Week 3 lecture slides

Data W a Ware r house house and and OLAP II Week 6 1

Data Warehouse: Introduction

Lection 3-4 WAREHOUSING

CHAPTER 4 Data Warehouse Architecture

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Part 22. Data Warehousing

Fluency With Information Technology CSE100/IMT100

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Data Warehousing and OLAP Technology for Knowledge Discovery

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehousing & OLAP

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data Warehousing and elements of Data Mining

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Data Warehousing, OLAP, and Data Mining

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

When to consider OLAP?

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

B.Sc (Computer Science) Database Management Systems UNIT-V

New Approach of Computing Data Cubes in Data Warehousing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

CS54100: Database Systems

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Data Warehousing Systems: Foundations and Architectures

Foundations of Business Intelligence: Databases and Information Management

Data Warehousing and Data Mining

CHAPTER 3. Data Warehouses and OLAP

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Multi-dimensional index structures Part I: motivation

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Data Warehouse. The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way:

A Critical Review of Data Warehouse

BUILDING OLAP TOOLS OVER LARGE DATABASES

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Basics of Dimensional Modeling

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

OLAP and Data Warehousing! Introduction!

A Technical Review on On-Line Analytical Processing (OLAP)

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Data Integration and ETL Process

Module 1: Introduction to Data Warehousing and OLAP

Data Warehouse Technology And The MSD Databases

OLAP Systems and Multidimensional Expressions I

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Unit -3. Learning Objective. Demand for Online analytical processing Major features and functions OLAP models and implementation considerations

Foundations of Business Intelligence: Databases and Information Management

Mastering Data Warehouse Aggregates. Solutions for Star Schema Performance

Transcription:

TIES443 Lecture 3 Data Warehousing Mykola Pechenizkiy Course webpage: http://www.cs.jyu.fi/~mpechen/ties443 Department of Mathematical Information Technology University of Jyväskylä November 3, 2006 1 Topics for today What is a data warehouse? Data warehouse architectures Conceptual DW Modelling Physical DW Modelling A multi-dimensional data model Data Cubes OLAP 12 Codd s rules for OLAP Main OLAP operations New buzzwords Data warehouse implementation and maintenance 2 1

Data Warehouse A decision support DB that is maintained separately from the organization s operational databases. Why Separate Data Warehouse? High performance for both systems DBMS tuned for OLTP access methods, indexing, concurrency control, recovery Warehouse tuned for OLAP complex OLAP queries, multidimensional view, consolidation. Different functions and different data Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 3 Three-Tier Tier Architecture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Marts ROLAP Server Data Sources Data Storage OLAP Engine Front-End Tools 4 2

Three-Tier Tier Architecture Warehouse database server Almost always a relational DBMS, rarely flat files Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and materialization) Supporting query language extensions beyond SQL Complex query processing and optimization Data partitioning and parallelism OLAP servers Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array Specialized SQL servers: specialized support for SQL queries over star/snowflake schemas Clients Query and reporting tools Analysis tools Data mining tools 5 Warehouse Physical Architectures Client Client Client Central Data Warehouse Source Source Centralized architecture Logical Data Warehouse Physical Data Warehouse Source Source Source Source Federated architecture Tiered architecture 6 3

Three Data Warehouse Models Enterprise warehouse: collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization Requires extensive business modeling (may take years to design and build) Data Marts: Departmental subsets that focus on selected subjects Marketing data mart: customer, product, sales Faster roll out, but complex integration in the long run Virtual warehouse: views over operational DBs Materialize selective summary views for efficient query processing Easy to build but require excess capability on operat. db servers 7 8 4

Data Warehouse A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making 9 Data Warehouse Subject Subject-Oriented Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 10 5

Data Warehouse Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted 11 Data Warehouse Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element 12 6

Data Warehouse Non Non-Volatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data 13 Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration Build wrappers/mediators on top of heterogeneous databases Query driven approach When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources Data warehouse update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis 14 7

Data Warehouse vs. Operational DBMS OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries 15 Conceptual Modeling of Data Warehouses ER design techniques not appropriate - design should reflect multidimensional view Star Schema A fact table in the middle connected to a set of dimension tables Snowflake Schema A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact Constellation Schema Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 16 8

Example of a Star Schema Order Order No Order Date Customer Customer No Customer Name Customer Address City Salesperson SalespersonID SalespersonName City Quota Fact Table OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Product ProductNO ProdName ProdDescr Category CategoryDescription UnitPrice Date DateKey Date City CityName State Country 17 Star Schema A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables Example: Projected expense and the actual expense may share dimensional tables 18 9

Some Terms Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store) Each dimension can have a set of associated attributes For each dimension, the set of associated attributes can be structured as a hierarchy 19 A Concept Hierarchy: Dimension (location) all all region North_America... Europe country Canada... Mexico Ireland... France city Toronto... Dublin... Belfast office Belfield... Blackrock 20 10

Example of a Snowflake Schema Order No Order Date Customer No Customer Name Customer Address City SalespersonID SalespersonName City Quota Order Customer Salesperson Fact Table OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Product ProductNO ProdName ProdDescr Category Category UnitPrice Date DateKey Date Month City CityName State Country Month Month Year State Category CategoryName CategoryDescr StateName Country Year Year 21 Example of Fact Constellation Multiple fact tables share dimension tables Shipping Fact Table Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Measures Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city Province/street country Time_key Item_key shipper_key from_location to_location Euros_sold unit_shipped shipper shipper_key shipper_name location_key shipper_type 22 11

Multidimensional Data Model Fact relation Two-dimensional cube sale Product Client Amt p1 c1 12 p2 c1 11 p1 c3 50 p2 c2 8 c1 c2 c3 p1 12 50 p2 11 8 Fact relation 3-dimensional cube sale Product Client Date Amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 23 Multidimensional Data Sales volume as a function of product, month, and region Region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Month Week Office Day Month 24 12

From Tables to Data Cubes A data warehouse is based on multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions (such as sales) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Definitions an n-dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube 25 Cube: A Lattice of Cuboids all time item location supplier 0-D(apex) cuboid 1-D D cuboids time,item time,location time,supplier item,location item,supplier location,supplier 2-D D cuboids time,location,supplier time,item,location time,item,supplier item,location,supplier time, item, location, supplier 3-D D cuboids 4-D(base) cuboid 26 13

A Sample Data Cube TV PC VCR sum Product Date 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in Ireland Ireland France Germany Country sum 27 Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice project and select Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) rankings Typical OLAP Operations time functions: e.g. time avg. 28 14

Typical OLAP Operations: Drill & Roll Drill Down Total Sales Total Sales per city Total Sales per city per store Total Sales per city per store per month Drill Up Drill Down Total Sales Total Sales per city Total Sales per city by category Drill Across Drill Up 29 OLAP Queries: Rollup & Drill-Down Down day 1 day 2 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 c1 c2 c3 p1 56 4 50 p2 11 8 rollup drill-down c1 c2 c3 sum 67 12 50 sum p1 110 p2 19 129 30 15

Typical OLAP Operations: Pivoting Pivoting can be combined with aggregation sale prodid clientid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 c1 c2 c3 Sum 1 23 8 50 81 2 44 4 48 Sum 67 12 50 129 c1 c2 c3 Sum p1 56 4 50 110 p2 11 8 19 Sum 67 12 50 129 The result of pivoting is called a cross-tabulation 31 Browsing a Data Cube Visualization OLAP capabilities Interactive manipulation 32 16

12 Codd s Rules for OLAP 1. Multi-Dimensional Concept View The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice. (See later.) 2. Transparency The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host, i.e. OLAP server should shield the user for the complexity of the data and application 3. Accessibility OLAP should allow the user to access diverse data stores (relational, nonrelational and legacy systems) but see the data within a common 'schema provided by the OLAP tool, i.e. Users shouldn t have to know the location, type or layout of the data to access it. OLAP server should automate the mapping of the logical schema to the physical data 4. Consistent Reporting Performance There should not be significant degradation in performance with large numbers of dimensions or large quantities of data. 33 12 Codd s Rules for OLAP 5. Client-Server Architecture Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together Different clients can be used Data sources must be transparently supported by the OLAP server 6. Generic Dimensionality Data dimensions must all be treated equally. Functions available for one dimension must be available for others. 7. Dynamic Sparse Matrix Handling The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data. 8. Multi User Support access, integrity, security 34 17

12 Codd s Rules for OLAP 9. Unrestricted Cross-Dimensional Operations e.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales. Non-additive formulas cause the problems Contribution = Revenue - Total Costs Margin Percentage = Margin / Revenue 10. Intuitive Data Manipulation Navigation should be done by operations on individual cells rather than menus. Dimensions defined should allow automatic reorientation, drill-down, zoom-out, etc Interface must be intuitive 11. Flexible Reporting Row and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension. 12. Unlimited Dimensions and Aggregation Levels 15-20 dimensions are required in modelling, and within each there may be many hierarchical levels, i.e. unlimited aggregation Rare in reporting to go beyond 12 dimensions, 6-7 is usual 35 DW Back-End Tools and Utilities Data extraction: get data from multiple, heterogeneous, and external sources Data cleaning: detect errors in the data and rectify them when possible Data transformation: convert data from legacy or host format to warehouse format: different data formats, languages, etc. Load: sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh propagate the updates from the data sources to the warehouse 36 18

DW Information Flows INFLOW - Processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. UPFLOW - Processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data. DOWNFLOW - Processes associated with archiving and backing-up/recovery of data in the warehouse. OUTFLOW - Processes associated with making the data available to the end-users. METAFLOW - Processes associated with the management of the metadata. 37 Data Cleaning why? Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis finding and resolving inconsistency in the source data detecting data anomalies and rectifying them early has huge payoffs Important to identify tools that work together well Long Term Solution Change business practices and data entry tools Repository for meta-data 38 19

Data Cleaning Techniques Transformation Rules Example: translate gender to sex Uses domain-specific knowledge to do scrubbing Parsing and fuzzy matching Multiple data sources (can designate a preferred source) Discover facts that flag unusual patterns (auditing) Some dealer has never received a single complaint 39 Load Issues: huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken off-line When to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume load, or change load rate restart after failure with no loss of data integrity Techniques: batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long (~100 days for TB) use parallelism and incremental techniques 40 20

Refresh when to refresh on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator based on user needs and traffic possibly different policies for different sources how to refresh Full extract from base tables read entire source table or database: expensive Incremental techniques detect & propagate changes on base tables: replication servers logical correctness transactional correctness: incremental load 41 Metadata Repository Administrative metadata source databases and their contents warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control Business data business terms and definitions ownership of data charging policies Operational metadata data lineage: history of migrated data and sequence of transf-s applied currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails. 42 21

Design of a DW: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view: allows selection of the relevant information necessary for the data warehouse (mature) Bottom-up: Starts with experiments and prototypes (rapid) Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view consists of fact tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user 43 Data Warehouse Design Process From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record 44 22

DW Design: Issues to Consider What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 45 DW Data Management: Issues to Consider Meta-data Data sourcing Data quality Data security Granularity History- how long and how much? Performance 46 23

Common DW Problems Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration 47 Research Issues Data cleaning focus on data inconsistencies, not schema differences data mining techniques Physical Design design of summary tables, partitions, indexes tradeoffs in use of different indexes Query processing selecting appropriate summary tables dynamic optimization with feedback query optimization: cost estimation, use of transformations, search strategies partitioning query processing between OLAP server and backend server. Warehouse Management incremental refresh techniques computing summary tables during load failure recovery during load and refresh process management: scheduling queries, load and refresh use of workflow technology for process management 48 24

OLAP Mining: An Integration of DM and DW Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association 49 Summary What is a data warehouse Data warehouse architectures Conceptual DW Modelling Physical DW Modelling A multi-dimensional data model Data Cubes Main OLAP operations Data warehouse implementation and maintenance What else did you get from this lecture? 50 25

Additional Slides 51 Some Critics for Data Cubes Jargon for IT professionals Metaphor for end-users Not useful beyond introduction Users expect to see one Rubic s cube Easy to manipulate but difficult to solve Alternatives are better Cognos ways, Thomsen s multi-dimensional domain diagrams, Bulos s Adapt diagrams or simply well designed interactive reports 52 26

DW Development: A Recommended Approach Distributed Data Marts Multi-Tier Data Warehouse Data Mart Data Mart Enterprise Data Warehouse Model refinement Model refinement Define a high-level corporate data model 53 DMQL: Language Primitives Nice presentation on Data Mining Query Languages can be found here: http://www.cs.wisc.edu/edam/slides/data%20mining%20query%20languages.ppt Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) First time as cube definition define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 54 27

Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 55 Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) 56 28

Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales 57 A Star-Net Query Model Shipping Method AIR-EXPRESS Customer Orders CONTRACTS Customer Time ANNUALY QTRLY REGION Location TRUCK CITY COUNTRY Each circle is called a footprint DAILY Promotion ORDER PRODUCT LINE Product PRODUCT ITEM PRODUCT GROUP SALES PERSON DISTRICT DIVISION Organization 58 29

Index Structures This and the following slides on Indexing for DW are adopted with minor modifications from: http://infolab.stanford.edu/~hector/cs245/notes12.ppt Indexing principle: mapping key values to records for associative direct access Most popular indexing techniques in relational database: B+-trees For multi-dimensional data, a large number of indexing techniques have been developed: R-trees Index structures applied in warehouses inverted lists bit map indexes join indexes text indexes 59 20 23 18 19 20 21 22 23 25 26 Inverted Lists r4 r18 r34 r35 r5 r19 r37 r40 rid name age r4 joe 20 r18 fred 20 r19 sally 21 r34 nancy 20 r35 tom 20 r36 pat 25 r5 dave 21 r41 jeff 26... age index Query: Get people with age = 20 and name = fred inverted lists data records List for age = 20: r4, r18, r34, r35 List for name = fred : r18, r52 Answer is intersection: r18 60 30

Bitmap Indexes Bitmap index: An indexing technique that has attracted attention in multi-dimensional database implementation table Customer City Car c1 Detroit Ford c2 Chicago Honda c3 Detroit Honda c4 Poznan Ford c5 Paris BMW c6 Paris Nissan 61 Bitmap Indexes The index consists of bitmaps: Index on City: ec1 Chicago Detroit Paris Poznan 1 0 1 0 0 2 1 0 0 0 3 0 1 0 0 4 0 0 0 1 5 0 0 1 0 6 0 0 1 0 Index on Car: ec1 BMW Ford Honda Nissan 1 0 1 0 0 2 1 0 1 0 3 0 0 1 0 4 0 1 0 0 5 1 0 0 0 6 0 0 0 1 bitmaps bitmaps Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column 62 31

Bitmap Indexes Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column 63 Bitmap Index 20 23 18 19 20 21 22 23 25 26 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 id name age 1 joe 20 2 fred 20 3 sally 21 4 nancy 20 5 tom 20 6 pat 25 7 dave 21 8 jeff 26... data records Query: Get people with age = 20 and name = fred List for age = 20: 1101100000 List for name = fred : 0100000001 Answer is intersection: 0100000000 Suited well for domains with small cardinality age index bit maps 64 32

Bitmap Index Summary With efficient hardware support for bitmap operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries e.g., selection on two attributes Some commercial products have implemented bitmap index Works poorly for high cardinality domains since the number of bitmaps increases Difficult to maintain - need reorganization when relation sizes change (new bitmaps) 65 Join Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 product id name price p1 bolt 10 p2 nut 5 jointb prodid name price storeid date amt p1 bolt 10 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 10 c3 1 50 p2 nut 5 c2 1 8 p1 bolt 10 c1 2 44 p1 bolt 10 c2 2 4 66 33

Join Indexes product id name price jindex p1 bolt 10 r1,r3,r5,r6 p2 nut 5 r2,r4 join index sale rid prodid storeid date amt r1 p1 c1 1 12 r2 p2 c1 1 11 r3 p1 c3 1 50 r4 p2 c2 1 8 r5 p1 c1 2 44 r6 p1 c2 2 4 67 Join Indexes Traditional indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. In data warehouse cases, join indexes relate the values of the dimensions of a star schema to rows in the fact table. For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city Join indexes can span multiple dimensions 68 34