Introduction to Data Warehousing and OLAP



Similar documents
Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

DATA WAREHOUSING AND OLAP TECHNOLOGY

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Week 3 lecture slides

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Multi-dimensional index structures Part I: motivation

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Data Warehousing Systems: Foundations and Architectures

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

A Critical Review of Data Warehouse

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Indexing Techniques for Data Warehouses Queries. Abstract

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Chapter 20: Data Analysis

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Data Warehousing and OLAP Technology for Knowledge Discovery

Part 22. Data Warehousing

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Week 13: Data Warehousing. Warehousing

DATA WAREHOUSING - OLAP

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Data Warehousing and OLAP

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

When to consider OLAP?

14. Data Warehousing & Data Mining

Data Mining and Database Systems: Where is the Intersection?

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Dimensional Modeling for Data Warehouse

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Database Design Patterns. Winter Lecture 24

New Approach of Computing Data Cubes in Data Warehousing

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Data Warehouse: Introduction

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Optimizing Your Data Warehouse Design for Superior Performance

Optimization of ETL Work Flow in Data Warehouse

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Fluency With Information Technology CSE100/IMT100

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Data Warehousing and Data Mining

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Foundations of Business Intelligence: Databases and Information Management

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Introduction to Data Warehousing. Ms Swapnil Shrivastava

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Foundations of Business Intelligence: Databases and Information Management

Data Warehousing, OLAP, and Data Mining

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

CS2032 Data warehousing and Data Mining Unit II Page 1

The Cubetree Storage Organization

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

IBM WebSphere DataStage Online training from Yes-M Systems

Overview of Data Warehousing and OLAP

Turkish Journal of Engineering, Science and Technology

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

II. OLAP(ONLINE ANALYTICAL PROCESSING)

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehousing Concepts

Data Warehousing und Data Mining

Overview. Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Data Warehousing. An Example: The Store (e.g.

Data W a Ware r house house and and OLAP II Week 6 1

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Data Preprocessing. Week 2

Extraction Transformation Loading ETL Get data out of sources and load into the DW

A Design and implementation of a data warehouse for research administration universities

Designing a Dimensional Model

Basics of Dimensional Modeling

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

SQL Server 2005 Features Comparison

Business Intelligence, Analytics & Reporting: Glossary of Terms

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

CIS 631 Database Management Systems Sample Final Exam

Data warehousing with PostgreSQL

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Transcription:

Introduction to Data Warehousing and OLAP Outline Part I Introduction OLAP vs OLTP Data Cleaning and Integration Part II Data Models and Warehouse Design Part III Index Structures for Data Warehouses 1

Types of Data Operational Data (OLTP applications) Data that works Frequent updates and queries Normalized for efficient search and updates (minimize update anomalies) Fragmented and local relevance Point Queries: queries accessing individual tuples Types of Data Historical Data (OLAP applications) Data that tells Very infrequent updates Integrated data set with global relevance Analytical queries that require huge amounts of aggregation Performance issues mainly in query response time (not in updates) 2

Example OLTP Queries What is the salary of Mr. Ali? What is the address and phone number of the person in charge of the Supplies department? How many employees have received an excellent credential in the latest appraisal? Example OLAP Queries How is the employee attrition scene changing over the years across the company? Is there a correlation between the geographical location of a company unit and excellent employee appraisals? Is it financially viable to continue our manufacturing unit in Taiwan? 3

A Data Warehouse An infrastructure to manage historical data Designed to support OLAP queries involving gratuitous use of aggregations Post retrieval processing (reporting) just as complex, if not more, as the retrieval itself Warehousing Data OLTP Unit Operational Data OLTP Unit OLTP Unit Data Cleaning and Integration Data Warehouse 4

Data Marts Data warehouses seen as a collection of data-marts or historical data about each OLTP segment that feeds into the warehouse Data-marts also seen as small warehouses for OLAP activities within a given segment Data Cleaning and Integration Back flush DCU DIU Data Warehouse OLTP Databases Updates / Feedback 5

Dirty Data Lack of Standardization Multiple encodings, locales, languages Spurious abbreviations: Allama Iqbal Road and A.I. Road are the same Semantic equivalence: Rawalpindi is the same as Pindi Multiple standards: 1.6 kilometer is the same as 1 miles Dirty Data Missing, spurious and duplicate data Missing age field for an employee Spurious (incorrectly entered) sales values Duplication of data-sets across OLTP units Semantic duplication (M. A. Khan appearing in another data set as Khan Muhammad Ali) 6

Dirty Data Inconsistencies Incorrect use of codes (use of M/F in addition to 0/1 for gender) Codes with inconsistent or outdated meaning (Travel eligibility C denoting eligibility to travel only by III class sleeper, which no longer exists) Inconsistent duplicate data (two data sets are found to belong to the same person, but have two different address information) Dirty Data Inconsistencies Inconsistent associations (Sales figures provided by the marketing department do not add up to the total sales figures by the retail units) Semantic inconsistencies (Feb 31 st ) Referential inconsistency (Rs. 10 lakhs sales reported from a unit that has been closed down) 7

Issues in Data Cleaning Cannot be fully automated GIGO (Garbage in Garbage out) Requires considerable knowledge that is tacit and beyond the purview of the warehouse (metrics, geography, govt. policies, etc.) Complexity increases (usually geometrically) with increase in data sources Complexity increases with the history span that is taken up for cleaning Steps in Data Cleaning (Rahm and Do [1]) 1. Data Analysis: Analyze data set to obtain meta-data and detect dirty data 2. Definition of transformation rules: Transform data from its current dirty form to the required clean form. Transformation can be either at the schema level or data level 3. Rule Verification: Verification of the transformation rules on test data sets 4. Transformation: Execution of transformation rules on data set 5. Backflow: Re-populating data sources with cleaned data. 8

Data Analysis Techniques (Refs [1],[2]) Problem to be Detected Illegal values Spelling mistakes Lack of standards Duplicate and missing values (max, min), (mean, deviation), Cardinality Hashing, N-gram outliers Meta-data used Column comparison (compare value sets from given column across tables) Compare cardinality with #rows, detect nulls, use rules to predict incorrect or missing values. Transformation Algorithms Hash-Merge for duplicate elimination 1. Hash tuples based on given column into buckets 2. (Tuples with duplicate values are hashed onto the same bucket) 3. Merge tuples within each bucket separately 9

Transformation Algorithms Name M.A. Khan Saleem Address 50, Lvl Rd. 25, LB Rd Dept Sales R&D M.A. Khan Rahim 50, Lvl Rd. 30, Snky Rd. PR Products Hash Buckets Hash key Transformation Algorithms Sorted Neighborhood Technique for misspelling integration 1. Identify a set of data values within a given row as key 2. Sort table based on key 3. Slide a window of n rows over the sorted table and merge data values based on rules. (Ex: Merge names if all other values like age, address, dept, etc. match) 4. Make multiple passes until there are no more merges of records 10

Transformation Algorithms Name M.A. Khan Saleem M.A. Khan Rahim Address 50, Lvl Rd. 25, LB 50, Rd. Lvl Rd. 30, Snky Rd. Dept Sales R&D PR Products Rule: Merge rows if name and address match. Window size n = 3. Name M.A. Khan M.A. Khan Rahim Saleem Address 50, Lvl Rd. 50, Lvl Rd. 30, Snky Rd. 25, LB Rd. Dept Sales PR Products R&D Transformation Algorithms (Monge Elkan 97, [3]) Graph-Based transitive closure to reduce number of passes 1. Use sorted neighborhood technique and sort records based on identified keys 2. Create an undirected graph structure where nodes correspond to records and edges correspond to is a duplicate of relationship 3. Records R1 and R2 need not be compared in any pass if they belong to the same connected component 11

R3 R1 Transformation algorithms R2 R1 R2 R1 R2 R4 R3 R4 R3 R4 R5 R5 R5 Slide 0: R1, R2, R3 Slide 1: R2, R3, R4 Slide 2: R3, R4, R5 Slide 0: R1, R2, R3 Slide 1: R3, R4, R5 Naïve sliding window Graph-based transitive closure No need to compare R1/R2 with R4/R5 Integration Combining disparate data sources into a single schematic structure Schema Integration: Forming an integrated schematic structure from the disparate data sources Data Integration: Cleaning and merging data from different sources 12

Schema Integration Consider the following schemata [4]: Cars (serialno, model, colour, stereo, glasstint, ) and Autos(serienNr, modelle, farbe) Optionen(serienNr, stereo, glastint, ) Schema Integration Challenges Naming differences Structural differences Data type differences Missing fields Semantic differences 13

Schema Integration Generic Architecture of an Integrator Mediator / Constructor Wrapper / Extractor Wrapper / Extractor Wrapper / Extractor Data Sources Integration Wrapper / Extractor Creates a common view across all data sources Bridges differences in naming, type and schema structure Wrappers do not physically extract data from the data sources Mediator / Constructor Constructs an integrated schematic structure Performs data integration and populates the data warehouse 14

Tools for Data Cleaning and Integration dfpower From Dataflux corporation (http://www.dataflux.com/) De-duplication engine Analyzes data based on values and number of occurrences Does not support detection of semantic duplicates based on user specified rules Permits duplicates to be grouped or merged Tools for Data Cleaning and Integration ETI* Data Cleanse From Evolutionary Technologies Int l (http://www.evtech.com/) Table driven data cleaning, matching and quality review, duplicate matching, imprecise spelling correction Supports meta-data repositories to store schemas, transformation rules, interrelationships, etc. 15

Tools for Data Cleaning and Integration SSA Name/Data Clustering Engine From Search Software America (http://www.searchsoftware.co.uk/) Addresses errors in spelling, typing, transcription, nicknames, synonyms, abbreviations, prefix/suffix variations, punctuation, casing, etc. Supports user specified transformation rules Scalable up to 500 M records Summary OLAP versus OLTP Characteristics of OLAP queries Data Warehousing systems Data Cleaning Issues Dirty Data Cleaning Algorithms Integration of Data and Schema 16

Introduction to Data Warehousing and OLAP Part II: Data Models and Warehouse Design Example OLAP Queries How is the employee attrition scene changing over the years across the company? Is there a correlation between the geographical location of a company unit and excellent employee appraisals? Is it financially viable to continue our manufacturing unit in Taiwan? 17

OLAP Query Characteristics Aggregation and summarization over large data sets Clustering Trend detection Multi-dimensional projections A Typical Warehouse Hypercube Core Materialized Views 18

A Typical Warehouse Hypercube Core Manages the atomic data elements Global schematic structure for the entire warehouse Based on the multi-dimensional data model Materialized Views Physical views for faster aggregate-query answering De-normalization of the core The Sales Hyper Cube Product Week Branch 19

The Sales Hyper Cube Sales is the fact Branch, Product and Week are dimensions Operations on Hyper Cubes Pivoting: Choosing (rotating the cube on a pivot) a set of dimensions for display Slicing-dicing: Select some subset of the cube Roll-up: Aggregate a dimension to a smaller dimension (Roll-up weeks dimensions into months) Drill-down: Open an aggregated dimension to reveal details (Open up months to reveal week-by-week information) 20

Implementation of Hyper Cubes Multi-dimensional to relational mapping (ROLAP) Map hyper cube queries to relational queries and maintain the data cube in a set of RDBMS tables Ex: True Relational OLAP from Microstrategy Inc (http://www.microstrategy.com/) Native multidimensional model (MOLAP) Use a separate storage model for multidimensional data Ex: Arbor Essbase (http://www.arborsoft.com/) Physical models: Star Branch Dimension Table Product Dimension Table Brnch Prod Wk Sales Fact Table Week Dimension Table 21

Star Schema Features Central Fact Table Set of supporting dimension tables Denormalized data storage Advantages Simple to comprehend and design Small meta-data Quick query responses Limitations Not robust towards changes (Changes in dimension table) Enormous amount of redundancy in dimension-table data Physical models: Snowflake Branch Product Division Brnch Prod Wk Sales Options Scheme Unit Week 22

Snowflake Schema Features Central Fact Table Normalized dimension tables storing atomic data units Advantages Faster query responses Easy updation Limitations Large amount of meta-data May result in too many tables Harder to comprehend manually Physical models: Constellation Branch Dimension Table Brnch Prod Wk Sales Product Dimension Table Discounts Fact Table Wk Prod Sch Dist Sales Fact Table Week Dimension Table Scheme Dimension Table 23

Constellation Most commonly used architecture Used when multiple fact tables are needed Usually has a main fact table and several auxiliary fact tables which are summary tables or materialized views over the main fact table Helps in faster query answering for frequently asked queries Costlier to update than snowflake Issues in Data Cubes Curse of high dimensionality Currently known index structures degrade to linear search when number of dimensions become high Categorical dimensions In order to run certain algorithms like clustering, dimensions should belong to ordinal classes Categorical dimensions difficult to index Ordinal changes during aggregation Certain dimensions may change their ordinal property when aggregated and should be indexed at several levels. Ex: Student names are ordered lexicographically, but when aggregated into classes, are ordered on their graduation year. 24

The Time Dimension Mandatory in most warehouse applications Has several meanings and roll-up techniques depending on application context Simple calendar based rollup Fiscal calendar based rollup Academic calendar based rollup Need to separately index special dates like releases, events, etc. Order of traversal of time dimension important Materialized Views Summary tables that create physical views of fact table Trade-off between faster query answering and increased complexity during updates When to materialize? Use the result to search space (RSS) ratio: (#rows returned / # rows scanned) for query Summarize if RSS ratio too small and query is too frequent 25

Revision History Table(s) Manages data that is revised over time Queries select appropriate value based on relevant version Usually required for most warehousing applications Id 1 (turnover per employee) 1 Val 110,050 130,045 Revised 01-01-2000 01-06-2000 1 140,011 01-01-2001 Designing a Data Warehouse Enterprise Model DW Logical Model DW Physical Model End-user + DBA DBA + Automated Tools 26

Enterprise to Warehouse Some thumb rules: Warehouse logical model closely resembles enterprise model Some transformation usually necessary from enterprise to warehouse models Warehouse logical model should depict denormalized data sets implicit in enterprise model Special planning required for managing time dimension and revision histories OLTP to Warehouse Models OLTP databases usually organized around the enterprise model OLTP schemas provide a good starting point for designing OLAP logical models 27

OLTP to Warehouse Models Some thumb rules when converting OLTP schemas into OLAP schemas: Look for operational data fields and remove them (Ex: Counter-sales table containing register number, cashier Emp_id) Add time element (and version elements if necessary) to data sets before populating the warehouse Decide on derived data and summary tables at design time itself Iterate between transformation rule specification, integration and schema design Add the commonly required summary information ALL to every domain Introduction to Data Warehousing and OLAP Part III: Index structures and query processing 28

Classes of Dimensions Categorical {Cats, dogs, sheep, cows, bulls, buffaloes} Ordinal Totally ordered (integers), partially ordered (credentials of a candidate) Sparse Small number of data points per value Dense Large number of data points per value Multi-dimensional indexes Usually based around ordinal classes Different kinds of indexes for sparse and dense data sets Performance may depend on storage structure for data set 29

Representing Multi-dimensional Data Multi-level sorting Sorts data based on different dimensions one after the other Simple to implement Searching is fast if dominant attribute is part of query Search becomes fragmented if dominant attributed omitted from query Dim 1 Dim 2 Dim 3 1 34 2 1 34 5 1 34 10 1 56 20 2 45 9 2 49 10 3 69 20 3 69 30 4 23 29 4 23 48 4 40 50 Representing Multi-dimensional Data Space filling curves Sorted on all attributes at once Location of a data point easily computable Suffers with increase in number of dimensions 0 1 2 3 4 5 0 1 2 3 4 30

Multi-dimensional Indexes Ordered index on multiple attributes Considers a composite key as a tuple of simple keys (k 1, k 2, k n ) Ordered index files maintained by ordering each key in sequence. Multi-dimensional Indexes Partitioned Hashing Given a composite key (k 1, k 2, k n ), partitioned hashing returns n different bucket numbers Hash bucket is determined by concatenating the n numbers. 31

Multi-dimensional Indexes Grid Files Partitions the range of key values for each key into several buckets Combinations of buckets of each key forms a grid A grid file stores a grid as any other multidimensional data set. Grid Files Grade A B C D Roll No. 1 2 3 4 5 Roll No. 1 001 025 2 026 050 3 051 075 4 076 100 5 101 125 Bucket Pool 32

Multi-dimensional Indexes Bit-map indexes Used on fields that are sparse (i.e. has only a small number of values. Example, gender, grade, etc.) A bit vector enumerates all possible values and sets corresponding bit for each data element Much more compact than other index structures Useful for efficiently answering composite queries over multiple bit-vectored fields Can be integrated with tree indexes Multi-dimensional Indexes Encoding Bit-map indexes Grade = {A, B, C, D, E, F} Subject = {DB, AI, PDS} A = 000001 DB = 001 B = 000010 AI = 010 C = 000100 PDS = 100 D = 001000 no value = 000 E = 010000 F = 100000 Student who has scored A in DB and AI No value = 000000 (000001 && 001 && 001) 33

Multidimensional Tree Indexes KD Trees A binary tree structure that can store n- dimensional data points Each dimension compared at appropriate level Useful for point queries KD Trees Let data be represented as 2-dimensional points of the form (x,y) representing (salary, age) Example data set: (2500, 20) (5000, 32) (4500, 28) (2000, 23) (4800, 25) (1800, 18) (6500, 27) 34

KD Trees (2500, 20) (5000, 32) (4500, 28) (2000, 23) (4800, 25) (1800, 18) (6500, 27) 1800, 18 2000, 23 2500, 20 4500, 28 5000, 32 4800, 25 6500, 27 KD Trees 35

KD Trees Each point divides search space along one of the dimensions Structure of the tree (and hence its performance) sensitive to the order of insertion of data points Quad Trees Initially, index contains only one bucket representing the entire space If number of data points in any bucket exceeds maximum limit, it is split into two along each dimension and are added as children of the larger bucket When number of dimensions = 2, splitting results in a quad 36

Quad Trees R Trees Manages regions Leaf nodes represent data regions and non-leaf nodes represent virtual (non-data) regions A node is split when it contains too many regions Addition of regions begins from root node until the smallest accommodating region is found (possibly by splitting one or more regions) Sibling regions may overlap but may not subsume one another 37

R Trees Data region Virtual region R Trees Suitable for range, neighborhood and nearness searches Tree structure and performance sensitive to order in which data regions are added Suffers from the curse of high dimensionality 38

Indexing Categorical Data (Ref [7]) Categorical Data Have no ordinal relationship Cannot be compared, except for equality Can be represented as sets in many cases Example categorical attribute: Team members of a given project, Ingredients for a given recipe, Products manufactured by a unit, etc. Comparison operators on sets: equality, membership, superset, subset Signatures Represent a set as a bitmap where each bit corresponds to an object in a larger UoD UoD = {set of all ingredients} S, T UoD : ingredients for two recipies s, t : corresponding bit maps of S and T Queries: S T s ~t = 0 S T t ~s = 0 39

Signature Trees Leaf nodes contain (signature, datapointer) pairs Non-leaf nodes formed by bit-wise ORing of its children nodes Traverse the tree by AND ing the query signature with the node signature 1111 1100 1011 1000 0100 1001 0011 Extensible Signature Hashing Hash tables constructed based on the most significant d bits of signature Hash levels extended by extending d whenever overflow occurs 40

Extensible Signature Hashing d = 2 Bucket for records whose hash values starts with 00 000 001 010 011 100 101 110 111 Global depth n = 3 d = 3 d = 3 d = 1 Bucket for records whose hash values starts with 010 Bucket for records whose hash values starts with 011 Bucket for records whose hash values starts with 1 Summary The OLAP Hypercube Materialized views ROLAP and MOLAP implementations Star, Snowflake and Constellation Time dimensions and revision tables Thumb rules for OLAP design Multi-dimensional index structures 41

Furthermore Topics not addressed for reasons of brevity Query Language constructs Data Mining over warehouses Handling semi-structured data in warehouses Performance Tuning Maintenance of materialized views Browsing and Visualization Thank You 42

References 1. Erhard Rahm, Hong Hai Do. Data Cleaning: Problems and Current Approaches. Bulletin on the Technical Committee on Data Engineering, IEEE Computer Society, Vol. 23, No. 4, Dec 2000. 2. Vijay T. Raisinghani. Cleaning Methods in Data Warehousing. PhD seminar report, IIT Bombay, Dec 1999. 3. A. Monge, C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997. http://citeseer.ist.psu.edu/monge97efficient.html References 4. H. Garcia-Molina, J.D. Ullman, J. Widom. Database Systems: The Complete Book. Pearson Education, 2004. 5. R. Agrawal, A. Gupta, and S. Sarawagi, Modeling multidimensional databases, ICDE, 1997. 6. Oliver Guenther. Data Warehouses and Data Mining. Course Notes, Humboldt University, Berlin. http://www.wiwi.huberlin.de/~guenther/dw/dw_ws03.html 7. Helmer, S., Moerkotte, G., 1999. A study of four index structures for set-valued attributes of low cardinality. Reihe Informatik 2, University of Mannheim. pp. 20. 43

Conferences and Workshops DaWaK: Data Warehousing and Knowledge Discovery (http://www.dexa.org/) VLDB: Very Large Databases (http://www.vldb.org/) EDBT: Extending Database Technology (http://www.edbt.org/) DOLAP: ACM International Workshop on Data Warehousing and OLAP (http://www.cis.drexel.edu/faculty/song/dolap.html) Some WWW Links DW Infocenter (http://www.dwinfocenter.org/) Data.com (http://www.data.com/) The Data Warehousing Institute (http://www.dw-institute.com/) KDNuggets, a comprehensive portal on knowledge discovery (http://www.kdnuggets.com/) Oracle Data Warehousing Tutorial (regn. Required) (http://www.oracle.com/technology/idevelop/online/courses/oln/h ow_to04.html) Data Warehouse: Online Recourses (http://www.dci.com/news/datawarehouse/articles/1998/05/links. htm) Data Warehousing and OLAP bibliography (http://www.ondelette.com/olap/dwbib.html) 44