DATA WAREHOUSE E KNOWLEDGE DISCOVERY



Similar documents
OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

DATA WAREHOUSING AND OLAP TECHNOLOGY

Fluency With Information Technology CSE100/IMT100

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Week 3 lecture slides

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse: Introduction

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

14. Data Warehousing & Data Mining

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Week 13: Data Warehousing. Warehousing

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

B.Sc (Computer Science) Database Management Systems UNIT-V

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Introduction to Data Mining

Part 22. Data Warehousing

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

When to consider OLAP?

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Warehousing and Data Mining in Business Applications

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

SQL Server Analysis Services Complete Practical & Real-time Training

DATA WAREHOUSING - OLAP

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

CS2032 Data warehousing and Data Mining Unit II Page 1

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

New Approach of Computing Data Cubes in Data Warehousing

CHAPTER 4 Data Warehouse Architecture

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Data Warehousing, OLAP, and Data Mining

Ramesh Bhashyam Teradata Fellow Teradata Corporation

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Overview of Data Warehousing and OLAP

Distance Learning and Examining Systems

Foundations of Business Intelligence: Databases and Information Management

Data W a Ware r house house and and OLAP II Week 6 1

SAS BI Course Content; Introduction to DWH / BI Concepts

Framework for Data warehouse architectural components

BUILDING OLAP TOOLS OVER LARGE DATABASES

Basics of Dimensional Modeling

Business Intelligence, Analytics & Reporting: Glossary of Terms

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

CHAPTER 5: BUSINESS ANALYTICS

SQL Server Administrator Introduction - 3 Days Objectives

Data Mining as Part of Knowledge Discovery in Databases (KDD)

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Data Warehousing Systems: Foundations and Architectures

IBM WebSphere DataStage Online training from Yes-M Systems

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Data Warehousing and OLAP

Designing a Dimensional Model

CHAPTER 4: BUSINESS ANALYTICS

MDM and Data Warehousing Complement Each Other

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Data Warehouse Design

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Adobe Insight, powered by Omniture

Introduction to Data Mining

Data Warehousing and Data Mining

Data Warehousing. Paper

Data Mining Solutions for the Business Environment

Introduction to Data Warehousing. Ms Swapnil Shrivastava

SQL SERVER TRAINING CURRICULUM

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Data Warehousing and OLAP Technology for Knowledge Discovery

IST722 Data Warehousing

Data Warehousing Concepts

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

TIM 50 - Business Information Systems

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

CS54100: Database Systems

Data Modeling for Big Data

Chapter 20: Data Analysis

Data warehouse Architectures and processes

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Data Warehouse design

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

A Design and implementation of a data warehouse for research administration universities

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

Foundations of Business Intelligence: Databases and Information Management

Transcription:

DATA WAREHOUSE E KNOWLEDGE DISCOVERY Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano

DATA WAREHOUSE (DW) A TECHNIQUE FOR CORRECTLY ASSEMBLING AND MANAGING DATA COMING FROM DIFFERENT SOURCES TO OBTAIN A DETAILED VIEW OF AN ECONOMIC SYSTEM IT IS AN INTEGRATED PERMANENT TIME VARIANT TOPIC ORIENTED DATA COLLECTION TO SUPPORT MANAGERIAL DECISIONS IT IS THE SEPARATION ELEMENT BETWEEN OLTP AND DSS WORKLOADS Fabio A. Schreiber data warehouse 1

ON LINE TRANSACTION PROCESSING (OLTP) TYPICAL OF ELECTRONIC DATA PROCESSING (EDP) APPLICATIONS TRANSACTIONS MUST POSSESS ACID PROPERTIES STRUCTURED AND REPETITIVE SHORT AND ISOLATED DATA MUST BE DETAILED AND UP-TO-DATE ACCESS TO DATA IS MAINLY MADE THROUGH THE PRIMARY KEY DATABASE SIZE VARIES BETWEEN 10 2 MBYTE AND 10 GBYTE TRANSACTION THROUGHPUT IS THE MAIN PERFORMANCE METRICS Fabio A. Schreiber data warehouse 2

ON LINE ANALYTICAL PROCESSING (OLAP) TYPICAL OF DECISION SUPPORT SYSTEMS WORKLOAD IS CONSTITUTED OF VERY COMPLEX QUERIES ACCESSING MORE THAN 10 6 RECORDS DATA ARE OF HISTORICAL TYPE, AGGREGATED FROM DIFFERENT SOURCES WAREHOUSE DIMENSIONS GO BEYOND A TBYTE QUERY THROUGHPUT AND RESPONSE TIME ARE THE MAIN PERFORMANCE INDEXES Fabio A. Schreiber data warehouse 3

DATA MODELS FOR OLAP THEY MUST SUPPORT SOPHISTICATED ANALYSES AND COMPUTATIONS OVER DIFFERENT HIERARCHICAL DIMENSIONS THE MOST SUITABLE LOGICAL DATA MODEL IS A MULTIDIMENSIONAL STRUCTURE - THE DATA CUBE THE DIMENSIONS OF THE CUBE ARE THE ATTRIBUTES ON WHICH SEARCHES ARE MADE (KEYS) EACH DIMENSION IN TURN CAN BE A HIERARCHY DATE {DAY-MONTH-QUARTER-YEAR} PRODUCT {NAME - TYPE - CATHEGORY} (LAND ROVER - CROSS-COUNTRY - MOTOR VEHICLE) THE CELLS OF THE CUBE CONTAIN THE METRICAL VALUES PERTAINING TO DIMENSIONAL VALUES Fabio A. Schreiber data warehouse 4

LOGICAL DATA MODELS FOR OLAP INSURANCE COMPANY EXAMPLE DIMENSIONS METRICAL VALUES POLICY NUMBER, TOTAL AMOUNT 1990 1991 1992 YEAR 1993 1994 1995 1996 1997 <20 20-30 30-40 40-50 50-60 60-70 70-80 >80 LIFE-HEALTH THIRD-PARTY FIRE-THEFT RC-CAR TYPE AGE Fabio A. Schreiber data warehouse 5

OLAP OPERATIONS ROLL-UP INCREASES DATA AGGREGATION LEVEL DRILL-DOWN INCREASES DATA DETAIL LEVEL SLICE-AND-DICE SELECTS AND PROJECTS IN ORDER TO REDUCE DATA DIMENSION PIVOTING SELECTS TWO DIMENSIONS TO AGGREGATE METRICAL DATA AROUND RANKING SORTS DATA FOLLOWING PREDEFINED CRITERIA TRADITIONAL OPERATIONS (SELECTION, COMPUTED ATTRIBUTES, ECC.) Fabio A. Schreiber data warehouse 6

OLAP OPERATIONS DRILL-DOWN TYPE TYPE AGE ROLL-UP AGE YEAR MONTH Fabio A. Schreiber data warehouse 7

OLAP OPERATIONS YEAR PIVOTING YEAR AGE TYPE TYPE AGE AGE TYPE Fabio A. Schreiber data warehouse 8 YEAR

OLAP OPERATIONS YEAR SLICE AND DICE YEAR AGE TYPE AGE TYPE Fabio A. Schreiber data warehouse 9

DW ARCHITECTURE DATABASE 1 DATABASE 2 DATABASE LEGACY SPARSE FILES WAREHOUSE CONSTRUCTION COMPANY WAREHOUSE REPLICATION AND PROPAGATION DATA MART DATA MART DATA MART KNOWLEDGE DISCOVERY AND DATA MINING 1 +1 = 3 INFORMATION ACCESS AND MANAGEMENT? Fabio A. Schreiber data warehouse 10

WAREHOUSE CONSTRUCTION DATA COME FROM DIFFERENT AND DIRTY SOURCES LEGACY NON DOCUMENTED SYSTEMS PRODUCTION SYSTEMS WITHOUT ANY INTERNAL INTEGRITY CHECK EXTERNAL SOURCES WITH DOUBTFUL QUALITY FEATURES IT IS MANDATORY TO RESTORE DATA QUALITY IN ORDER TO COMMIT RELIABLE DECISIONS TO THEM Fabio A. Schreiber data warehouse 11

WAREHOUSE CONSTRUCTION TOOLS FOR DATA QUALITY FOR MIGRATION TRASFORM AND REFORMAT DATA FROM DIFFERENT SOURCES FOR SCRUBBING USE DOMAIN KNOWLEDGE TO SCRUB AND HOMOGENIZE FOR AUDITING DISCOVER RULES AND RELATIONS AMONG DATA AND VERIFY THEIR VALIDITY TOOLS FOR DATA LOADING VERIFY REFERENTIAL INTEGRITY VIOLATIONS SORT, AGGREGATE, BUILD DERIVED DATA BUILD INDEXES AND OTHER ACCESS PATHS REPRESENT A HEAVY BATCH WORKLOAD NEED TO ORGANIZE LOADING OPERATIONS IN A PARALLEL OR IN AN INCREMENTAL WAY Fabio A. Schreiber data warehouse 12

WAREHOUSE CONSTRUCTION INCREMENTAL LOADING PERFORMED DURING UPDATING; ONLY NEW OR MODIFIED TUPLES ARE LOADED CONFLICTS WITH NORMAL WORK REQUIRES SHORT TRANSACTIONS (<1000 RECORD) NEEDS COORDINATION TO GUARANTEE INDEXES AND DERIVED DATA CONSISTENCY UPDATING PERIODICALLY APPLIED FOLLOWING APPLICATIONS NEEDS DUPLICATION SERVER USAGE DATA DISPATCHING: FOR EACH CHANGE OF THE SOURCE TABLE, TRIGGERS ARE USED TO UPDATE A SNAPSHOT LOG TABLE WHICH IS PROPAGATED THEREAFTER TRANSACTION DISPATCHING: THE STANDARD LOG IS MONITORED AND THE CHANGES ON THE REPLICATED TABLES ARE TRANSFERRED TO THE DUPLICATION SERVER Fabio A. Schreiber data warehouse 13

INPUT DATA ANALYSIS DW DESIGN METHODOLOGY SELECTION OF RELEVANT INFORMATION SOURCES TRANSLATION INTO A REFERENCE CONCEPTUAL MODEL (E-R) INFORMATION SOURCES ANALYSIS IDENTIFICATION OF FACTS, MEASURES, DIMENSIONS INTEGRATION IN A GLOBAL CONCEPTUAL SCHEMA DATA WAREHOUSE DESIGN CONCEPTUAL AGGREGATED DATA, HISTORICAL DATA, INTRODUCTION LOGICAL DATA MART DESIGN (MULTIDIMENSIONAL DB) IDENTIFICATION OF FACTS AND DIMENSIONS E-R SCHEMA RESTRUCTURING DIMENSIONAL GRAPH DERIVATION TRANSLATION INTO THE LOGICAL MODEL Fabio A. Schreiber data warehouse 14

FREQUENT FAILURE CAUSES IGNORE DATA QUALITY ACCURACY COMPLETENESS CONSISTENCY TIMELINESS AVAILABILITY NOT STORING NECESSARY DATA IGNORE DATA IN EXTERNAL SOURCES IGNORE SOFT DATA (e.g. subjective judgements) Fabio A. Schreiber data warehouse 15

MULTIDIMENSIONAL OLAP SERVER (MOLAP) DIRECTLY IMPLEMENTS THE CUBE MODEL MULTIDIMENSIONAL MATRIX STRUCTURES OPTIMAL FOR DENSE STRUCTURES SEARCH BY ADDRESS BECOMES AN ALGEBRAIC COMPUTATION HIGH AND CONSTANT QUERY PROCESSING PERFORMANCE SPECIALIZED ACCESS METHODS AGGREGATION AND COMPILATION ARE MADE IN ADVANCE LIMITED SCALABILITY DUE TO PREPROCESSING MORE INGENUITY REQUIRED TO THE DATA ADMINISTRATOR Fabio A. Schreiber data warehouse 16

RELATIONAL OLAP SERVER (ROLAP) USES A STANDARD RDBMS TO IMPLEMENT THE MULTIDIMENSIONAL STRUCTURE BY MEANS OF THE GROUP_BY OPERATION THE SCHEMA HAS A STAR OR A SNOWFLAKE SHAPE (IT NORMALIZES HIERARCHIES) CENTRAL FACT TABLE TUPLES ARE CONSTITUTED OF POINTERS (EXTERNAL KEYS) TO THE DIMENSION TABLES AND OF THE VALUES FOR THE DESCRIBED COORDINATES f=(k 1,..., k n, v 1, v m ) DIMENSION TABLES STORE TUPLES WITH ATTRIBUTES PERTAINING TO EACH DIMENSION d 1 =(k 1, a 1,, a n ) FACTS CONSTELLATION MANY FACT TABLES SHARE DIMENSION TABLES HAVING THE SAME STRUCTURE Fabio A. Schreiber data warehouse 17

STAR SCHEMA CONSTELLATION COMPONENT COMPONENT#.. PROJECT PROJECT# COMPONENT#.. FACT TABLE INVOICE# DATEK TOTAL INVOICE INVOICE# DIMENSION TABLE... SNOWFLAKE PROJECT# MANAGER MANAGER#... MANAGER# DATEK QUANTITY COST FACT TABLE DATE DATEK DATE MONTH MONTH MONTH YEAR YEAR YEAR STAR Fabio A. Schreiber data warehouse 18

MAIN PROBLEMS IN DW DESIGN LOGICAL STRUCTURES DESIGN IN ORDER TO OPTIMIZE QUERIES NEED TO MINIMIZE JOINS DENORMALIZATION AND DATA REPLICATION TABLES DIMENSIONS REDUCTION HORIZONTAL PARTITIONING VERTICAL PARTITIONING BY ROWS BREAKING (USEFUL FOR DRILL-DOWN) Fabio A. Schreiber data warehouse 19

MAIN PROBLEMS IN DW DESIGN PHYSICAL STRUCTURES DESIGN CHOICE OF INDEXES CHOICE OF MATERIALIZED VIEWS VIEW AND METADATA MAINTENANCE REPLICATION MANAGEMENT HOW AND WHEN TO UPDATE CONSISTENCY MANAGEMENT APPLICATIONS IMPLEMENTATION Fabio A. Schreiber data warehouse 20

KNOWLEDGE HIERARCHY VARIABLES INVOICES SALES TREND MARKET RULES STRATEGIC DECISIONS ELEMENTS (VOLUME) DATA INFORMATION KNOWLEDGE WISDOM VALUE ADDED STATISTICAL PROCESSING KNOWLEDGE DISCOVERY PROCEDURES EXPERIENCE Fabio A. Schreiber data warehouse 21

KNOWLEDGE DISCOVERY AND DATA MINING KNOWLEDGE DISCOVERY IN DATABASES AND DATA WAREHOUSES TO IDENTIFY THE MOST SIGNIFICANT INFORMATION TO SHOW IT TO THE USER IN THE MOST CONVENIENT WAY DATA MINING ALGORITHM APPLICATION TO RAW DATA IN ORDER TO EXTRACT KNOWLEDGE (RELATIONS, PATHS, ) PREDICTIVE AIM (SIGNAL ANALYSIS, VOICE RECOGNITION, ECC.) DESCRIPTIVE AIM (DECISION SUPPORT SYSTEMS) Fabio A. Schreiber data warehouse 22

KNOWLEDGE DISCOVERY PROCESS (1) EVEN IF SPECIALIZED TOOLS ARE AVAILABLE IT REQUIRES A COMPETENCE IN USED TECHNIQUES A VERY GOOD APPLICATION DOMAIN KNOWLEDGE SEQUENTIAL STEPS SELECTION CHOICE OF THE SAMPLE DATA THE ANALYSIS SHALL BE FOCUSED ON PREPROCESSING DATA SAMPLING IN ORDER TO REDUCE THEIR VOLUME DATA SCRUBBING FOR ERRORS AND OMISSIONS Fabio A. Schreiber data warehouse 23

KNOWLEDGE DISCOVERY PROCESS (2) TRANSFORMATION DATA TYPES HOMOGENEIZATION AND/OR CONVERSION DATA MINING CHOICE OF THE METHOD/ALGORITHM INTERPRETATION AND EVALUATION RETRIEVED INFORMATION FILTERING POSSIBLE REFINING BY PREVIOUS STEPS REPETITION SEARCH RESULTS VISUAL PRESENTATION (GRAPHICAL OR LOGICAL) Fabio A. Schreiber data warehouse 24

KNOWLEDGE DISCOVERY PROCESS (3) INTERPRETATION KNOWLEDGE DATA MINING TRANSFORMATION PRE-PROCESSING PRE- PROCESSED DATA TRANSFORMED DATA CORRELATIONS AND PATHS SELECTION TARGET DATA RAW DATA da: G. Piatesky-Shapiro 1996 Fabio A. Schreiber data warehouse 25

DATA MINING APPLICATIONS CORRESPONDENCE AND RETAIL SALES WHICH SPECIAL OFFER SHOULD BE MADE HOW TO LOCATE GOODS ON THE SHELVES MARKETING SALES FORECASTS PRODUCTS BUYING PATHS BANKS LOANS CONTROL CREDIT CARDS USAGE (AND MISUSE) TELECOMMUNICATIONS SPECIAL FARES Fabio A. Schreiber data warehouse 26

DATA MINING APPLICATIONS ASTRONOMY AND ASTROPHYSICS STARS AND GALAXIES CLASSIFICATION CHEMICAL AND PHARMACEUTIC RESEARCH NEW COMPOUNDS DISCOVERY INTERRELATIONS AMONG CHEMICAL COMPOUNDS MOLECULAR BIOLOGY PATTERNS IN GENETICAL DATA AND IN MOLECULAR STRUCTURES REMOTE SURVEY AND METEOROLOGY SATELLITE DATA ANALYSIS ECONOMIC AND DEMOGRAPHIC STATISTICS CENSUS DATA ANALYSIS Fabio A. Schreiber data warehouse 27

DATA MINING ALGORITHMS STRUCTURE MODEL REPRESENTATION FORMALISMS TO REPRESENT AND DESCRIBE POSSIBLE PATHS MODEL EVALUATION STATISTICAL OR LOGICAL ESTIMATE OF THE CORRESPONDENCE OF A PATH TO THE SEARCH CRITERIA SEARCH METHOD OF PARAMETERS SEARCH OF THE PARAMETERS WHICH OPTIMIZE THE EVALUATION CRITERIA, THE OBSERVATIONS SET AND THE MODEL REPRESENTATION BEING GIVEN OF MODEL THE PARAMETERS ARE APPLIED TO MODELS BELONGING TO THE SAME FAMILY, DIFFERENTIATED BY THE REPRESENTATION TYPE, FOR QUALITY EVALUATION Fabio A. Schreiber data warehouse 28

WHAT KIND OF INFORMATION DO WE GET? ASSOCIATIONS SET OF RULES SPECIFYING THE JOINT OCCURRENCE OF TWO (OR MORE) ELEMENTS SEQUENCES POSSIBILITY OF STATING TEMPORAL SEQUENCES OF EVENTS CLASSIFICATIONS GROUPING OF ELEMENTS INTO CLASSES FOLLOWING A GIVEN MODEL CLUSTERS GROUPING OF ELEMENTS INTO CLASSES WHICH HAVE NOT BEEN DEFINED A-PRIORI TRENDS DISCOVERY OF PECULIAR TEMPORAL PATHS HAVING A FORECASTING VALUE Fabio A. Schreiber data warehouse 29

DATA MINING TECHNIQUES AND MODELS NEURAL NETWORKS VERY POWERFUL TOOL LEARNING CAPABILITY OPAQUE BEHAVIOUR INDUCTION DECISION TREES (HIERARCHY OF if - then DECISIONS) RULE INDUCTION (NON HIERARCHICAL SETS DERIVED FROM DECISION TREES) SELF EXPLAINING Fabio A. Schreiber data warehouse 30

DATA MINING TECHNIQUES AND MODELS RULES DISCOVERY ASSOCIATION RULES (SIMULTANEOUS EVENTS X Y) SEQUENTIAL ASSOCIATIONS DATA VISUALIZATION DATA ARE PREPARED AND GRAPHICALLY PRESENTED IN ORDER TO EVIDENTIATE POSSIBLE IRREGULARITIES OR STRANGE PATTERNS Fabio A. Schreiber data warehouse 31

DATA MINING TECHNIQUES TECHNIQUES INFORMATION ASSOCIATIONS SEQUENCES CLASSIFICATIONS CLUSTERS TRENDS NEURAL NETWORKS INDUCTION RULES DISCOVERY DATA VISUALIZATION Fabio A. Schreiber data warehouse 32

THE MARKET BASKET THE BEST-KNOWN MODEL ON WHICH DATA MINING TECHNIQUES ARE APPLIED MAINLY, BUT NOT EXCLUSIVELY, USED FOR RETAIL SALE PROBLEMS THE GOAL IS TO DISCOVER RECURRENT PATTERNS IN DATA (ASSOCIATION RULES) Fabio A. Schreiber data warehouse 33

THE MARKET BASKET I = {i 1,..., i k } B = {b 1,..., b n } b i I SET OF k ELEMENTS (ITEM) SET OF n SUBSETS (BASKET) OF I I GOODS IN A SUPERMARKET WORDS IN A DICTIONARY B A CUSTOMER S PURCHASE A DOCUMENT IN A CORPUS ASSOCIATION RULE i 1 i 2 i 1 AND i 2 SHOW TOGETHER IN AT LEAST s% OF THE n BASKET (SUPPORT) OF ALL THE BASKETS CONTAINING i 1 AT LEAST c% CONTAIN ALSO i (CONFIDENCE) 2 Fabio A. Schreiber data warehouse 34

EXAMPLES OF RULES RULES HAVING DIET COKE AS CONSEQUENT HELP IN DEFINING THE STRATEGIES TO INCREASE THE SALES OF A SPECIFIC PRODUCT RULES HAVING GRISSINI AMONG THEIR ANTECEDENTS HELP IN UNDERSTANDING THE IMPACT OF CEASING A SPECIFIC PRODUCT SALE RULES HAVING WURSTEL AMONG THEIR ANTECEDENTS AND MUSTARD AS CONSEQUENT EVIDENTIATE THE PRODUCT COUPLINGS IN THE ANTECEDENT INDUCING THE SALE OF THE CONSEQUENT Fabio A. Schreiber data warehouse 35

EXAMPLES OF RULES RULES CONNECTING ELEMENTS IN SHELF A TO ELEMENTS IN SHELF B HELP TO AN EFFECTIVE ALLOCATION OF PRODUCTS ON THE SHELVES THE BEST k RULES HAVING GRISSINI IN THE CONSEQUENT IN TERMS OF CONFIDENCE IN TERMS OF SUPPORT Fabio A. Schreiber data warehouse 36

ANY PROBLEM? c COFFEE IS IN THE BASKET c NO COFFEE IN THE BASKET t TEA IS IN THE BASKET t NO TEA IN THE BASKET t t Σcolumns c c 20 5 2 5 Σ rows 70 5 75 90 10 100 t c IS TRUE??? s = 20% c = P[t c] / P[t] =20/25= 80% PERHAPS, BUT... THOSE WHO BUY COFFEE ANYHOW REACH 90%!!! WARNING! A CORRELATION EXISTS BETWEEN TEA AND COFFEE r = P[t c] / (P[t] x P[c] ) = 0.89 Fabio A. Schreiber data warehouse 37

DATA MINING IN A RELATIONAL ENVIRONMENT SQL EXTENSION BY INTEGRATING DATA MINING OPERATORS INTO THE LANGUAGE INTEGRATION OF AN SQL SERVER WITH A DATA MINING ENGINE MINE RULE FOR ASSOCIATION RULES MINE CLASSIFICATION FOR CLASSIFICATION PROBLEMS MINE INTERVAL FOR DISCRETIZING CONTINUOUS ATTRIBUTES Fabio A. Schreiber data warehouse 38

MINE RULE: EXAMPLE MINE RULE SimpleAssociation AS SELECT DISTINCT 1..n item AS BODY, 1..1 item AS HEAD, SUPPORT, CONFIDENCE FROM Purchase GROUP BY transaction EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.2 TR CLIENT ITEM DATE PRICE QUANT. 1 Rossi Ski-trousers 17/12 90.000 1 Rossi Climbing-boots 17/12 180.000 1 Bianchi Sport shirt 18/12 70.000 2 2 Bianchi Brown shoes 18/12 250.000 1 3 Bianchi Rossi Jacket Jacket 18/12 18/12 400.000 400.000 1 1 4 Bianchi Sport shirt 19/12 70.000 3 Bianchi Jacket 19/12 400.000 2 Fabio A. Schreiber data warehouse 39

MINE RULE: EXAMPLE BODY HEAD SUPPORT CONFIDENCE Ski-trousers Climbing-boots Sport shirt Sport shirt Brown shoes Brown shoes Jacket Jacket Climbing-boots Ski-trousers Brown shoes Jacket Sport shirt Jacket Sport shirt Brown shoes 0.25 0.25 0.25 0.5 0.25 0.25 0.5 0.25 1 1 0.5 1 1 1 0.66 0.33 Sport shirt, Brown shoes Jacket 0.25 1 Sport shirt, Jacket Brown shoes 0.25 0.5 Brown shoes, Jacket Sport shirt 0.25 1 Fabio A. Schreiber data warehouse 40

CLASSIFICATION PROBLEM EACH ELEMENT (RECORD) OF A DATA SET IS ASSOCIATED WITH A DISTINCTIVE FEATURE CALLED CLASS, ON THE BASIS OF PREVIOUS EXPERIENCES AND OBSERVATIONS PHASE 1: TRAINING A MODEL IS BUILT ON THE KNOWN SET (TRAINING SET) IN SUCH A WAY AS EACH CLASS IS A PARTITION OF THE SET PHASE 2: APPLICATION THE MODEL IS USED TO CLASSIFY NEW DATA Fabio A. Schreiber data warehouse 41

CLASSIFICATION PROBLEM : AN EXAMPLE PHASE 1: TRAINING MINE CLASSIFICATION CarInsuranceRules AS SELECT DISTINCT RULES ID, *, CLASS FROM CarInsurance CLASSIFY BY Risk PHASE 2: APPLICATION MINE CLASSIFICATION TEST ClassifiedApplicants AS SELECT DISTINCT *, CLASS FROM Applicants USING CLASSIFICATION FROM CarInsuranceRules AS RULES Fabio A. Schreiber data warehouse 42

CLASSIFICATION PROBLEM : AN EXAMPLE AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low 1. IF Age 23 THEN Risk IS High; 2. IF CarType = sports THEN Risk IS High; 3. IF CarType IN {family, truck} AND Age > 23 THEN Risk IS Low; 4. DEFAULT Risk IS Low AGE CAR TYPE 22 family 60 family 35 sports Fabio A. Schreiber data warehouse 43 AGE CAR TYPE 22 family 60 family 35 sports MINE CLASSIFICATION TEST CLASS high low high

DISCRETIZATION PROBLEM ONE MUST TRANSFORM A NUMERIC ATTRIBUTE IN A CATHEGORIAL ONE BY DIVIDING THE NUMERIC DOMAIN INTO A SET OF INTERVALS, EACH OF WHICH IS LABELLED WITH A CLASS LABEL, CONSTITUTING THE CATHEGORIAL DOMAIN DISCRETIZATION METHODS NON SUPERVISIONED ONLY THE VALUES DISTRIBUTION OF THE ATTRIBUTE IS CONSIDERED SUPERVISIONED AN EFFORT IS MADE TO KEEP AS MUCH INFORMATION AS POSSIBLE ABOUT THE CLASSES OF A TUPLE ASSOCIATED ATTRIBUTE Fabio A. Schreiber data warehouse 44

DISCRETIZATION PROBLEM SIMPLE DISCRETIZATION SAME WIDTH INTERVALS THE NUMERIC DOMAIN IS DIVIDED INTO A GIVEN NUMBER OF EQUAL WIDTH INTERVALS SAME FREQUENCY INTERVALS INTERVALS ARE NARROWER WHERE VALUES ARE DENSE AND WIDER WHERE VALUES ARE SPARSE MINE INTERVAL SameWidthInterval AS SELECT DISTINCT ID, LOWER, UPPER GENERATING DiscreteCarInsurance FROM CarInsurance DISCRETIZE Age BY WIDTH USING 3 INTERVALS Fabio A. Schreiber data warehouse 45

DISCRETIZATION PROBLEM : AN EXAMPLE ID LOWER UPPER AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low 1 17 34 2 34 52 3 52 68 AGE CAR TYPE RISK 1 sports high --- ----- ---- 1 truck low 2 family low --- ----- ---- 2 truck low 3 ----- ---- 3 family low Fabio A. Schreiber data warehouse 46