Data Warehousing and Data Mining. A.A. 04-05 Datawarehousing & Datamining 1

Similar documents

Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Information Management course

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Data Warehousing and Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Week 3 lecture slides

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

DATA WAREHOUSING AND OLAP TECHNOLOGY

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Introduction to Data Mining

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Introduction to Data Mining

Chapter 20: Data Analysis

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data Warehouse: Introduction

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehousing and Data Mining

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

An Overview of Database management System, Data warehousing and Data Mining

Foundations of Business Intelligence: Databases and Information Management

Fluency With Information Technology CSE100/IMT100

Data Warehousing Systems: Foundations and Architectures

Data Mining Introduction

Data Mining - Introduction

New Approach of Computing Data Cubes in Data Warehousing

Chapter ML:XI. XI. Cluster Analysis

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Multi-dimensional index structures Part I: motivation

IST722 Data Warehousing

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Foundations of Business Intelligence: Databases and Information Management

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Data Mining for Successful Healthcare Organizations

Data Mining: Concepts and Techniques

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Database Marketing, Business Intelligence and Knowledge Discovery

DATA WAREHOUSING - OLAP

Course MIS. Foundations of Business Intelligence

SPATIAL DATA CLASSIFICATION AND DATA MINING

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Foundations of Business Intelligence: Databases and Information Management

DATA MINING AND WAREHOUSING CONCEPTS

Data Warehouse design

Data W a Ware r house house and and OLAP II Week 6 1

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Data Warehousing and Online Analytical Processing

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

CHAPTER 3. Data Warehouses and OLAP

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Warehousing and Data Mining in Business Applications

A Technical Review on On-Line Analytical Processing (OLAP)

University of Gaziantep, Department of Business Administration

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Data Mining System, Functionalities and Applications: A Radical Review

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

Datawarehousing and Business Intelligence

Customer Classification And Prediction Based On Data Mining Technique

Concept and Applications of Data Mining. Week 1

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Distance Learning and Examining Systems

AMIS 7640 Data Mining for Business Intelligence

A Review of Data Mining Techniques

CS2032 Data warehousing and Data Mining Unit II Page 1

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

not possible or was possible at a high cost for collecting the data.

14. Data Warehousing & Data Mining

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Introduction to Data Mining

Part 22. Data Warehousing

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Transcription:

Data Warehousing and Data Mining A.A. 04-05 Datawarehousing & Datamining 1

Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining Association rules Sequential patterns Classification Clustering A.A. 04-05 Datawarehousing & Datamining 2

Introduction and Terminology Evolution of database technology File processing (60s) Relational DBMS (70s) Advanced data models e.g., Object-Oriented, application-oriented (80s) Web-Based Repositories (90s) Data Warehousing and Data Mining (90s) Global/Integrated Information Systems (2000s) A.A. 04-05 Datawarehousing & Datamining 3

Introduction and Terminology Major types of information systems within an organization EXECUTIVE SUPPORT SYSTEMS Few sophisticated users MANAGEMENT LEVEL SYSTEMS Decision Support Systems (DSS) Management Information Sys. (MIS) KNOWLEDGE LEVEL SYSTEMS Office automation, Groupware, Content Distribution systems, Workflow mangement systems TRANSACTION PROCESSING SYSTEMS Enterprise Resource Planning (ERP) Customer Relationship Management (CRM) Many unsophisticated users A.A. 04-05 Datawarehousing & Datamining 4

Introduction and Terminology Transaction processing systems: Support the operational level of the organization, possibly integrating needs of different functional areas (ERP); Perform and record the daily transactions necessary to the conduct of the business Execute simple read/update operations on traditional databases, aiming at maximizing transaction throughput Their activity is described as: OLTP (On-Line Transaction Processing) A.A. 04-05 Datawarehousing & Datamining 5

Introduction and Terminology Knowledge level systems: provide digital support for managing documents (office automation), user cooperation and communication (groupware), storing and retrieving information (content distribution), automation of business procedures (workflow management) Management level systems: support planning, controlling and semi-structured decision making at management level by providing reports and analyses of current and historical data Executive support systems: support unstructured decision making at the strategic level of the organization A.A. 04-05 Datawarehousing & Datamining 6

Introduction and Terminology Multi-Tier architecture for Management level and Executive support systems Decision Making Data Presentation Reporting/Visualization engines Data Analysis OLAP, Data Mining engines PRESENTATION BUSINESS LOGIC Data Warehouses / Data Marts Data Sources Transactional DB, ERP, CRM, Legacy systems DATA A.A. 04-05 Datawarehousing & Datamining 7

Introduction and Terminology OLAP (On-Line Analytical Processing): Reporting based on (multidimensional) data analysis Read-only access on repositories of moderate-large size (typically, data warehouses), aiming at maximizing response time Data Mining: Discovery of novel, implicit patterns from, possibly heterogeneous, data sources Use a mix of sophisticated statistical and highperformance computing techniques A.A. 04-05 Datawarehousing & Datamining 8

Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining Association rules Sequential patterns Classification Clustering A.A. 04-05 Datawarehousing & Datamining 9

Data Warehousing DATA WAREHOUSE Database with the following distinctive characteristics: Separate from operational databases Subject oriented: provides a simple, concise view on one or more selected areas, in support of the decision process Constructed by integrating multiple, heterogeneous data sources Contains historical data: spans a much longer time horizon than operational databases (Mostly) Read-Only access: periodic, infrequent updates A.A. 04-05 Datawarehousing & Datamining 10

Data Warehousing Types of Data Warehouses Enterprise Warehouse: covers all areas of interest for an organization Data Mart: covers a subset of corporate-wide data that is of interest for a specific user group (e.g., marketing). Virtual Warehouse: offers a set of views constructed on demand on operational databases. Some of the views could be materialized (precomputed) A.A. 04-05 Datawarehousing & Datamining 11

Data Warehousing Multi-Tier Architecture DB DB Cleaning extraction Data Warehouse Server Analysis Reporting Data Mining Data sources Data Storage OLAP engine Front-End Tools A.A. 04-05 Datawarehousing & Datamining 12

Data Warehousing Multidimensional (logical) Model Data are organized around one or more FACT TABLEs. Each Fact Table collects a set of omogeneous events (facts) characterized by dimensions and dependent attributes Example: Sales at a chain of stores Product Supplier Store Period Units Sales P1 S1 St1 1qtr 30 1500 P2 S1 St3 2qtr 100 9000 dimensions dependent attributes A.A. 04-05 Datawarehousing & Datamining 13

Data Warehousing Multidimensional (logical) Model (cont d) Each dimension can in turn consist of a number of attributes. In this case the value in the fact table is a foreign key referring to an appropriate dimension table product Code Description supplier Code Name Address dimension table dimension table Product Supplier Store Period Units Sales Fact table dimension table STAR SCHEMA Store Code Name Address Manager A.A. 04-05 Datawarehousing & Datamining 14

Data Warehousing Conceptual Star Schema (E-R Model) PERIOD (0,N) PRODUCT (0,N) (1,1) (1,1) (1,1) (0,N) FACTs STORE (1,1) (0,N) SUPPLIER A.A. 04-05 Datawarehousing & Datamining 15

Data Warehousing OLAP Server Architectures They are classified based on the underlying storage layouts ROLAP (Relational OLAP): uses relational DBMS to store and manage warehouse data (i.e., table-oriented organization), and specific middleware to support OLAP queries. MOLAP (Multidimensional OLAP): uses array-based data structures and pre-computed aggregated data. It shows higher performance than OLAP but may not scale well if not properly implemented HOLAP (Hybird OLAP): ROLAP approach for low-level raw data, MOLAP approach for higher-level data (aggregations). A.A. 04-05 Datawarehousing & Datamining 16

Data Warehousing PC DVD CD Product TV MOLAP Approach Period 1Qtr 2Qtr 3Qtr 4Qtr Europe North America Middle East Total sales of PCs in europe in the 4th quarter of the year Region Far East DATA CUBE representing a Fact Table A.A. 04-05 Datawarehousing & Datamining 17

Data Warehousing OLAP Operations: SLICE Fix values for one or more dimensions E.g. Product = CD A.A. 04-05 Datawarehousing & Datamining 18

Data Warehousing OLAP Operations: DICE Fix ranges for one or more dimensions E.g. Product = CD or DVD Period = 1Qrt or 2Qrt Region = Europe or North America A.A. 04-05 Datawarehousing & Datamining 19

Data Warehousing OLAP Operations: Roll-Up PC DVD CD Product Period year TV Europe North America Aggregate data by grouping along one (or more) dimensions E.g.: group quarters Middle East Far East Region Drill-Down = (Roll-Up) -1 A.A. 04-05 Datawarehousing & Datamining 20

Data Warehousing Cube Operator: summaries for each subset of dimensions TV PC DVD CD SUM 1Qtr 2Qtr 3Qtr 4Qtr SUM Europe North America Middle East Far East SUM Yearly sales of PCs in the middle east Yearly sales of electronics in the middle east A.A. 04-05 Datawarehousing & Datamining 21

Data Warehousing Cube Operator: it is equivalent to computing the following lattice of cuboids product period region product, period product, region period, region product, period, region FACT TABLE A.A. 04-05 Datawarehousing & Datamining 22

Data Warehousing Cube Operator In SQL: SELECT Product, Period, Region, SUM(Total Sales) FROM FACT-TABLE GROUP BY Product, Period, Region WITH CUBE A.A. 04-05 Datawarehousing & Datamining 23

Data Warehousing Product Period Region Tot. Sales CD 1qtr Europe * * All combinations of Product, Period and Region CD 1qtr ALL ALL * * All combinations of Product and Period ALL 1qtr Europe * ALL * CD ALL Europe * ALL * CD ALL ALL * * All combinations of Product ALL 1qtr ALL * * ALL ALL Europe * ALL ALL * ALL ALL ALL * A.A. 04-05 Datawarehousing & Datamining 24

Data Warehousing Roll-up (= partial cube) Operator In SQL: SELECT Product, Period, Region, SUM(Total Sales) FROM FACT-TABLE GROUP BY Product, Period, Region WITH ROLL-UP A.A. 04-05 Datawarehousing & Datamining 25

Data Warehousing Product Period Region Tot. Sales CD 1qtr Europe * * All combinations of Product, Period and Region CD 1qtr ALL ALL * * All combinations of Product and Period CD ALL ALL ALL ALL * * All combinations of Product ALL ALL ALL * Reduces the complexity from exponential to linear in the number of dimensions A.A. 04-05 Datawarehousing & Datamining 26

Data Warehousing it is equivalent to computing the following subset of the lattice of cuboids product period region product, period product, region period, region product, period, region A.A. 04-05 Datawarehousing & Datamining 27

Outline 1. Introduction and Terminology 2. Data Warehousing 3. Data Mining Association rules Sequential patterns Classification Clustering A.A. 04-05 Datawarehousing & Datamining 28

Data Mining Data Explosion: tremendous amount of data accumulated in digital repositories around the world (e.g., databases, data warehouses, web, etc.) Production of digital data /Year: 3-5 Exabytes (10 18 bytes) in 2002 30% increase per year (99-02) See: www.sims.berkeley.edu/how-much-info We are drowning in data, but starving for knowledge A.A. 04-05 Datawarehousing & Datamining 29

Data Mining Knowledge Discovery in Databases KNOWLEDGE Selection & transformation Data Mining Training data Evaluation & presentation Data cleaning & integration Data Warehouse Domain knowledge DB DB Information repositories A.A. 04-05 Datawarehousing & Datamining 30

Data Mining Typologies of input data: Unaggregated data (e.g., records, transactions) Aggregated data (e.g., summaries) Spatial, geographic data Data from time-series databases Text, video, audio, web data A.A. 04-05 Datawarehousing & Datamining 31

Data Mining DATA MINING Process of discovering interesting patterns or knowledge from a (typically) large amount of data stored either in databases, data warehouses, or other information repositories Interesting: non-trivial, implicit, previously unknown, potentially useful Alternative names: knowledge discovery/extraction, information harvesting, business intelligence In fact, data mining is a step of the more general process of knowledge discovery in databases (KDD) A.A. 04-05 Datawarehousing & Datamining 32

Data Mining Interestingness measures Purpose: filter irrelevant patterns to convey concise and useful knowledge. Certain data mining tasks can produce thousands or millions of patterns most of which are redundant, trivial, irrelevant. Objective measures: based on statistics and structure of patterns (e.g., frequency counts) Subjective measures: based on user s belief about the data. Patterns may become interesting if they confirm or contradict a user s hypothesis, depending on the context. Interestingness measures can employed both after and during the pattern discovery. In the latter case, they improve the search efficiency A.A. 04-05 Datawarehousing & Datamining 33

Data Mining Multidisciplinarity of Data Mining Statistics Algorithms Data Mining High- Performance Computing Machine learning DB Technology Visualization A.A. 04-05 Datawarehousing & Datamining 34

Data Mining Data Mining Problems Association Rules: discovery of rules X Y ( objects that satisfy condition X are also likely to satisfy condition Y ). The problem first found application in market basket or transaction data analysis, where objects are transactions and conditions are containment of certain itemsets A.A. 04-05 Datawarehousing & Datamining 35

Association Rules I D =Set of items Statement of the Problem =Set of transactions: t D then t I For an itemset X I, support(x) = fraction or number of transactions containing X ASSOCIATION RULE: X-Y Y, with Y X I support confidence = support(x) = support(x)/support(x-y) PROBLEM: find all association rules with support than min sup and confidence than min confidence A.A. 04-05 Datawarehousing & Datamining 36

Association Rules Market Basket Analysis: Items = products sold in a store or chain of stores Transactions = customers shopping baskets Rule X-YY = customers who buy items in X-Y are likely to buy items in Y Of diapers and beer.. Analysis of customers behaviour in a supermarket chain has revealed that males who on thursdays and saturdays buy diapers are likely to buy also beer. That s why these two items are found close to each other in most stores A.A. 04-05 Datawarehousing & Datamining 37

Association Rules Applications: Cross/Up-selling (especially in e-comm., e.g., Amazon) Cross-selling: push complemetary products Up-selling: push similar products Catalog design Store layout (e.g., diapers and beer close to each other) Financial forecast Medical diagnosis A.A. 04-05 Datawarehousing & Datamining 38

Association Rules Def.: Frequent itemset = itemset with support min sup General Strategy to discover all association rules: 1. Find all frequent itemsets 2. frequent itemset X, output all rules X-YY, with YX, which satisfy the min confindence constraint Observation: Min sup and min confidence are objective measures of interestingness. Their proper setting, however, requires user s domain knowledge. Low values may yield exponentially (in I ) many rules, high values may cut off interesting rules A.A. 04-05 Datawarehousing & Datamining 39

Association Rules Example Transaction-id Items bought Frequent Itemsets Support 10 A, B, C {A} 75% 20 A, C {B} 50% 30 A, D {C} 50% 40 B, E, F {A, C} 50% Min. sup 50% Min. confidence 50% For rule {A} {C} : support = support({a,c}) = 50% confidence = support({a,c})/support({a}) = 66.6% For rule {C} {A} (same support as {A} {C}): confidence = support({a,c})/support({c}) = 100% A.A. 04-05 Datawarehousing & Datamining 40

Association Rules Dealing with Large Outputs Observation: depending on the values of min sup and on the dataset, the number frequent itemsets can be exponentially large but may contain a lot of redundancy Goal: determine a subset of frequent itemsets of considerably smaller size, which provides the same information content, i.e., from which the complete set of frequent itemsets can be derived without further information. A.A. 04-05 Datawarehousing & Datamining 41

Association Rules Notion of Closedness I (items), D (transactions), min sup (support threshold) Closed Frequent Itemsets = {XI : supp(x)min sup & supp(y)<supp(x) IY X} It s a subset of all frequent itemsets For every frequent itemset X there exists a closed frequent itemset YX such that supp(y)=supp(x), i.e., Y and X occur in exactly the same transcations All frequent itemsets and their frequencies can be derived from the closed frequent itemsets without further information A.A. 04-05 Datawarehousing & Datamining 42

Association Rules Notion of Maximality Maximal Frequent Itemsets = {XI : supp(x)min sup & supp(y)<min sup IY X} It s a subset of the closed frequent itemsets, hence of all frequent itemsets For every (closed) frequent itemset X there exists a maximal frequent itemset YX All frequent itemsets can be derived from the maximal frequent itemsets without further information, however their frequencies must be determined from D information loss A.A. 04-05 Datawarehousing & Datamining 43

Association Rules Example of closed and maximal frequent itemsets Closed Frequent Itemsets B B, A, D, F B, L B, G B, H Min sup = 3 Maximal NO YES YES YES YES Support 6 4 4 3 3 Tid 10 20 30 40 50 60 Supporting Transactions all 10, 30, 50, 60 20, 30, 40, 60 10, 40, 60 10, 30, 40 Items B, A, D, F, G, H B, L B, A, D, F, L, H B, L, G, H B, A, D, F B, A, D, F, L, G DATASET A.A. 04-05 Datawarehousing & Datamining 44

Association Rules (All frequent) VS (closed frequent) VS (maximal frequent) (,6) (B,6) (A,4) (D,4) (F,4) (L,4) (G,3) (H,3) (B,4) (B,4) (A,4) (B,4) (A,4) (D,4) (B,4) (B,3) (B,3) = closed & maximal (B,4) (B,4) (B,4) (A,4) = closed but not maximal (B,4) A.A. 04-05 Datawarehousing & Datamining 45

Sequential Patterns Data Mining Problems Sequential Patterns: discovery of frequent subsequences in a collection of sequences (sequence database), each representing a set of events occurring at subsequent times. The ordering of the events in the subsequences is relevant. A.A. 04-05 Datawarehousing & Datamining 46

Sequential Patterns Sequential Patterns PROBLEM: Given a set of sequences, find the complete set of frequent subsequences sequence database A sequence: <(ef)(ab)(df)(c)(b)> SID 10 20 30 40 sequence <(a)(abc)(ac)(d)(cf)> <(ad)(c)(bc)(ae)> <(ef)(ab)(df)(c)(b)> <(e)(g)(af)(c)(b)(c)> An element may contain a set of items. Items within an element are unordered and are listed alphabetically. <(a)(bc)(d)(c)> is a subsequence of <(a)(abc)(ac)(d)(cf)> For min sup = 2, <(ab)(c)> is a frequent subsequence A.A. 04-05 Datawarehousing & Datamining 47

Sequential Patterns Applications: Marketing Natural disaster forecast Analysis of web log data DNA analysis A.A. 04-05 Datawarehousing & Datamining 48

Classification Data Mining Problems Classification/Regression: discovery of a model or function that maps objects into predefined classes (classification) or into suitable values (regression). The model/function is computed on a training set (supervised learning) A.A. 04-05 Datawarehousing & Datamining 49

Classification Statement of the Problem Training Set: T = {t1,, tn} set of n examples Each example ti characterized by m features (ti(a 1 ),, ti(a m )) belongs to one of k classes (C i : 1 i k) GOAL From the training data find a model to describe the classes accurately and synthetically using the data s features. The model will then be used to assign class labels to unknown (previously unseen) records A.A. 04-05 Datawarehousing & Datamining 50

Classification Applications: Classification of (potential) customers for: Credit approval, risk prediction, selective marketing Performance prediction based on selected indicators Medical diagnosis based on symptoms or reactions to therapy A.A. 04-05 Datawarehousing & Datamining 51

Classification Classification process Training Set BUILD model Unseen Data Test Data VALIDATE PREDICT class labels A.A. 04-05 Datawarehousing & Datamining 52

Classification Observations: Features can be either categorical if belonging to unordered domains (e.g., Car Type), or continuous, if belonging to ordered domains (e.g., Age) The class could be regarded as an additional attribute of the examples, which we want to predict Classification vs Regression: Classification: builds models for categorical classes Regression: builds models for continuous classes Several types of models exist: decision trees, neural networks, bayesan (statistical) classifiers. A.A. 04-05 Datawarehousing & Datamining 53

Classification Classification using decision trees Definition: Decision Tree for a training set T Labeled tree Each internal node v represents a test on a feature. The edges from v to its children are labeled with mutually exclusive results of the test Each leaf w represents the subset of examples of T whose features values are consistent with the test results found along the path from the root to w. The leaf is labeled with the majority class of the examples it contains A.A. 04-05 Datawarehousing & Datamining 54

Classification Example: examples = car insurance applicants, class = insurance risk features class Age < 25 Age 23 18 Car Type family sports Risk high high High Car type {sports} 43 68 sports family high low High Low 32 20 truck family low high Model (decision tree) A.A. 04-05 Datawarehousing & Datamining 55

Clustering Data Mining Problems Clustering: grouping objects into classes with the objective of maximizing intra-class similarity and minimizing inter-class similarity (unsupervised learning) A.A. 04-05 Datawarehousing & Datamining 56

Clustering Statement of the Problem GIVEN: N objects, each characterized by p attributes (a.k.a. variables) GROUP: the objects into K clusters featuring High intra-cluster similarity Low inter-cluster similarity Remark: Clustering is an instance of unsupervised learning or learning by observations, as opposed to supervised learning or learning by examples (classification) A.A. 04-05 Datawarehousing & Datamining 57

Clustering Example outlier attribute Y attribute Y object attribute X attribute X A.A. 04-05 Datawarehousing & Datamining 58

Clustering Several types of clustering problems exist depending on the specific input/output requirements, and on the notion of similarity: The number of clusters K may be provided in input or not As output, for each cluster one may want a representative object, or a set of aggregate measurements, or the complete set of objects belonging to the cluster Distance-based clustering: similarity of objects is related to some kind of geometric distance Conceptual clustering: a group of objects forms a cluster if they define a certain concept A.A. 04-05 Datawarehousing & Datamining 59

Clustering Applications: Marketing: identify groups of customers based on their purchasing patterns Biology: categorize genes with similar functionalities Image processing: clustering of pixels Web: clustering of documents in meta-search engines GIS: identification of areas of similar land use A.A. 04-05 Datawarehousing & Datamining 60

Clustering Challenges: Scalability: strategies to deal with very large datasets Variety of attribute types: defining a good notion of similarity is hard in the presence of different types of attributes and/or different scales of values Variety of cluster shapes: common distance measures provide only spherical clusters Noisy data: outliers may affect the quality of the clustering Sensitivity to input ordering: the ordering of the input data should not affect the output (or its quality) A.A. 04-05 Datawarehousing & Datamining 61

Clustering Main Distance-Based Clustering Methods Partitioning Methods: create an initial partition of the objects into K clusters, and refine the clustering using iterative relocation techniques. A cluster is represented either by the mean value of its component objects or by a centrally located component object (medoid) Hierarchical Methods: start with all objects belonging to distinct clusters and then successively merge the pair of closest clusters/objects into one single cluster (agglomerative approach); or start with all objects belonging to one cluster and successively split up a cluster into smaller ones (divisive approach) A.A. 04-05 Datawarehousing & Datamining 62