Chapter 3: Cluster Analysis



Similar documents
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Introduction. A. Bellaachia Page: 1

Data Mining: Overview. What is Data Mining?

Data Mining Solutions for the Business Environment

Introduction to Data Mining

CS590D: Data Mining Chris Clifton

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Mining: Concepts and Techniques

Introduction to Data Mining

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Chapter 20: Data Analysis

Outlier Detection in Clustering

Data Mining for Fun and Profit

Data Mining + Business Intelligence. Integration, Design and Implementation

from Larson Text By Susan Miertschin

Principles of Data Mining by Hand&Mannila&Smyth

not possible or was possible at a high cost for collecting the data.

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining System, Functionalities and Applications: A Radical Review

DATA MINING TECHNIQUES AND APPLICATIONS

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

An Overview of Knowledge Discovery Database and Data mining Techniques

Building Data Cubes and Mining Them. Jelena Jovanovic

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Hexaware E-book on Predictive Analytics

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Information Management course

Data Warehousing and Data Mining in Business Applications

Cluster Analysis: Advanced Concepts

DATA MINING AND WAREHOUSING CONCEPTS

2.1. Data Mining for Biomedical and DNA data analysis

Data Mining Introduction

Social Media Mining. Data Mining Essentials

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

An Introduction to Data Mining

Data Warehouse: Introduction

How To Perform An Ensemble Analysis

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Importance or the Role of Data Warehousing and Data Mining in Business Applications

The Data Mining Process

Use of Data Mining in Banking

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Data Warehousing and Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

An Overview of Database management System, Data warehousing and Data Mining

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Role of Social Networking in Marketing using Data Mining

Database Marketing, Business Intelligence and Knowledge Discovery

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Chapter 7. Cluster Analysis

III JORNADAS DE DATA MINING

Knowledge Discovery Process and Data Mining - Final remarks

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Subject Description Form

Applications and Trends in Data Mining

DATA MINING ALPHA MINER

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Techniques

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

Grid Density Clustering Algorithm

OUTLIER ANALYSIS. Data Mining 1

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Knowledge Discovery in Data with FIT-Miner

Chapter ML:XI. XI. Cluster Analysis

Data Mining and Marketing Intelligence

Specific Usage of Visual Data Analysis Techniques

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

ANALYTICS CENTER LEARNING PROGRAM

OLAP Theory-English version

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Distance Learning and Examining Systems

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Chapter ML:XI (continued)

Data quality in Accounting Information Systems

Sunnie Chung. Cleveland State University

MS1b Statistical Data Mining

Chapter 6 - Enhancing Business Intelligence Using Information Systems

LVQ Plug-In Algorithm for SQL Server

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies

DATA MINING - SELECTED TOPICS

Customer Classification And Prediction Based On Data Mining Technique

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Transcription:

Chapter 3: Cluster Analysis 3.1 Basic Concepts of Clustering 3.2 Partitioning Methods 3.3 Hierarchical Methods 3.4 Density-Based Methods 3.5 Model-Based Methods 3.6 Clustering High-Dimensional Data 3.7 Outlier Analysis 3.7.1 Definition 3.7.2 Statistical-Based Methods 3.7.3 Distance-Based Methods 3.7.4 Density-Based Local Methods 3.7.5 Deviation-Based Methods

3.7.1 Definition Outliers: data objects that do not comply with the general behavior or model of the data Outlier detection or analysis is referred to as Outlier Mining Outlier mining has different applications Fraud detection Detecting unusual usage of telecommunication services Identifying the spending behavior of costumers with extremely low or extremely high incomes Finding unusual responses to various medical treatments Etc.

Outlier Mining Given a set if n data objects and k expected number of outliers Find the top k objects that are considerably Dissimilar Exceptional Inconsistent with respect to the remaining data The outlier mining problem can be seen as two sub-problems 1) Define what data can be considered as inconsistent in a given data set 2) Find an efficient method to mine the outliers so defined Data visualization methods are weak in detecting data with many categorical attributes or data of high dimensionality Investigate computer-based techniques to detect outliers

3.7.2 Statistical Distribution-Based Methods Assume a distribution model for the given data set(e.g., Normal) Identify outliers w. r. t the model using a discordancy test How does it work? Examine two hypothesis working hypothesis alternative hypothesis A working hypothesis H is a statement that the entire data set of n objects comes from an initial distribution model F that is: H: o i F, where i=1,2,,n The hypothesis H is retained if there is no statistically significant evidence supporting its rejection

Discordancy Test Verifies whether an object o i is significantly large(or small) in relation to the distribution F Principle Choose a some statistic T for discordancy testing Consider the value v i of an object o i If significance probability SP(vi) is sufficiently small o i is discordant The working hypothesis is rejected An alternative hypothesis H which says that o i comes from a another distribution model G is adopted The result depends on the model F is chosen because o i may be an outlier under one model and perfectly valid value under another

Discordancy Test: Example Let o 1,,o n represent the data objects Compute the sample mean µ and the standard deviation σ If the an object o i is suspected to be an outlier Compute the test statistic T T = i µ o σ If T exceeds some critical value, then o i is an outlier

Discordancy Test: Example Consider the following ordered data: 3.84, 4.26, 4.53, 4.60, 5.28, 5.29, 5.74, 5.86 Consider an additional sample P: 10 (it is suspected that this point might be an outlier) Compute µ and σ without the suspected outlier µ = 5.48, σ = 1.82 T = 5.48 10 1.82 = 2.48 With n=9 and level of significance α=0.05, the critical value is 2.110 T>2.110, then there is an evidence that P is an outlier

Alternative Distributions Inherent Alternative Distribution The working hypothesis that all objects come from distribution F is rejected Alternative hypothesis assume that all objects come from another distribution G H: o i G, where i=1,2,,n F and G: different distributions F and G : the same distribution but with different parameters Distribution G must have the potential to produce outliers (a different mean, or dispersion, or a longer tail)

Alternative Distributions Mixture Alternative Distribution The discordant values are not outliers in F population but contaminants from some other population G The alternative hypothesis is H: o i (1-λ) F+ λg, where i=1,2,,n Slippage Alternative Distribution All objects (except a small number) are from initial model F, with its given parameters The remaining objects are from a modified version of F in which the parameters have been shifted

Characteristics of Statistical-Based Methods Tests are for single attributes Need to find outliers in multidimensional space Statistical approaches require knowledge about parameters of the data set Statistical methods do not guarantee that all outliers will be found No specific test was developed The distribution cannot be adequately modeled with any standard distribution

3.7.2 Distance-Based Methods Generalize the test-based techniques Distance-based outliers are those objects that do not have enough neighbors Formally Define DB(pct, dmin)-outlier: a distance based outlier with parameters pct and dmin An object o is DB(pct, dmin)-outlier if at least a fraction pct of the objects lie at a distance greater than dmin from o Avoids excessive computation related to fitting the observed data into some standard distribution and selecting discordancy tests

Distance-Based Algorithms Index-based algorithms Use multidimensional indexing structures such as R-trees or k- d trees to search for neighbors of each object o

Distance-Based Algorithms Find neighbors of object o within a radius dmin M is the maximum number of objects within the dmin-neighborhood of an outlier Once M+1 objects of object o are found, then o is not an outlier Complexity of O(n 2 k) N: number of objects K: dimensionality Complexity is in search time. Building the index can be computationally very expensive

Distance-Based Algorithms Cell-based algorithms The data space is partitioned into cells with a side length equal to dmin 2 k dmin: radius around objects K: dimensionality Each cell has two layers surrounding it First layer is 1-cell thick 2 k 1 Second layer is thick, rounded up to the closest integer

Distance-Based Algorithms Cell-based algorithms Count outliers on a cell-by-cell rather than object-by-object basis For a given cell, the algorithm accumulates three counts The number of objects on the cell C The number of objects in the cell and the first layer C+1 The number of objects In the cell and the second layer C+2 How to determine outliers with these counts?

Distance-Based Algorithms Cell-based algorithms Assume M to be a threshold used to detect outliers An object o is considered as an outlier if C+1 <M, else all the objects in the cell are considered as non outliers If C+2 <M, all the objects in the cell are considered outliers If C+2 >M, it is possible that some objects in the cell are outliers do object-by-object processing to detect outliers only objects that have less than M objects in their dminneighborhood are outliers the dmin-neighborhood consist of the object s cell, all of its first layer and some of its second layer

Characteristics of Distance-Based Methods Avoid O(n 2 ) computational complexity Its complexity is O(c k +n) c is a constant depending on the number of cells k the dimensionality n number of objects Developed for memory-resident data sets Requires the user to set both dmin and pct Finding suitable settings for these parameters can involve much trial and error

3.7.3 Density-Based Methods Statistical and distance-based methods depend on the overall global distribution of data Data are usually not uniformly distributed Data can have different density distributions C 1 C o 2 2 o 1

Density-Based Methods Define Local Outliers An object is a local outlier if it is outlying relative to its local neighborhood (w. r. t the density of the neighborhood ) Does not consider being an outlier as a binary property Asses the degree to which an object is an outlier The degree of the outlierness is computed as the Local Outlier Factor(LOF) of an object The degree depends on how isolated the object is with respect to the surrounding neighborhood Detect global and local outliers

Density-Based Methods To define the local outlier factor of an object, the following concepts should be introduced K-distance K-distance neighborhood Reachability distance Local reachability distance

K-distance & K-distance neighborhood The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors Denoted k-distance(p) p How k is determined? LOF method sets k to the parameter MinPts used in the densitybased clustering (e.g., Minpts=4) [MinPts-distance] K-distance neighborhood of an object p contains the MinPtsnearest neighbors of p Denoted N k-distance (P) or N k (P), also N MinPts p

Reachability distance The reachability-distance of an object q with respect to object o (where o is within the MinPts-nearest neighbors of P) is denoted reach_distminpts(p,o) p Reach_distMinPts (p,o)=max{minpts_distance(o), d(p,o)} If p is far away from o, the reachability distance between the two is simply their actual distance If they are close, then the actual distance is replaced by the MinPts_distance of o

Local Outlier Factor (LOF) The local reachability density of p is the inverse of the average reachability density based on the MinPts-nearest neighbors of p lrd MinPts (p) = o NMinPts(p) NMinPts(P) reach_dist MinPts (p,o) The local outlier factor (LOF) of p captures the degree to which we call p an outlier LOF MinPts (p) = o NMinPts(P) N MinPts Ird Ird ( P) MinPts MinPts ( o) ( P)

3.7.4 Deviation-Based Methods Identify outliers by examining the main characteristics of objects on a group Objects that deviate from this description are outliers The term deviation is used to refer to outliers Two main methods Sequential Exception Technique OLAP Data Cube Technique

Summary of Chapter 3 A cluster is a collection of data objects that are similar within the same cluster and dissimilar to the objects on other clusters Clustering can be used as a main task to gain insights about the data a preprocessing step for other data mining algorithms Several applications Market segmentation Pattern recognition Biological studies Spatial data analysis Web document classification, etc.

Summary of Chapter 3 The quality of clustering can be assessed based on dissimilarity of objects Many techniques have been developed Partitioning Methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Clustering high dimensional data Constrained-based methods

Applications and Tools in Data Mining Summary

1. Financial Data Analysis Banks and Institutions offer a wise variety of banking services Checking and savings accounts for business or individual customers Credit business, mortgage, and automobile loans Investment services (mutual funds) Insurance services and stock investment services Financial data is relatively complete, reliable, and of high quality What to do with this data?

1. Financial Data Analysis Design of data warehouses for multidimensional data analysis and data mining Construct data warehouses (data come from different sources) Multidimensional Analysis: e.g., view the revenue changes by month. By region, by sector, etc. along with some statistical information such as the mean, the average, the maximum and the minimum values, etc. Characterization and class comparison Outlier analysis

1. Financial Data Analysis Loan Payment Prediction and costumer credit policy analysis Attribute selection and attribute relevance ranking may help indentifying important factors and eliminate irrelevant ones Example of factors related to the risk of loan payment Term of the loan Debt ratio Payment to income ratio Customer level income Education level Residence region The bank can adjust its decisions according to the subset of factors selected (use classification)

2. Retail Industry Collect huge amount of data on sales, customer shopping history, goods transportation, consumption and service, etc. Many stores have web sites where you can buy online. Some of them exist only online (e.g., Amazon) Data mining helps to Identify costumer buying behaviors Discover customers shopping patterns and trends Improve the quality of costumer service Achieve better costumer satisfaction Design more effective good transportation Reduce the cost of business

2. Retail Industry Design data warehouses Multidimensional analysis Analysis of the effectiveness of sales campaigns Advertisements, coupons, discounts, bonuses, etc Comparing transactions that contain sales items during and after the campaign Costumer retention Analyze the change in costumers behaviors Product Recommendation Mining association rules Display associative information to promote sales

3. Telecommunication Industry Many different ways of communicating Fax, cellular phone, Internet messenger, images, e- mail, computer and Web data transmission, etc. Great demand of data mining to help Understanding the business involved Indentifying telecommunication patterns Catching fraudulent activities Making better use of resources Improve the quality of service

3. Telecommunication Industry Multidimensional analysis (several attributes) Several features: Calling time, Duration, Location of caller, Location of callee, Type of call, etc. Compare data traffic, system workload, resource usage, user group behavior, and profit Fraudulent Pattern Analysis Identify potential fraudulent users Detect attempts to gain fraudulent entry to costumer accounts Discover unusual patterns (outlier analysis)

4. Many Other Applications Biological Data Analysis E.g., identification and analysis of human genomes and other species Web Mining E.g., explore linkage between web pages to compute authority scores (Page Rank Algorithm) Intrusion detection Detect any action that threaten file integrity, confidentiality, or availability of a network resource

How to Choose a Data Mining System (Tool)? Do data mining system share the same well defined operations and a standard query language? No Many commercial data mining system have a little in common Different functionalities Different methodology Different data sets You need to carefully choose the data mining system that is appropriate for your task

How to Choose a Data Mining System (Tool)? Data Types Available systems handle formatted record-based, relational-like data with numerical, and nominal attributes That data could be on the form of ASCII text, relational databases, or data warehouse data It is important to check which kind of data the system you are choosing can handle Operating System A data mining system may run only on one operating system The most popular operating systems that host data mining tools are UNIX/LINUX and Microsoft Windows Large industry data mining systems adopt client-server architecture

How to Choose a Data Mining System (Tool)? Data Sources Data formats Some systems work only with ASCII test files, whereas many other work with databases It is important that the data mining system supports ODBC connections (Open Database Connectivity) Data Mining functions and Methodologies Some systems provide only one data mining function(e.g., classification). Other system support many functions For a given data mining function (e.g., classification), some systems support only one method. Other systems may support many methods (k-nearest neighbor, naive Bayesian, etc.) Data mining system should provide default settings for non experts

How to Choose a Data Mining System (Tool)? Coupling data mining with databases(data warehouse) systems No Coupling A DM system will not use any function of a DB/DW system Fetch data from particular resource (file) Process data and then store results in a file Loose coupling A DM system use some facilities of a DB/DW system Fetch data from data repositories managed by a DB/DW Store results in a file or in the DB/DW Semi-tight coupling Efficient implementation of few essential data mining primitives (sorting, indexing, histogram analysis) is provided by the DB/DW Tight coupling A DM system is smoothly integrated into the DB/DW Data mining queries are optimized Tight coupling is highly desirable because it facilitates implementations and provide high system performance

How to Choose a Data Mining System (Tool)? Scalability Query execution time should increase linearly with the number of dimensions Visualization A picture is worth a thousand words The quality and the flexibility of visualization tools may strongly influence usability, interpretability and attractiveness of the system Data Mining Query Language and Graphical user Interface High quality user interface It is not common to have a query language in a DM system

Examples of Commercial Data Mining Tools Database system and graphics vendors Intelligent Miner (IBM) Microsoft SQL Server 2005 MineSet (Purple Insight) Oracle Data Mining (ODM)

Examples of Commercial Data Mining Tools Vendors of statistical analysis or data mining software Clementine (SPSS) Enterprise Miner (SAS Institute) Insightful Miner (Insightful Inc.)

Examples of Commercial Data Mining Tools Machine learning community CART (Salford Systems) See5 and C5.0 (RuleQuest) Weka developed at the university Waikato (open source)

End of The Data Mining Course Questions? Suggestions?