Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)



Similar documents
Data Mining 5. Cluster Analysis

Data Mining for Knowledge Management. Clustering

Chapter 7. Cluster Analysis

資 料 倉 儲 (Data Warehousing)

Machine Learning using MapReduce

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Using Data Mining for Mobile Communication Clustering and Characterization

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Neural Networks Lesson 5 - Cluster Analysis

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

How To Cluster

DATA MINING TECHNIQUES AND APPLICATIONS

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Social Media Mining. Data Mining Essentials

Segmentation of stock trading customers according to potential value

Cluster Analysis: Basic Concepts and Algorithms

Information Management course

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised learning: Clustering

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining Part 5. Prediction

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Cluster Analysis: Basic Concepts and Methods

COURSE SYLLABUS. Enterprise Information Systems and Business Intelligence

Introduction to Clustering

Cluster analysis Cosmin Lazar. COMO Lab VUB

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

City University of Hong Kong. Information on a Course offered by the Department of Management Sciences with effect from Semester A in 2012 / 2013

Statistical Databases and Registers with some datamining

A Study of Web Log Analysis Using Clustering Techniques

Clustering Connectionist and Statistical Language Processing

A Comparative Study of clustering algorithms Using weka tools

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

CLUSTER ANALYSIS FOR SEGMENTATION

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Subject Description Form

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

RYERSON UNIVERSITY Ted Rogers School of Information Technology Management And G. Raymond Chang School of Continuing Education

City University of Hong Kong. Information on a Course offered by Department of Management Sciences with effect from Semester A in 2010 / 2011

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Credit Card Fraud Detection Using Self Organised Map

Clustering UE 141 Spring 2013

Data Mining + Business Intelligence. Integration, Design and Implementation

The SPSS TwoStep Cluster Component

A Relevant Document Information Clustering Algorithm for Web Search Engine

Quality Assessment in Spatial Clustering of Data Mining

Hierarchical Cluster Analysis Some Basics and Algorithms

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis: Basic Concepts and Algorithms

EXTENDED CENTROID BASED CLUSTERING TECHNIQUE FOR ONLINE SHOPPING FRAUD DETECTION

The Data Mining Process

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

Classification and Prediction

K-Means Clustering Tutorial

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

MS1b Statistical Data Mining

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin.

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

An Overview of Knowledge Discovery Database and Data mining Techniques

資 訊 管 理 個 案. E-commerce: Digital Markets, Digital Goods 1. Facebook, 2. Amazon vs. Walmart (Chap. 10)

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

CHAPTER 20. Cluster Analysis

Big Data and Marketing

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Analytics on Big Data

Data Mining K-Clustering Problem

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Principles of Data Mining by Hand&Mannila&Smyth

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining and Soft Computing. Francisco Herrera

Transcription:

Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資 訊 管 理 學 系 http://mail. tku.edu.tw/myday/ - -

課 程 大 綱 (Syllabus) 週 次 (Week) 日 期 (Date) 內 容 (Subject/Topics) // 資 料 探 勘 導 論 (IntroducGon to Data Mining) // 關 連 分 析 (AssociaGon Analysis) // 分 類 與 預 測 (ClassificaGon and PredicGon) // 分 群 分 析 (Cluster Analysis) // 個 案 分 析 與 實 作 一 (SAS EM 分 群 分 析 ): Case Study (Cluster Analysis K- Means using SAS EM) // 個 案 分 析 與 實 作 二 (SAS EM 關 連 分 析 ): Case Study (AssociaGon Analysis using SAS EM) // 教 學 行 政 觀 摩 日 (Off- campus study) // 個 案 分 析 與 實 作 三 (SAS EM 決 策 樹 模 型 評 估 ): Case Study (Decision Tree, Model EvaluaGon using SAS EM)

課 程 大 綱 (Syllabus) 週 次 (Week) 日 期 (Date) 內 容 (Subject/Topics) // 期 中 報 告 (Midterm Project PresentaGon) // 期 中 考 試 週 (Midterm Exam) // 個 案 分 析 與 實 作 四 (SAS EM 迴 歸 分 析 類 神 經 網 路 ): Case Study (Regression Analysis, ArGficial Neural Network using SAS EM) // 文 字 探 勘 與 網 頁 探 勘 (Text and Web Mining) // 海 量 資 料 分 析 (Big Data AnalyGcs) // 期 末 報 告 (Final Project PresentaGon) // 畢 業 考 試 週 (Final Exam)

Outline Cluster Analysis K- Means Clustering Source: Han & Kamber ()

Cluster Analysis Used for automagc idengficagon of natural groupings of things Part of the machine- learning family Employ unsupervised learning Learns the clusters of things from past data, then assigns new instances There is not an output variable Also known as segmentagon Source: Turban et al. (), Decision Support and Business Intelligence Systems

Cluster Analysis Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a +.) Source: Han & Kamber ()

Cluster Analysis Clustering results may be used to IdenGfy natural groupings of customers IdenGfy rules for assigning new cases to classes for targegng/diagnosgc purposes Provide characterizagon, definigon, labeling of populagons Decrease the size and complexity of problems for other data mining methods IdenGfy outliers in a specific domain (e.g., rare- event detecgon) Source: Turban et al. (), Decision Support and Business Intelligence Systems

Example of Cluster Analysis Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )

Cluster Analysis for Data Mining Analysis methods StaGsGcal methods (including both hierarchical and nonhierarchical), such as k- means, k- modes, and so on Neural networks (adapgve resonance theory [ART], self- organizing map [SOM]) Fuzzy logic (e.g., fuzzy c- means algorithm) GeneGc algorithms Divisive versus AgglomeraGve methods Source: Turban et al. (), Decision Support and Business Intelligence Systems

Cluster Analysis for Data Mining How many clusters? There is not a truly opgmal way to calculate it HeurisGcs are ojen used. Look at the sparseness of clusters. Number of clusters = (n/) / (n: no of data points). Use Akaike informagon criterion (AIC). Use Bayesian informagon criterion (BIC) Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items Euclidian versus Manhalan (recglinear) distance Source: Turban et al. (), Decision Support and Business Intelligence Systems

k- Means Clustering Algorithm k : pre- determined number of clusters Algorithm (Step : determine value of k) Step : Randomly generate k random points as inigal cluster centers Step : Assign each point to the nearest cluster center Step : Re- compute the new cluster centers RepeGGon step: Repeat steps and ungl some convergence criterion is met (usually that the assignment of points to clusters becomes stable) Source: Turban et al. (), Decision Support and Business Intelligence Systems

Cluster Analysis for Data Mining - k- Means Clustering Algorithm Step Step Step Source: Turban et al. (), Decision Support and Business Intelligence Systems

Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (x i, x i,, x ip ) and j = (x j, x j,, x jp ) are two p- dimensional data objects, and q is a posigve integer If q =, d is Manhalan distance q q p p q q j x i x j x i x j x i x j i d )... ( ), ( + + + =... ), ( p p j x i x j x i x j x i x j i d + + + = Source: Han & Kamber ()

Similarity and Dissimilarity Between Objects (Cont.) If q =, d is Euclidean distance: d( i, j) = ProperGes d(i,j) d(i,i) = ( x i d(i,j) = d(j,i) x j + x i d(i,j) d(i,k) + d(k,j) x j +... + x i p p Also, one can use weighted distance, parametric Pearson product moment correlagon, or other disimilarity measures x j ) Source: Han & Kamber ()

Euclidean distance vs ManhaMan distance Distance of two point x = (, ) and x (, ) x (, ). x = (, ) Euclidean distance: = ((-) + (-) ) / = ( + ) / = ( + ) / = () / =. Manhattan distance: = (-) + (-) = + =

Example The K- Means Clustering Method Assign each objects to most similar center reassign Update the cluster means reassign K= Arbitrarily choose K object as initial cluster center Update the cluster means Source: Han & Kamber ()

K- Means Clustering Step by Step Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )

m = (, ) K- Means Clustering Step : K=, Arbitrarily choose K object as initial cluster center M = (, ) Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, ) Initial m (, ) Initial m (, )

Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) M = (, ) Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster K- Means Clustering Initial m (, ) Initial m (, )

Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / M = (, ) K- Means = () / Clustering =. Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / = () / =. p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster Initial m (, ) Initial m (, )

Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (.,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (..,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

K- Means Clustering (K=, two clusters) stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

Summary Cluster Analysis K- Means Clustering Source: Han & Kamber ()

References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second EdiGon,, Elsevier Efraim Turban, Ramesh Sharda, Dursun Delen, Decision Support and Business Intelligence Systems, Ninth EdiGon,, Pearson.