COC131 Data Mining - Clustering



Similar documents
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Hierarchical Clustering Analysis

Tutorial Segmentation and Classification

WEKA Explorer Tutorial

2 Decision tree + Cross-validation with R (package rpart)

Data Mining with Weka

Clustering UE 141 Spring 2013

JustClust User Manual

Data Mining and Visualization

Web Document Clustering

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

A Demonstration of Hierarchical Clustering

CSC 177 Fall 2014 Team Project Final Report

Data mining techniques: decision trees

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining: STATISTICA

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

WEKA KnowledgeFlow Tutorial for Version 3-5-8

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Scatter Plots with Error Bars

They can be obtained in HQJHQH format directly from the home page at:

MultiExperiment Viewer Quickstart Guide

WEKA Explorer User Guide for Version 3-4-3

Final Project Report

Market Pricing Override

BIRCH: An Efficient Data Clustering Method For Very Large Databases

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

How To Understand How Weka Works

Structural Health Monitoring Tools (SHMTools)

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Data Mining in Weka Bringing It All together

Didacticiel - Études de cas

Data Mining with SQL Server Data Tools

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Treemap Visualisations

Introduction Predictive Analytics Tools: Weka

Visualization of Phylogenetic Trees and Metadata

Didacticiel Études de cas

Creating Drawings in Pro/ENGINEER

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Exploration Data Visualization

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

CGLMS. Building Control and Alarm System. Caccialanza & C., SpA Via Pacinotti 10 I Segrate / Milano (Italy)

Neural Networks Lesson 5 - Cluster Analysis

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Using Excel for Statistical Analysis

In this tutorial, we try to build a roc curve from a logistic regression.

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Getting Started with R and RStudio 1

Cluster Analysis: Advanced Concepts

Data quality in Accounting Information Systems

Introduction to Clustering

Text Analytics Illustrated with a Simple Data Set

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Analysis report examination with CUBE

This means that any user from the testing domain can now logon to Cognos 8 (and therefore Controller 8 etc.).

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Adobe Dreamweaver CC 14 Tutorial

Introduction to Machine Learning Using Python. Vikram Kamath

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

UCINET Quick Start Guide

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Equational Reasoning as a Tool for Data Analysis

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Tutorial Exercises for the Weka Explorer

Hotspot Image Connector Page Guide

UNPCC: A Novel Unsupervised Classification Scheme for Network Intrusion Detection

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Social Media Mining. Data Mining Essentials

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Tutorial for proteome data analysis using the Perseus software platform

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

Data Mining III: Numeric Estimation

Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice

Appendix 2.1 Tabular and Graphical Methods Using Excel

Cluster Analysis using R

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

1 Topic. 2 Scilab. 2.1 What is Scilab?

Gephi Tutorial Visualization

0 Introduction to Data Analysis Using an Excel Spreadsheet

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

DIRECTIONS FOR SETTING UP LABELS FOR MARCO S INSERT STOCK IN WORD PERFECT, MS WORD AND ACCESS

Easy Map Excel Tool USER GUIDE

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Analysis Tools and Libraries for BigData

Visualization methods for patent data

Transcription:

COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window and select the Preprocess tab. Open the iris data-set ( iris.arff, this should be in the./data/ directory of the Weka install). 2. In week 2 (Tutorial 2) we looked at data preprocessing and visualisation, before we launch into Clustering its worth recapping the way in which data can be visualised using Weka. Figure 1 shows plots using instances of the iris data-set with possible class boundaries. (or clusters) marked on (you can see these plots in Weka under the Visualisation tab). Figure 1(a) shows a plot of attributes petal width vs pepal length and Figure 1(b) shows a plot of attributes setal width vs sepal length. Each plot has new points drawn on marked 1-4. For each plot (a) and (b) and each point (1-4) which class does the new point belong to and why? Which plot is most reliable for class discrimination and why? In 1(b) the overlap D between class boundaries B and C is very large, do you think the class boundaries B and C are meaningful and why? 3. In Figure 1 determining the class boundaries manually was hard enough with labeled data (i.e. knowing the classes) however imagine a case where the classes are not known a priori. Look at Figure 2 this shows the same plots as Figure 1 but without labeling. Without the labeled data would you mark the same class boundaries as above? Which class boundaries would be particularly hard to place? What else do you notice about these plots? 4. Select the Cluster tab to get into the clustering perspective of Weka. Under Clusterer select and run each of the following clustering algorithms. - cobweb, EM, farthest first, simple K-means - under Cluster mode leave Use training set highlighted 1

5. Under the Results list you should now have five entries, right-clicking on an entry will give you the option to Visualise cluster assignments. Inspect each of the entries for the plot (varying the x- and y- axes) and compare them to the class labeled plots (under the Visualise tab). How do the clusterers compare with the labeled plots? The perfect clusterer should place all the instances of a particular class in the same cluster. Try to change the numclusters parameter where possible - what happens to the cluster assignments? 6. Now run the EM, farthest first, simple K-means clusterers again, but this time selecting under Cluster mode, Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix. - For each clusterer inspect the confusion matrix and classification error. Which clusterer gives the best result? 7. Select the Preprocessing tab to get into the dataset loading perspective of Weka. Open the flag data-set ( flagdata.arff, this is available on the tutorial webpage http://www-staff.lboro.ac.uk/ comds2). Go back to the clustering perspective of Weka by selecting the Cluster tab. Under Clusterer select and run the cobweb clustering algorithm, this is the only hierarchical clusterer within Weka. - make sure the two parameters acuity and cutoff are set to -A 1.0 -C 0.4 respectively. (They can be specified through the pop-up window that appears by clicking on area left to the Choose button. ) - under Cluster mode leave Use training set highlighted 8. In the Clusterer output window we will notice a tree (dendrogram) generated by the clusterer. Do you understand what the tree is representing in terms of clusters? How many clusters are there in the first level of the tree? The tree can be ingterpreted as follows: - node N or leaf N represents a subcluster, whose parent cluster is N - The clustering tree structure is shown as a horizontal tree, where subclusters are aligned at the same column. For example, cluster 1 (referred to in node 1) has three subclusters 2 (leaf 2), 3 (leaf 3) and 4 (leaf 4). - The root cluster is 0. Each line with node 0 defines a subcluster of the root. - The number in square brackets after node N represents the number 2

of instances in the parent cluster N. - Clusters with [1] at the end of the line are instances. - Finally to view the clustering tree in graphical form right click on the last line in the result list window and then select Visualize tree. 3

Figure 1: Labeled data visualisation a b 4

Figure 2: Unlabeled data visualisation a b 5