CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS"

Transcription

1 CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford, CT USA (+1) John Maleyeff Lally School of Management & Technology Rensselaer Polytechnic Institute 275 Windsor Street Hartford, CT USA (+1) ABSTRACT A new classification approach is explored where service systems are grouped by dimensions of performance important to customers. Service systems are coded as binary vectors and Ward s Algorithm is used to group these systems into eight clusters, using the simple matching metric to measure distances between vectors. The resulting clusters were analyzed. Across the clusters, similar types of customers (i.e., internal vs. external) and similar process characteristics were evident. Hence, this clustering approach generates sets of services that differ from classifications based purely on process characteristics. The implications of this result for leaders of service innovation efforts are discussed. KEYWORDS: Cluster analysis, Service operations, Service marketing, Innovation INTRODUCTION Innovation is often accomplished by adapting ideas, processes, and techniques successful in one situation to solve problems or make improvements in a seemingly unrelated situation. With respect to service innovation, the challenge would be to identify hidden patterns that exist within multiple services, even those that appear unrelated. For example, emergency room trauma center teams of physicians, nurses, and technicians learned to effectively and quickly treat patients by incorporating methods used by pit crews at automobile racing events [1]. Preliminary results of an ongoing research project are presented below. This research attempts to create sets of services that would be deemed similar because, within each set, customers have similar needs. For example, both trauma center patients and automobile racers need fast and expert service, with little need for other dimensions of performance that customers of other services would consider important. Once the sets are created, their characteristics are explored to determine

2 whether or not leaders or innovation teams would gain a better understanding of how innovation could be achieved. BACKGROUND Prior work in classifying service systems is plentiful, most of it contained in the service marketing literature. For the sake of brevity, a very brief background is presented. Numerous attempts have been made to classify services in an effort to provide some understanding of the special challenges faced by service managers. A popular scheme separates services into four types: the service factory, the service shop, the mass services, and the professional service [2]. But it is not clear that the classification schemes offered in the past will be helpful to managers who wish to manage or improve customer satisfaction, because they tend to be based on the structure of the service rather than on the need or wants of customers. For example, Verma showed that only 4 of 22 important management challenges are affected by the differences in this classification scheme [2]. This research presented below uses a mathematical approach to cluster services based on the dimensions of performance deemed important by customers of each service. METHODOLOGY The approach to classifying services based on performance dimensions uses a binary vector clustering algorithm. The work began with the creation of a data set consisting of 168 services. Each service was analyzed by a professional employee of the organization who was very familiar with the activities associated with the delivery of the service and had access to customers of the service. The services selected were not random, but did consist of a cross section of various service types, albeit biased towards service contained within technologically sophisticated organizations. A mixture of customer types existed within the database. Many of the services were primarily for internal customers, many served external customers exclusively, and some served both internal and external customers. No single analyst studied more than one service. All of the analysts were working professionals, enrolled in a part-time graduate management program on the Hartford, Connecticut campus of Rensselaer Polytechnic Institute, in a course called Service Operations Management. Each analyst asked several customers of the service to list strengths and weaknesses of the process, and list key performance dimensions important to customers. The resulting reports followed a standard template that allowed for easy tabulation of key results. To ensure quality and consistency in the data, the authors studied the data generated from each report and at times modified the resulting list of performance dimensions. The resulting database that was input to the clustering algorithm consisted of 168 records and 9 fields, one field for each of 9 potential dimensions of performance. A binary code was used to signify whether or not the dimension was important to customers (1=important, 0=unimportant). The following dimensions were specified: (1) empathy (e.g., courtesy, professionalism); (2) knowledge (of service providers); (3) communication (providers with customers); (4) speed (e.g., responsiveness, turnaround time); (5) usefulness (e.g., comprehension, completeness, flexibility);

3 (6) quality (e.g., accuracy, consistency); (7) tangible (a physical good); (8) convenience (e.g., availability, ease); and (9) security (e.g., information, financial, personal). Clustering Algorithm The problem then becomes one of clustering 168 binary vectors (one per service) into groups containing like vectors. To do this, a metric must be developed to gauge the closeness of each pair of binary vectors. Several different distance metrics have been proposed in the literature. We have used the simple matching metric. This metric is described as follows: given two binary vectors V₁ and V₂, let B denote the number of digits where V₁ and V₂ agree. The intervector distance, D(V₁,V₂) is equal to 1-B/L. We note that 0 D(V₁,V₂) 1 and that D(V₁,V₂) is 0 when V₁ = V₂ and 1 when V₁ and V₂ are complements. Ward s Algorithm is a well-known and widely used algorithm for grouping binary vectors into clusters. We have used the version of this algorithm wherein the user specifies the target number of clusters. The algorithm is agglomerative and begins by placing each vector in its own separate cluster. Thus, in present case, the method began with 168 clusters. Then, clusters are successively merged in a systematic way until the requisite number of clusters is obtained. We next describe how clusters are selected for merging. The algorithm computes a medoid for every cluster. This is a member of the cluster (not necessarily unique) that has the smallest sum of distances (based on the simple matching metric) to other members [3]. Thus, a medoid is the binary vector analogous to the familiar centroid of a cluster of points on a plane. However, a medoid (unlike a centroid) is necessarily a member belonging to the cluster. Next, to determine which pair of clusters to merge, the algorithm considers all pairs for merging and selects the pair with least variance (calculated as the sum of squares of the distances averaged over the number of members in this tentative cluster). Any two clusters under consideration are temporarily merged, a medoid determined, and the sum of squares of distances to all members from this medoid computed. At each stage, in selecting pairs for merging with minimum variance, the algorithm seeks to merge clusters so that the resulting clusters are round (i.e., they have members that tend to be equally distant from the medoid of that cluster). The algorithm terminates when the requisite number of clusters has been generated. RESULTS After some trial and error, the target number of clusters was specified to be 8. This level of discrimination was chosen because fewer clusters appeared to contain dissimilar services and more clusters would provide a less than useful classification scheme. It is important to note that the clusters generated by Ward s Algorithm are known to be fairly immune to the ordering of the input data. The authors verified this characteristic by running the algorithm using a number reordered data sets. The numbers of services within each cluster group (numbered 1-8 in the tables that follow) were 13, 16, 16, 33, 32, 32, 11, and 12, respectively. Table 1 provides a summary of the 8 clusters by showing, for each cluster, the percentage of services that indicated each potential dimension as important to that service. In the table, the dimensions are abbreviated (Emp is empathy, Knw is

4 knowledge, Cmc is communication, Spd is speed, Use is usefulness, Qua is quality, Tan is tangibles, Cnv is convenience, and Sec is security). For example, the first row shows that, for the 13 services included within Cluster #1, each had empathy as an important dimension, 11 of the 13 services (85%) had knowledge as an important dimension, none of the 13 services had communication as an important dimension, etc. Table 1: Clusters and Associated Dimensions Cluster Emp Knw Cmc Spd Use Qua Tan Cnv Sec 1 100% 85% 0% 100% 69% 92% 8% 100% 8% 2 19% 0% 0% 88% 0% 94% 13% 94% 0% 3 81% 100% 100% 100% 31% 88% 13% 88% 0% 4 6% 21% 70% 94% 88% 94% 6% 100% 6% 5 13% 34% 0% 91% 84% 94% 25% 0% 0% 6 16% 53% 100% 97% 84% 100% 22% 0% 3% 7 9% 45% 9% 100% 100% 73% 100% 100% 0% 8 75% 67% 8% 100% 0% 100% 58% 8% 0% To explore the usefulness of the resulting classification scheme, and to compare this scheme to a scheme based on process characteristics alone, a number of statistical analyses were performed. Perhaps the most important of these analyses compared the clusters with another classification scheme that was based on process characteristics, rather than customer dimensions. Details on this scheme may be obtained from the authors. In Table 2, the process-oriented classifications are abbreviated (A=analysis, C=consultation, E=evaluation, G=gathering, P=planning, and T=troubleshooting). For example, of the 13 services contained in Cluster #1, one service was classified as an analysis process, 4 services were classified as a consultation process, one service was classified as an evaluation process, etc. As implied by the diversity of process types within each cluster and supported by a chi-square statistical analysis, no relationship was evident between these two classification schemes (p=0.235). An example of two similar processes that were assigned to different clusters will help to explain this result. This process was one that involved the testing of material. The algorithm assigned one testing process to cluster 4 and a second testing process to cluster 5. Both testing services included quality, speed, and usefulness as important dimensions, but the service classified in cluster 4 also listed convenience and communication. Therefore, the material testing service assigned to cluster 5 had customers who expected more interaction with the service provider than did the material testing service assigned to cluster 4. Table 2: Clusters and Associated Service Process Classification Cluster A C E G P T

5 Table 3 shows the fraction of services in each cluster whose customers were primarily internal or primarily external, and the average number of functions through which the service flowed in each cluster. For example, 53.8% (7 of the 13) of the services in Cluster #1 served primarily internal customers and 46.2% (6 of the 13) of the services in Cluster #1 served primarily external customers. In some clusters, some services served internal and external customer in about equal measure. In these cases, the internal and external fractions will not add to one. Also, in cluster #1, an average of 5.2 departments or functions that took part in delivering the service. An analysis of variance concluded that the number of functions did not vary across clusters (p=0.164). A chi-square analysis showed that the prevalence of internal or external customers did not vary across clusters at a 5% level of significance (p=0.093). Significance at a 10% level for this test may indicate that a statistically significant, but weak in magnitude, relationship exists relative to the prevalence of internal customers across clusters. Table 3: Clusters and Characteristics Cluster Internal External Functions IMPLICATIONS The main result of this exploratory investigation is that a difference exists between a classification scheme based on process characteristics and a scheme based on customer preferences. This result has implications for leaders of service improvement or service innovation teams. It also supports an earlier conclusion by Maleyeff [5] who argued that, based on characteristics unique to service systems, improvement efforts should start by focusing on the information being provided to customers of interval services rather than the physical manifestation of that information. For example, he suggests that rather than focus an improvement project on speeding up the flow of a payment invoice, project teams should first ensure that the information contained on the invoice is useful, clearly printed, unambiguous in meaning, and accurate. A secondary implication could be stated as a word of caution to leaders who may focus the improvement or innovation of services based exclusively on process improvements alone. Many service improvement methodologies, such as those contained in the Lean Six Sigma toolbox [6], are process-based, such as mistake proofing, process standardization, or visual workflow control. For example, it would appear that a dimension such as empathy may be ignored by these project teams. In the case of the emergency room trauma team learning from pit crews, perhaps the

6 innovation was successful because the customers need have similar dimensions (e.g., speed and competency). FUTURE WORK This research has some limitations. Because binary data is much less powerful than continuous data, perhaps a similar analysis that incorporated dimensions measured on a continuous scale should be undertaken. The precision and reliability of the data used here can also be questioned, due to the multiple analysts and the potential for mischaracterization of customer preferences. This limitation can easily be overcome in future analyzes. Future research could also investigate if these other well-known metrics (besides the simple matching metric) would generate clusters similar to the clusters obtained above. Finally, a more thorough analysis of best number of clusters may prove useful. REFERENCES [1] Nicholson, Kieran, Hospital teams find vroom to improve by changing race-car tires. Denver Post, April 16, 2004, p. B1. [2] Verma, Rohit, An empirical analysis of management challenges in service factories, service shops, mass services and professional services. International Journal of Service Industry Management, 2000, 11(1), [3] Guralnik, V. and Karypis, G., "A Scalable Algorithm for Clustering Protein Sequences." in Workshop on Data Mining in Bioinformatics, 2001, [4] Luke, Brian T., Agglomerative Linkages. [5] Maleyeff, John, Exploration of Internal Service Systems using Lean Principles. Management Decision, 2006, 44(5), [6] Maleyeff, John, Improving Service Delivery in Government Using Lean Six Sigma. IBM Center for The Business of Government, Washington, DC, 2007.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

The Statistics of Income (SOI) Division of the

The Statistics of Income (SOI) Division of the Brian G. Raub and William W. Chen, Internal Revenue Service The Statistics of Income (SOI) Division of the Internal Revenue Service (IRS) produces data using information reported on tax returns. These

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points

Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points Journal of Computer Science 6 (3): 363-368, 2010 ISSN 1549-3636 2010 Science Publications Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

0.1 What is Cluster Analysis?

0.1 What is Cluster Analysis? Cluster Analysis 1 2 0.1 What is Cluster Analysis? Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Hatice Camgöz Akdağ. findings of previous research in which two independent firm clusters were

Hatice Camgöz Akdağ. findings of previous research in which two independent firm clusters were Innovative Culture and Total Quality Management as a Tool for Sustainable Competitiveness: A Case Study of Turkish Fruit and Vegetable Processing Industry SMEs, Sedef Akgüngör Hatice Camgöz Akdağ Aslı

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

CREATING VALUE WITH BUSINESS ANALYTICS EDUCATION

CREATING VALUE WITH BUSINESS ANALYTICS EDUCATION ISAHP Article: Ozaydin, Ulengin/Creating Value with Business Analytics Education, Washington D.C., U.S.A. CREATING VALUE WITH BUSINESS ANALYTICS EDUCATION Ozay Ozaydin Faculty of Engineering Dogus University

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013 FACTORING CRYPTOSYSTEM MODULI WHEN THE CO-FACTORS DIFFERENCE IS BOUNDED Omar Akchiche 1 and Omar Khadir 2 1,2 Laboratory of Mathematics, Cryptography and Mechanics, Fstm, University of Hassan II Mohammedia-Casablanca,

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Clustering and Data Mining in R

Clustering and Data Mining in R Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Distance based clustering

Distance based clustering // Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Lecture 20: Clustering

Lecture 20: Clustering Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

AP Statistics 2002 Scoring Guidelines

AP Statistics 2002 Scoring Guidelines AP Statistics 2002 Scoring Guidelines The materials included in these files are intended for use by AP teachers for course and exam preparation in the classroom; permission for any other use must be sought

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Classify then Summarize or Summarize then Classify

Classify then Summarize or Summarize then Classify Classify then Summarize or Summarize then Classify DIMACS, Rutgers University Piscataway, NJ 08854 Workshop Honoring Edwin Diday held on September 4, 2007 What is Cluster Analysis? Software package? Collection

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

3. INNER PRODUCT SPACES

3. INNER PRODUCT SPACES . INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

More information

Information Architecture Planning Template for Health, Safety, and Environmental Organizations

Information Architecture Planning Template for Health, Safety, and Environmental Organizations Environmental Conference September 18-20, 2005 The Fairmont Hotel Information Architecture Planning Template for Health, Safety, and Environmental Organizations Presented By: Alan MacGregor ENVIRON International

More information

Linear Codes. In the V[n,q] setting, the terms word and vector are interchangeable.

Linear Codes. In the V[n,q] setting, the terms word and vector are interchangeable. Linear Codes Linear Codes In the V[n,q] setting, an important class of codes are the linear codes, these codes are the ones whose code words form a sub-vector space of V[n,q]. If the subspace of V[n,q]

More information

Notes on Factoring. MA 206 Kurt Bryan

Notes on Factoring. MA 206 Kurt Bryan The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

EXPERIMENTAL ERROR AND DATA ANALYSIS

EXPERIMENTAL ERROR AND DATA ANALYSIS EXPERIMENTAL ERROR AND DATA ANALYSIS 1. INTRODUCTION: Laboratory experiments involve taking measurements of physical quantities. No measurement of any physical quantity is ever perfectly accurate, except

More information

MATHEMATICS CLASS - XII BLUE PRINT - II. (1 Mark) (4 Marks) (6 Marks)

MATHEMATICS CLASS - XII BLUE PRINT - II. (1 Mark) (4 Marks) (6 Marks) BLUE PRINT - II MATHEMATICS CLASS - XII S.No. Topic VSA SA LA TOTAL ( Mark) (4 Marks) (6 Marks). (a) Relations and Functions 4 () 6 () 0 () (b) Inverse trigonometric Functions. (a) Matrices Determinants

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Another Way to Burn. The rise of the burndown chart

Another Way to Burn. The rise of the burndown chart This article describes a burndown chart based upon test cases rather than effort and describes its advantages. It is intended for readers already familiar with the concept of a burndown chart. The rise

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

More information

Novel Automatic PCB Inspection Technique Based on Connectivity

Novel Automatic PCB Inspection Technique Based on Connectivity Novel Automatic PCB Inspection Technique Based on Connectivity MAURO HIROMU TATIBANA ROBERTO DE ALENCAR LOTUFO FEEC/UNICAMP- Faculdade de Engenharia Elétrica e de Computação/ Universidade Estadual de Campinas

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Measurement Information Model

Measurement Information Model mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides

More information

Trends in Interdisciplinary Dissertation Research: An Analysis of the Survey of Earned Doctorates

Trends in Interdisciplinary Dissertation Research: An Analysis of the Survey of Earned Doctorates Trends in Interdisciplinary Dissertation Research: An Analysis of the Survey of Earned Doctorates Working Paper NCSES 12-200 April 2012 by Morgan M. Millar and Don A. Dillman 1 Disclaimer and Acknowledgments

More information

Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Improving Generalization

Improving Generalization Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

CLUSTERING FOR FORENSIC ANALYSIS

CLUSTERING FOR FORENSIC ANALYSIS IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS

More information

Vaccination Level De-duplication in Immunization

Vaccination Level De-duplication in Immunization Vaccination Level De-duplication in Immunization Information Systems (IIS) One of the major functions of an Immunization Information System (IIS) is to create and maintain an accurate and timely record

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

More information

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Time series clustering and the analysis of film style

Time series clustering and the analysis of film style Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such

More information

Available online at www.sciencedirect.com Available online at www.sciencedirect.com. Advanced in Control Engineering and Information Science

Available online at www.sciencedirect.com Available online at www.sciencedirect.com. Advanced in Control Engineering and Information Science Available online at www.sciencedirect.com Available online at www.sciencedirect.com Procedia Procedia Engineering Engineering 00 (2011) 15 (2011) 000 000 1822 1826 Procedia Engineering www.elsevier.com/locate/procedia

More information

Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE

Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE DOCUMENT RESUME ED 309 760 IR 013 926 AUTHOR Wise, Steven L.; Kutish, Gerald W. TITLE Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE 10p,; Paper presented at the Annual

More information

Identification of noisy variables for nonmetric and symbolic data in cluster analysis

Identification of noisy variables for nonmetric and symbolic data in cluster analysis Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

More information

Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

More information

Lean Certification Program Blended Learning Program Cost: $5500. Course Description

Lean Certification Program Blended Learning Program Cost: $5500. Course Description Lean Certification Program Blended Learning Program Cost: $5500 Course Description Lean Certification Program is a disciplined process improvement approach focused on reducing waste, increasing customer

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

Measurement Systems Analysis MSA for Suppliers

Measurement Systems Analysis MSA for Suppliers Measurement Systems Analysis MSA for Suppliers Copyright 2003-2007 Raytheon Company. All rights reserved. R6σ is a Raytheon trademark registered in the United States and Europe. Raytheon Six Sigma is a

More information

Classification of Household Devices by Electricity Usage Profiles

Classification of Household Devices by Electricity Usage Profiles Classification of Household Devices by Electricity Usage Profiles Jason Lines 1, Anthony Bagnall 1, Patrick Caiger-Smith 2, and Simon Anderson 2 1 School of Computing Sciences University of East Anglia

More information