1 BULETINUL UniversităŃii Petrol Gaze din Ploieşti Vol. LXII No. 3/ Seria ŞtiinŃe Economice Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile Daniela Şchiopu Petroleum-Gas University of Ploiesti, Informatics Department, 39 Bucuresti Blvd., Ploieşti, Romania Abstract In this paper we analyze information about the customers of a bank, dividing them into three clusters, using SPSS TwoStep Cluster method. This method is perfect for our case study, because, compared to other classical clustering methods, TwoStep uses mixture data (both continuous and categorical variables) and it also finds the optimal number of clusters. TwoStep creates three customers profiles. The largest group contains skilled customers, whose purpose of the loan is education or business. The second group consists in persons with real estate, but mostly unemployed, which asked for a credit for retraining or for household goods. The third profile gathers people with unknown properties, who make a request for a car or a television and then for education. The benefit of the study is reinforcing the company s profits by managing its clients more effectively. Key words: TwoStep Cluster, clustering, pre-clustering, CF tree JEL Classification: C63, C46, C9 Introduction The applications that can use clustering algorithms belong to various fields. However, most of these algorithms work with numerical data or categorical data. Nevertheless, data from real world contains both numerical and categorical attributes. TwoStep Cluster is an SPSS method which solves this problem. In the present paper, we intend to identify the bank customers profiles, starting with a public dataset provided by a German bank and using TwoStep Cluster. This method has the advantage of determining the proper number of clusters, so the aim is to find this number of profiles, for managing the existing and the possible clients effectively. In the following sections, we introduce the TwoStep Cluster method and our case study with inputs, outputs and the interpretation of the results. Statistical Approach Data grouping (or data clustering) is a method that can form classes of objects with similar characteristics. Clustering is often confused with classification, but there is a major difference between them, namely, when classifying, the objects are assigned to predefined classes, whereas in the case of clustering, those classes must be defined too.
2 Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile 67 Clustering techniques are used when we expect the data to group together naturally in various categories. The clusters are categories of items with many features in common, for instance, customers, events etc. If the problem is complex, before clustering the data, other data mining techniques can be applied (such as neural networks or decision trees). Classical methods of clustering use hierarchical or partitioning algorithms. The hierarchical algorithms form the clusters successively, on the basis of clusters established before, while the partitioning algorithms determine all the clusters at the same time, building different partitions and then evaluating them in relation to certain criteria. In SPSS, clustering analysis can be performed using TwoStep Cluster, Hierarchical Cluster or K-Means Cluster, each of them relying on different algorithm to create the clusters. The last two are classical methods of classification, based on hierarchical, respectively partitioning algorithms, while TwoStep method is especially designed and implemented in SPSS. In terms of types of data considered for application, Hierarchical Cluster is limited to small datasets, K-Means is restricted to continuous values and TwoStep can create cluster models based on both continuous and categorical variables. Next, we approach the TwoStep method, highlighting its advantages in the field under discussion. The TwoStep Cluster Analysis TwoStep Cluster is an algorithm primarily designed to analyze large datasets. The algorithm groups the observations in clusters, using the approach criterion. The procedure uses an agglomerative hierarchical clustering method 3. Compared to classical methods of cluster analysis, TwoStep enables both continuous and categorical attributes. Moreover, the method can automatically determine the optimal number of clusters. TwoStep Cluster involves performing the following steps: o pre-clustering; o solving atypical values (outliers) - optional; o clustering. In the pre-clustering step, it scans the data record one by one and decides whether the current record can be added to one of the previously formed clusters or it starts a new cluster, based on the distance criterion 4. The method uses two types of distance measuring: Euclidian distance and log-likelihood distance 5. Pre-clustering procedure is implemented by building a data structure called CF (cluster feature) tree, which contains the cluster centers. The CF tree consists of levels of nodes, each node having a number of entries. A leaf entry is a final sub-cluster. For each record, starting from the root node, the nearest child node is found recursively, descending along the CF tree. Once reaching a leaf node, the algorithm finds the nearest leaf entry in the leaf node. If the record is within a threshold distance of the nearest leaf entry, then the record is added into the leaf entry and the CF tree is updated. Otherwise, it creates a new value for the leaf node. If there is enough *** SPSS (Statistical Package for the Social Sciences), available at [accessed on 3 July 00]. *** Analiza datelor, available at [accessed on 0 July 00]. 3 *** The SPSS TwoStep cluster component, Technical report, available at _The%0SPSS%0TwoStep%0Cluster%0Component.pdf, [accessed on 0 July 00]. 4 ibidem 5 A r m i n g e r, G., C l o g g, C., S o b e l, M., Handbook of Statistical Modeling for the Social and Behavioral Sciences, Plenum Press, New York, 995, pp. 30.
3 68 Daniela Şchiopu space in the leaf node to add another value, that leaf is divided into two values and these values are distributed to one of the two leaves, using the farthest pair as seeds and redistributing the remaining values based on the closeness criterion. In the process of building the CF tree, the algorithm has implemented an optional step that allows solving atypical values (outliers). Outliers are considered records that do not fit well into any cluster. In SPSS, the records in a leaf are considered outliers if the number of records is less than a certain percentage of the size of the largest leaf entry in the CF tree; by default, this percentage is 5%. Before rebuilding the CF tree, the procedure searches for potential atypical values and puts them aside. After the CF tree is rebuilt, the procedure checks if these values can fit in the tree without increasing the tree size. Finally, the values that do not fit anywhere are considered outliers. If the CF tree exceeds the allowed maximum size, it is rebuilt based on the existing CF tree, by increasing the threshold distance. The new CF tree is smaller and allows new input records. The clustering stage has sub-clusters resulting from the pre-cluster step as input (without the noises, if the optional step was used) and groups them into the desired number of clusters. Because the number of sub-clusters is much smaller than the number of initial records, classical clustering methods can be used successfully. TwoStep uses an agglomerative hierarchical method which determines the number of clusters automatically. Hierarchical clustering method refers to the process by which the clusters are repeatedly merged, until a single cluster groups all the records. The process starts with defining an initial cluster for each sub-cluster. Then, all clusters are compared and the pair of clusters with the smallest distance between them is merged into one cluster. The process repeats with a new set of clusters until all clusters have been merged. Thus, it is quite simple to compare the solutions with a different number of clusters. To calculate the distance between clusters, both the Euclidian distance and the log-likelihood distance can be used. The Euclidian distance can be used only if all variables are continuous. The Euclidian distance between two points is defined as the square root of the sum of the squares of the differences between coordinates of the points 6. For clusters, the distance between two clusters is defined as the Euclidian distance between their centers. A cluster center is defined as the vector of cluster means of each variable 7. Log-likelihood distance can be used both for continuous and categorical variables. The distance between two clusters is correlated with the decrease of the natural logarithm of likelihood function, as they are grouped into one cluster. To calculate log-likelihood distance, it is assumed that the continuous variables have normal distributions and the categorical variables have multinomial distributions, and also the variables are independent of each other. The distance between clusters i and j is defined as 8 : where: ( ξ ξ ξ () d i, j) = i + j < i, j> 6 *** Euclidian distance, available at [accessed on July 00]. 7 *** TwoStep Cluster Analysis, available at statistics/algorithms/4.0/twostep_cluster.pdf [accessed on July 00]. 8 ibidem
4 Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile 69 and in equation (), K A = ξ s N s log () k= B ( ˆ + ) K σ + k σˆ sk Eˆ sk k= with notations: Eˆ sk Lk N skl N skl = log (3) N N l= d ( i, j) is the distance between clusters i and j; < i, j > index that represents the cluster formed A by combining clusters i and j; K is the total number of continuous variables; K is total number of categorical variables; L is the number of categories for the k-th categorical k variable; N s is the total number of data records in cluster s; N skl is the number of records in cluster s whose categorical variable k takes l category; N kl is the number of records in categorical variable k that take the l category; σ - the estimated variance (dispersion) of the continuous variable k, for the entire dataset; variable k, in cluster j. s ˆ k s ˆ σ sk - the estimated variance of the continuous To determine the number of clusters automatically, the method uses two stages. In the first one, the indicator BIC (Schwarz s Bayesian Information Criterion) or AIC (Akaike s Information Criterion) is calculated for each number of clusters from a specified range; then this indicator is used to find an initial estimation for the number of clusters. For J clusters, the two indicators are computed according to equations (4) and (5), as follows 9 : where: J BIC ( J ) = ξ j + m J log( N ), (4) j= J AIC ( J ) = ξ j + m J, (5) K B A m J = J K + ( Lk ) (6) k= The relative contribution of variables to form the clusters is computed for both types of variables (continuous and categorical). j= For the continuous variables, the importance measure is based on: where: ˆ µ k ˆ µ sk t = N k (7) ˆ σ sk µˆ k is the estimator of k continuous variable mean, for entire dataset, and µˆ sk is the estimator of k continuous variable mean, for cluster j. In H 0 (the null hypothesis), the importance measure has a Student distribution with N k degrees of freedom. The significance level is two-tailed. 9 ibidem B
5 70 Daniela Şchiopu For the categorical variables, the importance measure is based on L = k which, in null hypothesis, is distributed as a χ test: N skl χ, (8) l= N kl χ with L k degrees of freedom. Regarding the cluster membership of the items, the records are allocated on the specifications of resolving atypical values (the noises) and the options for measuring the distances. If the option of solving the atypical values is not used, the values are assigned to the nearest cluster, according to the method of distance measuring. Otherwise, the values are treated differently, as follows: o in the case of the Euclidian method, an item is assigned to the nearest cluster if the distance between them is smaller than a critical value, Otherwise, the item is declared as noise (outlier). J A K C = A ˆ σ jk (9) JK j= k= o If the log-likelihood method is chosen, it assumes the noises follow a uniform distribution and it computes both the log-likelihood corresponding to assigning an item to a noise cluster and that resulting from assigning it to the nearest non-noise cluster. Then, the item is assigned to the cluster that has obtained the highest value of logarithm. This is equivalent to assigning an item to the nearest cluster if the distance between them is smaller than a critical value. Otherwise, the item is designated as noise. In conclusion, an important advantage of the method is that it operates with mixed data (both continuous and categorical data). Another advantage is that, although the TwoStep method works with large datasets, in terms of time required for processing such data, this method needs a shorter time than other methods 0. As a disadvantage, the TwoStep method does not allow missing values and the items that have missing values are not considered for analysis. Case Study Since TwoStep Cluster is often preferred first for large datasets and second for handling mixture data, we applied this method using some public data referring to the clients of a bank for clustering this data. (On the other hand, some of this data was used in another application to reduce the dimensionality applying PCA Principal Component Analysis). The input and the output of this method are presented further below. A related paper presents a study for control client using the same method which we used in this paper. The authors propose a policy for consolidating a company s profits by selecting the clients using the cluster analysis method of CRM (Client Relationship Management), managing the resources better. For the realization of a new service policy, they analyze the level of contribution of the clients' service pattern: total number of visits to the homepage, service type, 0 B a c h e r, J., W e n z i g, K., V o g l e r, M., SPSS TwoStep Cluster A First Evaluation, available at [accessed 3 July 00]. I o n i Ń ă, I., Ş c h i o p u, D., Using principal component analysis in loan granting, Bulletin of Petroleum-Gas University of Ploieşti, Mathematics, Informatics, Physics Series, vol. LXI, no. /00. P a r k, H. - S, B a i k, D. - K., A study for control of client value using cluster analysis, Journal of Network and Computer Applications, Vol. 9, No. 4, Elsevier, 006, pp
6 Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile 7 service usage period, total payment, average service period, service charge per homepage visit and profits through the cluster analysis of clients' data. The clients were grouped into four clusters according to the contribution level in terms of profits. Input The dataset that has been used for our case study has been obtained from a public database that contains credit data of a German bank 3. The dataset has 000 records and is presented in a table in SPSS. This table contains information about the duration of the credit, credit history, purpose of the loan, credit amount, savings account, years employed, payment rate, personal status, residency, property, age, housing, number of credits at bank, job, dependents and credit approval. In Table we present part of this data. Table. Source data Duration CreditHistory Purpose CreditAmount YearsEmployed PaymentRate PersonalStatus 6 critical television 69.0 >=7 4.0 male_single 48 ok_til_now television <4.0 female critical education <7.0 male_single 4 ok_til_now furniture <7.0 male_single 4 past_delays car_new <4 3.0 male_single 36 ok_til_now education <4.0 male_single 4 ok_til_now furniture >=7 3.0 male_single 36 ok_til_now car_used <4.0 male_single ok_til_now television <7.0 male_divorced 30 critical car_new unemployed 4.0 male_married The database contains 9 categorical variables and 7 continuous variables. Continuous variables are standardized by default. Because we use mixture data, we have only log-likelihood option for distance measure. In the first running, we choose BIC to determine the number of clusters, though we may override this and specify a fixed number. The results obtained using AIC running are not different from those obtained with BIC, so below we present only those obtained with BIC indicator. Regarding the noises (the outliers) from our dataset, we do not check the noise handling option. Outliers are defined as cases in CF tree, in other leaves with fewer than the specified percentage of the maximum leaf size. An important option given by SPSS is to export in XML format the CF tree or the entire model. This allows the model to be updated later for additional datasets. Output The Auto-Clustering statistics table in SPSS output can be used to assess the optimal number of clusters in our analysis, as shown in Table. 3 *** Source data, available at [accessed on 9 May 00].
7 7 Daniela Şchiopu Number of clusters Table. Auto-Clustering Schwarz's Bayesian Criterion (BIC) Ratio of Distance Measures 654,864 53,605, ,5, ,87,0 5 40,9, ,66, ,90, ,38, ,958, , ,083,5 4066,7, ,60, ,57, ,773,003 In Table, although the lowest BIC coefficient is for seven clusters, according to the SPSS algorithm, the optimal number of clusters is three, because the largest ratio of distances is for three clusters. The cluster distribution is shown in Table 3. Table 3. Cluster distribution N % of Combined % of Total Cluster 50 5,0% 5,0% 35 35,% 35,% ,9% 49,9% Combined % 00% Total % SPSS presents also the frequencies for each categorical variable. Table 4 shows the frequencies for SavingsAccount variable. Table 4. Frequencies for SavingsAccount variable <00 <000 <500 >=000 unknown F % F % F % F % F % Cluster 88 4,6% 8,7% 6 5,5% 5 0,4% 33 8,0% 40 39,8% 0 3,7% 34 33,0% 5 3,3% 4 3,0% ,6% 35 55,6% 53 5,5% 8 58,3% 08 59,0% Combined ,0% 63 00,0% 03 00,0% 48 00,0% 83 00,0% Note: * F Frequencies; % - Percent The cluster pie chart from Figure shows the relative size for our three clusters solution.
8 Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile 73 Fig.. Cluster size For categorical variables, the within-cluster percentage plot shows how each variable is split within each cluster. In Figure, it is shown the contribution of variable Property within each of the three clusters. Note that in cluster, the predominant property is unknown, while in cluster it is the real estate and the car in cluster 3. Fig.. The weight of Property in each cluster SPSS gives the importance plot for each variable (categorical or continuous). In Figure 3 we present the importance of the categorical variables for the first two clusters. Note that Property and Housing contribute the most to differentiating the first cluster and PersonalStatus, CreditHistory, Housing and YearsEmployed differentiate the second. Fig. 3. Categorical variablewise importance for clusters and
9 74 Daniela Şchiopu Regarding the continuous variables importance for cluster 3, we note that cluster 3 is differentiated by the top four variables (the number of credits at bank, the dependents, the age and the payment rate) in a positive direction and only by ResidenceSince in a negative direction, but the positive variables contribute more to differentiation of cluster 3. Discussion Fig. 4. Continuous variablewise importance for cluster 3 We present the following conclusions after the results provided by TwoStep Cluster. The first cluster, which fills 5%, contains mostly single male customers, which occupy management positions (34.5%) or are unemployed (7.3%), they have unknown properties and their loan is approved in a small percentage (.9%). Cluster fills 35.%, contains female or married male customers with real estate (54.6%), mostly unemployed (54.5%) or unskilled (47.5%) and the purpose of the loan is appliances, retraining and furniture. The most important cluster is the third. This is the largest cluster (49.9%) containing mostly single male or divorced male customers, with the largest saving accounts, between 4 and 7 years employed, occupying management positions (54.7%) or being skilled workers (50.6%), with a history of credit okay; the purpose of the loan is for business, cars (new or used), or for education; they have their own housing (65.%) and their loan is approved in a large percentage (55.9%). Conclusions Clustering methods can be applied in various fields which use large datasets, just to find hidden patterns. Since most data taken from the real world (as in banking field, in our case) contains both numerical and categorical attributes, classical clustering algorithms can not work efficiently with such data. To solve this problem, we showed that TwoStep method can be easily used, which also determines the optimal number of clusters automatically. Applying this method to our data, we identified three customers profiles. The most important profile contains skilled customers with no bad credit history, whose purpose is to obtain the loan for education or for business. The second profile groups middle class customers, unemployed, but with real estate and whose loan is for retraining or for household goods. The third profile groups the persons with unknown properties, mostly unemployed, who want credit for things such as new or used cars or for television, and then for education.
10 Applying TwoStep Cluster Analysis for Identifying Bank Customers Profile 75 This case study is useful for a bank company which intends to consolidate the company profit, for a better management of existing or possible clients when loan granting. References. *** Analiza datelor, available at [accessed on 0 July 00].. *** Euclidian distance, available at [accessed on July 00]. 3. *** Source data, available at german/ [accessed on 9 May 00]. 4. *** SPSS (Statistical Package for the Social Sciences), available at [accessed on 0 July 00]. 5. *** The SPSS TwoStep cluster component, Technical report, available at _The%0SPSS%0TwoStep%0Cluster%0Component.pdf [accessed on 0 July 00]. 6. *** TwoStep Cluster Analysis, available at statistics/algorithms/4.0/twostep_cluster.pdf [accessed July 00]. 7. A r m i n g e r, G., C l o g g, C., S o b e l, M., Handbook of Statistical Modeling for the Social and Behavioral Sciences, Plenum Press, New York, 995, pp B a c h e r, J., W e n z i g, K., V o g l e r, M., SPSS TwoStep Cluster A First Evaluation, available at [accessed on 3 July 00]. 9. G a r s o n, D., Cluster Analysis, available at cluster.htm, [accessed 30 July 00]. 0. I o n i Ń ă, I., Ş c h i o p u, D., Using principal component analysis in loan granting, Bulletin of Petroleum-Gas University of Ploieşti, Mathematics, Informatics, Physics Series, vol. LXI, no. /00.. P a r k, H. - S, B a i k, D. - K., A study for control of client value using cluster analysis, Journal of Network and Computer Applications, Vol. 9, No. 4, Elsevier, 006, pp Aplicarea metodei TwoStep Cluster în identificarea profilului clienńilor unei bănci Rezumat În acest articol se analizează datele referitoare la clienńii unei bănci, grupându-i în trei categorii, folosind metoda TwoStep Cluster din SPSS. Această metodă este utilă pentru cazul nostru, deoarece, fańă de alte metode tradińionale de clustering, TwoStep poate lucra cu date mixte (atât date continue, cât şi numerice) şi poate găsi, de asemenea, numărul optim de clusteri. Metoda creează trei profiluri de clienńi. Cel mai mare grup este constituit din clienńii calificańi, care doresc un credit pentru educańie sau pentru afaceri. Al doilea grup conńine persoanele care au proprietăńi imobiliare, dar care sunt în majoritate şomeri şi care cer un credit pentru a se recalifica sau pentru articole de menaj. Al treilea profil cuprinde persoane cu proprietăńi care nu sunt cunoscute şi care cer un împrumut pentru lucruri precum o maşină sau un televizor şi abia apoi pentru educańie. Avantajul studiului nostru este consolidarea profitului companiei prin gestionarea mai eficientă a clienńilor săi.
White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster
Petroleum-Gas University of Ploiesti BULLETIN Vol. LXIII No. 3/2011 105-112 Economic Sciences Series The Analysis of Currency Exchange Rate Evolution using a Data Mining Technique Mădălina Cărbureanu Petroleum-Gas
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
DATA MINING TECHNIQUES FOR IDENTIFYING THE CUSTOMER BEHAVIOUR OF INVESTMENT IN STOCK MARKET IN INDIA DR. NAVEETA MEHTA*; MS. SHILPA DANG** * Associate Prof., MMICT&BM, MCA Dept., M.M. University, Mullana
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
Economic Insights Trends and Challenges Vol. I (LXIV) No. 4/2012 121-128 The Annual Inflation Rate Analysis Using Data Mining Techniques Mădălina Cărbureanu Petroleum-Gas University of Ploiesti, Bd. Bucuresti
BULETINUL Universităţii Petrol Gaze din Ploieşti Vol. LXII No. 1/2010 88-96 Seria Matematică - Informatică - Fizică Using Principal Component Analysis in Loan Granting Irina Ioniţă, Daniela Şchiopu Petroleum
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Medical Information Management & Mining You Chen Jan,15, 2013 You.firstname.lastname@example.org 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,
BULETINUL Universităţii Petrol Gaze din Ploieşti Vol. LX No. 4/2008 15-20 Seria Ştiinţe Economice Money and the Key Aspects of Financial Management Mihail Vincenţiu Ivan*, Ferenc Farkas** * Petroleum-Gas
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Cluster Analysis Chapter 16 Identifying groups of individuals or objects that are similar to each other but different from individuals in other groups can be intellectually satisfying, profitable, or sometimes
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
COC131 Data Mining - Clustering Martin D. Sykora email@example.com Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (firstname.lastname@example.org) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 email@example.com What is Learning? "Learning denotes changes in a system that enable
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Annals of the University of Petroşani, Economics, 12(3), 2012, 39-48 39 POST-HOC SEGMENTATION USING MARKETING RESEARCH CRISTINEL CONSTANTIN * ABSTRACT: This paper is about an instrumental research conducted
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION Ionela-Catalina Tudorache (Zamfir) 1, Radu-Ioan Vija 2 1), 2) The Bucharest University of Economic Studies, Economic Cybernetics
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati firstname.lastname@example.org, email@example.com
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide
Start-up Companies Predictive Models Analysis Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov Abstract: A quantitative research is performed to derive a model for predicting the success of Bulgarian start-up
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes
Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010 1 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling
Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
Data Mining Part 2. and Preparation 2.1 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Outline Introduction Measuring the Central Tendency Measuring the Dispersion of Data Graphic Displays References
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India firstname.lastname@example.org 2 School of Computer
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between