Comparative Analysis of Supervised and Unsupervised Discretization Techniques

Size: px
Start display at page:

Download "Comparative Analysis of Supervised and Unsupervised Discretization Techniques"

Transcription

1 , Vol., No. 3, 0 Comparative Analysis of Supervised and Unsupervised Discretization Techniques Rajashree Dash, Rajib Lochan Paramguru, Rasmita Dash 3, Department of Computer Science and Engineering, ITER,Siksha O Anusandhan University,Bhubaneswar, India rajashree_dash@yahoo.co.in, rajib_kec@yahoo.com, 3 rasmita0@yahoo.co.in Abstract Most of the Machine Learning and Data Mining applications can be applicable only on discrete features. However, data in real world are often continuous in nature. Even for algorithms that can directly deal with continuous features, learning is often less efficient and effective. Hence discretization addresses this issue by finding the intervals of numbers which are more concise to represent and specify. Discretization of continuous attributes is one of the important data preprocessing steps of knowledge extraction. An effective discretization method not only can reduce the demand of system memory and improve the efficiency of data mining and machine learning algorithm, but also make the knowledge extracted from the discretized dataset more compact, easy to be understand and used. In this paper, different types of traditional Supervised and Unsupervised discretization techniques along with examples, as well as their advantages and drawbacks have been discussed. Keywords: Continuous Features, Features, Discretization, Supervised Discretization, Unsupervised Discretization. Introduction Due to the development of new techniques, the rate of growth of scientific databases has become very large, which creates both a need and an opportunity to extract knowledge from databases. The data in the databases are usually found in a mixed format: nominal, discrete, and/or continuous. and continuous data are ordinal data types with orders among the, while nominal do not possess any order among them. are intervals in a continuous spectrum of. The number of continuous for an attribute can be infinitely many, but the number of discrete is often few or finite. Continuous features are also called quantitative features, e.g. people's height, age and soon. features, also often referred to as qualitative features, including sex and degree of education, can only be limited among a few. Continuous features can be ranked in order and admit to meaningful arithmetic operations. However, discrete features sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them. In the field of Machine Learning and Data Mining, there exist many learning algorithms that are primarily oriented to handling discrete features. However, data in real world are often continuous in nature. Hence discretization is a commonly used data preprocessing procedure that transforms continuous features into discrete features []. It is the process of partitioning continuous variables into categories. Unfortunately, the number of ways to discretize a continuous attribute is infinite. Discretization is a potential time-consuming bottleneck, since the number of possible discretizations is exponential in the number of interval threshold candidates within the domain. Discretization techniques are often used by the classification algorithms, genetic algorithms, instance-based learning and a wide range of learning algorithms. Use of discrete has a no. of advantages like: features require less memory space. features are often closer to a knowledge-level representation. Data can be reduced and simplified through discretization, which becomes easier to understand, use and explain. Learning will be more accurate and faster using the discrete features. March Issue Page 9 of 88 ISSN 9 56

2 . Discretization Process Data discretization is a general purpose pre-processing method that reduces the number of distinct for a given continuous variable by dividing its range into a finite set of disjoint intervals, and then relates these intervals with meaningful labels []. Subsequently, data are analyzed or reported at this higher level of knowledge representation rather than the subtle individual, and thus leads to a simplified data representation in data exploration and data mining process. A discretization process flows in four steps [3] as depicted in Figure. Sorting the continuous of the attribute to be discretized Evaluating a cut-point for splitting or adjacent intervals for merging According to some criterion, splitting or merging intervals of continuous. Finally stopping at some point based on stopping criteria Figure. Steps of Discretization Process The goal of discretization is to find a set of cut points to partition the range into a small number of intervals. Mainly there are two tasks of discretization. The first task is to find the number of discrete intervals. Only a few discretization algorithms perform this; often, the user must specify the number of intervals or provide a heuristic rule. The second task is to find the width, or the boundaries, of the intervals given the range of of a continuous attribute. Usually, in discretization process, after sorting data in ascending or descending order with respect to the variable to be discretized, landmarks must be chosen among the whole dataset. In general, the algorithm for choosing landmarks can be either top down, which starts with an empty list of landmarks and splits intervals, or bottom-up, which starts with the complete list of all the as landmarks and merges intervals. In both cases there is a stopping criterion, which specifies when to stop the discretization process. 3. Classification of Discretization Methods The motivation for the discretization of continuous features is based on the need to obtain higher accuracy rates in order to handle data with high cardinality attributes. Discretization methods have been developed along different lines due to different needs. Main classification is as follows: supervised vs. unsupervised, dynamic vs. static, global vs. local, splitting (top-down) vs. merging (bottom-up), and direct vs. incremental [4], [5] and [6]. Discretization methods can be supervised or unsupervised depending on whether it uses class information of data sets. Supervised methods make use of the class label when partitioning the continuous features. On the other hand, unsupervised discretization methods do not require the class information to discretize continuous attributes. Supervised discretization can be further characterized as error-based, entropy-based or statistics based. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Discretization methods can also be viewed as dynamic or static. A dynamic method would discretize continuous when a classifier is being built, such as in C4.5 while in the static approach discretization is done prior to the classification task. Next, the distinction between global and local methods is based on the stage when the discretization takes place. Global methods discretize features prior to induction. In contrast, local methods discretize features during the induction process. Empirical results have indicated that global discretization methods often produced superior results compared to local methods since the former use the entire March Issue Page 30 of 88 ISSN 9 56

3 value domain of a numeric attribute for discretization, whereas local methods produce intervals that are applied to sub partitions of the instance space. Finally, the distinction between top-down and bottom-up discretization methods can be made. Topdown methods consider one big interval containing all known of a feature and then partition this interval into smaller and smaller subintervals until a certain stopping criterion or optimal number of intervals is achieved. In contrast, bottom-up methods initially consider a number of intervals, determined by the set of boundary points, to combine these intervals during execution until a certain stopping criterion, such as an x threshold, or optimal number of intervals is achieved. Another dimension of discretization methods is direct vs. incremental. Direct methods divide the range of k intervals simultaneously (i.e., equal-width and equal-frequency), needing an additional input from the user to determine the number of intervals. Incremental methods begin with a simple discretization and pass through an improvement process, needing an additional criterion to know when to stop discretizing. 4. Unsupervised Discretization Methods Among the unsupervised discretization methods there are the simple ones (equal-width and equalfrequency interval binning) and the more sophisticated ones, based on the clustering analysis, such as k-means discretization. Continuous ranges are divided into sub ranges by the user specified width or frequency [7]. 4.. Equal Width Interval Discretization Equal-width interval discretization is a simplest discretization method that divides the range of observed for a feature into k equal sized bins, where k is a parameter provided by the user. The process involves sorting the observed of a continuous feature and finding the minimum, V min and maximum, V max,. The interval can be computed by dividing the range of observed for the variable into k equally sized bins using the following formula, where k is a parameter supplied by the user: Vmax Vmin Interval k Boundaries Vmin ( i Interval ) Then the boundaries can be constructed for i =...k- using the above equation. This type of discretization does not depend on the multi-relational structure of the data. However, this method of discretization is sensitive to outliers that may drastically skew the range. The limitations of this method are given by the uneven distribution of the data points: some intervals may contain much more data points than other. Table. Solved example for Equal Width Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Here V min = 0 V max =50 let k= So Interval=0 and Boundary=0+0=30 March Issue Page 3 of 88 ISSN 9 56

4 4.. Equal-Frequency Interval Discretization The equal-frequency algorithm determines the minimum and maximum of the discretized attribute, sorts all in ascending order, and divides the sorted continuous into k intervals such that each interval contains approximately n/k data instances with adjacent. For equalfrequency, many occurrences of a continuous value could cause the occurrences to be assigned into different bins. This algorithm tries to overcome the limitations of the equal-width interval discretization by dividing the domain in intervals with the same distribution of data points. The data instances with identical value must be placed in the same interval, thus it is not always possible to generate exactly k equal frequency intervals. This method is also called as proportional k-interval discretization. Table. Solved example for Equal Frequency Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Let k= So each interval will contain 8/=4 data 4.3. Clustering Based Discretization The k-means clustering method remains one of the most popular clustering methods, is also suitable to be used to discretize continuous valued variables because it calculates continuous distance-based similarity measure to cluster data points [8]. In fact, since unsupervised discretization involves only one variable, it is equivalent to a -dimensional k-means clustering analysis. K-means is a nonhierarchical partitioning clustering algorithm that operates on a set of data points and assumes that the number of clusters to be determined (k) is given. Initially, the algorithm assigns randomly k data points to be the so called centers (or centroids) of the clusters. Then each data point of the given set is associated to the closest center resulting the initial distribution of the clusters. After this initial step, the next two steps are performed until the convergence is obtained:. Recompute the centers of the clusters as the average of all in each cluster.. Each data point is assigned to the closest center. The clusters are formed again. The algorithm stops when there is no data point that needs to be reassigned or the number of data point s reassignments is less than a given small number. The convergence of the iterative scheme is guaranteed in a finite number of iterations but the convergence is only to a local optimum, depending very much on the choice of the initial centers of the clusters. Theoretically, clusters formed in this way should minimize the sum of squared distance between data points within each cluster over the sum of squared distance between data points from different clusters. The basic limitation of discretization based on k-means clustering is that the outcome of discretization mainly depends on the given k value, the initially chosen cluster centroids and also it is sensitive to outliers. In [9] a supervised clustering algorithm, called SX-means has been proposed, which is a variation of X-means that extended K- means algorithm using Bayesian Information Criteria to decide whether to keep on dividing into subclusters or not. This algorithm automatically selects the number of discrete intervals without any user supervision. Other types of clustering methods can also be used as the baselines for the designing discretization methods, for examples hierarchical clustering methods. As opposed to the k-means clustering method which is an iterative method, the hierarchical methods can be either divisive or agglomerative. Divisive methods start with a single cluster that contains all the data points. This cluster is then divided March Issue Page 3 of 88 ISSN 9 56

5 International Journal of Advances in Science and Technology successively into as many clusters as needed. Agglomerative methods start by creating a cluster for each data point. These clusters are then merged, two clusters at a time, by a sequence of steps until the desired number of clusters is obtained. Both approaches involve design problems: which cluster to divide/which clusters to merge and what is the right number of clusters. After the clusterization is done, the discretization cut points are defined as the minimum and maximum of the active domain of the attribute and the midpoints between the boundary points of the clusters (minimum and maximum in each cluster). Table 3. Solved example for Discretization Based on K-means clustering Data Original Continuous V V V3 V4 V5 V6 V7 V Let k= and 5, 30 are two randomly chosen initial centroids. 5. Supervised Discretization Methods Supervised discretization methods make use of the class label when partitioning the continuous features. Among the supervised discretization methods there are the simple ones like Entropy-based discretization, Interval Merging and Splitting using χ Analysis [0]. 5.. Entropy Based Discretization Method One of the supervised discretization methods, introduced by Fayyad and Irani, is called the entropybased discretization. An entropy-based method will use the class information entropy of candidate partitions to select boundaries for discretization. Class information entropy is a measure of purity and it measures the amount of information which would be needed to specify to which class an instance belongs. It considers one big interval containing all known of a feature and then recursively partitions this interval into smaller subintervals until some stopping criterion, for example Minimum Length (MDL) Principle or an optimal number of intervals is achieved thus creating multiple intervals of feature. In information theory, the entropy function for a given set S, or the expected information needed to classify a data instance in S, Info(S) is calculated as Info(S) = - Σ pi log (pi) Where pi is the probability of class i and is estimated as Ci/S, Ci being the total number of data instances that is of class i. A log function to the base is used because the information is encoded in bits. The entropy value is bounded from below by 0, when the model has no uncertainty at all, i.e. all data instances in S belong to one of the class pi =, and other classes contain 0 instances pj =0, i j. And it is bounded from the top by log m, where m is the number of classes in S, i.e. data instances are uniformly distributed across k classes such that pi=/m for all. Based on this entropy measure, J. Ross Quinlan developed an algorithm called Iterative Dichotomiser 3 (ID3) to induce best split point in decision trees. ID3 employs a greedy search to find potential split-points within the existing range of continuous using the following formula: m m j j Info( S, T ) pleft p j,left log p j,left pright p j,right log p j,right March Issue Page 33 of 88 ISSN 9 56

6 International Journal of Advances in Science and Technology In the equation, pj,left and p j,right are probabilities that an instances, belong to class j, is on the left or right side of a potential split-point T. The split-point with the lowest entropy is chosen to split the range into two intervals, and the binary split is continued with each part until a stopping criterion is satisfied. Fayyad and Irani propose a stopping criterion for this generalization using the minimum description length principle (MDLP) that stops the splitting when InfoGain(S, T) = Info(S) Info(S, T) < δ Where T is a potential interval boundary that splits S into S (left) and S (right) parts, and δ = [log (n-) + log (3k -) [m Info(S) m Info (S) m Info (S)]] / n Where mi is the number of classes in each set Si and n is the total number of data instances in S. Table 4. Solved example for Entropy Based Discretization Data Original Continuous Class Information Requirement Obtained by Considering each data value as a splitting point V V V3 V4 V5 V6 V7 V Here no of class =. Min [ InfoA( S ) ] = at attribute value 5. So splitting point is at 5.Stopping criteria when no. of intervals reached is. 5.. Chi-Square Based Discretization Chi-square (x) is a statistical measure that conducts a significance test on the relationship between the of a feature and the class. The x statistic determines the similarity of adjacent intervals based on some significance level. It tests the hypothesis that two adjacent intervals of a feature are independent of the class. If they are independent, they should be merged; otherwise they should remain separate. The top-down method based on chi-square is Chi Split. It searches for the best split of an interval, by maximizing the chi-square criterion applied to the two sub-intervals adjacent to the splitting point: the interval is split if both sub-intervals substantially differ statistically. The Chi Split stopping rule is based on a user-defined chi-square threshold to reject the split if the two sub-intervals are too similar. The bottom-up method based on chi-square is Chi Merge. It searches for the best merge of adjacent intervals by minimizing the chi-square criterion applied locally to two adjacent intervals: they are merged if they are statistically similar. The stopping rule is based on a user-defined Chi-square threshold to reject the merge if the two adjacent intervals are insufficiently similar. The ChiMerge algorithm is initialized by first sorting the training examples according to their value for the attribute being discretized and then constructing the initial discretization, in which each example is put into its own interval. Initially, if two adjacent intervals have a very similar distribution of classes, then the intervals can be merged. In Chi-Merge, each distinct value of a numerical attribute is considered to be one interval. Then X² tests are performed for every pair of adjacent intervals and adjacent intervals with least X² are merged together. This merging process proceeds recursively, until a predefined stopping criterion is met i.e. until of all adjacent pairs exceeds a threshold or a predefined no. of intervals has reached. The threshold is determined by the significance level and degrees of freedom = number of classes -. March Issue Page 34 of 88 ISSN 9 56

7 X is a statistic used to perform statistical independence test on relationship between two variables in a contingency table. In a database with data instances labeled with p classes, the formula to compute X statistic at a split point for two adjacent intervals against the p class is: p ( Aij Eij) x i j Eij Where p = number of classes A ij = number of distinct in the i th interval, j th class p R i = number of examples in i th interval = Aij j C j = number of examples in j th class = Aij N = total number of examples = Cj p j m i E ij = expected frequency of A ij = (R i * C j ) / N The Contingency table is represented by the table5. Table 5. Contingency Table Class Class Sum Interval A A R Interval A A R Sum C C N Table 6. Solved example for Chi Merge Based Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Class Initially the no. of intervals is 8.Then combining the adjacent intervals having same class value we get 5 distinct intervals i.e. I=0, I=, I3= 5,0,5, I4=30,40, I5= 50 Then X value for each adjacent intervals are calculated and adjacent intervals with least X² are merged together until no. of intervals reached is. 0 5,0,5 30,40 50 X = X =4 X =5 X =.94 0, 5,0,5 30,40 50 X =.875 X =5 X =.94 0,,5,0,5 30,40 50 X =3.75 X =.94 0,,5,0,5 30,40,50 March Issue Page 35 of 88 ISSN 9 56

8 A limitation of Chi Merge is that it cannot be used to discretize data for unsupervised learning (clustering) tasks []. Also, Chi Merge is only attempting to discover first order (single attribute) correlations, thus might not perform correctly when there is a second-order correlation without a corresponding first-order correlation, which might happen if an attribute only correlates in the presence of some other condition. Another shortcoming of Chi Merge is its lack of global evaluation. When deciding which intervals to merge, the algorithm only examines adjacent intervals, ignoring other surrounding intervals. Because of this restricted local analysis, it is possible that the formation of a large, relatively uniform interval could be prevented by an unlikely run of examples within it. One possible fix is to apply the x test to three or more intervals at a time. The x formula is easily extended, by adjusting the value of the parameter m in the x calculation. 6. Comparative Analysis of Discretization Methods A no. of discretization methods exist, each having its own characteristic and doing well in different situations. Every discretization methods have its-own strengths. A comparative analysis of common discretization methods based on different dimensions of discretization and their limitations has displayed in Table 7. Table 7. Comparative Analysis of Discretization Methods Methods Evaluation Criteria Supervised/ Unsupervised Dynamic/ Static Equal Width Equal Frequency K-means Clustering based Entropy Based Chi Merge based Unsupervised Unsupervised Unsupervised Supervised Supervised Static Static Static Static Static Global/ Local Global Global Local Local Global Splitting/ Merging Direct / Incremental Stopping Criteria Sensitive to outlier Same go to different intervals Time Complexity to discretize one attribute of n objects Split Split Split Split Merge Direct Direct Direct Incremental Incremental Fixed Bin no. Fixed Bin no. No further reassignment of data to the given fixed cluster no. Threshold / Fixed no. of intervals Threshold / Fixed no. of intervals Yes No Yes No No No Yes No No No O(n) O(n) O(ikn) i=no of iteration and k= no of intervals O(n log(n)) O(n log(n)) March Issue Page 36 of 88 ISSN 9 56

9 7. Conclusion Discretization of continuous features plays an important role in data pre-processing before applying a no. of machine learning and data mining algorithms on the real valued data sets. Since a large number of possible attribute slows and makes inductive learning ineffective, one of the main goals of a discretization algorithm is to significantly reduce the number of discrete intervals of a continuous attribute by maximizing the interdependency between discrete attribute and class labels, which minimizes the information loss due to discretization. This paper briefly introduces the need of discretization for improving the learning algorithms efficiency, the various taxonomies of discretization methods, the idea and drawbacks of some typical methods expressed in details by supervised or unsupervised category. Also from the solved examples it has been found that unsupervised methods like K-means clustering can perform equally well to that of supervised methods due to its characteristics of being an algorithm that uses minimum square error partitioning to generate an arbitrary number of partitions reflecting the original distribution of the partition attribute. Lastly a comparative analysis has been given based on different issues of discretization. No discretization method can ensure a negative difference for all data sets and all algorithms. So it's of vital importance to select proper methods depending on data sets and learning context in practice. 8. References [] E Xu, Shao Liangshan, Ren Yongchang, Wu Hao and Qiu Feng, A new Discretization approach of Continuous attributes, Asia-Pacific Conference on Wearable Computing Systems, vol. 5, no., pp , 00. [] Sheng-yi Jiang, Xia Li, Qi Zheng and Lian-xi Wang, An Approximate Equal Frequency Discretization method, WRI Global Congress Intelligent System, vol. 3, no. 4, pp , 009. [3] Joao Gama and Carlos Pinto, Discretization from Data Streams: Applications to Histograms and Data Mining, Symposium on Applied Computing, pp , 006. [4] Liu Peng, Wang Qing, Gu Yujia, Study on Comparison of Discretization Methods, International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, no., pp , 009. [5] Azuraliza Abu Bakar, Zulaiha Ali Othman and Nor Liyana Mohd Shuib, Building a new Taxonomy for Data Discretization Techniques, nd Conference on Data Mining and Optimization, vol. 3, no., pp. 3-40, 009. [6] Rayner Alfred, Discretization Numerical Data for Relational Data with One-to-Many Relations, Journal of Computer Science, vol. 5, no. 7, pp.59-58, 009. [7] Daniela Joita, Unsupervised Static Discretization Methods in Data Mining, Revista Mega Byte, vol. 9, 00. [8] Sellappan Palaniappan, Tan Kim Hong, Discretization of Continuous Valued dimensions in OLAP Data Cubes, International Journal of Computer Science and Network Security, vol. 8, no., pp. 6-6, 009. [9] Haiyang Hua, Huaici Zhao, A Discretization algorithm of Continuous attribute based on Supervised Clustering,Chinese Conference on Pattern Recognition, vol. 8, no. 3, pp. -5, 009. [0] Kotsiantis Sotiris and Kanellopoulos Dimitris, Discretization Techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, vol. 3, no., pp , 006. [] Kerber. Randy, ChiMerge: Discretization of numeric attributes, Proceedings of the tenth National Conference on Artificial Intelligence, pp.3-8, 99. March Issue Page 37 of 88 ISSN 9 56

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Discretization: An Enabling Technique

Discretization: An Enabling Technique Data Mining and Knowledge Discovery, 6, 393 423, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Discretization: An Enabling Technique HUAN LIU FARHAD HUSSAIN CHEW LIM TAN MANORANJAN

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Performance Study on Data Discretization Techniques Using Nutrition Dataset

Performance Study on Data Discretization Techniques Using Nutrition Dataset 2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Performance Study on Data Discretization Techniques Using Nutrition

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE 1 K.Murugan, 2 P.Varalakshmi, 3 R.Nandha Kumar, 4 S.Boobalan 1 Teaching Fellow, Department of Computer Technology, Anna University 2 Assistant

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Data Mining: Data Preprocessing. I211: Information infrastructure II

Data Mining: Data Preprocessing. I211: Information infrastructure II Data Mining: Data Preprocessing I211: Information infrastructure II 10 What is Data? Collection of data objects and their attributes Attributes An attribute is a property or characteristic of an object

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics for Healthcare Analytics Si-Chi Chin,KiyanaZolfaghar,SenjutiBasuRoy,AnkurTeredesai,andPaulAmoroso Institute of Technology, The University of Washington -Tacoma,900CommerceStreet,Tacoma,WA980-00,U.S.A.

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Zeta: A Global Method for Discretization of Continuous Variables

Zeta: A Global Method for Discretization of Continuous Variables Zeta: A Global Method for Discretization of Continuous Variables K. M. Ho and P. D. Scott Department of Computer Science, University of Essex, Colchester, CO4 3SQ, UK hokokx@essex.ac.uk and scotp@essex.ac.uk

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska. Classification Lecture Notes Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Clustering Via Decision Tree Construction

Clustering Via Decision Tree Construction Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.

More information

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1 Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

More information

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim. Tomáš Horváth BUSINESS ANALYTICS Lecture 3 Data Pre-processing Information Systems and Machine Learning Lab University of Hildesheim Germany Overview The aim of this lecture is to describe some data pre-processing

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo.

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo. Enhanced Binary Small World Optimization Algorithm for High Dimensional Datasets K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India. ktramprasad04@yahoo.com

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Gaurav L. Agrawal 1, Prof. Hitesh Gupta 2 1 PG Student, Department of CSE, PCST, Bhopal, India 2 Head of Department CSE, PCST, Bhopal,

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Content Based Data Retrieval on KNN- Classification and Cluster Analysis for Data Mining

Content Based Data Retrieval on KNN- Classification and Cluster Analysis for Data Mining Volume 12 Issue 5 Version 1.0 March 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: & Print ISSN: Abstract - Data mining is sorting

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

{ Mining, Sets, of, Patterns }

{ Mining, Sets, of, Patterns } { Mining, Sets, of, Patterns } A tutorial at ECMLPKDD2010 September 20, 2010, Barcelona, Spain by B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, A. Zimmermann 1 Overview Tutorial 00:00 00:45 Introduction

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

Data Mining: Foundation, Techniques and Applications

Data Mining: Foundation, Techniques and Applications Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Chapter 4. Probability and Probability Distributions

Chapter 4. Probability and Probability Distributions Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the

More information

Diagnosis of Students Online Learning Portfolios

Diagnosis of Students Online Learning Portfolios Diagnosis of Students Online Learning Portfolios Chien-Ming Chen 1, Chao-Yi Li 2, Te-Yi Chan 3, Bin-Shyan Jong 4, and Tsong-Wuu Lin 5 Abstract - Online learning is different from the instruction provided

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

AnalysisofData MiningClassificationwithDecisiontreeTechnique

AnalysisofData MiningClassificationwithDecisiontreeTechnique Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information