Comparative Analysis of Supervised and Unsupervised Discretization Techniques

Transcription

1 , Vol., No. 3, 0 Comparative Analysis of Supervised and Unsupervised Discretization Techniques Rajashree Dash, Rajib Lochan Paramguru, Rasmita Dash 3, Department of Computer Science and Engineering, ITER,Siksha O Anusandhan University,Bhubaneswar, India [email protected], [email protected], 3 [email protected] Abstract Most of the Machine Learning and Data Mining applications can be applicable only on discrete features. However, data in real world are often continuous in nature. Even for algorithms that can directly deal with continuous features, learning is often less efficient and effective. Hence discretization addresses this issue by finding the intervals of numbers which are more concise to represent and specify. Discretization of continuous attributes is one of the important data preprocessing steps of knowledge extraction. An effective discretization method not only can reduce the demand of system memory and improve the efficiency of data mining and machine learning algorithm, but also make the knowledge extracted from the discretized dataset more compact, easy to be understand and used. In this paper, different types of traditional Supervised and Unsupervised discretization techniques along with examples, as well as their advantages and drawbacks have been discussed. Keywords: Continuous Features, Features, Discretization, Supervised Discretization, Unsupervised Discretization. Introduction Due to the development of new techniques, the rate of growth of scientific databases has become very large, which creates both a need and an opportunity to extract knowledge from databases. The data in the databases are usually found in a mixed format: nominal, discrete, and/or continuous. and continuous data are ordinal data types with orders among the, while nominal do not possess any order among them. are intervals in a continuous spectrum of. The number of continuous for an attribute can be infinitely many, but the number of discrete is often few or finite. Continuous features are also called quantitative features, e.g. people's height, age and soon. features, also often referred to as qualitative features, including sex and degree of education, can only be limited among a few. Continuous features can be ranked in order and admit to meaningful arithmetic operations. However, discrete features sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them. In the field of Machine Learning and Data Mining, there exist many learning algorithms that are primarily oriented to handling discrete features. However, data in real world are often continuous in nature. Hence discretization is a commonly used data preprocessing procedure that transforms continuous features into discrete features []. It is the process of partitioning continuous variables into categories. Unfortunately, the number of ways to discretize a continuous attribute is infinite. Discretization is a potential time-consuming bottleneck, since the number of possible discretizations is exponential in the number of interval threshold candidates within the domain. Discretization techniques are often used by the classification algorithms, genetic algorithms, instance-based learning and a wide range of learning algorithms. Use of discrete has a no. of advantages like: features require less memory space. features are often closer to a knowledge-level representation. Data can be reduced and simplified through discretization, which becomes easier to understand, use and explain. Learning will be more accurate and faster using the discrete features. March Issue Page 9 of 88 ISSN 9 56

2 . Discretization Process Data discretization is a general purpose pre-processing method that reduces the number of distinct for a given continuous variable by dividing its range into a finite set of disjoint intervals, and then relates these intervals with meaningful labels []. Subsequently, data are analyzed or reported at this higher level of knowledge representation rather than the subtle individual, and thus leads to a simplified data representation in data exploration and data mining process. A discretization process flows in four steps [3] as depicted in Figure. Sorting the continuous of the attribute to be discretized Evaluating a cut-point for splitting or adjacent intervals for merging According to some criterion, splitting or merging intervals of continuous. Finally stopping at some point based on stopping criteria Figure. Steps of Discretization Process The goal of discretization is to find a set of cut points to partition the range into a small number of intervals. Mainly there are two tasks of discretization. The first task is to find the number of discrete intervals. Only a few discretization algorithms perform this; often, the user must specify the number of intervals or provide a heuristic rule. The second task is to find the width, or the boundaries, of the intervals given the range of of a continuous attribute. Usually, in discretization process, after sorting data in ascending or descending order with respect to the variable to be discretized, landmarks must be chosen among the whole dataset. In general, the algorithm for choosing landmarks can be either top down, which starts with an empty list of landmarks and splits intervals, or bottom-up, which starts with the complete list of all the as landmarks and merges intervals. In both cases there is a stopping criterion, which specifies when to stop the discretization process. 3. Classification of Discretization Methods The motivation for the discretization of continuous features is based on the need to obtain higher accuracy rates in order to handle data with high cardinality attributes. Discretization methods have been developed along different lines due to different needs. Main classification is as follows: supervised vs. unsupervised, dynamic vs. static, global vs. local, splitting (top-down) vs. merging (bottom-up), and direct vs. incremental [4], [5] and [6]. Discretization methods can be supervised or unsupervised depending on whether it uses class information of data sets. Supervised methods make use of the class label when partitioning the continuous features. On the other hand, unsupervised discretization methods do not require the class information to discretize continuous attributes. Supervised discretization can be further characterized as error-based, entropy-based or statistics based. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Discretization methods can also be viewed as dynamic or static. A dynamic method would discretize continuous when a classifier is being built, such as in C4.5 while in the static approach discretization is done prior to the classification task. Next, the distinction between global and local methods is based on the stage when the discretization takes place. Global methods discretize features prior to induction. In contrast, local methods discretize features during the induction process. Empirical results have indicated that global discretization methods often produced superior results compared to local methods since the former use the entire March Issue Page 30 of 88 ISSN 9 56

3 value domain of a numeric attribute for discretization, whereas local methods produce intervals that are applied to sub partitions of the instance space. Finally, the distinction between top-down and bottom-up discretization methods can be made. Topdown methods consider one big interval containing all known of a feature and then partition this interval into smaller and smaller subintervals until a certain stopping criterion or optimal number of intervals is achieved. In contrast, bottom-up methods initially consider a number of intervals, determined by the set of boundary points, to combine these intervals during execution until a certain stopping criterion, such as an x threshold, or optimal number of intervals is achieved. Another dimension of discretization methods is direct vs. incremental. Direct methods divide the range of k intervals simultaneously (i.e., equal-width and equal-frequency), needing an additional input from the user to determine the number of intervals. Incremental methods begin with a simple discretization and pass through an improvement process, needing an additional criterion to know when to stop discretizing. 4. Unsupervised Discretization Methods Among the unsupervised discretization methods there are the simple ones (equal-width and equalfrequency interval binning) and the more sophisticated ones, based on the clustering analysis, such as k-means discretization. Continuous ranges are divided into sub ranges by the user specified width or frequency [7]. 4.. Equal Width Interval Discretization Equal-width interval discretization is a simplest discretization method that divides the range of observed for a feature into k equal sized bins, where k is a parameter provided by the user. The process involves sorting the observed of a continuous feature and finding the minimum, V min and maximum, V max,. The interval can be computed by dividing the range of observed for the variable into k equally sized bins using the following formula, where k is a parameter supplied by the user: Vmax Vmin Interval k Boundaries Vmin ( i Interval ) Then the boundaries can be constructed for i =...k- using the above equation. This type of discretization does not depend on the multi-relational structure of the data. However, this method of discretization is sensitive to outliers that may drastically skew the range. The limitations of this method are given by the uneven distribution of the data points: some intervals may contain much more data points than other. Table. Solved example for Equal Width Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Here V min = 0 V max =50 let k= So Interval=0 and Boundary=0+0=30 March Issue Page 3 of 88 ISSN 9 56

4 4.. Equal-Frequency Interval Discretization The equal-frequency algorithm determines the minimum and maximum of the discretized attribute, sorts all in ascending order, and divides the sorted continuous into k intervals such that each interval contains approximately n/k data instances with adjacent. For equalfrequency, many occurrences of a continuous value could cause the occurrences to be assigned into different bins. This algorithm tries to overcome the limitations of the equal-width interval discretization by dividing the domain in intervals with the same distribution of data points. The data instances with identical value must be placed in the same interval, thus it is not always possible to generate exactly k equal frequency intervals. This method is also called as proportional k-interval discretization. Table. Solved example for Equal Frequency Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Let k= So each interval will contain 8/=4 data 4.3. Clustering Based Discretization The k-means clustering method remains one of the most popular clustering methods, is also suitable to be used to discretize continuous valued variables because it calculates continuous distance-based similarity measure to cluster data points [8]. In fact, since unsupervised discretization involves only one variable, it is equivalent to a -dimensional k-means clustering analysis. K-means is a nonhierarchical partitioning clustering algorithm that operates on a set of data points and assumes that the number of clusters to be determined (k) is given. Initially, the algorithm assigns randomly k data points to be the so called centers (or centroids) of the clusters. Then each data point of the given set is associated to the closest center resulting the initial distribution of the clusters. After this initial step, the next two steps are performed until the convergence is obtained:. Recompute the centers of the clusters as the average of all in each cluster.. Each data point is assigned to the closest center. The clusters are formed again. The algorithm stops when there is no data point that needs to be reassigned or the number of data point s reassignments is less than a given small number. The convergence of the iterative scheme is guaranteed in a finite number of iterations but the convergence is only to a local optimum, depending very much on the choice of the initial centers of the clusters. Theoretically, clusters formed in this way should minimize the sum of squared distance between data points within each cluster over the sum of squared distance between data points from different clusters. The basic limitation of discretization based on k-means clustering is that the outcome of discretization mainly depends on the given k value, the initially chosen cluster centroids and also it is sensitive to outliers. In [9] a supervised clustering algorithm, called SX-means has been proposed, which is a variation of X-means that extended K- means algorithm using Bayesian Information Criteria to decide whether to keep on dividing into subclusters or not. This algorithm automatically selects the number of discrete intervals without any user supervision. Other types of clustering methods can also be used as the baselines for the designing discretization methods, for examples hierarchical clustering methods. As opposed to the k-means clustering method which is an iterative method, the hierarchical methods can be either divisive or agglomerative. Divisive methods start with a single cluster that contains all the data points. This cluster is then divided March Issue Page 3 of 88 ISSN 9 56

5 International Journal of Advances in Science and Technology successively into as many clusters as needed. Agglomerative methods start by creating a cluster for each data point. These clusters are then merged, two clusters at a time, by a sequence of steps until the desired number of clusters is obtained. Both approaches involve design problems: which cluster to divide/which clusters to merge and what is the right number of clusters. After the clusterization is done, the discretization cut points are defined as the minimum and maximum of the active domain of the attribute and the midpoints between the boundary points of the clusters (minimum and maximum in each cluster). Table 3. Solved example for Discretization Based on K-means clustering Data Original Continuous V V V3 V4 V5 V6 V7 V Let k= and 5, 30 are two randomly chosen initial centroids. 5. Supervised Discretization Methods Supervised discretization methods make use of the class label when partitioning the continuous features. Among the supervised discretization methods there are the simple ones like Entropy-based discretization, Interval Merging and Splitting using χ Analysis [0]. 5.. Entropy Based Discretization Method One of the supervised discretization methods, introduced by Fayyad and Irani, is called the entropybased discretization. An entropy-based method will use the class information entropy of candidate partitions to select boundaries for discretization. Class information entropy is a measure of purity and it measures the amount of information which would be needed to specify to which class an instance belongs. It considers one big interval containing all known of a feature and then recursively partitions this interval into smaller subintervals until some stopping criterion, for example Minimum Length (MDL) Principle or an optimal number of intervals is achieved thus creating multiple intervals of feature. In information theory, the entropy function for a given set S, or the expected information needed to classify a data instance in S, Info(S) is calculated as Info(S) = - Σ pi log (pi) Where pi is the probability of class i and is estimated as Ci/S, Ci being the total number of data instances that is of class i. A log function to the base is used because the information is encoded in bits. The entropy value is bounded from below by 0, when the model has no uncertainty at all, i.e. all data instances in S belong to one of the class pi =, and other classes contain 0 instances pj =0, i j. And it is bounded from the top by log m, where m is the number of classes in S, i.e. data instances are uniformly distributed across k classes such that pi=/m for all. Based on this entropy measure, J. Ross Quinlan developed an algorithm called Iterative Dichotomiser 3 (ID3) to induce best split point in decision trees. ID3 employs a greedy search to find potential split-points within the existing range of continuous using the following formula: m m j j Info( S, T ) pleft p j,left log p j,left pright p j,right log p j,right March Issue Page 33 of 88 ISSN 9 56

6 International Journal of Advances in Science and Technology In the equation, pj,left and p j,right are probabilities that an instances, belong to class j, is on the left or right side of a potential split-point T. The split-point with the lowest entropy is chosen to split the range into two intervals, and the binary split is continued with each part until a stopping criterion is satisfied. Fayyad and Irani propose a stopping criterion for this generalization using the minimum description length principle (MDLP) that stops the splitting when InfoGain(S, T) = Info(S) Info(S, T) < δ Where T is a potential interval boundary that splits S into S (left) and S (right) parts, and δ = [log (n-) + log (3k -) [m Info(S) m Info (S) m Info (S)]] / n Where mi is the number of classes in each set Si and n is the total number of data instances in S. Table 4. Solved example for Entropy Based Discretization Data Original Continuous Class Information Requirement Obtained by Considering each data value as a splitting point V V V3 V4 V5 V6 V7 V Here no of class =. Min [ InfoA( S ) ] = at attribute value 5. So splitting point is at 5.Stopping criteria when no. of intervals reached is. 5.. Chi-Square Based Discretization Chi-square (x) is a statistical measure that conducts a significance test on the relationship between the of a feature and the class. The x statistic determines the similarity of adjacent intervals based on some significance level. It tests the hypothesis that two adjacent intervals of a feature are independent of the class. If they are independent, they should be merged; otherwise they should remain separate. The top-down method based on chi-square is Chi Split. It searches for the best split of an interval, by maximizing the chi-square criterion applied to the two sub-intervals adjacent to the splitting point: the interval is split if both sub-intervals substantially differ statistically. The Chi Split stopping rule is based on a user-defined chi-square threshold to reject the split if the two sub-intervals are too similar. The bottom-up method based on chi-square is Chi Merge. It searches for the best merge of adjacent intervals by minimizing the chi-square criterion applied locally to two adjacent intervals: they are merged if they are statistically similar. The stopping rule is based on a user-defined Chi-square threshold to reject the merge if the two adjacent intervals are insufficiently similar. The ChiMerge algorithm is initialized by first sorting the training examples according to their value for the attribute being discretized and then constructing the initial discretization, in which each example is put into its own interval. Initially, if two adjacent intervals have a very similar distribution of classes, then the intervals can be merged. In Chi-Merge, each distinct value of a numerical attribute is considered to be one interval. Then X² tests are performed for every pair of adjacent intervals and adjacent intervals with least X² are merged together. This merging process proceeds recursively, until a predefined stopping criterion is met i.e. until of all adjacent pairs exceeds a threshold or a predefined no. of intervals has reached. The threshold is determined by the significance level and degrees of freedom = number of classes -. March Issue Page 34 of 88 ISSN 9 56

7 X is a statistic used to perform statistical independence test on relationship between two variables in a contingency table. In a database with data instances labeled with p classes, the formula to compute X statistic at a split point for two adjacent intervals against the p class is: p ( Aij Eij) x i j Eij Where p = number of classes A ij = number of distinct in the i th interval, j th class p R i = number of examples in i th interval = Aij j C j = number of examples in j th class = Aij N = total number of examples = Cj p j m i E ij = expected frequency of A ij = (R i * C j ) / N The Contingency table is represented by the table5. Table 5. Contingency Table Class Class Sum Interval A A R Interval A A R Sum C C N Table 6. Solved example for Chi Merge Based Discretization Data V V V3 V4 V5 V6 V7 V8 Original Continuous Class Initially the no. of intervals is 8.Then combining the adjacent intervals having same class value we get 5 distinct intervals i.e. I=0, I=, I3= 5,0,5, I4=30,40, I5= 50 Then X value for each adjacent intervals are calculated and adjacent intervals with least X² are merged together until no. of intervals reached is. 0 5,0,5 30,40 50 X = X =4 X =5 X =.94 0, 5,0,5 30,40 50 X =.875 X =5 X =.94 0,,5,0,5 30,40 50 X =3.75 X =.94 0,,5,0,5 30,40,50 March Issue Page 35 of 88 ISSN 9 56

8 A limitation of Chi Merge is that it cannot be used to discretize data for unsupervised learning (clustering) tasks []. Also, Chi Merge is only attempting to discover first order (single attribute) correlations, thus might not perform correctly when there is a second-order correlation without a corresponding first-order correlation, which might happen if an attribute only correlates in the presence of some other condition. Another shortcoming of Chi Merge is its lack of global evaluation. When deciding which intervals to merge, the algorithm only examines adjacent intervals, ignoring other surrounding intervals. Because of this restricted local analysis, it is possible that the formation of a large, relatively uniform interval could be prevented by an unlikely run of examples within it. One possible fix is to apply the x test to three or more intervals at a time. The x formula is easily extended, by adjusting the value of the parameter m in the x calculation. 6. Comparative Analysis of Discretization Methods A no. of discretization methods exist, each having its own characteristic and doing well in different situations. Every discretization methods have its-own strengths. A comparative analysis of common discretization methods based on different dimensions of discretization and their limitations has displayed in Table 7. Table 7. Comparative Analysis of Discretization Methods Methods Evaluation Criteria Supervised/ Unsupervised Dynamic/ Static Equal Width Equal Frequency K-means Clustering based Entropy Based Chi Merge based Unsupervised Unsupervised Unsupervised Supervised Supervised Static Static Static Static Static Global/ Local Global Global Local Local Global Splitting/ Merging Direct / Incremental Stopping Criteria Sensitive to outlier Same go to different intervals Time Complexity to discretize one attribute of n objects Split Split Split Split Merge Direct Direct Direct Incremental Incremental Fixed Bin no. Fixed Bin no. No further reassignment of data to the given fixed cluster no. Threshold / Fixed no. of intervals Threshold / Fixed no. of intervals Yes No Yes No No No Yes No No No O(n) O(n) O(ikn) i=no of iteration and k= no of intervals O(n log(n)) O(n log(n)) March Issue Page 36 of 88 ISSN 9 56

9 7. Conclusion Discretization of continuous features plays an important role in data pre-processing before applying a no. of machine learning and data mining algorithms on the real valued data sets. Since a large number of possible attribute slows and makes inductive learning ineffective, one of the main goals of a discretization algorithm is to significantly reduce the number of discrete intervals of a continuous attribute by maximizing the interdependency between discrete attribute and class labels, which minimizes the information loss due to discretization. This paper briefly introduces the need of discretization for improving the learning algorithms efficiency, the various taxonomies of discretization methods, the idea and drawbacks of some typical methods expressed in details by supervised or unsupervised category. Also from the solved examples it has been found that unsupervised methods like K-means clustering can perform equally well to that of supervised methods due to its characteristics of being an algorithm that uses minimum square error partitioning to generate an arbitrary number of partitions reflecting the original distribution of the partition attribute. Lastly a comparative analysis has been given based on different issues of discretization. No discretization method can ensure a negative difference for all data sets and all algorithms. So it's of vital importance to select proper methods depending on data sets and learning context in practice. 8. References [] E Xu, Shao Liangshan, Ren Yongchang, Wu Hao and Qiu Feng, A new Discretization approach of Continuous attributes, Asia-Pacific Conference on Wearable Computing Systems, vol. 5, no., pp , 00. [] Sheng-yi Jiang, Xia Li, Qi Zheng and Lian-xi Wang, An Approximate Equal Frequency Discretization method, WRI Global Congress Intelligent System, vol. 3, no. 4, pp , 009. [3] Joao Gama and Carlos Pinto, Discretization from Data Streams: Applications to Histograms and Data Mining, Symposium on Applied Computing, pp , 006. [4] Liu Peng, Wang Qing, Gu Yujia, Study on Comparison of Discretization Methods, International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, no., pp , 009. [5] Azuraliza Abu Bakar, Zulaiha Ali Othman and Nor Liyana Mohd Shuib, Building a new Taxonomy for Data Discretization Techniques, nd Conference on Data Mining and Optimization, vol. 3, no., pp. 3-40, 009. [6] Rayner Alfred, Discretization Numerical Data for Relational Data with One-to-Many Relations, Journal of Computer Science, vol. 5, no. 7, pp.59-58, 009. [7] Daniela Joita, Unsupervised Static Discretization Methods in Data Mining, Revista Mega Byte, vol. 9, 00. [8] Sellappan Palaniappan, Tan Kim Hong, Discretization of Continuous Valued dimensions in OLAP Data Cubes, International Journal of Computer Science and Network Security, vol. 8, no., pp. 6-6, 009. [9] Haiyang Hua, Huaici Zhao, A Discretization algorithm of Continuous attribute based on Supervised Clustering,Chinese Conference on Pattern Recognition, vol. 8, no. 3, pp. -5, 009. [0] Kotsiantis Sotiris and Kanellopoulos Dimitris, Discretization Techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, vol. 3, no., pp , 006. [] Kerber. Randy, ChiMerge: Discretization of numeric attributes, Proceedings of the tenth National Conference on Artificial Intelligence, pp.3-8, 99. March Issue Page 37 of 88 ISSN 9 56