Associative Classification Mining for Website Phishing Classification

Transcription

1 Associative Classification Mining for Website Phishing Classification 1 Neda Abdelhamid, 1 Aladdin Ayesh, 2 Fadi Thabtah 1 Informatics Dept, De Montfort University, Leicester, LE1 9BH 1 p @my .dmu.ac.uk [email protected] 2 E-Business Dept, CUD, Dubai 2 [email protected] Abstract --Website phishing is one of the crucial research topics for the internet community due to the massive number of online daily transactions. The process of predicting the phishing activity for a website is a typical classification problem in data mining where different website s features such as URL length, prefix and suffix, IP address, etc., are used to discover concealed correlations (knowledge) among these features that are useful for decision makers. In this article, an Associative classification (AC) data mining algorithm that uses association rule methods to build classification systems (classifiers) is developed and applied on the important problem of phishing classification. The proposed algorithm employs a classifier building method that discovers vital rules that possibly can be utilised to detect phishing activity based on a number of significant website s features. Experimental results using the proposed algorithms and three other rule based algorithms on real legitimate and fake websites collected from different sources have been conducted. The results reveal that our algorithm is highly competitive in classifying websites if contrasted with the other rule based classification algorithms with respective to accuracy rate. Further, our algorithm normally extracts smaller classifiers than other AC algorithm because of its novel rule evaluation method which reduces overfitting. Keywords: Associative Classification, Data Mining, Phishing Detection, WEB Security 1. INTRODUCTION Associative classification in data mining is about constructing classification systems (classifier) from an input data called the training data set aiming to predict the class value of unseen data called test data set accurately [1]. One distinguishing feature of AC algorithms is their ability to discover new hidden knowledge and then extract them as simple If-Then rules. In the last decade, different research studies on AC mining have resulted in the disseminations of various algorithms including CBA [2], CMAR [3], LCA [4], ADA [5]and others. These studies have revealed that AC is able to construct more accurate classifiers than rule based classification data mining approaches including rule induction and decision tree. Nevertheless, the numbers of rules discovered by the AC algorithms are normally huge which therefore limits its applicability sometimes in business domains. One primary reason of the large numbers of rules resulting from these AC algorithms is inherited from association rule since all correlations among the attribute values and the class attribute are tested in the training phase and many rules are derived. One way to control the exponential growth in the number of rules is to develop rule filtering methods that minimise rules redundancy during building the classifier. Rule evaluation sometimes called filtering or pruning usually occurs during building the classifier in AC mining. So once the complete set of rules are found in the training phase and sorted based on certain conditions (e.g. rule s confidence, support, body length, etc), the AC algorithm has to decide the way it should choose a subset of effective rules to represent the classifier. There are different ways used in AC to choose the classifier s rules. For instance, CBA [6] utilises the database coverage rule where rules that cover correctly a certain number of training cases are marked as accurate rules and the remaining rules get discarded. Lazy AC algorithms like L3G algorithms employ lazy pruning that stores primary and secondary rules in the classifier. In this paper, we first treat the problem of generating large classifiers in AC by proposing a new rule evaluation method for removing useless and redundant rules during constructing the classifier. The new rule evaluation method is an enhancement of a current AC called Multiclass Associative Classification (MA) [7]. We have enhanced MAC rule pruning method and classification procedure in which rather than using one rule for prediction in the proposed algorithm we utilise group of rules prediction to enhance the accuracy rate. Further, in building the classifier we developed a rule evaluation method that increases the training coverage per rule in order to reduce the classifier size and thus end-user can control and understand the classifier easily. The proposed rule evaluation method ensures larger training data coverage per classifier rule by taking into account only the similarity of rule s body and the training case attribute values while building the classifier. Whereas other current AC algorithms like MCAR consider the class similarity between the candidate rule and the training data, and the attribute values in the candidate rule body and those belonging to the training data. The two enhancements have resulted in a new algorithm that we call Enhanced Multiclass Associative Classification (emac). So emac s rule evaluation method ensures less number of rules in the classifier. We show the applicability of emac on a crucial domain related to web security named website phishing classification that normally criticised of having dense data because of the correlations among the website s features. Phishing is considered a form of web-threats that is defined

2 as the art of impersonating a website of an honest enterprise aiming to acquire private information such as usernames, password s and social security numbers [8]. Phishing websites are created by dishonest people to impersonate a webpage of genuine websites. Almost these websites have high visual similarities to the legitimate ones in an attempt to defraud the innocent people. Some of these websites designed to be almost similar to the genuine ones. Social engineering and technical tricks are commonly combined together in order start a phishing attack [8]. Phishing websites has become a serious problem not only because of the increased number of those websites but also due to the smart strategies used to design such websites, and thus even those having a good experience in the computer and internet might be deceived. The process of detecting the type of website is a typical classification problem where different features like URL length, sub-domains, and adding prefix and suffix, etc, are utilised to learn important hidden knowledge among these features. This knowledge is in fact the classification system that in turn is used to automatically guess the phishing activities of the website when a user browses it. The phishing problem is considered a vital issue in.com industry especially e-banking and e-commerce taking the number of online transactions involving payments. This article deals with two problems 1) Improvement of current AC algorithms particularly the generation of a large number of rules by proposing a new method that reduces the number of rules discovered without drastically impacting the predictive accuracy of the classifiers. In other words, and during constructing the classifier, we would like to minimise the number of rules derived by an AC algorithm. This can help decision makers especially in understanding, controlling and maintaining the final set of rules primarily when making a prediction decision. 2) The applicability of AC mining on the website phishing problem to learn important hidden knowledge from the website s features correlations. These correlations will be extracted as If-Then rules in order to be used by end-user for the automatic classification of websites. A number of fake and legitimate websites collected from known sources like Phishtank ( and millersmiles ( in the experimintation section to evaluate the performance of the proposed algorithm. Further, emac and three other AC and rule based algorithms have been contrasted with respect to different performance measures like classificaiton accuracy and numberof rules. More details are given in Section4., This article is structured as follows: Section 2 presents the phishing problem and related definitions to AC in data mining. The proposed algorithm and its main steps are explained in Section 3. Section 4 is devoted to experimentations and finally conclusions are given in Section THE PHISHING PROBLEM AND ASSOCIATIVE CLASSIFICATION MINING Typically, a phishing attack starts by sending an that appears to be from an authentic organisation to victims urging them to update or validate their information by following a fake URL link within the body. remains the main spreading channel for phishing links since 65% of phishing attacks start by visiting a link received within an (Kaspersky Lab, 2013). Typically, two common approaches are used to detect phishing activities, i.e. blacklist and features methods [6]. In the black list approach the website URL is basically compared with those in the black list to identify whether it is legitimate or fake. On the other hand a more realistic approach which is based on extracting the website features and using a heuristic method to identify the phishing activities have been successfully utilised [9]. Unlike the blacklist approach, the features based approach distinguishes new created phishing in real-time [8]. The effectiveness of the features methods depends on selecting a set of significant features that could help in determining the phishy website [9]. Phishing detection for websites is a typical classification in data mining problem where the goal is to forecast the type of the website based on a number of features that can be stored in the training data set. For simplicity we can consider the website phishing detection a two class problem (binary classification) since the target class has only two possible values; Phishy or Legitimate. Once a webpage is loaded on the browser a set of features will be extracted from the webpage. Those features have an influence in determining the type of the webpage. Website features like IP address, long URL, https and SSL are examples of important features that are used for learning knowledge. An AC data mining model will learn from the websites features important knowledge (correlations between the features values and the class attribute) to classify the webpage as either Phishy or Legitimate. We start formulating the phishing detection problem in AC data mining with definitions given in [4]. Let T denote the domain of the training data containing phishing features and C be a list of classes. Each training data t T may be given a single class ck where ck C, and is represented as a pair (t, ck ) where ck is connected with the data instance t in the training data. Let H denote the set of classifiers for T C where each case t T is given a classes and the goal is to find a classifier h H that maximises the probability that h(t) = c for each test data. So, for the training data set T with m attributes A1, A2,, Am and C is a set of classes, Definition 1: An attribute value set (AttValSet) can be described as a set of disjoint attribute values contained in a training case, denoted < (A i1, a i1 ),, (A ik, a ik )>. Definition 2: A rule r is of the form < AttValSet, c>, where c C is the class. Definition 3: The actual occurrence (ActOccr) of r in T is the number of cases in T that match r s antecedent. Definition 4: The support count (SuppCount) of r is the number of cases in T that matches r s antecedent, and belong to a class c i. Definition 5: A rule r passes the user minimum support threshold (minsupp) if for r, the SuppCount(r)/ T minsupp, where T is the number of cases in T. Definition 6: A rule r passes the user minimum confidence threshold (minconf) if SuppCount(r)/ActOccr(r) minconf.

3 Generally, an AC algorithm operates in three main phases. Firstly, it discovers all frequent attribute values which hold enough supports. Once all frequent attribute values are found, then it transforms the subset of which hold enough confidence values into rules. In other words, the algorithm finds and extracts rules that pass user defined thresholds denoted by minimum support (minsupp), and minimum confidence (minconf). In the second phase, rule pruning operates where only rules with high quality (confidence and support values) are selected to represent the classifier. Lastly, the classifier is utilised to forecast the class values on new unseen data. 3. THE PROPOSED ALGORITHM The proposed algorithm utilises AC learning strategies to generate the rules. It comprises of three main steps: rules discovery, classifier building and class assignment procedure (prediction step). In the first step, it iterates over the input training data set in which the rules is found and extracted using minsupp and minconf thresholds. Then in the second step it tests the discovered rules on the training data set in order to select one subset to represent the classifier. The final step involves assigning classes to test data. The general description of the emac learning algorithm is depicted in Figure 1, and details are given in the next subsections. We assume that the input attributes are categorical or continuous attributes. For continuous attributes any discretisation measure is employed before the training phase. Missing values attributes will be treated as other existing values in the data set RULE DISCOVERY EMAC uses a training method that employs a simple intersection among ruleitems locations in the training data set (TIDs) to discover the rules. The TID of a ruleitem holds the row numbers that contain the attribute values and their corresponding class labels in the training data set. The proposed algorithm discovers the frequent ruleitem of size 1 (F1) after iterating over the training data set. Then, it intersects the TIDs of the disjoint ruleitems in F1 to discover the candidate ruleitems of size 2, and after determining F2 the possible remaining frequent ruleitems of size 3 are obtained from intersecting the TIDs of the disjoint ruleitems of F2, and so forth. The TIDs of a ruleitem comprises useful information that are utilised to Input: Training data D, minsupp and minconf thresholds locate values easily in the training data set especially in computing the support and confidence for rules. When frequent attribute values are identified, emac generates any of which as a rule when it passes the minconf threshold. Now, when an attribute value is connected with more than one class and became frequent, EMAC considers only the largest frequency class associated with the attribute value and ignores the other. In cases that the classes frequencies in the training data set when connected with the attribute value is similar the choice is random RULE RANKING METHOD There are several different rule ranking formulas containing different criteria considered by scholars in AC. For instance, CBA algorithm [2] and its successors consider the rule s confidence and support as main criteria for rule favouring, CMAR [3] and MCAR [4] algorithms add on top of that the rule s length and the majority class count respectively when rules having identical confidence and support. On the other hand, lazy AC algorithms [10] place specific rule first (rules with large number of attribute values in their body) since they claim these rules are often more accurate. Though, this approach has been criticised of ending up with very large classifiers that are hard to be maintained, understood and updated. We argue that the minority class frequency as a rule preference parameter should be employed rather than the majority class count as in MCAR when rules are having similar confidence, support and length. This is since the numbers of rules for the lower frequency class are normally smaller than that of the largest frequency class. Therefore, ranking rules with smaller frequency class higher gives them a better chance to survive during rule evaluation and be part of the classifier and resulting with more representation in the context of rules for each class with low frequency in the training data. We have favoured rules associated with less frequent class in rule ranking since such a class is not well represented by rules in the classifier and usually has less number of rules CLASSIFIER CONSTRUCTION After rules are sorted a subset of which gets chosen to comprise the classifier. The classifier is built by emac as follows: For each training case emac iterates over the set of discovered rules and selects the first rule that matches the Output: A classifier that comprises rules Step One: Iterate over the training data set D with n columns to find all frequent ruleitems Convert any frequent ruleitem that passes minconf to a single label rule Sort the rules set according to Section Step Two: Evaluate the complete set of rules discovered in step (1) on the training data set in order to remove redundant rules or rules that have no training data coverage Step Three: Classify test cases Fig. 1. The proposed algorithm

4 training case as a classifier rule. The same process is repeated until all training cases are utilised or all candidate rules have been evaluated. In cases when the training data has any uncovered data the default class rule will be formed. This rules will represents the majority class in remaining uncovered training data. Finally, emac outputs all marked rules to form the classifier. The remaining unmarked rules are discarded by the proposed algorithm since some higher ranked rules have covered their training cases during building the classifier and therefore these unmarked rules become redundant and useless. The rule pruning of the proposed algorithm differs from other pruning procedure in AC such as CBA, CMAR, and CPAR in that it does not require the similarity of the class of both the evaluated rule and the training case as a condition of rule significance rather it only considers the matching between the rule body and the training case. This reduces overfitting the training data set since most of current AC algorithms mark the candidate rule as a classifier rule if its body matches the training case and has the same class as the training case. This may result in more accurate prediction on the training data set but not necessarily on new unseen test cases. We argue that the similarity test between the candidate rule class and the training case class has limited effect on the predictive power of the resulting classifiers during the prediction step. Lastly, one obvious advantage of the proposed rule evaluation method is that it ensures more data coverage per rule which consequently often leads to less number of rules in the classifier. This means end-user can control the classifier and understand it easily CLASSIFICATION OF TEST DATA When a test case is about to classify, the prediction procedure of the EMAC algorithm works as follow: It iterates over the set of the rules stored in the classifier, it highlights all rules that are contained in the test data (the rule s body matches some attribute values in the test data). If only one rule is applicable to the test data then the class of that rule is assigned to the test data. In cases where multiple rules are applicable to the test data, the algorithm categorises these rules into groups according to their classes, and counts the number of rules in each group. The class belonging to the group that has the largest number of rules gets assigned to the test data. In case that more than one group having the same number of rules, then the choice will be random. This method which utilises more than one rule to make the class assignment of test data have improved upon single rule prediction procedures such as that of CBA and MCAR that takes the class of the highest ranked rule in the classifier matching the test data to make URL Anchor Request URL the prediction decision. Lastly, in cases when no rules in the classifier are applicable to the test case, the default class (Majority class in the training dataset) will be assigned to that case. Table 1 Sample of the websites features data URL Prefix Sub Subdomain HTTPs Length Suffix IP Domain 4. EXPERIMENTAL RESULTS 4.1. DATA AND PHISHING FEATURES We have investigated a large number of different features contributing in the classification of the type of the websites that have been proposed in [8]. We selected nine effective features among them after applying Chi-square feature selection metric in WEKA against 1228 different websites. The dataset utilised in the experiments consists of 547 and 681 legitimate and fake websites respectively. It has been collected from yahoo directory ( starting point directory ( Phishtank ( and Millersmiles archives ( Seven samples of the websites features data is shown in Table 1 where the class is either 1 (legitimate) or 0 (phishy). The -1 value in the below table denotes Suspicious which can go either phishy or legitimate so the end-user is unsure about the feature s value. The features that we consider are described below, 1. Using IP address: Using IP address in the hostname part of the URL address means user can almost be sure someone is trying to steal his personal information. 2. Long URL: Phishers resort to hide the suspicious part of the URL, which may redirect the information submitted by the users or redirect the uploaded page to a suspicious domain. 3. Adding Prefix and Suffix to URL: Phishers try to deceive users by reshaping the URL to look like legitimate one. A technique used to do so is by adding prefix or suffix to the legitimate URL thus the user may not notice any difference. 4. Sub-domain(s) in URL: Another technique used by the phishers to deceive the users is by adding subdomain(s) to the URL thus the users may believe that they are dealing with a credited website. 5. Misuse of HTTPs protocol: The existence of the HTTPs protocol every time sensitive information is being transferred reveals that the user certainly connected with an honest website. However, phishers may use a fake HTTPs protocol so that the users may be deceived. 6. Request URL: A webpage consists of a text and Domain age Class

5 some objects such as images and videos. Typically, these objects are loaded to the webpage from the same domain where the webpage exists. If the objects are loaded from a domain different from the domain typed in the URL address bar the webpage is potentially suspicious. 7. URL of Anchor: Similar to Request URL but for this feature the links within the webpage might refer to a domain different from the domain typed on the URL address bar. This feature is treated exactly as Request URL. 8. Website Traffic: Legitimate websites having high web traffic since they are visited regularly. Phishing websites often have short life thus their web traffic is either does not exist or its rank is less than the limit that gives it the legitimate status. 9. Age of Domain: The website is considered Legitimate if the domain aged more than 2 years. Otherwise, the website is considered Phishy EXPERIMENTS RESULTS Ten-fold cross-validation was utilised to evaluate the classification models and to produce error rates in the experiments. Four dissimilar rule based classification algorithms which utilise a variety of rule learning methodologies have been considered for contrasting purposes with EMAC. These algorithms are CBA [2], PRISM [11], PART [11], and MCAR [4]. Our selection of the above classification algorithms is because firstly all these algorithms generate rules in the form of If-Then rules for fair comparison. Secondly, the chosen algorithms use different learning methodologies in discovering and producing the rules. The learning strategy exploited by CBA is based on Apriori association rule technique where frequent ruleitems are produced iteratively based on the minsupp threshold inputted by the end-user. On the other hand, MCAR uses vertical mining methodology to discover the rules. Mainly, it utilises ruleitem s locations in the training data set (tidlist) to perform tid-list intersections to compute the ruleitems s support and confidence which in turn are used to decide whether the ruleitem is a rule. Finally, PRISM is a covering algorithm that divides the data set into parts according to the available class labels and produces all rules for each class. For each class, it is starts with an empty rule and adds the highest expected accuracy for each possible attribute value. It stops adding attribute values to the rule body when the candidate rule expected accuracy reaches 100% and at that point it generates the rules and removes all training data covered by the rules from the training data set. The algorithm repeats the same step until the data belonging to the selected class gets empty. Once this happens PRISM begins generating rules for another class and so forth. When the data in all parts are covered PRISM merges all rules derived for all class labels and forms the classifier. Lastly, PART algorithm is a combination of decision tree and rule induction algorithm that constructs partial decision trees. The experiments were conducted on an I3 machine with 2.0 Ghz. The experiments of PRISM were carried out in Weka software [11]. For the AC algorithms (CBA, MCAR), CBA source code has been obtained from its prospective authors and (EMAC, MCAR) were implemented in Java. Several researchers in AC, i.e. [2] [3] [4], have revealed that the minsupp threshold usually controls the numbers of rules generated. Thus, we have followed them in setting the support threshold to 1%-5% in the experiments of CBA, MCAR and the proposed algorithm. The confidence threshold, however, has less impact on the general performance of AC algorithms and we set it to 50% for CBA, MCAR and EMAC. Figure 2 displays the classification accuracy of the compared algorithm on the nine phishing detection data set. It is obvious from the figure that the proposed algorithm is highly effective in predictive power when contrasted with other AC algorithms as well as rule based ones. Precisely, EMAC has outperformed PRISM and PART by 7.77% and 0.93% respectively. MCAR algorithm has slightly outperformed the proposed algorithm on the selected nine features data set by 0.21%. Though as we will see shortly, MCAR have produced 56 more rules in the classifier than EMAC, which is approximately 38% larger classifier to accomplish just 0.21% higher accuracy. We believe that there should be a trade-off between the number of rules produced and classification accuracy where one can accept smaller classifier in the exchange with slightly lower accuracy. One possible reason for the slight increase in the accuracy for MCAR over the proposed algorithm is the way it builds the classifier. In particular, MCAR evaluates each candidate rule derived in the learning phase on the training data set in which a rule is considered significant if it covers correctly at least one training data instance MAC MCAR PRISM PART Fig. 2 The classification accuracy (%) for the contrasted algorithms derived from the phishing data

6 The coverage requires that: 1) The candidate rule body (attribute values) must be contained within the training instance 2) The class of the candidate rule and that of the training instance are similar This rule evaluation process limits the data coverage per rule since the above tow conditions must be true in order to consider the rule to be part of the classifier. Alternatively, EMAC inserts the candidate rule into the classifier if only the first condition above is true relaxing the second condition (class similarity). This normally reduces overfitting by allowing the rule to cover larger portion of training cases, which shows the smaller classifiers produced by EMAC if compared to MCAR. Figure 3 depicts the number of rules generated by the contrasted algorithms on the data set we consider in which it clearly shows that the proposed algorithm extracts smaller classifiers than MCAR and PRISM. advantage of the simplicity of a data mining approach called associative classification that extracts simple yet effective classifiers containing easy to understand chunk of knowledge to solve the website phishing detection. Since phishing features are often correlated, we propose an algorithm that reduces the number of rules by using a novel evaluation method which cuts down the number of rules approximately by 38% if contrasted with other AC algorithms like MCAR and without effecting classification accuracy. The new algorithm has been compared with one AC and two rule based classification algorithms with respect to accuracy rate and classifier size on real websites data set. The data size is 1228 websites and it consists of nine significant features that have been collected from different online sources such as Phishtank and Yahoo directory. The features have been chosen after applying Chi-Square testing measure on larger numbers of features set. After experimentations, the results showed that the proposed algorithm scales well if compared to MCAR, MAC MCAR PRISM PART Fig. 3 The classifier size of MCAR and EMAC derived from the phishing data The main reason for the fewer number of rules in EMAC classifier if compared to MCAR is due to the way EMAC constructs the classifier in which it considers the candidate rule part of the classifier when only its body is within the training instance and thus no class check is performed by EMAC. This usually ends up of having the candidate rule covers large number of training instances and therefore several redundant rules will be discarded. In other words, some lower ranked rules will end up having no training data coverage and therefore they will be deleted. PRISM covering algorithm generates the largest classifier since it has no rule pruning at all. As a matter fact PRISM keeps producing rules per class labels as long as there are training instances exist which explains its very large classifiers. On the other hand, PART algorithm utilises rule induction and decision tree pruning heuristics to cut down the possible numbers of rules. To be more precise, it employs information gain approach from decision tree to build partial trees and then a pessimistic error and reduced error pruning methods are applied to remove candidate rules. This explains its small size classifier. Overall, AC algorithms such as MCAR and the proposed algorithm normally extract additional knowledge missed by classic rule based algorithms and thus they end up with more rules in the classifiers. 5. CONCLUSIONS Phishing detection is a vital problem in the online community due to the massive numbers of online transactions performed by users. In this paper, we take PART and PRISM. Specifically, our algorithm has higher accuracy by 7.77% and 0.93% than PRISM and PART respectively. MCAR has slightly outperformed our algorithm by 0.21% yet derived 56 additional rules in the classifier. In near future we intend to plug our algorithm in a browser to determine on the fly the phishing activity and alert users. 6. REFERENCES [1] F Thabtah, Q Mahmood, L McCluskey, and H Abdeljaber, "A new Classification based on Association Algorithm," Journal of Information and Knowledge Management, vol. 9, no. 1, pp , [2] B Liu, W Hsu, and Y Ma, "Integrating Classification and Association Rule Mining," in Knowledge Discovery and Data mining (KDD), 1998, pp [3] W Li, J Han, and J Pei, "CMAR: Accurate and efficient classification based on multiple-class association rule," in Proceedings of the ICDM 01, San Jose, CA., 2001, pp [4] F Thabtah, C Peter, and Y Peng, "MCAR: Multi-class Classification based on Association Rule," in The 3rd ACS/IEEE International Conference on Computer Systems and Applications, 2005, p. 33. [5] X Wang, K Yue, W Niu, and Z Shi, "An approach for adaptive associative classification," Expert Systems with Applications: An International Journal, vol. 38, no. 9, pp , 2011.

7 [6] W Liu, X Deng, G Huang, and A Y. Fu, "An Antiphishing Strategy Based on Visual Similarity Assessment," in IEEE Educational Activities Department Piscataway, NJ, USA, 2006, pp [7] N Abdelhamid, A Ayesh, F Thabtah, S Ahmadi, and W Hadi, "MAC: A multiclass associative classification algorithm," Journal of Information and Knowledge Management (JIKM), pp , [8] R M Mohammad, F Thabtah, and L McCluskey, "An Assessment of Features Related to Phishing Websites using an Automated Technique," in The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012), London, [9] M Aburrous, M A Hossain, K Dahal, and F Thabtah, "Intelligent phishing detection system for e-banking using fuzzy data mining," Expert Systems with Applications: An International Journal, pp , December [10] E Baralis, S Chiusano, and P Graza, "support thresholds in associative classification," in Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia, Cyprus., 2004, pp [11] E Frank and I Witten, "Generating accurate rule sets without global optimisation," in Proceedings of the Fifteenth International Conference on Machine Learning, Madison, Wisconsin., pp