The Use of Evolutionary Algorithms in Data Mining. Khulood AlYahya Sultanah AlOtaibi

Size: px
Start display at page:

Download "The Use of Evolutionary Algorithms in Data Mining. Khulood AlYahya Sultanah AlOtaibi"

Transcription

1 The Use of Evolutionary Algorithms in Data Mining Ayush Joshi Jordan Wallwork Khulood AlYahya Sultanah AlOtaibi MScISE BScAICS MScISE MScACS 1

2 Abstract With the huge amount of data being generated in the world every day, at a rate far higher than by which it can be analyzed by human comprehension alone, data mining becomes an extremely important task for extracting as much useful information from this data as possible. The standard data mining techniques are satisfactory to a certain extent but they are constrained by certain limitations, and it is for these cases that evolutionary approaches are both more capable and more efficient. In this paper we present the use of nature inspired evolutionary techniques to data mining augmented with human interaction to handle situations for which concept definitions are abstract and hard to define, hence not quantifiable in an absolute sense. Finally, we propose some ideas for these techniques for future implementations. Keywords: data mining, knowledge discovery, evolutionary algorithms, interactive evolutionary algorithms, genetic algorithms, genetic programming, co-evolutionary algorithms, rule discovery, classification, clustering, data mining tasks, data mining algorithms 2

3 Table of Contents 1 Introduction Overview of Data mining and Knowledge Discovery Data Mining Pre-processing Data Mining Tasks Models and Patterns Conventional Techniques of Data Mining Evolutionary Algorithms and Data Mining Genetic Algorithms Genetic Programming Co-evolutionary Algorithms Representation and Encoding Rules Representation Fuzzy Logic Based Rules Representation Genetic Operators Crossover Mutation Fuzzy logic Operators Fitness Evaluation Objective Fitness Evaluation Subjective Fitness Evaluation (Interactive Evolutionary Algorithms) Selection and Replacement Integrating Conventional Techniques with Evolutionary Algorithms Applications of Data Mining Using IEA Extracting Knowledge from a Text Database Extracting Marketing Rules from User Data Fraud Detection Using Data Mining and IEA Techniques Some current work being done Conclusion and Future Work References

4 1 Introduction In recent years, the massive growth in the amount of stored data has increased the demand for effective data mining methods to discover the hidden knowledge and patterns in these data sets. Data mining means to mine or extract relevant information from any available data of concern to the user. Data mining is not a new technique but has been around for centuries and has been used for problems like regression analysis, or knowledge discovery from records of various types. As computers invaded almost all conceivable fields of human knowledge and occupation, their advantages were advocated all over, but what was observed soon enough was that with the increasing amounts of data that could be generated, stored and analysed there was a need to define some way to sift through it and grab the important stuff out. During the earlier days a human or a group of humans would sit down to analyse the data by going through it manually and using statistical techniques, but the curve of data generation was far steeper than what could realistically be processed by hand. This led to the emergence of the field of data mining, which was essentially to define and formalize standard techniques to extract data from large data warehouses. As data mining evolved it was observed that the data at hand was almost always never perfect or suitable to be fed to data mining engines and needed several steps of pre-processing before it could be put through mining. Generally these inconsistencies would be in data format, level of noise or incorrect data, unnecessary data, redundant data etc. These steps would clean, integrate, discretize and select the most relevant attributes before performing any mining. A whole new area called Intelligent data analysis has emerged which utilises efficient techniques for mining data from large sets keeping in mind that the knowledge obtained is useful at the same time also remembering that time for mining is constrained and the user requires data as soon as possible. Some of the methods used to mine data include support vector machines, decision trees, nearest neighbour analysis, Bayesian classification, and latent semantic analysis. With the problems associated with conventional data mining techniques, clever new ways to overcome these were needed, and the application of AI techniques to the field resulted in a very powerful hybrid of techniques. Evolutionary optimization techniques provided with a useful and novel solution to these issues, and once data mining was enhanced with using EC many of the previously mentioned problems were no longer big issues. Some of applications of evolutionary algorithms in data mining, which involves human interaction, are presented in this paper. When dealing with concepts that are abstract and hard to define or cases where there are a large or variable number of parameters, we still do not have reliable methods for finding solutions. For certain cases where we are unable to quantify what we want to measure, for instance beauty in images or pleasantness in music, we almost always require a human to drive the solutions through his choices. In these situations we use a combination of Evolutionary computation along with data mining but with a human sitting and interacting with the engine to steer the computation towards solutions or answers he is looking for. 4

5 This paper begins by describing some concepts in data mining and general evolutionary algorithms by giving relevant concepts and descriptions. In the later sections we discuss some of the areas where these are implemented and lastly we give a few ideas of where these techniques may be implemented in the future. 2 Overview of Data mining and Knowledge Discovery Knowledge discovery and data mining as defined by Fayyad et al. (1996) is the process of identifying valid, novel, useful, and understandable patterns in data. Data mining has emerged particularly in situations where analysing the data manually or by using simple queries is either impossible or very complicated (Cant u-paz & Kamath, 2001). Data mining is a multi-disciplinary field that incorporates knowledge from many disciplines, mainly from machine learning, artificial intelligence, statistics, signal and image processing, mathematical optimization, and pattern recognition (ibid.). Knowledge discovery and data mining consist of three main steps to convert a collection of raw data to valuable knowledge. These three steps are data pre-processing, knowledge extraction, and data post-processing (Freitas, 2003). The discovered knowledge should be accurate, comprehensible, relevant and interesting for the end user in order to consider the data mining process as successful (Cant u-paz & Kamath, 2001). This section gives an overview of data mining pre-processing, data mining tasks, and the conventional techniques for data mining. 2.1 Data Mining Pre-processing The purpose of using data mining pre-processes is to eliminate the outliers, inconsistency and incompleteness of data in order to obtain accurate results (Freitas, 2003). These preprocesses are listed below: Data cleaning: involves preparing data to the following process by removing irrelevant data and as much noise as possible from the data. It is done to guarantee the accuracy and the validity of the data. Data integration: removes redundant and inconsistent data from data that is collected from different sources. Discretization: converts continuous values of attributes to discrete values e.g. for the attribute Age we can set minimum value equal to 21 and maximum value equal to 60. Attribute selection: selects the relevant data to the analysis process from all the data sets. Data mining: after doing all the previous steps, data mining algorithms or techniques can be applied to the data in order to extract the desirable knowledge. 5

6 2.2 Data Mining Tasks It is very important to define the data mining task that the algorithm should address before designing it for application to a particular problem. There are several tasks of data mining and each of them has specific purposes in terms of the knowledge to be discovered (Freitas, 2002) Models and Patterns In data mining the term model is a high level description of the data set (Hand, 20001). A model can be either descriptive or predictive. As the names imply, the descriptive model is an unsupervised model that aims to describe the data, while predictive model is a supervised model that aims to predict values from the data. Patterns are used to define the important and interesting features of the data. Unusual combination of purchased items in supermarket is an example of a pattern. Models are used to describe the whole data set, while patterns are used to highlight particular aspects of data Predictive Models According to Kambert (2001), data analysis generally can be either in a classification or a prediction form. Regression analysis is an example of prediction tasks, namely numeric prediction. The difference between classification and regression is that the target value (response variable) is a quantitative value in regression modeling, while it is a qualitative or categorical value in classification modeling. Classification Task Some terms need to be introduced in order to describe classification tasks. The data sets that the classification techniques or algorithms are applied to are composed of a number of instances/objects. Each instance has a number of attributes, which have discrete values. The records in databases tables, for example represent the instances and the fields represent the attributes. In other words, each row represents an object and the columns describe this object in terms of its attributes. Classification is tasked with being able to extract the hidden knowledge from some attributes values in form of patterns in order to predict the value of particular field or attribute. This target value is known as the class (Dzeroski & Lavrac, 2001) The inputs for the classification algorithm are data instances and the outputs are the patterns that are used to predict the class that this instance belongs to. Here is an example of classification rule (Freitas, 2002): IF (a_given_set_of_conditions_ is _satisfied_by _an_instant) Antecedent part THEN (predict_a_certian_class_for_that_instance) Consequent part 6

7 The data in the classification task is divided into two mutually exclusive data sets, the training dataset and the testing dataset (Freitas, 2003). The training dataset is used to build the classification model and the test dataset is used to evaluate the predictive performance of the model. Overfitting occurs when the model is over trained on the training dataset and it is simply memorizing it, which would result in poor predictive performance on the testing dataset. By contrast, underfitting occurs when the model is undertrained and did not learn well from the training data. In underfiting situations, the model consists of number of rules that cover too many training instances (ibid.). Regression Task In general, regression modeling is very similar to classification modeling, except that the target value in regression modeling is a continuous or ordered value. In statistics, regression analysis models the relation of response value (output or target value of the predictor) to specified predictor values (input variable for the predictor). Regression analysis can be linear or multiple regression. Both are used to approximate the relation of a single response variable to a single continuous predictor value in the former and multiple continuous predictor values in the latter (Larose, 2006) Descriptive model Clustering Task Clustering simply means grouping, placing data instances into different groups or clusters such that instances from the same clusters are similar together and easily distinguished from the instances that belong to the other clusters (Zaki et al., 2010). Association Analysis Task Association analysis refers to the process of extracting association rules from a data set that describe some interesting relations hidden in this data set. For further illustration, imagine the market basket transactions example, where we have two items A and B and the following rule is extracted from the data: {A} -> {B}. This rule suggests that there is a strong relation between item A and item B in terms of the frequency of their occurrence together (Tan et al., 2006). This means if there is an item A in the basket then there is a high probability that item B will be in the basket as well. 7

8 2.3 Conventional Techniques of Data Mining Several tools and techniques are available for data mining and knowledge discovery. These techniques have been developed from two main fields: statistics and machine learning. Multivariate analysis, logistic regression, liner discrimination, ID3, k-nearest neighbor, Bayesian classifiers, principal component analysis, and support vector machines are examples of these techniques. These techniques are designed to discover accurate and comprehensible rules, but most of them are not designed to discover interesting rules (Freitas, 2003). Statistics and machine learning techniques are considered to be the most used techniques for data mining but these techniques have some drawbacks. The models or rules discovered using these techniques are not always optimal. This is due to their sensitivity to the noise in the data set, which may cause them to overfit the data (Vafaie & Jong, 1994). They also tend to generate models with a larger number of features than really necessary, which increases the computational cost of the model (ibid.).another drawback is they typically assume a priori knowledge about the data set, which is not available in most cases. Statistical methods have another problem that is that they assume linearity of the models and distribution of the data (Terano & Ishino, 1996). 3 Evolutionary Algorithms and Data Mining Evolutionary algorithms have several features that make them attractive for the data mining process (Freitas, 2003; Vafaie & Jong, 1994). They are a domain independent technique, which makes them ideal for applications where domain knowledge is difficult to provide. They have the ability to explore large search spaces finding consistently good solutions. In addition, they are relatively insensitive to noise, and can manage attribute interaction better than the conventional data mining techniques. Therefore, several works have been done, in recent years, to develop new techniques for data mining using evolutionary algorithms. These attempts used evolutionary algorithms for different tasks of data mining such as feature extraction, feature selection, classification, and clustering (Cant u-paz & Kamath, 2001). The main role of evolutionary algorithms in most of these approaches is optimization. They are used to improve the robustness and accuracy of some of the traditional data mining techniques. Different types of evolutionary algorithms have been developed over the years such as genetic algorithms, genetic programming, evolution strategies, evolutionary programming, evolution strategies, differential evolution, cultural evolution algorithms and co-evolutionary algorithms (Engelbrecht, 2007). Some of these types that are used in data mining are genetic algorithms, genetic programming and co-evolutionary algorithms. Genetic algorithms are used for data preprocessing and for post processing the discovered knowledge, while genetic programming is used for rule discovery and data preprocessing (Freitas, 2003). This section will give a general overview of genetic algorithms, genetic programming, and co-evolutionary algorithms, followed by an overview of different representation schemes, genetic operators, and fitness evaluation for the purpose of data mining. Finally a brief discussion of integrating conventional data mining techniques with evolutionary algorithms is given. 8

9 3.1 Genetic Algorithms Genetic algorithms are those that have been originally proposed as a general model of adaptive processes, but by far the largest application of the techniques is in the domain of optimization (Back et al., 1997). They consist of a population of individual solutions that are acted upon by a series of genetic operators in order to generate new, and hopefully better, solutions to a particular problem, and are inspired by natural evolution. The term genetic algorithm was coined in the early 70s by John Holland, who had been working with systems that generate populations of potential solutions using natural methods since the early 60s. In his paper Outline for a Logical Theory of Adaptive Systems, Holland (1962) describes a system where a generation tree of populations is generated. By applying a number of solutions (the population) to a number of problems (the environment), the solutions that are able to successfully solve the problems are given reward/activation scores, which enable solutions to be compared with one another, and the best of these are used in the generation of the next branch of the generation tree. Virtually all modern evolutionary systems have the same general stages: A random population of solutions is generated, to be used as the initial population The solutions within the population are evaluated to determine their fitness Solution pairs are first selected, based on their fitness, and then are combined to create offspring, which are added to the next generation of the population Other genetic operators, such as mutation, are also applied to offspring 3.2 Genetic Programming Genetic programming is a specific application of genetic algorithms, used to evolve computer programs. The paradigm was named and developed by John Koza in the early 90s, who initially used genetic programming to evolve LISP programs. (Koza, 1992) Whilst the general premise of genetic programming is the same as the basic genetic algorithm selecting the fittest members of a given population and then crossing and mutating them. The representation of the solutions, however, is radically different, which results in the need for alternative crossover methods. Instead of representing solutions as chromosomes, which are fixed length set where each gene has a specific meaning and a limited value, genetic program represents programs as strings, which can grow infinitely longer. This means that crossover cannot simply occur randomly anywhere in the program, as this would probably simply break it, so more careful crossover algorithms need to be developed. Figure 1: Examples of evolved LISP programs. Fitness calculated by number of outputs closer than 20% to correct output. (Mitchell, 1998) Koza realized that not only could genetic algorithms be used to evolve programs, but also other complex structures, such as equations and rule sets. This is particularly useful when 9

10 looking at genetically evolving data mining techniques, since we can use the principles of genetic programming in the evolution of rule constructs. In data mining, genetic programming is considered as a more open-ended search technique that can produce many different combinations of attributes. Hence it is very useful for classification and prediction tasks(freitas, 2003). 3.3 Co-evolutionary Algorithms In co-evolutionary algorithms, two populations are evolved together, with the fitness function involving the relationship with other individuals. In this algorithm, the individuals of the two populations evolve through either competing against each other or through cooperation with each other (Engelbrecht, 2007). The competitive approach is used to obtain exclusivity on a limited resource (Tan et al, 2005), while a cooperative approach is used to gain access to some hard to attain resourse (ibid.). In the competitive approach, the fitness of an individual in one population is based on the direct competition with the fitness of individuals in the other population. The cooperative approach, on the other hand, the fitness of an individual in one population is based on the how much does it cooperate with the individuals in the other population. Co-evolutionary approaches, particularly the cooperative approach, can address some of the problems of evolutionary algorithms with a single population, such as poor performance and convergence to local optima when dealing with problems that have complex solution (Tan et al, 2005). Several attempts have been made to apply co-evolutionary algorithms to the field of data mining. One of them is the distributed evolutionary classifier for knowledge discovery in data mining proposed by Tan et al (2005). In their approach, they use a cooperative evolutionary algorithm to evolve two populations. The individuals of the first population represent a single rule. Each individual of the second population represents a set of rules. They validated their approach using six datasets. Their classifier preformed better than C4.5 classifier (a well-known algorithm for generating decision trees). The proposed coevolutionary approach reduces the computation time through sharing the workload among multiple computers. It has also achieved a smaller number of rules for the rule set compared with other classification techniques, which increases the comprehensibility of the classification model. Moreover, it is more robust to noise in the data and has robust predication accuracy. Another approach that applies co-evolutionary algorithms to data mining is the coevolutionary system for discovering fuzzy classification rules developed by Mendes et al (2001). They used two evolutionary algorithms in their system: a genetic programming algorithm and an evolutionary algorithm to co-evolve two populations. The genetic programming algorithm evolves a population of fuzzy rule sets and the evolutionary algorithm evolves a population of membership function definitions. The advantage of using the co-evolutionary process is the discovery of fuzzy rule sets and the membership function definitions that are more adjusted to each other. 10

11 3.4 Representation and Encoding The traditional method of encoding the genetic rules, which perhaps resembles most closely the way evolution occurs in nature, is to use a direct representation scheme to encode the population data as a series of bitstrings a binary string representative of the genes, which build each chromosome in a population. An 8-bit binary string would be representative of a population whose genetic data consisted of 8 Boolean values, where each bit had some specific meaning. For example, a system looking to design a new car could use its first bit to represent whether or not the car has two or four doors, the second to represent whether it has 3 or 4 wheels, the third to represent whether the car has a spoiler, etc. In this system, a population member with a value 011xxxxx would represent a car with two doors, four wheels and a spoiler. The key issue with this kind of representation, however, is that it defines a very specific search space, with a set number of genetic parameters and a very - restricted set of values that each of these parameters can be. A simple way to make the genetic algorithm considerably more powerful is to alter the representation so that rather than being encoded as a set of Boolean values, it is stores a number, such as an integer or a floating point value. This means that the search space defined by the system described before could be hugely expanded, allowing for a far greater number of genetic possibilities in its populous. For example, the second bit was representative of whether the car had three or four wheels; by using an integer rather than a Boolean we broaden the system so that it can represent a car with any number of wheels. Expanding the representation in this way comes with an additional memory overhead a gene encoded with a 1-byte integer is 8 times the size of a binary gene however, the number of possible values leaps to 255, so the memory increase is a small cost to pay for significant improvement to the representation. Of course, not all genes need to be encoded in the same way; a chromosome can be constructed by any combination of data types that best fit the space being represented. For example, it would not make sense to represent whether a car has a spoiler or not with an integer, as there are only two possibilities, so a Boolean would be sufficient Rules Representation Classification is the most common application of evolutionary algorithms in data mining. There are many techniques to perform classification task. Rule-based technique is preferred over other classification techniques because rules are more comprehensible (Freitas, 2003). There are two approaches to represent individuals when using evolutionary algorithms for rule discovery: Michigan and Pittsburgh approach (ibidi.). In the Michigan approach each individual represent a single rule, where in the Pittsburgh approach each individual represent a set of rules. The Pittsburgh approach is more suitable for classification tasks because the quality of the rule set will be evaluate as a whole, rather than the quality of a single rule. On the other hand, the Michigan approach is more suitable for other kinds of data mining tasks such as find a small set of high-quality prediction rules because each rule is evaluated independently of the other rules (ibid.). 11

12 3.4.2 Fuzzy Logic Based Rules Representation Fuzzy logic based rules are not only more readable by humans, but are also easier to evolve than classic rule types. Fuzzy rules use unary functions to classify variables for example, IS_LOW(Age) could be equivalent to Age < 27. The advantage of using this fuzzy representation is that it allows us to classify variables without needing to know explicitly the range of values that could be realistically expected, as they are simply classified into three distinct groups, low, medium and high, defined by the normalized values of all the data contained in the database. An example of a fuzzy classification function is depicted in Figure 2, Figure 2: Fuzzy membership classification (Walter, 2000) where the x-axis denotes the normalized data values, and the y-axis denotes the degree of membership in the fuzzy groups. For our rule set, therefore, we will use just two binary operators, AND and OR, four unary operators, NOT, LOW, MEDIUM and HIGH, and an integer value to represent each data variable. We will assign a numerical value to each of these, so for instance values 0-5 could represent the operators, and the variables could be 6+. We will use 1 byte binary representations of these, which gives us up to 249 possible variables; if we require more variables, we can simply chose to use a larger representation (i.e. 2 bytes gives us variables). If we consider a sample rule, LOW(Age) AND NOT(HIGH(Height) OR HIGH(Weight)), we can see how the binarized form of the rule, , can be parsed and understood (in this example, we have used a 4 bit representation). 3.5 Genetic Operators Figure 3 New population members can be evolved by applying a number of genetic operators in order to combine existing chromosomes. There are two operators which mimic natural methods of recombining genetic material: crossover, which merges two sets of chromosomes in much the same way as sexual reproduction does between animals, and mutation, where genes alter randomly, which happens in real life after organisms reproduce Crossover A common method for crossover is called one-point crossover (Rawlins, 1991). Bitstring representations will be discussed for simplicity here, but whatever the data type used, the methods do not vary. In one-point crossover, the two parents chromosomes are split in the same place, and half of one set of genes is combined with the other half of the other. The crossover point tends to be randomized each time a pair of chromosomes reproduce. As an 12

13 example, consider the two parent chromosomes and Since there are eight genes in the chromosome, the crossover point can be anywhere between bits 1-2 and 7-8. Say the crossover point is between bits 3-4, the two halves of each parent will be and Depending on which way the parents are combined, the offspring will be either or Generalizing / Specializing Crossover The purpose of this type of crossover is to generalize a rule when it is overfitting or specialize the rule when it is underfiting the data (Freitas, 2003). If binary encoding is used, the generalization crossover is done using logical OR and the specialization crossover is done using logical AND. Example of generalizing and specializing crossover, where the symbol illustrates the crossover points (ibid.). Parents Offspring (generalizing crossover OR) Offspring (specializing crossover AND) Mutation Mutation is a fairly simple operator, where bits are flipped to alter a chromosome s genetic makeup. The mutation rate affects how often these mutations occur a system with a high mutation rate will result in lots of mutated offspring. Mutation is necessary as it provides renewable variety: it allows the system to explore solutions that may not be available by recombination alone. Generalizing / Specializing Mutation Different mutation operators can be performed to generalize or specialize a rule. A simple mutation for generalizing a rule can be done through deleting one of the conditions in its antecedent part. In the opposite, adding a condition to the rule s antecedent will be a specialization mutation (Freitas, 2003). Another generalizing/specializing mutation operator is done by subtracting or adding randomly generated value to attribute-value conditions (ibid.). Example if the condition is (years_ of_ experience > 20), then subtracting a randomly generated value from this condition (e.g. years_ of_ experience > 10) will be a generalizing mutation and adding a randomly generated value (e.g. years_ of_ experience > 25) will be a specializing mutation Fuzzy logic Operators With fuzzy logic representation, we need to come up with new ways to cross and mutate the individuals in our population, ensuring that the rules are still valid within the structure of the grammar. 13

14 Mutation Mutation is a simple enough process: we can interchange the binary functions AND and OR, leaving a syntactically correct rule; we can add or remove NOT before any of the operators, whether unary or binary; and we can substitute any of the fuzzy classification functions LOW, MEDIUM or HIGH with one another. Crossover Crossover, however, is a more difficult problem. One point crossover is not a suitable method here, since it can result in syntactically incorrect rules. Figure 3 shows the result of crossing the rule shown in Figure 2 with another rule, NOT( (MEDIUM(Age) OR LOW(Age)) OR (LOW(Height)) ) at a random point. Rule 1: Rule 2: Rule 3: Figure 4 As you can see, crossing at this point has cut a binary operator in half, resulting in a rule that cannot be parsed. For this reason, it is important to look more carefully at the crossover points. Crossover may not occur after a fuzzy classification function, however, it may occur at any point where an AND, OR, or NOT branch. In addition to this method of merging rules, other systems have used a method of occasionally simply combining rules using AND or OR. This can be a useful technique if used infrequently, so we can use this combination method 10% of the time, and the rest use the merging method (Walter, 2000). 14

15 3.6 Fitness Evaluation Each member of the population in an evolutionary system will have a fitness level, which is defined by how effective the solution is deemed to be at solving a particular problem. The general aim of any genetic algorithm is to adapt the parameters of its population in order to evolve solutions with maximal fitness. Fitness functions can be extremely complicated, as they require some method for quantitatively and qualitatively evaluating solutions where often the knowledge of what makes a good solution is not known. In data mining, the fitness function is used to evaluate the fitness of the prediction rules. As mentioned earlier in this paper, prediction accuracy, comprehensibility and interestingness represent the quality criteria of the discovered rules and can be used to measure their fitness. Two main types of fitness functions are used in data mining to evaluating the fitness of an individual: objective and subjective fitness evaluation. A major issue with using evolutionary algorithms in data mining that needs to be considered when designing the fitness function is the interesting of the discovered rules. Because evolutionary algorithms are powerful techniques that can perform global search and generate huge numbers of rules. However, these rules can be trivial and not interesting Objective Fitness Evaluation This section discusses the objective or the quantitative approaches to evaluate the fitness of discovered rules (Freitas, 2003). Different approaches have been proposed to design effective objective fitness functions. These approaches will be organized according to the quality criteria of discovered rule mentioned earlier. The examples used to illustrate these approaches are represented using Michigan scheme, which means each individual represents a single rule. Prediction Accuracy Criteria One of the approaches to measure the prediction accuracy of a rule is to use confidence factor CF (Freitas, 2003). Suppose that the rule to be evaluated is as follow: IF A THEN B. Then the confidence factor can be calculated as: CF= A&B / A Where A represents the number of the instances in the data that satisfy all the conditions in the antecedent part A of the rule and A&B represents the number of the instances in the data that satisfy all the conditions in A and are classified to be of class B (Freitas, 2003). Here is an example of how to calculate CF. IF A = 100 and A&B =60, then CF will be 60% which can give an insight of how accurate the rule is. Therefore, the higher the rule accuracy in the training set, the more likely that it will be selected. This is a very simple approach to define the prediction accuracy, but one obvious drawback of such approach is that it most likely to overfit the data which would results in poor prediction performance on the testing data set. 15

16 Another approach mentioned in (Freitas, 2003) is to use a confusion matrix. This matrix is a 2 x 2 matrix used to describe the predictive performance of the rule. Recall the previous rule example: IF A THEN B. The confusion matrix of this rule is: Predicted class Actual Class B Not B B TP FP Not B FN TN Where: TP = True Positives = Number of examples satisfying A and B FP = False Positives = Number of examples satisfying A but not B FN = False Negatives = Number of examples not satisfying A but satisfying B TN = True Negatives = Number of examples not satisfying A or B (Freitas, 2003). Using this matrix, CF can be computed as following: CF=TP/ (TP+FP). One important advantage of the previous approach is that it introduces a measurement of the rule completeness Comp : Comp=TP/ (TP+FN). Now the rule fitness can be calculated as follow: Comprehensibility Criteria Fitness=CF*Comp Fitness function can be extended in order to cover the comprehensibility criteria as follow (Freitas, 2003): Fitness= w1* (CF*Comp.)+w2*Simp. Where W1 and W2 are user defined weights. Simp refers to the simplicity measurement of a rule. One obvious way to measure the simplicity is to compute the number of the conditions in the rule. The smaller the number of conditions the simpler the rule is. For data mining approached that uses genetic programming, the simplicity of a rule can be measured by counting the number of nodes. Possible method to measure the rule simplicity mentioned in (Freitas, 2003), is to define a maximum number of nodes in a tree (individual) and then calculate the simplicity as follow: 16

17 Interestingness Criteria Simp= (MaxNodes -0.5 NumNodes 0.5)/(Maxnodes -1) Noda et al (1999) have proposed a fitness function that composed of two parts: the first part is to measure the degree of interestingness and the second part to measure the predictive accuracy. The degree of interestingness part also consists of two parts. Users are supposed to set the weights of the degree of interestingness and the predictive accuracy parts. Another measurement of interestingness of rule has been introduced by Piatetsky-Shapiro cited in (Gebhardt, 1991) which is called PS measure: PS = A&B - A B /N. According to Piatetsky-Shapir summarized in (Gebhardt, 1991), there are three principles for rule interestingness (RI) measures: RI = 0 if A & B = A B / N, when the antecedent and the consequent of the rule are statistically independent. RI monotonically increases with A&B when other parameters are fixed, namely A and B. In this case the CF and Comp factors are increased also which means more interesting rule. RI monotonically decreases with A or B when other parameters are fixed, namely A&B. In this case the CF and Comp factors are decreased also which means less interesting rule Subjective Fitness Evaluation (Interactive Evolutionary Algorithms) Writing an appropriate objective fitness function can be a very hard task. This is particularly true for situations where the domain knowledge or a prior knowledge is not available, which makes the decision of what is considered as an interesting knowledge difficult. In such cases, a subjective fitness function evaluation could be very useful. Subjective fitness evaluation is done by human experts. In data mining, domain experts evaluate the fitness of the discovered rules according to their interesting feature. Rules can be interesting if they are unexpected and actionable for the user (Liu et al., 1997). In many different domains, however, knowledge about the domain data can vary from one user to another. User s prior knowledge of the domain can be either general impression (GI) when the user has feelings about the domain or it can be reasonably precise knowledge (RPK) when the user has definite idea. Generally, discovered rules are evaluated and ranked against these two types of concept (ibid.). A major problem with data mining is that the discovered models do not necessarily contain important or interesting rules. They sometime include trivial rules or even worse they can have counterintuitive rules (Pazzani, 2002). Some of the previous attempt to address this problem is to interact with the domain experts to evaluate the models and to find what is 17

18 interesting and important. The found model, then, will be adjusted according to the feedback from the domain experts (for example by adding or removing variables) until an acceptable model is found (ibid). Subjective fitness evaluation and Interactive evolutionary algorithms accelerates this process and probably generates more interesting rules by involving the domain expert in the search process to bias the search toward models that are more novel and comprehensible. Figure 5 (Pazzani, 2002) Using subjective function evaluation in data mining offers many opportunities for future research. For example, Pazzani s idea (2002) illustrated in Figure 5 could be accomplished through the use of subjective fitness evaluation. In his idea, he describes how the different fields of artificial intelligence, statistics, data base and cognitive psychology should be combined together to improve the performance of the multi-disciplinary field of data mining. Interactive evolutionary algorithms can allow the use of cognitive psychology in developing tools and techniques for data mining and knowledge discovery through involving the human cognitive process into the search for interesting patterns and the discovery of new knowledge from data sets. 3.7 Selection and Replacement Selection is the process of choosing which individuals in the population to use for reproduction, and replacement is the process of selecting which individuals in the population will go through to the next generation. Whilst selection and replacement can use the same methods almost interchangeably, they do not both need to be implemented the same way in a particular genetic system: a system may use the roulette wheel method for selection, and the absolute method for replacement. There are a number of different methods for making these selections: Absolute The n fit individuals in the population are chosen for breeding, or the n least fit individuals are replaced. Whilst this seems like a good strategy, it can result in losing individuals that, while being less fit, hold genetic material that could be useful in 18

19 evolving even better strategies. It is important to keep a mix or genetic material within the population to stop the solutions converging prematurely at local optima. Random The opposite of absolute selection is random selection. Here, no regard is given at all to the fitness all individuals are selected with uniform probability. Whilst this method does preserve variety, it also means that it can take a very long time to find a good solution, and if it is used as a replacement strategy, good solutions have as much chance of being overlooked as bad ones, meaning that good solutions may never develop. Roulette Wheel By looking at the previous two methods, we can see that whilst it is important to focus on the individuals with higher fitness levels, we also need to ensure that we do not throw away potentially useful solutions. The roulette wheel method addresses this by picking randomly, but in proportion to the fitness of the individuals, so that very fit individuals have a higher chance of being selected for breeding, and less fit individuals have a higher chance of being replaced. 3.8 Integrating Conventional Techniques with Evolutionary Algorithms Several hybrid approaches have been proposed that integrate evolutionary algorithms with one of the conventional techniques to tackle some of the problems with the conventional techniques such as minimizing the number of selected features and selecting more interesting features. One of the successful attempts to integrate evolutionary algorithms with data mining is the approach developed by Terano and Ishino (1996). Their approach integrates evolutionary algorithm with one of the machine learning data mining techniques, namely inductive learning technique that generates decision trees. They used the inductive learning algorithm to find rules from the data, then they used interactive evolutionary algorithm to refine these rules. Their work will be discussed in greater depth in the following section. 4 Applications of Data Mining Using IEA In this section we examine some areas where data mining with interactive evolutionary algorithms IEA techniques has been successfully applied. The first approach detailed is very general in terms that it can be used to classify any text based data and hence is not limited to any specific discipline. The approach requires textual data in the form of reports, which can be just normal text files corresponding to the database for which the knowledge needs to be extracted. 4.1 Extracting Knowledge from a Text Database This technique proposed by Sakurai (2001) details a means to extract knowledge from any database with the help of domain dependent dictionaries. The particular application in the paper deals with text mining from daily business reports generated by some institution and 19

20 classification of the reports based on some knowledge dictionaries. In their experiment, two kinds of knowledge dictionaries were used, one is called the key concept dictionary, and the other is the concept relation dictionary. The daily business reports generated from any source are decomposed into words using lexical analysis and the words are checked for entry in the key concept dictionary. All reports are then classified with particular concepts; according to the words in the report, which represent the concept in the key concept dictionary. Also each report is then checked if its key concepts are assigned in the concept relation dictionary. Reports are then classified according to the set of concept relations, and reports having the same text class are put into the same group. This facilitates the end users as they can read only those reports, which are put into groups with topics matching their interests; also it gives them and indication of the trends of topics in reports. The key concept dictionary contains concepts having common features, concepts and related keywords, and expressions and phrases concerned with the target problem. An example of the key concept dictionary can be seen in the figure below concept relation dictionary contains a relation, which describes a condition and a result. This is a mapping from key concepts to classes. Since creating a dictionary is time consuming and prone to errors the paper describes an automatic way of creating a concept relation dictionary. Figure 6 (Sakurai et al., 2001) The relation in concept relation dictionary is like a rule and can be acquired by inductive learning if training examples are available, to do so words are extracted from the document by lexical analysis and these words are checked if they match a expression in key concept dictionary. Thus we have the following assumptions, concept classes are attributes, concepts are values and test classes given by the reader are the result classes we want, this forms a training example. Also for all those attributes, which do not have values, 0 is assigned. An overview of this is clearly depicted in the figure below 20

21 Figure 7 (Sakurai et al., 2001) For the inductive learning to work we need a fuzzy algorithm, as reports, which are written by humans, are not strict in accordance with descriptions. Thus the method described for the learning is the IDF algorithm, which is a fuzzy algorithm. This algorithm makes rules from the generated training examples and the rules, which are generated, have the genotype of a tree. 21

22 The whole process can be seen in figure 8 below which shows the inputs, and the processes, which go into getting the final outputs from the input dictionaries and data. Figure 8 (Sakurai et al., 2001) The algorithm was tested on daily reports for a business concerning retail sales into 3 classes concerned with describing a sales opportunity as best, missed or other. The key concept dictionary was composed of 13 concept classes and each concept class has its subset of concepts. Those reports which contained contradicting descriptions were regarded as unnecessary and training example from them were not generated. And the results showed that by using 10 fold cross validation they were successfully able to generate the concept relation dictionary and obtain better results than IDF on the reports generated for retailing. 4.2 Extracting Marketing Rules from User Data Since marketing decisions require optimum rules from customer data, which can be really noisy, Simulated breeding and inductive learning methods have been tested to create such rules, which have been able to generate simple and easy to understand results in the form which can be used directly by the marketing agent. This work has been developed by Terano and Ishino (1996). The conventional method to solve the problem of generating efficient decision making rules was to use statistical methods but these prove to be weak since they assume that the mining data is based on linear models. Multivariate analysis, which is popularly used, fails to satisfy the need for 22

23 both quantitative as well as qualitative analysis of data. AI techniques on the other hand focus on the problem of feature selection, which is based on machine learning and aims to find the optimal number of features to describe the target concept. This does not work for the current problem hence we cannot apply well-known standard techniques to choose the appropriate features. Hence the smart way proposed by the authors is to use both simulated breeding and inductive learning techniques. The Inductive learning is used to generate the decision rules from data to give emphasis on relationship between product and feature, while simulated breeding to get the effective features. This work was the first of its kind that specifically address the problem of clarifying the relationship between the product image and features using user questionnaire data. Simulated breeding is a GA based technique to evolve offspring. The offspring, which are judged by human expert to have some, desired features are allowed to breed. The judgment is done interactively. It is used in cases where fitness function is hard to define. Inductive learning is used to generate the rules in the form of a decision tree as output for the analysis of features and attribute value pairs. This specific implementation used C4.5. Since marketing decisions must be made by analysts who need to make promotion strategies for their product according to an abstract image of their product. The things they need to keep in mind are that the data gathered from users is inherently noisy and the data is based on complicated models hence simple rules are needed to explain the characteristics of the products. Also the features of the product to realize the image are left on intuition of the experts and there is no clear way to do this. So, the information needs to be organized in a clear manner to understand the relationship between the feature and image of the product. The algorithm proposed consists of the following steps: 1. Inductive learning to classify data 2. Genetic operators to enhance flexibility of feature selection 3. Decision tree selection based on human judgment 4. Developing decision trees with small number of features which fully explain data Automation of offspring selection is not done to promote human creativeness, which incorporates appropriate explanations, and also as our problem needs subjective judgment which makes it really very hard to define a formal fitness function. This analysis was carried out on oral care products and 2300 users filled the questionnaire used, the knowledge obtained was tested by a domain expert at the manufacturing company. The domain expert must know basic principles if IL, stats, and must understand outputs obtained. Using the outputs of decision trees she interactively evaluates the quality of the obtained knowledge. 23

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Numerical Research on Distributed Genetic Algorithm with Redundant

Numerical Research on Distributed Genetic Algorithm with Redundant Numerical Research on Distributed Genetic Algorithm with Redundant Binary Number 1 Sayori Seto, 2 Akinori Kanasugi 1,2 Graduate School of Engineering, Tokyo Denki University, Japan 10kme41@ms.dendai.ac.jp,

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

A Robust Method for Solving Transcendental Equations

A Robust Method for Solving Transcendental Equations www.ijcsi.org 413 A Robust Method for Solving Transcendental Equations Md. Golam Moazzam, Amita Chakraborty and Md. Al-Amin Bhuiyan Department of Computer Science and Engineering, Jahangirnagar University,

More information

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH 205 A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH ABSTRACT MR. HEMANT KUMAR*; DR. SARMISTHA SARMA** *Assistant Professor, Department of Information Technology (IT), Institute of Innovation in Technology

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Introduction To Genetic Algorithms

Introduction To Genetic Algorithms 1 Introduction To Genetic Algorithms Dr. Rajib Kumar Bhattacharjya Department of Civil Engineering IIT Guwahati Email: rkbc@iitg.ernet.in References 2 D. E. Goldberg, Genetic Algorithm In Search, Optimization

More information

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM MS. DIMPI K PATEL Department of Computer Science and Engineering, Hasmukh Goswami college of Engineering, Ahmedabad, Gujarat ABSTRACT The Internet

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Evolutionary SAT Solver (ESS)

Evolutionary SAT Solver (ESS) Ninth LACCEI Latin American and Caribbean Conference (LACCEI 2011), Engineering for a Smart Planet, Innovation, Information Technology and Computational Tools for Sustainable Development, August 3-5, 2011,

More information

6 Creating the Animation

6 Creating the Animation 6 Creating the Animation Now that the animation can be represented, stored, and played back, all that is left to do is understand how it is created. This is where we will use genetic algorithms, and this

More information

Introduction to Data Mining Techniques

Introduction to Data Mining Techniques Introduction to Data Mining Techniques Dr. Rajni Jain 1 Introduction The last decade has experienced a revolution in information availability and exchange via the internet. In the same spirit, more and

More information

A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number

A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number 1 Tomohiro KAMIMURA, 2 Akinori KANASUGI 1 Department of Electronics, Tokyo Denki University, 07ee055@ms.dendai.ac.jp

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Using Artificial Intelligence to Manage Big Data for Litigation

Using Artificial Intelligence to Manage Big Data for Litigation FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear

More information

Genetic Algorithms. Rule Discovery. Data Mining

Genetic Algorithms. Rule Discovery. Data Mining Genetic Algorithms for Rule Discovery in Data Mining Magnus Erik Hvass Pedersen (971055) Daimi, University of Aarhus, October 2003 1 Introduction The purpose of this document is to verify attendance of

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Outline Selection methods Replacement methods Variation operators Selection Methods

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

A Genetic Algorithm Processor Based on Redundant Binary Numbers (GAPBRBN)

A Genetic Algorithm Processor Based on Redundant Binary Numbers (GAPBRBN) ISSN: 2278 1323 All Rights Reserved 2014 IJARCET 3910 A Genetic Algorithm Processor Based on Redundant Binary Numbers (GAPBRBN) Miss: KIRTI JOSHI Abstract A Genetic Algorithm (GA) is an intelligent search

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

New Modifications of Selection Operator in Genetic Algorithms for the Traveling Salesman Problem

New Modifications of Selection Operator in Genetic Algorithms for the Traveling Salesman Problem New Modifications of Selection Operator in Genetic Algorithms for the Traveling Salesman Problem Radovic, Marija; and Milutinovic, Veljko Abstract One of the algorithms used for solving Traveling Salesman

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Volume 3, Issue 2, February 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 2, February 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 2, February 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

More information

An evolutionary learning spam filter system

An evolutionary learning spam filter system An evolutionary learning spam filter system Catalin Stoean 1, Ruxandra Gorunescu 2, Mike Preuss 3, D. Dumitrescu 4 1 University of Craiova, Romania, catalin.stoean@inf.ucv.ro 2 University of Craiova, Romania,

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

NEURAL NETWORKS IN DATA MINING

NEURAL NETWORKS IN DATA MINING NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant

More information

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

Neural Networks and Back Propagation Algorithm

Neural Networks and Back Propagation Algorithm Neural Networks and Back Propagation Algorithm Mirza Cilimkovic Institute of Technology Blanchardstown Blanchardstown Road North Dublin 15 Ireland mirzac@gmail.com Abstract Neural Networks (NN) are important

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments Comparison of Maor Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments A. Sima UYAR and A. Emre HARMANCI Istanbul Technical University Computer Engineering Department Maslak

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Alpha Cut based Novel Selection for Genetic Algorithm

Alpha Cut based Novel Selection for Genetic Algorithm Alpha Cut based Novel for Genetic Algorithm Rakesh Kumar Professor Girdhar Gopal Research Scholar Rajesh Kumar Assistant Professor ABSTRACT Genetic algorithm (GA) has several genetic operators that can

More information

Genetic Algorithms and Sudoku

Genetic Algorithms and Sudoku Genetic Algorithms and Sudoku Dr. John M. Weiss Department of Mathematics and Computer Science South Dakota School of Mines and Technology (SDSM&T) Rapid City, SD 57701-3995 john.weiss@sdsmt.edu MICS 2009

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining

GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining Celso Y. Ishida, Aurora T. R. Pozo Computer Science Department - Federal University of Paraná PO Box: 19081, Centro Politécnico - Jardim das

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Stock price prediction using genetic algorithms and evolution strategies

Stock price prediction using genetic algorithms and evolution strategies Stock price prediction using genetic algorithms and evolution strategies Ganesh Bonde Institute of Artificial Intelligence University Of Georgia Athens,GA-30601 Email: ganesh84@uga.edu Rasheed Khaled Institute

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

HELP DESK SYSTEMS. Using CaseBased Reasoning

HELP DESK SYSTEMS. Using CaseBased Reasoning HELP DESK SYSTEMS Using CaseBased Reasoning Topics Covered Today What is Help-Desk? Components of HelpDesk Systems Types Of HelpDesk Systems Used Need for CBR in HelpDesk Systems GE Helpdesk using ReMind

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets

A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets Artificial Intelligence in Medicine 30 (2004) 27 48 A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets Celia C. Bojarczuk a,1, Heitor

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Vol. 35, No. 3, Sept 30,2000 ملخص تعتبر الخوارزمات الجينية واحدة من أفضل طرق البحث من ناحية األداء. فبالرغم من أن استخدام هذه الطريقة ال يعطي الحل

Vol. 35, No. 3, Sept 30,2000 ملخص تعتبر الخوارزمات الجينية واحدة من أفضل طرق البحث من ناحية األداء. فبالرغم من أن استخدام هذه الطريقة ال يعطي الحل AIN SHAMS UNIVERSITY FACULTY OF ENGINEERING Vol. 35, No. 3, Sept 30,2000 SCIENTIFIC BULLETIN Received on : 3/9/2000 Accepted on: 28/9/2000 pp : 337-348 GENETIC ALGORITHMS AND ITS USE WITH BACK- PROPAGATION

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Using Adaptive Random Trees (ART) for optimal scorecard segmentation A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized

More information

Data Mining Applications in Fund Raising

Data Mining Applications in Fund Raising Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information