Comparison of K-means and Backpropagation Data Mining Algorithms

Transcription

1 Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and got more and more widely applied in several other fields. Data mining has been recognized by many researchers as a key research topic in data handling. Data mining is the important approach to realize knowledge discovery. It is the process of extracting patterns or predicting previously unknown and useful trends from large quantities of data by using the knowledge of multidisciplinary fields such as statistics, mode identify, artificial intelligence, machine learning, database, management information system and so on. There are many traditional data mining clustering and classification techniques K-means, fuzzy C- means, Decision tree etc. The artificial neural network (ANN) is one of the techniques of data mining different from traditional technique. It is the nonlinear auto-fit dynamic system made of many cells with simulating the construction of biology neural systems. It can make model from analyzing the mode in the data and discover the unknown knowledge. A study has been made by applying K-means and back propagation algorithms to the recruitment data of an organization. Experiments were conducted with the data collected from an engineering college to support their hiring decision. From the comparative study it has been observed that the K-means clustering algorithm is not much suitable for the problem and the performance of the back propagation neural network is high. Index Terms Artificial Neural Network, Back propagation, Data mining, K-means I. INTRODUCTION Data mining is the term used to describe the process of extracting value from a database. A data-warehouse is a location where information is stored. The type of data stored depends largely on the type of industry and the company. Many companies store every piece of data they have collected, while others are more ruthless in what they deem to be important.[4] Consider the following example of an organization for recruitment of the candidate. The present paper gives an engg. application of data mining based on neural networks. Manuscript received April 15, Nitu Mathuriya, Computer Science, RDPV/ Shri Vaishnav institute of technology and Science ( nitumathurya@gamil.com).indore, India. The back propagation (BP) neural network is used as the algorithm of data mining. In this paper, the back propagation (BP) neural network method is used as the technique of data mining to analyze the effects of structural technologic parameters on efficiency in resume filtering. In this paper, results of the experiments conducted by cluster the data with the K means clustering and by using back propagation algorithm have been analyzed. Accuracy of the classification is used as a metric for performance. This section presents with the details of the data sets. In this method, each feature vector representing the data has a degree of membership in to all the clusters and the algorithm works to minimize an objective function. The K-means algorithm has initially a randomly chosen centre for each cluster and assigns each data in the training set to one of the cluster whose centre is nearest. The algorithm recalculates the centre of the clusters and continues till there is no significant change in the value of centre. K-means algorithm works with an assumption that all attributes are independent and normally distributed. This study is intended to analyze the issues involved in the recruitment process of fresh graduates, and find out a way to reduce the time and cost involved. [1][3] Artificial intelligence was defined as the study and design of intelligent agents, where an intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success. Artificial intelligence has provided a number of useful methods for data mining by machine learning. Machine learning is the subfield of artificial intelligence that is concerned with the design and development of algorithms that allow computers (machines) to improve their performance over time (to learn) based on data, such as from sensor data or databases. Some of the most popular learning systems include the neural networks and support vector machines. A major focus of machine learning research is to automatically produce (induce) models, such as rules and patterns, from data. Hence, machine learning is closely related to fields such as data mining. [2] II. BRIEF INTRODUCTION OF DATA PREPROCESSING AND DATA MINING This paper proposed two most popular traditional and neural network data mining techniques such as K-means and Back-propagation neural network.[1][4] Dr. Ashish Bansal, Information Technology, University/ College/ Shri Vaishnav institute of technology and Science/, Indore, India, ( ashssi@rediffmail.com). 151

2 Figure: 1 Data mining process A. State the problem and formulate the hypothesis Most data-based modeling studies are performed in a particular application domain. Hence, domain-specific knowledge and experience are usually necessary in order to come up with a meaningful problem statement. Unfortunately, many application studies tend to focus on the data-mining technique at the expense of a clear problem statement. In this step, a modeler usually specifies a set of variables for the unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. B. Collect the data State the problem and formulate the hypothesis Collect the Data Data Preprocessing Estimate the model Interpret the model and draw conclusion This step is concerned with how the data are generated and collected. In general, there are two distinct possibilities. The first is when the data-generation process is under the control of an expert (modeler): this approach is known as a designed experiment. The second possibility is when the expert cannot influence the data generation process: this is known as the observational approach. An observational setting, namely, random data generation, is assumed in most data-mining applications. Also, it is important to make sure that the data used for estimating a model and the data used later for testing and applying a model come from the same, unknown, sampling distribution. If this is not the case, the estimated model cannot be successfully used in a final application of the results. C. Data Preprocessing Data-preprocessing steps should not be considered completely independent from other data-mining phases. In each iteration of the data-mining process, all activities, together, could define new and improved data sets for subsequent iterations. Generally, a good preprocessing method provides an optimal representation for a data-mining technique by incorporating a priori knowledge in the form of application- specific scaling and encoding. More about these techniques and the preprocessing phase in general will be given in Chapters 2 and 3, where we have functionally divided preprocessing and its corresponding techniques into two sub phases: data preparation and data-dimensionality reduction. D. Estimate the model The selection and implementation of the appropriate data-mining technique is the main task in this phase. This process is not straightforward; usually, in practice, the implementation is based on several models, and selecting the best one is an additional task. E. Interpret the model and draw conclusions In most cases, data-mining models should help in decision making. Hence, such models need to be interpretable in order to be useful because humans are not likely to base their decisions on complex "black-box" models. Note that the goals of accuracy of the model and accuracy of its interpretation are somewhat contradictory. Usually, simple models are more interpretable, but they are also less accurate. Modern data-mining methods are expected to yield highly accurate results using high-dimensional models. The problem of interpreting these models, also very important, is considered a separate task, with specific techniques to validate the results. A user does not want hundreds of pages of numeric results. He does not understand them; he cannot summarize, interpret, and use them for successful decision-making. k III. K-MEANS CLUSTERING K-means is one of the simplest unsupervised learning algorithms for clustering problems. The algorithm aims at forming k clusters of n objects such that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. The algorithm randomly selects k of the n objects and one of them is assigned to each cluster to represent the cluster mean or the center [1]. For each of the remaining objects, an object is assigned to the cluster to which it is most similar, based on the distance between the object and the cluster mean. Then new mean is computed for each cluster and the process iterates until the criterion function converges. A square-error criterion is used and defined as (1). E= p m i 2. (1) i=1 p C i [1] Arbitrarily choose K points into the space representing the objects that are being clustered. These points represent initial group centroids. [2] Assign each remaining object to the group that has the closest centroid. [3] When all objects have been assigned, recalculate the positions of the K centroids. [4] Repeat Steps 2 and 3 until the centroids no longer move.[3][7] 152

3 IV. BACKPROPAGATION ALGORITHM Backpropagation, or propagation of error, is a common method of teaching artificial neural networks how to perform a given task. The back propagation algorithm is used in layered feed forward ANNs. This means that the artificial neurons are organized in layers, and send their signals forward, and then the errors are propagated backwards. The back propagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the inputs and outputs we want the network to compute, and then the error (difference between actual and expected results) is calculated. The idea of the back propagation algorithm is to reduce this error, until the ANN learns the training data.[6][9] Summary of the technique: 1. Present a training sample to the neural network. 2. Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. 3. for each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. 4. Adjust the weights of each neuron to lower the local error. 5. Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. 6. Repeat the steps above on the neurons at the previous level, using each one's "blame" as its error.[8][5] Actual Algorithm: 1. Initialize the weights in the network (often randomly) 2. repeat * for each example e in the training set do 1. O = neural-net-output (network, e); Forward pass 2. T = teacher output for e 3. Calculate error (T - O) at the output units 4. Compute delta_wi for all weights from hidden layer to output layer; Backward pass 5. Compute delta_wi for all weights from input layer to hidden layer; backward pass continued 6. Update the weights in the network * end 3. until all examples classified correctly or stopping criterion satisfied 4. return (network) Input Layer Hidden Layer Output Layer Figure: 2 Back propagation network V. PROPOSED MODEL The design of the system requires the complete understanding of the problem domain. The data sets and the input attributes are determined through knowledge engineering in an IT industry. The process involves defining the problem, identifying relevant stake holders, and learns about current solutions to the problem. It also involves learning domain-specific terminology, description of the problem and restrictions of it. In this step, interviews were conducted to the domain experts to obtain required information to solve the problem, knowledge extraction was made with the collected information and a knowledge base was built. The knowledge base construction comprises collection of sample data, and deciding which data will be needed in respect to data mining knowledge discovery goals including its format and size.[3] Knowledge acquisition from the domain experts Apply Traditional data mining technique Preprocessing the data Apply Neural Network data mining technique Evaluate the algorithm and choose the structure with highest accuracy Discuss with the domain experts and chose the rules that best fits the problem Figure: 3 A data mining framework for the problem Fig 3 shows the steps involved in the mining process. The mining process begins with the step to gather knowledge from the domain experts. Knowledge acquisition is a process that includes elicitation, collection, analysis, modeling and validation of knowledge for knowledge engineering. 153

4 Some of the important issues involved in knowledge acquisition are the knowledge is hidden within the domain experts and is not with a single expert. Interviews were conducted with the domain experts to understand the problem and the knowledge required to solve the problem. The knowledge acquired is used along with the recruitment database maintained in the industry to form the dataset for experimentation.[3] The data collected from the industry is complex and have noisy, missing and inconsistent data. The data is preprocessed to improve the quality of data and make it fit for the data mining task. The data used are transformed into appropriate formats to support meaningful analysis. Some more attributes are derived using the acquired knowledge to support the mining process. Apply traditional data mining algorithm and back propagation neural network algorithm on the preprocessed data and evaluate the accuracy of each algorithm, after that compare the accuracy of k means and back propagation algorithm. Accuracy is the most important factor to evaluate any model in data mining, so select the model which gives better accuracy for the problem by discussing with the domain experts. The constructed models were reviewed and evaluated before it is used for decision support. The models were evaluated using accuracy as the criteria to assess the performance of the method. Constructive rules were extracted from the technique which had better accuracy. Table 1 Results of algorithms Trained with Dataset1 and Tested with Dataset1 K- means 72 Back propagation Table 2 Results of Clustering algorithms Trained with Dataset2 and Tested with Dataset2 K- means Back propagation Table 3 Results of algorithms Trained with Dataset1 and Tested with Dataset2 K- means 71 Back propagation Table 2 Results of algorithms Trained with Dataset2 and Tested with Dataset1 K- means 75 Back propagation VI. EXPERIMENTAL RESULTS The proposed system has been implemented and evaluated with extensive experimentations on the collected datasets. Accuracy of classification is used as the metric for deciding the best suited model. This section presents the details of the data sets, test results and comparison of them. The metric used to evaluate the clustering and classification algorithms is the accuracy. Accuracy is determined as the ratio of records correctly classified during testing to the total number of records tested. The clusters formed were verified for correctness to know the error. The details of the applicants to the industry comprising two datasets were used for experimentation. The first dataset includes the detail consists of 400 records, the second dataset include the details consist of 850 records. From the dataset it is observed, that the dataset consists of more than 60% of records to be in the rejected category. Hence the machine learning algorithms were very excellent in recognizing the rejected data however they were not able to identify selected records to a large extent. Therefore the dataset was premeditated and almost equal number of records in both the categories was used for experimentation. When the records were chosen for the learning process, the distribution of the status in the original data was maintained. The algorithms were trained with records of one dataset and tested with the records in the other dataset. K Means and back propagation algorithms were applied with MATLAB and the accuracy of the algorithms is depicted in Table 1, Table 2, Table 3 and table 4. It is observed that K means technique have poor accuracy and not suitable for this problem domain due to the nature of the data. The resultant accuracy of the algorithms shown in the above table represents backpropagation algorithm gives better result than K-means algorithm and backpropagation algorithm suitable for the problem. The variation of the accuracy depends on the training and testing data sets. VII. CONCLUSION The problem domain has been studied to extract the knowledgebase required to solve the problem. Datasets have been collected and analyzed to identify the input attributes to be used for the algorithms. Most popular clustering and classification techniques were deployed in solving the problem. It was observed that K-means clustering techniques are not suitable for this type of data distribution. The popular algorithm Backprapogation have been applied for the problem and it has been observed that constructed with Backpropagation algorithm has better accuracy. A set of experiments was conducted to test the proposed approach is using a well defined set of data mining problems. The results indicate that, using the proposed approach, high quality or useful data can be discovered from the given data sets. In future we can apply this technique to select deserving candidates for an organization. The use of neural networks in data mining is a promising field of research especially given the ready availability of large mass of data sets and the reported ability of neural networks to detect and assimilate relationships between a large numbers of variables. 154

5 REFERENCES [1] Arun K Pujari Data Mining Techniques University press india(private Limited), * th impression,2005. [2] Haykin, S., Neural Networks, Prentice Hall International Inc., 1999 [3] N. Sivaram Research Scholar and V, Department of CSE,Clustering and Classification Algorithms for Recruitment Data Mining, International Journal of Computer Applications ( ) Volume 4 No.5, July 2010 [4] Jiawei Han, Micheline Kamber Data Mining Concepts and Techniques, Second Edition Morgan Kaufmann Publishers, San Francisco [5] Artificial Neural Network, Wikipedia Encyclopedia, Wikimedia Foundation, Inc., [6] 1DR. YASHPAL SINGH,ALOK SINGH CHAUHAN, Journal of Theoretical and Applied Information Technology Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India. [7] Bhavani,Thura-is-ingham, Data-mining, Technologies,Techniques tools & Trends, CRC Press [8] Bradley, I., Introduction to Neural Networks, Multinet Systems Pty Ltd Nitu Mathuriya Bachelor of Engineering (Computer Science), (Master of Engineering (Computer Science), Shri Vaishnav institute of technology & Science. Dr. Ashish Bansal Ph.D (Information Technology), Subject: Techniques for enhancing watermarking using neural networks. Rajiv Gandhi University, Bhopal. Published more than 30 papers in international and National journals and conferences. 155