DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY SUBMITTED BY R. KAMALESWARI and E. MANIKANDAN MBA BANKING TECHNOLOGY PONDICHERRY UNIVERSITY 1
CERTIFICATE This is to certify that project report titled DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM submitted by R. KAMALESWARI of MBA, FINAL year, Dept. Of BANKING TECHNOLOGY, PONDICHERRY UNIVERSITY and E. MANIKANDAN of MBA, FINAL year, Dept. Of BANKING TECHNOLOGY, PONDICHERRY UNIVERSITY is record of a bonafide work carried out by them under my guidance during the period 10 th May 2012 to 6 th July 2012 at Institute of Development and Research in Banking Technology, Hyderabad. The project work is a research study, which has been successfully completed as per the set objectives. Dr. N.P. DHAVALE DGM, INFINET office IDRBT, Hyderabad 2
INDEX Sl.NO CONTENT PAGE NO 1 Introduction 4 2 WEKA Analyses 4 3 Algorithms in WEKA 6 4 WEKA Installation 6 5 File Formats 7 6 WEKA Interfaces 7 7 How to work with WEKA 8 8 Pattern Analysis 16 9 Pros and Cons of the Tool 18 10 Conclusion 19 3
INTRODUCTION Data mining is a process of extracting hidden predictive information from large databases and it is a powerful new technology with great potential to focus on the most important information in the datasets. Data mining tools predict future trends and behaviours, for making proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. The aim of this project is to identify the pattern of events happening in the network and the periodicity of the events and location to make managerial decision for the best usage of cost and time. Data mining tools help us to explore large dataset and find out patterns with minimal efforts. In this project, WEKA tool has been used to explore the given network data and the pros and cons of the tool are also been analyzed. WEKA ANALYSES GENERAL INFORMATION Data Mining Tool WEKA Edition version 3.6.7 Website http://www.cs.waikato.ac.nz/ml/weka/ License OSS, GNU, GPL version 2 Cost Free System Requirements Java 1.5.0 or later version SYSTEM FEATURES OS platform Windows, Mac OS X, Linux Framework Java Scripting Language Java Human Interaction Manual Interoperability self Documentation 3 4
Ease of Learning 4 Usability 4 Support Community 5 Extensibility 5 Reliability 5 Installation 5 Data IO/ Pre-processing 3 Data Visualization 3 Note: The System feature evaluation was based on a 5-point scale, with higher scores indicating better results such as high/comprehensive/easy/simple, and lower scores for negative results such as low/none/difficult/complex. DATA MINING FUNCTIONALITY Ba Network Decision Tree Neural Network Feature Selection Clustering Association Rules Model Information Evaluation Data Size Oracle Sybase SQL Server MySQL PostgreSQL MS Access ODBC JDBC ARFF CSV MS Excel Web Application Web Interface framework Ease of development DATA SOURCE CHARACTERISTICS Medium no no no no no no no no APPLICATION DEVELOPMENT SCOPE WEB possible JSP - Java Server Pages medium 5
DESKTOP Desktop Application Desktop Interface framework Ease of development Widget development possible Java medium possible ALGORITHMS IN WEKA 69 data preprocessing tools 76 classification/regression algorithms 12 clustering algorithms 17 attribute/subset evaluators + 11 search algorithms for feature selection. 6 algorithms for finding association rules WEKA INSTALLATION Download the self-extracting executable that includes Java VM 1.5. Save the file to a folder of your choice for installing the tool in windows. Double click on the file weka-3-6-7jre.exe, and allow all default values. After the installation completes, WEKA will start: 6
FILE FORMATS The default file format of WEKA is ARFF (Attribute-Relation File Format) which is an ASCII text file. ARFF file contain two sections, first is the header section followed by the data section. An example for ARFF file is shown in the figure. EXCEL sheet can be easily converted into CSV file which can be loaded into WEKA tool for analyses. But the CSV loader in WEKA is not robust i.e. while converting the EXCEL into CSV file, the EXCEL data should not have any formats, no formula should be used in the cells, the data should be continuous without any break between the columns etc only then the tool can accept the CSV file. WEKA INTERFACES The GUI of WEKA supports four interfaces. Simple CLI- provides users without a graphic interface option the ability to execute commands from a terminal window. Explorer- helps the user to uniquely apply machine learning algorithms in the dataset and to analyse the corresponding results. 7
Experimenter- this option allows users to conduct different experiments to find out which classification algorithms are best suited for the dataset. Knowledge Flow-basically the same functionality as Explorer with drag and drop functionality. The advantage of this option is that it supports incremental learning from previous results HOW TO WORK WITH WEKA EXPLORER There are six tabs in the explorer: 1. Preprocess- used to choose the data file to be used by the application and to apply various filters to the dataset. 2. Classify- used to test and train different learning schemes on the pre-processed data file. 3. Cluster- used to apply different functions that identify clusters within the data file 4. Association- used to apply different rules to the data file that identify association within the data 5. Select attributes- used to apply different rules to reveal changes based on selected attributes inclusion or exclusion from the experiment 6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot and bar graph output Preprocess tab To load the CSV file into WEKA, click on the Open File button and choose the CSV file for analysis from the popup menu as shown in the figure. 8
Once the dataset is loaded, the tool will identify the attributes in the dataset, no of instances etc. The visualization in the below figure says the relationship between the two attributes and the attributes can be manually selected by the user. Once the initial data has been selected and loaded the user can select options for refining the data. The options in the preprocess window include selection of optional filters to apply and the user can select or remove different attributes of the data set as necessary to identify specific information. There are many different filtering options available within 9
the preprocessing window and the user can select the different options based on need and type of data present. Choose button allows the user to select various filters for preprocessing. Classify tab The user has the option of applying many different algorithms to the data set that would produce a representation of the information used to make observation easier. It is difficult to identify which of the options would provide the best output for the experiment. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. There are four test options in the classify tab, namely use training set, supplied test set, cross validation, percentage split. Following figure shows some of the classification algorithms. The result of the algorithm is displayed in textual form on the lower right. Different algorithms can be applied to the same dataset and the result of various algorithms can be compared by moving to and fro between the tabs on the lower left. 10
Cluster tab The Cluster tab opens the process that is used to identify clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the classifier tab. They are use training set, supplied test set, percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range or for large data sets. Following figure shows the Cluster window and some of its options. Associate tab The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices to yield the results. Select Attributes tab The select attributes tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the use wanted to exclude certain categories of the data they would deselect 11
those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the sub set of the attributes then performs the necessary search for commonality with the data. Visualize tab The visualize tab plots the instances in 2D between various set of attributes in the dataset. The following figure shows the visualization window. EXPERIMENTER Experimenter enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes. 12
To work with experimenter in simple mode, the user has to input one or more dataset in the lower left and the classification algorithms in the lower right for comparison. An example is shown in the figure. After loading the datasets and the algorithms, the next step is to run the experiment. In the run tab on clicking the run button, the log window will display the error report. The following figure is the error log of the experiment shown in the previous figure. 13
The next step is to analyse the output, to do so click on the experiment button and then click perform test. The statistical comparison of the algorithms chosen for the datasets will be displayed in the lower right frame. The following figure shows the result of the experiment under analyse tab. Experimenter also allows the user to distribute the load in different host, for which the user has to work with the advance mode in the experimenter interface. KNOWLEDGE FLOW The KnowledgeFlow presents a "data-flow" inspired interface to WEKA. The user can select WEKA components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of WEKA'S classifiers and filters are available in the KnowledgeFlow along with some extra tools. The KnowledgeFlow can handle data either incrementally or in batches (the Explorer handles batch data only). Of course learning from data incrementally requires a classifier that can be updated on an instance by instance basis. 14
Features of the KnowledgeFlow Intuitive data flow style layout Process data in batches or incrementally Process multiple batches or streams in parallel(each separate flow executes in its own thread) View models produced by classifiers for each fold in a cross validation Visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) An example for the data flow of an experiment using KnowledgeFlow interface along with the textual output is shown in the figure. SIMPLE CLI WEKA S interactive interfaces the Explorer, the Knowledge Flow, and the Experimenter lies its basic functionality. This can be accessed in raw form through a 15
command-line interface. Select Simple CLI from the interface choices to bring up a plain textual panel with a line at the bottom on which you enter commands. Alternatively, use the operating system s command-line interface to run the classes in weka.jar directly, in which case you must first set the CLASSPATH environment variable. The window of simple CLI is shown in the figure. PATTERN ANALYSIS The ultimate goal is to find out patterns using the tool. Several patterns have been obtained using WEKA and the best patterns among all the other patterns are displayed below. The link failure dataset has been passed into J48 algorithm and the output of the algorithm is displayed in the following figures in the text and tree format. The output pattern is obtained using the attributes link_id, mins (period of link failure), incident_summary class attribute (reason for failure). Seeing the tree the type of link failure on a particular link would be easily identified. 16
17
The following figure is an example of clustering pattern obtained by passing the dataset regarding the inbound and outbound utilization of a particular RBI location (khargar). From this cluster pattern we can see the three groups of cluster which represents the instances based on the utilization level on that location. PROS AND CONS OF THE TOOL PROS WEKA contains wide set of algorithms Installation and handling the tool is easy Customization is possible Easy to compare the results of various algorithms Manipulation of data inside the WEKA is possible 18
CONS CSV loader is not robust Only small and medium sized data can be handled Complexity lies in analyzing the textual output Minimal visualization- Not all algorithms have visualization capability Parameter value of the algorithms cannot be saved for future dataset This version does not support time series analysis Lack of proper and adequate documentations CONCLUSION WEKA tool has wide set of machine learning algorithms with which the user can try different set of algorithms to get different set of patterns provided the algorithms and its optimal parameter values are known. Without knowing the algorithms and parameter value, the user will be using the algorithms in trial and error fashion to get patterns. So as a result best patterns can be easily obtained through WEKA until the algorithms and parameter values are known. 19