Knowledge Reuse in Data Mining Projects and Its Practical Applications

Transcription

1 Knowledge Reuse in Data Mining Projects and Its Practical Applications Rodrigo Cunha 1, Paulo Adeodato 1 and Silvio Meira 1, 1 Center of Informatics, Federal University of Pernambuco Caixa Postal Cidade Universitária , Recife-PE, Brazil {rclvc,pjla,srlm}@cin.ufpe.br Abstract. The objective of this paper is providing an integrated environment for knowledge reuse in KDD, for preventing recurrence of known errors and reinforcing project successes, based on previous experience. It combines methodologies from project management, data warehousing, mining and knowledge representation. Different from purely algorithmic papers, this one focuses on performance metrics used for managerial such as the time taken for solution development, the amount of files not automatically managed and other, while preserving equivalent performance on the technical solution quality metrics. This environment has been validated with metadata collected from previous KDD projects developed and deployed for real world applications by the development team members. The case study carried out in actual contracted projects have shown that this environment assesses the risk of failure for new projects, controls and documents all the KDD project development process and helps understanding the conditions that lead KDD projects to success or failure. Keywords: Data mining project, Knowledge reuse in KDD projects, risk assessment of KDD projects. 1 Introduction Early research on artificial intelligence (AI) focused on the implementation and optimization of algorithms. These algorithms however, only produced reliable results in very specific applications, in limited domains. The general application of AI to data generating real world activities, data mining, was far from satisfactory, mainly due to the difficult integration of the databases, to the low quality of the data available and to the poor understanding of the business operation (application domain). In 1996, Fayyad et al. [1] generalized the scope inserting data mining in a more global process coined Knowledge Discovery in Databases (KDD). Also in 1996, potential data mining solutions consumers and suppliers formed a consortium for creating a methodology for systematically developing data mining solutions for real problems. They came up with the CRISP-DM (Cross-Industry Standard Process for Data Mining) [2], a non-proprietary methodology for identifying and decomposing a data mining project in several stages, shared by all domains of application. Those initiatives aimed at standardizing the development process of data mining solutions

2 which involves the use of several tools for modeling, data visualization, analysis and transformation, performance evaluation and even specific programming tasks. Once this standard had been created, the provision of interoperability among the several different platforms in a single environment where all the processes are centralized and documented became one of the most important issues in KDD applications to real world problems. According to Bartlmae and Riemenschneider [3], another important issue in KDD projects nowadays, mainly due to their complexity and strong user dependence, is the inadequate documentation, management and control of the experiences in solution development, thus yielding the recurrence of errors already known from previous projects in new ones. The lack of a platform capable of reusing the knowledge and lessons learned in previous projects developments is a practical problem worsened by the inadequate interoperability among the data mining tools available in the platforms for KDD [4]. Summarizing, the lack of proper interoperability together with the lack of knowledge reuse capability in KDD solution development platforms are deficiencies that may lead projects to failure or delays and cause client dissatisfaction and cost increase. This paper presents the capability of knowledge reuse from previous data mining projects. Therefore, this environment provides a better understanding of the conditions which make KDD projects turn into a failure or a success and a simpler and more precise parameter specification for producing high quality KDD projects that match the clients expectations within the schedule and budget planned. This paper is organized as follows. Section 2 presents the literature survey on approaches related to the proposed environment. Section 3 describes the architecture and functionality of the Knowledge Reuse Environment. Section 4 shows the relevant results for knowledge reuse in data mining projects. Finally, Section 5 summarizes the research carried out, emphasizes its main results along with their interpretation, states its limitations, and proposes future work for improving the Knowledge Reuse. 2 Literature Review IMACS (Interactive Market Analysis and Classification System) [5] was one of the first initiatives to consider involving the user in the KDD process, back in When that system was first proposed, data mining tools used to provide very limited functionality and IMACS development has not followed the evolution of those tools. Thus, IMACS provides support for only the creation of semantic definitions for the data and the formal representation of knowledge. In 1997, CITRUS [6] was proposed based on the CRISP-DM methodology. Two years later, UGM (User Guidance Modulates) was presented [7] as an improvement to CITRUS based on experiences of past projects for knowledge reuse. In 2002, IDAs [8] was proposed based on Fayyad et al. s methodology. After the Clementine release, the vast majority of tools (market tools and academic tools), adopted an interface focused on the data mining workflow. Other relevant work is the application of Case Based Reasoning for Knowledge Management in KDD Projects [3], which proposes a environment aimed at reusing

3 knowledge in data mining projects. The idea is based on the concept of Experiences Factory where Case Based Reasoning (CBR) helps storing and retrieving knowledge packages in a data repository. The Statlog project [9] proposes a methodology to evaluate the performance of different algorithms for machine learning, neural networks and statistics. In spite of using the knowledge reuse concept, the scope of reuse in Statlog is limited to data mining algorithms. Finally, the environment for Distributed Knowledge Discovery Systems project [10] introduces a environment aimed at integrating different data mining tools and platforms related to organizational modeling and integration of solutions. Despite considering integration an important issue, this approach only integrates the data mining tools. It does not deal with either metadata acquisition or meta-data mining on stored knowledge. In summary, the literature offers isolated initiatives for knowledge reuse, but no contribution considering both of these features with focus on the process, as presented in this paper. 3 Knowledge Reuse Environment In this environment, the knowledge databases are stored in three different structures, according to the types of their contents. 1) Metadata Database: This database stores information from previous projects. Here, metadata are all types of information produced along the KDD project development process, such as: data transformation needed, algorithms used, number of components used, project manager, project duration, overall project cost and client s level of satisfaction among others. That is, the metadata database stores information ranging from project management to specific algorithms with their corresponding performances. This module consists of two sub-modules: the transactional metadata database and the managerial metadata database. The transactional metadata database stores all of the project s metadata in a relational logic model. Managerial metadata databases are constructed via an ETL (Extract, Transform, and Load) process [11] carried out on the transactional database. These managerial metadata databases, also called Data Marts, are represented in a star model [11]. The objective of these metadata databases is providing support for both the project manager and the KDD experts along all the KDD solution development process. 2) CBR Projects: This module stores the knowledge of past projects through the technique of Case Based Reasoning (CBR) [3]. The purpose of this module is to reuse cases similar to the current project (being or to be developed) for providing the data mining expert with the adequate condition for making the best decisions in the new project. Currently, the environment offers three milestones for decision support. In the first milestone, it helps the project manager estimate the risk of the project being a success or a failure, even before it has started, and recovers the most similar cases to the new project. The second milestone occurs at the projects planning stages where

4 the goal is to define the most appropriate data mining tasks (classification, forecasting etc) based on the most similar past projects. Finally, at the third milestone, available at the preprocessing stage, the environment analyzes and extracts the most similar past transformations of the data. 3) Learned Lessons Database: This module stores the lessons of previous projects through the technique of Case Based Reasoning. Learned Lessons consist of problems, solutions, suggestions and observations that the experts have catalogued in previous projects with the objective of sharing them in future projects or training. In short, a learned lesson is an entry in the environment s module that makes the experience lived and catalogued by users available for future use. An example of an actual catalogued lesson learned refers to importing data in text file format into the SPSS (Statistical Package for The Social Sciences). This tool is likely to modify the formatting of numeric variables and truncate those of the string type. Now, the environment gives a warning for this problem in projects involving text file inputs to SPSS. 4 Knowledge Reuse and Experimental Platform The knowledge stored can be reused in several ways, from supporting the decision of starting a new project to defining the most appropriate data transformation technique. This Section presents how it has been used and the results achieved in actual projects. 4.1 Problem Characterization Here, the decision support system helps decide if a new data mining project should be developed or not, based on previous projects experience. Even before a new project starts, the system estimates its risk of failure (the higher the score, the higher the risk). If the risk is acceptable, the project starts; otherwise, the system presents the conditions that make the project risky for supporting project renegotiation or, in extreme cases, even project halt. This system helps saving a lot of money and time spent in re-work on ill specified projects. A database collected along recent years of data mining project development by NeuroTech has been used for the environment performance assessment in an actual problem of meta-data mining. The metadata database has been imported from 69 data mining projects executed in the past; 27 labeled as success and 42 labeled as failure (69=27+42). The following three criteria were used for this labeling of project target classes: 1) The contracting client s evaluation (satisfied or dissatisfied); 2) NeuroTech s technical team evaluation: success or failure; and 3) Cost/benefit ratio resulting from the project: success or failure. When a project had a negative evaluation in any of the three criteria, it was labeled a failure; otherwise, it was labeled a success, in this binary classification modeling. Each row of the metadata database represents a project developed whose metadata attributes are stored in its columns. For all projects, there are 19 input attributes (explanatory variables) and an output attribute (dependent variable) which represents the target class label (success or failure). Some of the explanatory variables were:

5 company s (client) size (based on revenue), company s (client) experience with previous DW or KDD (number of projects developed) and if the present project needs behavioral data as input information among other variables. Logistic regression from Weka has been the statistical inference technique used for project risk estimation. Due to the small amount of labeled examples (69) available for modeling, the leave-oneout method has been applied as experimental data sampling strategy using MatLab code. The technical performance evaluation of the system was assessed using the R-Project software in two distinct forms: 1) Separability between the distributions of successes and failures measured by the KS statistical test; 2) Simulation of several decision thresholds scenarios on the project scores produced Experimental Results on Risk Assessment The quality of the meta-data mining is assessed by the usual data mining performance metrics. The performance achieved via leave-one-out reached a maximum value of 0.65 on the KS statistical test [12] which represents a statistically significant difference at α=0.05. This shows that, technically, the system can be used for decision support. For finer decisions, Table 1 presents the scenario for several different score thresholds. For each threshold, it presents the rate of detection of failure in the projects, showing that higher score bands contain higher percentage of failures. New projects that produce scores above 75, for instance, have very high risk of failure and should be renegotiated for risk reduction, before the project start. Table 1. Decision scenario for several score thresholds. Score band Failures Successes Total (30%) 21 (70%) (64%) 4 (36%) (60%) 2 (40%) (100%) 0 (0%) 23 Total 42 (61%) 27 (39%) 69 Should such a system be available for assessing the risk of these 69 projects in the past, just signaling those with scores above 75 would have prevented 23 out of the 42 failed projects without increased attention on any successful project. That would have represented a detection of 55% of the failures, from the start. As previously stated, this is an important managerial metrics for this paper CBR Measurements on Project Similarity The same 69 projects used in the metadata database application were imported to the CBR Project database. In practice, the CBR implementation complements the logistic regression project, returning the cases most similar to the new project. In the end, the

6 project manager has a score for project risk assessment and a collection of the most similar previous projects for decision support. For the cases representation, Case Based Reasoning (CBR) with attribute-value representation [3] was the technique used. The similarity is divided into global similarity and local similarity. The global similarity is weighed and normalized nearest neighbour [13]. The local similarity is related to the attributes that describe the case, in other words, the local similarity depends on the nature of the attribute (string, binary, numeric and ordinal). For each attribute of the "string" type a similarity matrix was constructed by interviewing three NeuroTech s project managers. According to the opinion of each one an average opinion was inferred. For the ordinal and binary attributes, local similarity was defined as the module difference of each attribute s values. For the numeric attributes the local similarity was defined by a linear function. In this case, the similarity grows as the weighed distance decreases. Once the structure of the cases and the similarity measures are defined, the CBR problem becomes the recovery of cases in the knowledge database. The recovery process is constituted of a group of sub-tasks. The first task is the assessment of the situation via a query through a group of relevant attributes. The second sub-task for case recovery is the matching strategy and selection. The objective is to identify a group of cases similar to that in query Q which returns k the most similar cases. In this work, the threshold was defined empirically as 0.5 similarity, Therefore only cases with similarity greater than or equal to 50% in relation to question Q will be returned From the results achieved, NeuroTech decided to adopt the environment to estimate the failure probability of its projects using logistic regression and to find the most similar previously developed projects using CBR. Now, new projects go through the model assessment in order to estimate the chance their success before their development. The score threshold was defined as 75, i.e., only the projects with score below 75 will be automatically approved. Every project with a score higher than 75 should be evaluated by the company s committee, formed by the managers in charge of the business area, the customer area, and by the company s chief-scientist. Only after the committee s approval, the project starts; otherwise, some contractual condition and/or project parameters should be altered based on similar cases and again submitted to the model for risk assessment. Some subjective results have been achieved in NeuroTech with the use of the environment. For instance, a new project contracted by a retail business company for credit scoring solution was evaluated with an 89% chance of failure. When the NeuroTech operation manager used the environment for searching similar cases, the most similar project returned by the CBR system was a project developed for a regional bank. In principle, there was no apparent correlation between a big nationwide retailer and a regional bank. When analyzed in more detail, the project in the bank had failed due to characteristics that matched the retailer's current situation particularly, the inexperience of their staff working in information technology and their lack of commitment with the project. Furthermore, neither the retailer nor the bank had ever developed a data mining project before. As the project had already been negotiated and there was no possibility of aborting it, the manager made two decisions. Firstly, he demanded full-time dedication of a member from the retailer s technology team and, secondly, he defined as the first project activity, a quick basic training course for the retailer's team about data mining. Thus, it was possible to

7 reduce the risks of the new project, with the support of the experience from a similar project previously developed Learned Lessons Database Load and Application Aiming at the practical application of this module to actual problems and assessing the benefit of its use, a learned lessons database has been collected at NeuroTech and imported by the environment. Interviews and forms collected experience from 10 data mining experts at several levels of the company, ranging from technical staff working in modeling to chief officers at the board of directors. A wide spectrum of 61 learned lessons was documented in 6 variables, namely: stage of the CRISP-DM, task of stage of the CRISP-DM, date of learning the lesson, expert who learned the lesson, lesson category and lesson description. These 61 lessons were divided into categories in the following proportions: 35.5% in project risk, 24.2% in best practices, 22.6% in technology and 17.7% distributed in other less frequent categories. Furthermore, the 61 learned lessons were also classified in the following types with their respective proportions: 58.1% of guidelines, 22.6% of problems, 16.1% of problem solutions and 3.2% of general spectrum. The application of this learned lessons database follows the same Case-Based Reasoning methodology and metrics described in the CBR section above. The only differences are the database used and the objective. Up to now, the learned lessons module has been used in NeuroTech by the operations manager, mainly at the beginning of the project, as a complement to the risk estimation module. The data mining specialists are also using the module in two situations: corrective or proactive actions. The corrective situation occurs when a new problem is found, for instance, error in the file importation in SPSS. In this case, after the mistake happens, the specialist consults the Learned Lessons database to identify the best solution to the problem. The proactive situation occurs when a new phase of the project begins, for instance, by signaling the risk of disrupting format in the file importation in SPSS. Another proactive action can be taken after having concluded the pre-processing phase and before beginning the application phase of the algorithm. The specialists query the lessons database aiming at verifying if there is any lesson suggested to avoid the same mistakes in the new phase. Despite the subjective evaluations, some practical actions have been taken by NeuroTech. For instance, according to the operations manager, a learned lesson has helped reduce the risk of a new project for developing a fraud detection solution in telecommunications. The lesson informed that the first meeting for solution requirements specification should not be accomplished with the client s business and information technology teams separately, i.e., a learned lesson informed that the first requirements specification meeting should involve both teams at the same time, otherwise, the lack of understanding and alignment would end up increasing the effort and stress for the entire project. In this scenario, the operations manager postponed the meeting to a controlled occasion where both teams would be together.

8 5 Conclusions This paper has presented an environment for the KDD project development process endowed knowledge reuse at a high level. This environment offers three ways of reusing knowledge: 1) project risk assessment and risk explanation based on the metadata database; 2) reuse of project procedures and settings via Case Based Reasoning on the metadata database; and 3) guidelines, recommendations and warnings from the learned lessons database. Several examples of the knowledge reuse application to real world problems have been presented in this paper, ranging from supporting the decision of whether starting or not a new risky data mining project to finding the most appropriate data transformation and parameter settings along its data mining solution development. The experiments have shown that the risk assessment at the beginning of a project along with the risk conditions help developing a high quality project leading to solutions with high chances of matching the clients expectations within the schedule and budget planned. The recent success of NeuroTech on the PAKDD 2007 Competition (First Runnerup) [14] and publication in IJCNN09 [15] has already proved these ideas. Despite its breadth in terms of managing KDD knowledge, this work has been constrained to the boundaries of binary classification problems. Extensions to other types of problems which were kept out of its scope (e.g. time series forecasting) are already under investigation and will demand a lot of effort. At the moment, the environment is in full application to real world problems and, soon, there will be enough metadata for presenting results with statistical significance. References 1. Fayyad, U.M., Piatetsky-Shapiro, G.,Smyth, P.: From data mining to knowledge discovery: an overview, Advances in knowledge discovery and data mining: p (1996) 2. Shearer, C.: The CRISP-DM model: the new blueprint for data mining, Journal of Data Warehousing, 5: p (2000) 3. Bartlmae, K.,Riemenschneider, M.: Case based reasoning for knowledge management in kdd projects. In: Proceedings of the 3rd International Conference on Practical Aspects of Knowledge Management. PAKM 2000, Basel Switzerland (2000) 4. Rodrigues, M.d.F., Ramos, C.,Henriques, P.R.: How to Make KDD Process More Accessible to Users. In: ICEIS. pp (2000) 5. Brachman, R.J., et al.: Integrated Support for Data Archaeology, International Journal of Intelligent and Cooperative Information Systems, 2(2): p (1993) 6. Wirth, R., et al.: Towards Process-Oriented Tool Support for Knowledge Discovery in Databases. In: Principles of Data Mining and Knowledge Discovery. pp Trondheim, Norway (1997) 7. Engels, R., Component-based User Guidance for Knowledge Discovery and Data Mining Processes, in Karlsruhe, p. 234 p Universidade de Karlsruhe (1999) 8. Bernstein, A., Hill, S.,Provost, F. Intelligent assistance for the data mining process: An ontology-based approach (2002) 9. King, R.D., The STATLOG Project 2007, Department of Statistics and Modelling Science

9 10.Neaga, I., Framework for Distributed Knowledge Discovery Systems Embedded in Extended Enterprise, in Manufacturing Engineering, Loughborough, United KingdomLoughborough University (2003) 11.Kimball, R.: The Data Warehouse Lifecycle Toolkit, New York: John Wiley & Sons (1998) 12.Conover, W.J.: Practical Nonparametric Statistics. Vol, 3, New York: John Wiley & Sons (1999) 13.Aamodt, A.,Plaza, E.: Case-Base Reasoning: Foundational Issues, Methodological Variations and Systems Approaches AICOM, 7(1) (1994) 14.Adeodato, P.J.L., et al.: The Power of Sampling and Stacking for the PAKDD-2007 Cross- Selling Problem, International Journal of Data Warehousing & Mining, 4(2): p (2008) 15.Adeodato, P., et al.: The Role of Temporal Feature Extraction and Bagging of MLP Neural Networks for Solving the WCCI 2008 Ford Classification Challenge. In: International Joint Conference on Neural Networks. IJCNN 2009 (accepted) (2009)