DEFINITION AND INSTANTIATION OF AN INTEGRATED DATA MINING PROCESS TIN2004-05873



Similar documents
Modellistica Medica. Maria Grazia Pia, INFN Genova. Scuola di Specializzazione in Fisica Sanitaria Genova Anno Accademico

Curriculum Reform in Computing in Spain

3C05: Unified Software Development Process

Abstract. 1 Introduction

Chap 1. Introduction to Software Architecture

A FRAMEWORK FOR INTEGRATING SARBANES-OXLEY COMPLIANCE INTO THE SOFTWARE DEVELOPMENT PROCESS

Software Development Methodologies

Classical Software Life Cycle Models

Development models. 1 Introduction. 2 Analyzing development models. R. Kuiper and E.J. Luit

Agile Unified Process

SOFTWARE PROCESS MODELS

Using Rational Software Solutions to Achieve CMMI Level 2

How To Understand The Software Process

Requirements Management Practice Description

A Software process engineering course

Unit 1 Learning Objectives

The Software Process. The Unified Process (Cont.) The Unified Process (Cont.)

6. Software Lifecycle Models. A software lifecycle model is a standardised format for planning organising, and running a new development project.

An Iterative and Agile Process Model for Teaching Software Engineering

Adaptación de MoProSoft para la producción de software en instituciones académicas

A Data Mining & Knowledge Discovery Process Model

Agile Techniques for Object Databases

Software Project Management using an Iterative Lifecycle Model

Software Engineering G

The most suitable system methodology for the proposed system is drawn out.

Surveying and evaluating tools for managing processes for software intensive systems

TRADITIONAL VS MODERN SOFTWARE ENGINEERING MODELS: A REVIEW

The W-MODEL Strengthening the Bond Between Development and Test

Software Development Life Cycle (SDLC)

Software Process Improvement

Software Quality and Assurance in Waterfall model and XP - A Comparative Study

Basic Unified Process: A Process for Small and Agile Projects

Software Engineering from an Engineering Perspective: SWEBOK as a Study Object

TOGAF usage in outsourcing of software development

I219 Software Design Methodology

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

A SYSTEM DEVELOPMENT METHODOLOGY FOR ERP SYSTEM IN SMEs OF MALAYSIAN MANUFACTURING SECTORS

BUSINESS RULES AS PART OF INFORMATION SYSTEMS LIFE CYCLE: POSSIBLE SCENARIOS Kestutis Kapocius 1,2,3, Gintautas Garsva 1,2,4

Best-Practice Software Engineering: Software Processes to Support Project Success. Dietmar Winkler

CHAPTER_3 SOFTWARE ENGINEERING (PROCESS MODELS)

Xtreme RUP. Ne t BJECTIVES. Lightening Up the Rational Unified Process. 2/9/2001 Copyright 2001 Net Objectives 1. Agenda

The Unified Software Development Process

A Comparison of SOA Methodologies Analysis & Design Phases

Information systems modelling UML and service description languages

Leveraging RUP, OpenUP, and the PMBOK. Arthur English, GreenLine Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Software Development Process Models and their Impacts on Requirements Engineering Organizational Requirements Engineering

A Capability Maturity Model (CMM)

Systematization of Requirements Definition for Software Development Processes with a Business Modeling Architecture

Reaching CMM Levels 2 and 3 with the Rational Unified Process

Requirement Management with the Rational Unified Process RUP practices to support Business Analyst s activities and links with BABoK

Chapter 4 Software Lifecycle and Performance Analysis

Using Simulation to teach project management skills. Dr. Alain April, ÉTS Montréal

Web services to allow access for all in dotlrn

Redesigned Framework and Approach for IT Project Management

Multi-Dimensional Success Factors of Agile Software Development Projects

Software Lifecycles Models

DSDM Case Study. An Agile Approach to Software Systems Development for the Highways Agency

Managing Small Software Projects - An Integrated Guide Based on PMBOK, RUP, and CMMI

Agile Modeling: A Brief Overview

Software Engineering for Software-Intensive Systems: III The Development Life Cycle

Agile Software Development. Mohsen Afsharchi

Masters in Information Technology

Plan-Driven Methodologies

A Report on The Capability Maturity Model

50399AE Diseño de soluciones Business Intelligence con Microsoft SQL Server 2008

Outline. III The Development Life Cycle. Characteristics of Software Development Methodologies. The Prototyping Process

Integration of Usability Techniques into the Software Development Process

TDWI Project Management for Business Intelligence

The Agile Teaching/Learning Methodology and its e-learning Platform

How Rational Configuration and Change Management Products Support the Software Engineering Institute's Software Capability Maturity Model

REVIEW ON THE EFFECTIVENESS OF AGILE UNIFIED PROCESS IN SOFTWARE DEVELOPMENT WITH VAGUE SYSTEM REQUIREMENTS

CS4507 Advanced Software Engineering

An Enterprise Framework for Evaluating and Improving Software Quality

Masters of Science in Software & Information Systems

The CAM-I Performance Management Framework

In the IEEE Standard Glossary of Software Engineering Terminology the Software Life Cycle is:

Research Topics in Software Engineering

Software Engineering. Software Processes. Based on Software Engineering, 7 th Edition by Ian Sommerville

To introduce software process models To describe three generic process models and when they may be used

Modeling Web Applications Using Java And XML Related Technologies

How To Understand And Understand The Software Development Process In Korea

EVALUATION FRAMEWORK FOR SERVICE CATALOG MATURITY IN INFORMATION TECHNOLOGY ORGANIZATIONS

Systems Engineering with RUP: Process Adoption in the Aerospace/ Defense Industry

Supporting Workflow Overview. CSC532 Fall06

How To Create A Tutorial System For Computer Science And Telecommunication Engineering Students

The Use of UML Activity Diagrams and the i* Language in the Modeling of the Balanced Scorecard Implantation Process

What CMMI Cannot Give You: Good Software

AGILE METHODOLOGY IN SOFTWARE DEVELOPMENT

Transcription:

Jornadas de Seguimiento de Proyectos, 2007 Programa Nacional de Tecnologías Informáticas DEFINITION AND INSTANTIATION OF AN INTEGRATED DATA MINING PROCESS TIN2004-05873 Javier Segovia Pérez * Universidad Politécnica de Madrid Abstract In practice, CRISP-DM is the most commonly used data mining process in both industry and academia. CRISP-DM has one major weakness: it is at once a process model, methodology and lifecycle. Therefore it lacks definition and detail. Also, the data mining process is completely decoupled from the software engineering process, even though its results have a definite impact on this process. This methodological deficiency is one of the main reasons why many data mining projects are not completed or why, if they are, they fail to meet customer expectations and are not used. This project aims to mitigate the above problems. To do this, it sets out to: (i) define and integrate a data mining process with a software process, examining what tasks they have in common, their respective inputs and outputs and CRISP-DM s weaknesses, (ii) develop a process instance tailored to CRISP-DM and unify this instance with the Unified Process (RUP) and (iii) validate and transfer the technology to real cases. 1 Project objectives The specific project objectives are as follows: 1. Define and Integrate a Data Mining Process with a Software Process. Based on an established software process, like IEEE Std. 1074 or ISO 12207, and CRISP-DM, the aim is to create an integrated process by carefully examining the tasks they have in common, their respective inputs and outputs and CRISP-DM s weaknesses. This study will include ideas or tasks from other processes enacted in related fields like customer relationship management (CRM) or knowledge engineering. 2. Develop a Process Instance. This objective aims to tailor the above integrated process to a particular software development paradigm, i.e. object orientation based on the Unified Process (RUP). The key goal is to incorporate and extend RUP techniques across the entire integrated process, stressing the software-dm relationship, the connection between software development tasks and pure DM tasks via inputs/outputs, * Email: fsegovia@fi.upm.es

and project management. Techniques borrowed from other fields more akin to the goals of a DM project, like knowledge engineering or CRM, respectively, will be tailored and used for tasks where RUP techniques are unsuitable or insufficient for DM goals, e.g. for requirements elicitation or business modeling. 3. Validate and Transfer Technology. The achievement of the above goals will be entirely subject to their validation and practical application. Apart from providing experts and experience to work on the above research goals, some companies have volunteered to collaborate on pilot testing during the project s third year. Additionally, as they have a vested interest in applying the project results to their own DM projects, they will be the first to test the technology transfer to their human team. To achieve these objectives, the project has been divided into the following four phases (the task decomposition is detailed in the original proposal): Phase 1. Define an integrated and generic data mining process Phase 2: Define an instance of an integrated and generic data mining process or methodology Phase 3: Integrate this process with a software development methodology Phase 4: Validate on pilot projects The project chronogram divided by phases is as follows: Jan'05-Jun'05 Jul'05-Dec'05 Jan'06-Jun'06 Jul'06-Dec'06 Jan'07-Jun'07 Jul'07-Dec'07 Phase 1 Define an integrated and generic data mining process Define an instance of an integrated and generic data Phase 2 mining process or methodology Integrate this process with a software development Phase 3 methodology Phase 4 Validate on pilot projects Phases 1, 2 and 3 have been completed successfully, and Phase 4, Validate on pilot projects, is to kick off this year. Phase 4 will be run in conjunction with G2 Marketing Intelligence, a member of the Grey Global Group, which acquired MDS Boole, the company that has been collaborating on the project from the start. There are also possibilities of running tests with or transferring technology to Ingenuity Solutions Bhd. This company is based in Kuala Lumpur, Malaysia, and has worked on several projects with the research group. 2 Project success level This project is based on the idea that the problems to be solved in the field of data mining nowadays are acquiring the dimensions of engineering problems. Therefore the processes to be applied should cover all the activities and tasks required in an engineering process, tasks that CRISP-DM might not be considering. The submitted project proposal was inspired by software engineering research conducted in recent years on the standardization of SE development processes by learning from other engineering disciplines and applying the experience of software developers. It intended to borrow from these ideas to establish a complete process model for DM

that would improve and add to CRISP-DM. To do this, we counted on the invaluable advice of International Software Engineering Research Network members. Having completed phases 1 to 3, which cover the theoretical and conceptual tasks of creating the process, the general impression is that the proposal was rather good. On the one hand, software engineering standards demonstrate that CRISP-DM did not consider or accounted only vaguely for some processes, especially management-related processes, at the same time as it dealt with other processes, such as business modeling, incorrectly. And, on the other hand, many of the processes in software engineering standards can, if necessary, be tailored to CRISP-DM without too much difficulty. These aspects are detailed in the following. Additionally, the consortium that developed the CRISP-DM standard [1], a European industrial consortium formed in the 90s to undertake an original ESPRIT project and led by SPSS and Teradata, has been reviewing the standard s first version, 1.0, since mid-2006 and has set up a Special Interest Group to do this. Our project development team has joined this SIG. This has been useful as a preliminary review of the project results. The SIG has held three workshops, two in 2006, and a third on 18 January 2007 in London, at which our research group presented the current project results. The first workshop conclusion is that the problems that we had detected in CRISP-DM as part of the project are precisely the problems that the consortium intends to put right in the new versions. This is an important measure of the project s success because while our analysis is based on theoretical research, comparing data mining and software engineering standards and models, their analysis is a wholly practical exercise, based on the experience they and hundreds of customers all over the world have gathered from using CRISP-DM. And both analyses arrive at the same conclusions. The second finding is that while the consortium believes that some of the solutions borrowed from software engineering that we propose would be difficult to adopt (due primarily to the preconceived idea that developing software is a very structured and not very agile business, whereas data mining is quite the opposite), they have taken others into account and intend to add them either to version 2.0, scheduled for 2007, or the planned version 3.0. This is another indication of the success of the project. 2.1 Phase 1. Define and Integrate a Data Mining Process with a Software Process To define the generic process we reviewed both processes and methodologies directly related to data mining or CRM, like KDD Process, SEMMA, Two Crows, 5 A s, 6-σ, CRM Catalyst, Data Mining Industrial Engineering, or Market ConsulTeks SM s proposal to unify RUP and CRISP-DM, and the software engineering process model standards IEEE Std. 1074 and ISO 12207. The first difficulty we encountered was that the data mining field makes no distinction between process model, methodology and lifecycle; they are all fused into one. This meant that the comparisons aimed at finding a correct definition were quite complicated. In the end it was decided to set up a common development process framework based on the two software engineering standards, IEEE Std. 1074 and ISO 12207, to accommodate the tasks and processes established by CRISP-DM. The figure below summarizes the result.

Improvement ORGANIZATIONAL PROCESSES Infrastructure Training PROJECT MANAGEMENT PROCESSES Lifecycle selection Acquisition Supply DEVELOPMENT PROCESSES Pre-Development processes Concept exploration Business modeling System allocation Knowledge importation Development processes Requirements processes INTEGRAL PROCESS Evaluation Configuration management Initiation Project planning Project monitoring and control Post-Development processes Operation and Installation support processes Maintenance Retirement Documentation Included in CRISP-DM Not included in CRISP-DM Partially included in CRISP-DM Clearly, CRISP-DM either fails to define or does not properly consider most of the project management processes, integral processes and organizational processes. The second difficulty derived from data mining s conceptual fusion of process and methodology that we came across is that the tasks belonging to CRISP-DM phases are misplaced in completely different processes. For example, the first phase in CRISP-DM, business understanding, includes the following tasks: Determine business objectives, which is related to Business Modeling and belongs to Development Processes. Situation assessment, which is related to Infrastructure and belongs to Organizational Processes, is related to Requirement Processes and belongs to Development Processes, and is related to Project Planning and belongs to Project Management processes. Determine DM goals, which is related to Requirement Processes and belongs to Development Processes. Produce project plan, which is related to Project Planning and belongs to Project Management processes. Even so, we have been able to map CRISP-DM tasks and activities more or less correctly to the generic and integrated process framework that we have defined and identify any others that CRISP-DM fails to cover either totally or partially. This served as input for the next stage of the project. As it has been created on the basis of SE standards, the framework has the advantage of providing for a future integration of data mining project activities with software development project activities.

2.2 Phase 2. Define an instance of an integrated and generic data mining process or methodology Having defined the process, we then had to define a methodology and a lifecycle. As already mentioned, CRISP-DM is a mixture of all three things. Therefore, part of the work involved figuring out what was what and be able to put it to the best possible use. As mentioned in the original objectives, we examined RUP [2] as an underlying methodology and lifecycle, taking into account that, for use in data mining, it would have to be suitable for both welldefined and complex processes and other short processes with non-existent requirements, such as what are known as Exploratory Projects. Projects like these, which are very common in data mining, probably do not go beyond RUP s Inception stage. RUP appears to be well suited for data mining projects, because it covers almost all development process activities, mandates iterative and incremental development, is intended for a team size of two or more, with no upper limit, and its project management is risk oriented. Additionally, RUP is also used to build business intelligence systems using Data Warehouse, Data Marts and reporting technologies. This is evidence for the fact that it could be suitable for an environment where software is not developed. But perhaps the most important concern for our project is that requirements are expressed as Use Cases and Features, serving as guidance throughout the process. This is something that was missing from CRISP-DM. The figure below is a comparison between RUP and CRISP-DM tasks and lifecycles. CRISP-DM Matching with RUP Disciplines RUP The overall findings after tailoring RUP to CRISP-DM are as follows:

The processes and tasks not covered by CRISP-DM can be covered and organized according to RUP disciplines. The CRISP-DM lifecycle is iterative but not incremental, overlooking the fact that the objectives and intensity of the activities to be carried out can vary from one project phase to another, and RUP s distribution is very well suited for this. The Agile Unified Process is the RUP version that should be used for most data mining projects [3]. Other very important findings are: RUP supports a formal specification: UML. UML is a common language for customers, end users, and developers. This is of vital importance for data mining projects where there is continual contact with the customer and a (non-existent) common language should be used. RUP defines roles, skills and artifacts to be used in each task, another of the points missing from CRISP-DM. Special attention should be attached to RUP s Business Modeling (BM) task, as repeatedly pointed out at CRISP-DM workshops. CRISP-DM s key problem is perhaps how to link data mining results with the ROI (Return on Investment) of the business where it is to be used. The question is, in other words, how to link the Deployment phase with the Business Understanding phase. This is crucial for data mining projects because the average customer finds it difficult to envisage how a data mining project can benefit his or her business (this does not apply to software because everyone is very well acquainted with its benefits). We put this lack of vision is down to the fact that, unlike the RUP Business modeling phase objectives, the CRISP-DM Business Understanding (BU) objectives are not very useful for this purpose (the very name denotes the difference, understanding is one thing and modeling another). One of the BM objectives is to understand the structure and the dynamics of the organization in which a system is to be deployed (the target organization). BU accounts for this point, but, more importantly, BM also covers the following objectives not included in BU: o To ensure that customers, end users, and developers have a common understanding of the target organization and the project o To understand current problems in the target organization and identify potential improvements, which is the seed of a data mining project for that organization. o To understand how the new system would affect the way customers conduct their business and its potential benefits, which is the link we were looking for between the final ROI and the deployment process, a link that is missing from CRISP-DM and which it dearly needs. o To derive the system requirements (data mining goals) from business use cases (business goals), a procedure that is not clarified in CRISP-DM. In view of this, we addressed the following points with the aim of defining an instance of an integrated and generic and data mining process or methodology: Define RUP as a methodological and lifecycle framework integrating CRISP, providing for an iterative and incremental cycle useful for both exploratory and well-defined projects, and reorder CRISP-DM tasks accordingly. Tailor RUP management tasks to CRISP-DM within the methodology.

Switch the business understanding (business-specific) task for the business modeling task. Choose a business process framework [4] to guide the business modeling process. This framework is necessary because business modeling is more important than in software development. Define the business use cases to drive project development. This is a conceptual difference from a software development project which is driven by use cases defined in the Requirements phase. Define data mining use cases as the building blocks for Requirements. Define data mining paradigms (clustering, classification, dependency modeling, deviation detection, sequence analysis, etc.) as the basis for Design (the later Implementation process selects techniques for each paradigm). Tailor RUP artifacts and roles to all of the above, and use UML especially for Business Modeling, Requirements, and Design, creating new elements if necessary. Design an agile version of the process. Additionally, we also reviewed the software Configuration and Change Management techniques, as no such techniques exist in CRISP-DM (the SAS tool alone covers some aspects). We looked at IEEE Std. 828-1998, IEEE Std. 1042-1987 and MIL-HDBK-61 Std., as well as most Revision Control Software tools, and we listed the data mining project elements that should be subject to such a control. The RUP tools that support the configuration and change management processes, such as Rational ClearCase and Rational ClearQuest, which are used in conjunction with Unified Change Management (UCM), remain to be analysed. 2.3 Integrate this process with a software development methodology By having defined an instance of a generic data mining process or methodology integrated with the RUP process the data mining process is automatically integrated with the software development methodology. The underlying idea is that, after business modeling, a decision is taken on which business cases call for previous iterations covering the data mining project, and the results are added as requirements at the software project requirements stage. 3 Results indicators As already mentioned, the project team has joined the CRISP-DM consortium s Update SIG, and explained and discussed the project results at the last workshop that was held in January 2007 in London. The team is also participating in KDubiq (Knowledge Discovery in Ubiquitous Environments). KDubiq is the first Coordination Action (CA) for Ubiquitous Knowledge Discovery, funded by the European Union under IST (Information Society Technology) and FET Open (Future and Emerging Technologies) within the 6th Framework Programme. Ernestina Menasalvas coordinated one of the working groups defining the priorities for the 7 th Framework Programme related to data mining processes. The team is a member of the Spanish Data Mining and Learning Network.

The following table lists the results planned in the project proposal. Publication type Submission date (project runs from Year 1 to 3) Quantity Congress papers Year 2-4 7-9 JCR journal publications Year 2-4 4 Planned achievements Congress tutorials Year 2-3 1-2 PhD dissertations Year 3-4 2 Inclusion of the results in the UPM s Expert in Business Intelligence course Year 3 1 Proposal for editing a handbook on DM processes Year 3 1 Proposal of a workshop for a Software Likely achievements depending Year 3 1 & Knowledge Engineering congress on the extent of the results Proposal for a book on DM Processes Year 4 1 Proposal of a European course within KDNet Year 3 1 There are two ongoing PhD dissertations that we expect to be completed in 2007, because this project covers most of their results: - Definición de un Proceso de Data Mining basado en técnicas de Ingeniería del Software (Definition of a Data Mining Process based on Software Engineering Techniques). Author: Gonzalo Mariscal Vivas. - Metodología para la definición de requisitos en proyectos de Data Mining (ER-DM) (Methodology for defining data mining project requirements). Author: José Alberto Gallardo Arancibia. And the following MSc dissertation was defended in 2006: - Uso de técnicas de educción para el entendimiento del negocio (Using elicitation techniques to understand a business). Author: María Alejandra Ochoa As regards publications, the result of the first stage of the project has been submitted to a JCR listed journal [5]. The second phase of the project ended in June 2006. As this coincided with the CRISP-DM consortium s announcement that it was going to review the process and intended to hold workshops to work on and review proposals, however, we decided to wait until we had participated in these workshops and got a first-hand evaluation of our results. We have submitted only one paper to a congress addressing the use of ontologies in the requirements phase [6]. In the light of the positive feedback that we received from the meeting we had in London at the last workshop in January 2007, we intend to pursue all the originally planned objectives, except for Inclusion of results in the UPM s Expert in Business Intelligence course. Finally, it was suggested to us at the last CRISP-DM workshop that we should write a Book on DM Processes based on our results. This would signify another achieved objective. 4 References [1] http://www.crisp-dm.org. [2] I. Jacobson, G. Booch, and J. Rumbaugh. The Unified Software Development Process. AddisonWesley Longman Inc., 1999.

[3] S. W. Ambler, R. Jeffries, Agile Modeling: Effective Practices for Extreme Programming and the Unified Process, John Wiley and Sons Inc., 2002 [4] A. Ostenwalder, Y. Pigneur, and C.L. Tucci, Clarifying Business Models: Origins, Present, and Future of the Concept, Communications of AIS, Volume 15, Article 1. [5] O. Marbán, E. Menasalvas, S. Eibe, J. Segovia, Towards Data Mining Engineering: a software engineering approach, The Knowledge Engineering Review, Cambridge University Press. Under review. [6] S. Eibe, M. Valencia, E. Menasalvas, J. Segovia, and P. Sousa, Towards autonomous search engine mining: Conceptualizing context as metadata, AWIC 2007.