EXTENDING UML FOR MODELLING OF DATA MINING CASES Prof. LADISLAV BURITA VOJTECH ONDRYHAL EXTENDING UML FOR MODELLING OF DATA MINING CASES The article describes possible approach for modelling data mining cases. The aim of the paper is to describe possibility of using standard modelling language for data mining process to achieve compatibility with other projects based on incremental approach, especially those using Unified Process and UML language. Common UML elements like use cases, classes, interfaces, components, nodes, etc. can be specialized by an extension mechanisms including stereotypes and named values. The new set of UML elements is provided and described for data mining process that covers whole project lifecycle. As an example of such approach can be stated data mining model element, that can extend UML element class by new named values like input data, output data, model parameters, etc. UML Schemas, syntax, semantic and usability examples for those new elements will be included in the paper. A cikk az adatbányászati esetek modellezésének egy lehetséges módját írja le. Célja hogy ismertesse az általános modellező nyelvek használatának lehetőségét az adatbányászati folyamatokhoz, megőrizve ezzel a kompatibilitást más projektekkel, különösen azokkal melyek Egységesített Eljárást és UML nyelvet használnak. Az általános UML elemek, mint a használati esetek (Use case), osztályok (class), komponensek, szálak (node) stb. egyedivé tehetők egy külső, sablonokat és nevesített értékeket tartalmazó mechanizmussal. Az így létrejött új UML elemkészlet használata ajánlott az egész project életciklust lefedő adatbányászati folyamatokhoz. Példaként egy ilyen megközelítéshez vegyük egy adatbányászati modell elemét, amely kiterjeszt egy UML osztályt új, nevesített értékekkel, amik lehetnek bementi és kimeneti adatok, paraméterek stb.. A cikk tartalmazza az új elemekhez tartozó UML sémákat és a szintaxist, valamint példákat a használatukhoz. Development Process Methodology A methodology formally defines the process that you use to gather requirements, analyze them, and design an application that meets them. There are many methodologies, each differing in some way or ways from the others. There are many reasons why one methodology may be better than another for any particular project: For example, some are better suited for large enterprise applications while others are built to design small embedded or safety-critical systems. Some methods better 111
VÉDELMI INFOKOMMUNIKÁCIÓ support large numbers of architects and designers working on the same project, while others work better when used by one person or a small group. Unified Process and UML Language The Unified Process and UML (Unified Modelling Language) are quickly becoming the defacto standards for development process (software development methodology), within the object-oriented and component-based software communities. The Unified Modelling Language (UML) is a graphical language for visualizing, specifying, constructing, and documenting the artefacts of a software-intensive system. The UML offers a standard way to write a system's blueprints, including conceptual things such as business processes and system functions as well as concrete things such as programming language statements, database schemas, and reusable software components. [www-uml] On the Figure 1 [RUP2000] there are displayed key concepts of Rational Unified Process (RUP). The aim of the research is to reuse of those concepts in building data mining methodology. 112 Figure 1 Key Concepts of Rational Unified Process
EXTENDING UML FOR MODELLING OF DATA MINING CASES mining development process methodologies In the data mining world, we can recognize several methodologies for data mining projects. These are usually tightly connected with software producers like SAS, SPSS, Oracle or Microsoft companies. Among these approaches, CRISP-DM methodology is probably the leader in the field of industry independent methodologies. The whole process is described in four level hierarchical process model, consisting of sets of tasks as follows: phase, generic task, specialized task, process instance. On the Figure 2 is the common representation of data mining project based on CRISP-DM. The data lies in the centre of the process. Figure 2 Project lifecycle according to CRISP-DM methodology Integration In a project, where data mining technology is only part of a whole solution, integrated environment has to be set up. Unified Process and UML, as was already mentioned, provide environment already accepted within 113
VÉDELMI INFOKOMMUNIKÁCIÓ software development communities. In the next part of the article possible approach for integration of data mining cases into corporate projects is introduced. All the main phases have been refactored and models, according to Unified Process guidelines, have been created. The following changes and additions have been made to the CRISP-DM methodology: Roles were introduced. Role is not explicitly defined in CRISP- DM. This will help to assign properly responsibilities to persons. For example role Analyst is required in Understanding workflow. Outputs and products from phases have been transformed to artefacts. Significantly reduced number of independent deliverables. Outputs from tasks were integrated and a list of suggested documents has been created. For all documents templates were defined in html and rtf formats. Modelling tool (Enterprise Architect) was used to model data mining process. From such tool subsequent documentation can be generated for output unification. od Process model packages (phases) Name: Package: Version: Author: Process model packages (phases) Mining Process Model 1.0 Vojtěch Ondryhal Tasks and Deliverables Business Understanding + Project Plan + Requirements + Terminology + Vision Understanding + Analysis Report Preparation + Set + Set Description Modelling + Model + Model Description + Model Parameters Settings + Test Design Ev aluation + Evaluation Report + Final Report Deployment + Deployment Plan + Monitoring And Maintenance Plan Figure 3 Mining Process Model Overview 114
EXTENDING UML FOR MODELLING OF DATA MINING CASES Business understanding The artefacts produced during work are: The vision document provides first insight into project. It includes the following parts: background, business objectives, business success criteria, inventory of resources, risks and contingencies, costs and benefits. Requirements document includes requirements, assumptions and constraints, data mining goals and data mining success criteria. Terminology repository (in form of document or model glossary in a tool) of relevant business terminology and data mining terminology. Project plan document, for example in a form of Gant chart. The plan lists stages, duration, resources, inputs, outputs and dependencies, including initial assessment of tools and techniques. Name: Package: Version: Author: Workflow detail Business Understanding 1.0 Vojtěch Ondryhal Determine business objectives Vision Business Analyst Determine Mining Goal Asses Situation Terminology Produce Project Plan Requirements Project Manager Project Plan Figure 4 Business understanding workflow detail understanding The artefact produced in this phase is Analysis Report Document that contains report on initial data collection, description on data, report on data exploration and data quality. 115
VÉDELMI INFOKOMMUNIKÁCIÓ Name: Package: Version: Author: Workflow detail Understanding 1.0 Vojtěch Ondryhal Collect Initial Describe Analyst Verify Quality Explore Analysis Report Figure 5 understanding workflow detail preparation This phase creates data sets that will be used in the next phases for modelling. Each activity displayed on Figure 6 provides a chapter in the Set Description document. Set contains real data prepared as an input for modelling. The data are properly selected, cleaned, eventually new data items created, merged and formatted. Name: Package: Version: Author: Workflow detail Preparation 1.0 Vojtěch Ondryhal Select Clean Construct Set Description Designer Set Format Integrate Figure 6 preparation workflow detail 116
EXTENDING UML FOR MODELLING OF DATA MINING CASES Modelling At the start of the workflow tests are created for model validation, training and testing. Model itself runs prepared dataset for results. Model parameters setting lists required parameters for model and values. Usually for different set of values model behaves variously. All variants of setting should be captured and described. Name: Workflow detail Package: Modell ing Version: 1.0 Author: Vojtěch Ondryhal Select Modell ing Techni que Generate Test Design Test Design Mining Engineer Build Model Model Description Model Parameters Settings Assess Model Model Set Figure 7 Modelling workflow detail Evaluation The evaluation report indicates how results meet business criteria defined in Business Understanding phase. During evaluation models are approved (or rejected). Name: Workflow detail Package: Evaluation Version: 1.0 Author: Vojtěch Ondryhal Evaluate Results Business Analyst Ev aluation Report Final Report Determine Next Steps Review Process Proj ect Manager Quality Insurance Manager Figure 8 Evaluation phase workflow detail 117
VÉDELMI INFOKOMMUNIKÁCIÓ Final report contains review of the whole process, checks whether all required activities have been finished. It also include list of possible action in the project and decisions on these actions. Deployment Deployment is last workflow in the data mining development process. Deployment packages and deployment plan for target environment is created. Monitoring and maintenance plan defines method of day-to-day result checking in order to assure correctness of produced results. Name: Workflow detail Package: Deployment Version: 1.0 Author: Vojtěch Ondryhal Plan Deployment Deployment Plan Deployment Manager Plan Monitoring And Maintenance Monitoring And Maintenance Plan Project Manager Review Project finalize Final Report Figure 9 Deployment phase workflow details 118 Conclusion The possible approach for modelling of data mining cases based on UML and CRISP-DM was introduced in the paper. Paper provides insight into the more detailed work that includes detailed description of deliverables, templates and examples. This methodology is based on prototypes which were experienced at the Communication and Information Systems Department at University of Defence. The advantage of this approach is unification of the project administration (templates, work description, etc.) with other development projects.
EXTENDING UML FOR MODELLING OF DATA MINING CASES References 1. [www-ea] Enterprise Architect web site. http://www.sparxsystems.com.au/ 2. [BOHTH05] Buřita L., Ondryhal V., Hodický J., Trunda M., Hlaváček M, Information Systems, University of Defence, 2005, U-3099 [in Czech language] 3. [CD01] CRISP-DM, Step by Step Mining Guide v. 1.0, CRISP-DM Consorcium, http://www.crisp-dm.org/ 4. [RUP2000] Rational Unified Process 2000 Online documentation 5. [www-vo] Web pages of the author. http://dcs.unob.cz/~vojtech.ondryhal/ [in Czech language] 6. [www-uml] Unified Modelling Language Resource Page. http://www.uml.org/ 119
120 VÉDELMI INFOKOMMUNIKÁCIÓ