Managing and Implementing the Data Mining Process Using a Truly Stepwise Approach



Similar documents
*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Modified Line Search Method for Global Optimization

(VCP-310)

Baan Service Master Data Management

CHAPTER 3 THE TIME VALUE OF MONEY

I. Chi-squared Distributions

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Domain 1: Designing a SQL Server Instance and a Database Solution

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

1 Computing the Standard Deviation of Sample Means

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

Engineering Data Management

How to read A Mutual Fund shareholder report

TruStore: The storage. system that grows with you. Machine Tools / Power Tools Laser Technology / Electronics Medical Technology

Enhancing Oracle Business Intelligence with cubus EV How users of Oracle BI on Essbase cubes can benefit from cubus outperform EV Analytics (cubus EV)

Prescribing costs in primary care

iprox sensors iprox inductive sensors iprox programming tools ProxView programming software iprox the world s most versatile proximity sensor

ODBC. Getting Started With Sage Timberline Office ODBC

Domain 1 - Describe Cisco VoIP Implementations

Confidence Intervals for One Mean

Hypergeometric Distributions

LEASE-PURCHASE DECISION

Making training work for your business

Lesson 15 ANOVA (analysis of variance)

INVESTMENT PERFORMANCE COUNCIL (IPC)

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Output Analysis (2, Chapters 10 &11 Law)

The Big Picture: An Introduction to Data Warehousing

1 Correlation and Regression Analysis

CCH Accountants Starter Pack

ADAPTIVE NETWORKS SAFETY CONTROL ON FUZZY LOGIC

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Incremental calculation of weighted mean and variance

Systems Design Project: Indoor Location of Wireless Devices

Statistical inference: example 1. Inferential Statistics

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

Hypothesis testing. Null and alternative hypotheses

France caters to innovative companies and offers the best research tax credit in Europe

The Forgotten Middle. research readiness results. Executive Summary

Handling. Collection Calls

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

CREATIVE MARKETING PROJECT 2016

Message Exchange in the Utility Market Using SAP for Utilities. Point of View by Marc Metz and Maarten Vriesema

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

How to use what you OWN to reduce what you OWE

QUADRO tech. PST Flightdeck. Put your PST Migration on autopilot

Electrostatic solutions for better efficiency

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

PUBLIC RELATIONS PROJECT 2016

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Optimize your Network. In the Courier, Express and Parcel market ADDING CREDIBILITY

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

A Secure Implementation of Java Inner Classes

Page 1. Real Options for Engineering Systems. What are we up to? Today s agenda. J1: Real Options for Engineering Systems. Richard de Neufville

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

E-Plex Enterprise Access Control System

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Configuring Additional Active Directory Server Roles

Soving Recurrence Relations

Agenda. Outsourcing and Globalization in Software Development. Outsourcing. Outsourcing here to stay. Outsourcing Alternatives

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Security Functions and Purposes of Network Devices and Technologies (SY0-301) Firewalls. Audiobooks

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Agency Relationship Optimizer

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

A Balanced Scorecard

TIAA-CREF Wealth Management. Personalized, objective financial advice for every stage of life

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

client communication

Neolane Reporting. Neolane v6.1

Desktop Management. Desktop Management Tools

Information about Bankruptcy

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

CCH CRM Books Online Software Fee Protection Consultancy Advice Lines CPD Books Online Software Fee Protection Consultancy Advice Lines CPD

Amendments to employer debt Regulations

Business Rules-Driven SOA. A Framework for Multi-Tenant Cloud Computing

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Digital Enterprise Unit. White Paper. Web Analytics Measurement for Responsive Websites

Measuring Magneto Energy Output and Inductance Revision 1

Document Control Solutions

leasing Solutions We make your Business our Business

Effective Data Deduplication Implementation

AdaLab. Adaptive Automated Scientific Laboratory (AdaLab) Adaptive Machines in Complex Environments. n Start Date:

On-Premise CRM to Salesforce Migration - Benefits, Challenges and Best Practices

Best of security and convenience

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

AN INTELLIGENT MODEL FOR SALES AND INVENTORY MANAGEMENT

PSYCHOLOGICAL STATISTICS

QUADRO tech. FSA Migrator 2.6. File Server Migrations - Made Easy

Transcription:

Maagig ad Implemetig the Data Miig Process Usig a Truly Stepwise Approach Perttu Laurie (1, Lauri Tuovie (1, Eija Haapalaie (1, Heli Juo (1, Juha Röig (1 ad Dietmar Zettel (2. 1) Itelliget Systems Group, Departmet of Electrical ad Iformatio Egieerig, PO BOX 4500, FIN-90014 Uiversity of Oulu, Filad. E-mail: perttu.laurie@ee.oulu.fi. (2 Fachhochschule Karlsruhe, Istitut für Iovatio ud Trasfer, Moltkestr. 30, 76133 Karlsruhe, Germay. E-mail: dietmar.zettel@fh-karlsruhe.de. Abstract. Data miig cosists of trasformatio of iformatio with a variety of algorithms to discover the uderlyig depedecies. The iformatio is passed through a chai of algorithms ad usually ot stored util it has reached the ed of the chai, which may result i a umber of difficulties. This paper presets a method for better maagemet ad implemetatio of the data miig process ad reports a case study of the method applied to the pre-processig of spot weldig data. The developed approach, called truly stepwise data miig, eables more systematic processig of data. It verifies the correctess of the data, allows easier applicatio of a variety of algorithms to the data, maages the work chai, ad differetiates betwee the data miig tasks. The method is based o storage of the data betwee the mai phases of the data miig process. The differet layers of the storage medium are defied o the basis of the type of algorithms applied to the data. The layers defied i this research cosist of raw data, preprocessed data, features, ad models. I coclusio, we preset a systematic, easy-to-apply method for implemetig ad maagig the work flow of the data miig process. A case study of applyig the method to a resistace spot weldig quality estimatio project is preseted to illustrate the superior performace of the method compared to the curretly used approach. Key words: hierarchical data storage, work flow maagemet, data miig work flow implemetatio. 1. Itroductio Data miig cosists of trasformatio of iformatio with a variety of algorithms to discover the uderlyig depedecies. The iformatio is passed through a chai of algorithms, ad the success of the process is determied by the outcome. The typical phases of a data miig process are: raw data acquisitio, preprocessig, feature extractio, ad modelig. The method of maagig the iteractios betwee these phases has a major impact o the outcome of the project.

The traditioal approach of implemetig the data miig process is to combie the algorithms developed for the differet phases ad to ru the data through the chai, as preseted i Figure 1. The emphasis i the approach preseted i Figure 1 is o the algorithms processig the data. The iformatio is expected to flow smoothly through the chai from the begiig to the ed o a sigle ru, ad the algorithms are usually implemeted withi the same applicatio. It is ot uusual that the data aalyst takes care of all the phases, ad ot much attetio is always paid to the (o-stadard) storage format of the data. This may result i a umber of difficulties that detract from the quality of the data miig process. To ame a few, the approach makes it more challegig to apply methods implemeted i a variety of tools to the data, requires comprehesive kowledge from the aalyst, ad results i icoheret storage of the research results ad data. The approach proposed i this study has a differet perspective toward implemetig the data miig process. The emphasis is o a stadard way of storig the data betwee the differet phases of the process. This, i tur, icreases the idepedece betwee the trasformatios ad the data storage. Stadard storage makes it possible for the algorithms to access the data through a stadard iterface, which allows iteractio betwee the data ad the algorithms implemeted i various applicatios supportig the iterface. The approach will be explaied i more detail i the ext chapter, ad after that, the beefits of applyig it will be illustrated with a compariso to the traditioal approach ad a case study. Extesive searches of scietific databases ad the World Wide Web did ot brig to light similar approaches applied to the implemetatio of the data miig process. However, there are studies ad projects o the maagemet of the data miig process. These studies idetify the mai phases of the process i a maer similar to that preseted i Figure 1 ad give a geeral outlie of the steps that should be kept i mid whe realizig the process. Oe of the earliest efforts perhaps the very earliest oe was the CRISP-DM, iitiated i 1996 by three compaies that proceeded to form a cosortium called CRISP-DM (CRoss-Idustry Stadard Process for Data Miig). CRISP-DM is also the ame of the process model created by the cosortium, which was proposed to serve as a stadard referece for all appliers of data miig [1]. The goal of the process model is to offer a represetatio of the phases ad tasks ivolved that is geeric eough to be applicable to ay data miig effort as well as guidelies o how to apply the process model. Although it is difficult to verify a method as geeric beyod all doubt, several studies testify to the usefuless of CRISP-DM as a tool for maagig data miig vetures ([2], [3], [4]). The approach proposed i CRIPS-DM was exteded i RAMSYS [5], which proposed a methodology for performig collaborative data miig work. Other proposals, with may similarities to CRISP-DM, for the data miig process were preseted i [6] ad [7]. Nevertheless, these studies did ot take a stad o what would be a effective implemetatio of the data miig process i practice. This study proposes a effective approach for implemetig the data miig process ad compares it to the traditioal way of implemetig the process, poitig out the obvious advatages of the proposed method.

Raw data Data Pre-processig Data Feature extractio Data Modellig Figure 1: The traditioal data miig process. 2. A truly stepwise method for maagig ad implemetig the data miig process This chapter presets a geeral framework for the proposed method ad defies the way i which it ca be applied to the maagemet of the data miig process. The two basic issues that result i a umber of difficulties whe usig the traditioal approach to data miig are: 1) The trasformatios are orgaized i a way that makes them highly depedet o each other, ad 2) all trasformatios are usually calculated at oce. To demostrate these problems ad to preset a idea for a solutio, the followig formalism is used. The data supplied for the aalysis ca be assumed to be stored i a matrix X 0. The result of the aalysis, X, is obtaied whe the th trasformatio (fuctio) f ( X) is applied to the data. The trasformatios are applied step-by-step to the data, but they are calculated all at oce, ad the results are ot stored util the last trasformatio has bee applied. Usig the above otatio, the process ca be defied as the ier fuctios of the supplied data, which leads to: Defiitio: Stepwise data miig process (the traditioal approach). The stepwise data miig process is a chai of ier trasformatios, f 1... f, that process the raw data, X 0, without storig it util the desired data, the output X, has bee obtaied: X = f ( ( f ( f ( )))... ).... 2 1 X0 This poits out clearly the marked depedece betwee the trasformatios ad the fact that all trasformatios are calculated at oce. However, the data is ot depedet o the trasformatios i such a way that all trasformatios would have to be calculated i a sigle ru. The result, X, might equally well be geerated i a truly stepwise applicatio of the trasformatios, which leads to the defiitio of the proposed data miig process.

Defiitio: Truly stepwise data miig process. The results, X,..., X 1, of each trasformatio, f 1... f, are stored i a storage medium before applyig the ext trasformatio i the chai to them: 1) 2) X 1 X ) X 2 = = = f f 1 ( X 0 ) ( X ) f 2 1 ( X ). This approach makes the trasformatios less depedet o each other: to be able to calculate the kth trasformatio (k =1,,), oe does ot eed to calculate all the (k-1) trasformatios prior to k, but just to fetch the data, X k 1, stored after the (k- 1)th trasformatio ad to apply the trasformatio k to that. The obvious differece betwee the two processes is that, i the latter, the result of the kth trasformatio is depedet oly o the data, X k 1, while i the former, it is depedet o X 0 ad the trasformatios f... 1 f k 1. The differece betwee these two defiitios, or approaches, might seem small at this stage, but it will be show below how large it actually is. I theory, the result of each trasformatio could be stored i the storage medium (preferably a database). I the cotext of data miig, however, it is more feasible to store the data oly after the mai phases of the trasformatios. The mai phases are the same as those show i Figure 1. Now that the proposed truly stepwise data miig process has bee defied ad the mai phases have bee idetified, the stepwise process preseted i Figure 1 ca be altered to reflect the developmets, as show i Figure 2. The apparet chage is the emphasis o the storage of the data. I Figure 1, the data flowed from oe trasformatio to aother, ad the boxes represeted the trasformatios. I Figure 2, the boxes represet the data stored i the differet layers, ad the trasformatios make the data flow from oe layer to aother. Thus, the two figures have all the same compoets, but the effect of emphasizig the storage of the data is apparet. The otio of the trasformatios carryig the data betwee the storage layers also seems more atural tha the idea that the data is trasmitted betwee the differet trasformatios. A few more commets o the diagram should be made before presetig the compariso of the two approaches. Four storage layers are defied, i.e. the layers of raw data, pre-processed data, features, ad models. Oe more layer could be added to the structure: a layer represetig the best model selected from the pool of available models. O the other had, this is ot ecessary, sice the preseted approach could be applied to the pool of models, treatig the geerated models as raw data. I this case, the layers would defie the required steps for choosig the best model. Aother commet ca be made cocerig the amout ad scope of data stored i the differet layers. As the amout of data grows toward the bottom layers, the scope of data decreases, ad vice versa. I practice, if the storage capabilities of the system are limited ad ulimited amouts of data are available, the stored features may cover a broader rage of data tha pure data could. This is poited out i the figure by the two arrows o the sides. 1

Data layer 4: Models The amout of data grows Data layer 3: Features Feature extractio Data layer 2: Pre-processed data Preprocessig Aalysis, modellig The scope of data grows Data layer 1: Raw data Figure 2: The four storage layers of the proposed data miig process. 3. The proposed vs. the traditioal method I this chapter, the various beefits of the truly stepwise approach over the stepwise oe are illustrated. Idepedece betwee the differet phases of the data miig process. I the stepwise approach, the output of a trasformatio is directly depedet o each of the trasformatios applied prior to it. To use a old phrase, the chai is as weak as its weakest lik. I other words, if oe of the trasformatios does ot work properly, oe of the trasformatios followig it ca be assumed to work properly, either, sice each is directly depedet o the output of the previous trasformatios. I the truly stepwise method, a trasformatio is directly depedet oly o the data stored i the layer immediately prior to the trasformatio, ot o the previous trasformatios. The trasformatios prior to a certai trasformatio do ot ecessarily have to work perfectly, it is eough that the data stored i the previous layers is correct. From the viewpoit of the trasformatios, it does ot matter how the data was acquired, e.g. whether it was calculated usig the previous trasformatios or eve iserted maually.

The multitude of algorithms easily applicable to the data. I the stepwise procedure, the algorithms must be implemeted i oe way or aother iside the same tool, sice the data flows directly from oe algorithm to aother. I the truly stepwise approach, the umber of algorithms is ot limited to those implemeted i a certai tool, but is proportioal to the umber of tools that implemet a iterface for accessig the storage medium. The most frequetly used iterface is the database iterface for accessig data stored i a database usig SQL. Therefore, if a stadard database is used as a storage medium, the umber of algorithms is limited to the umber of tools implemetig a database iterface which is large. Specializatio ad teamwork of researchers. The differet phases of the data miig process require so much expertise that it is hard to fid people who would be experts i all of them. It is easier to fid a expert specialized i some of the phases or trasformatios. However, i most data miig projects, the researcher must apply or kow details of may, if ot all, of the steps of the data miig chai, to be able to coduct the work. This results i wasted resources, sice it takes some of her / his time away from the area she / he is specialized i. Furthermore, whe a team of data miers is performig a data miig project, it might be that everybody is doig a bit of everythig. This results i cofusio i the project maagemet ad desychroizatio of the tasks. Usig the proposed method, the researchers ca work o the data relevat to their specializatio. Whe a team of data miers are workig o the project, the work ca be aturally divided betwee the workers by allocatig the data stored i the differet layers to suit the expertise ad skills of each perso. Data storage ad o-lie moitorig. The data acquired i the differet phases of the data miig process is stored i a coheret way whe, for example, a stadard database is used to implemet the truly stepwise process. Whe the data ca be accessed through a stadard iterface after the trasformatios, oe ca peek i o the data at ay time durig the process. This ca be coveiet, especially i situatios where the data miig chai is delivered as a fiished implemetatio. Whe usig a database iterface, oe ca eve select the moitorig tools from a set of readily available software. To moitor the differet phases of the stepwise process, it would be ecessary to display the output of the trasformatios i some way, which requires extra work. Time savigs. Whe the data i the differet layers has bee calculated oce i the truly stepwise process, it does ot eed to be re-calculated uless it eeds to be chaged. Whe workig with large data sets, this may result i eormous time savigs. Usig the traditioal method, the trasformatios must be recalculated whe oe wats to access the output of ay phase of the data miig chai, which results i uecessary waste of staff ad CPU time. Now that the umerous beefits of the proposed method have bee preseted, we could ask what the drawbacks of the method are. The obvious reaso for the eed for time is the care ad effort oe has to ivest i defiig the iterface for trasferrig the itermediate data to the storage space. O the other had, if this work is left udoe, oe may have to put twice as much time i tacklig with the flaws i the data miig process. It might also seem that the calculatio of the whole data miig chai usig the stepwise process is faster tha i the truly stepwise process. That is true, but oce the trasformatios i the truly stepwise process are ready ad fiished, the process ca be ru i the stepwise maer. I coclusio, o obvious drawbacks are so far detectable i the truly stepwise process.

4. A case study pre-processig spot weldig data This chapter illustrates the beefits of the proposed method i practice. The idea is here applied to a data miig project aalysig the quality of spot weldig joits, ad a detailed compariso to the traditioal approach is made cocerig the amout of work required for acquirig pre-processed data. The spot weldig quality improvemet project (SIOUX) is a two-year EUsposored CRAFT project aimig to create o-destructive quality estimatio methods for a wide rage of spot weldig applicatios. Spot weldig is a weldig techique widely used i, for example, the electrical ad automotive idustries, where more tha 100 millio spot weldig joits are produced daily i the Europea vehicle idustry oly [8]. No-destructive quality estimates ca be calculated based o the shape of the sigal curves measured durig the weldig evet [9], [10]. The method results i savigs i time, material, eviromet, ad salary costs which are the kid of advatages that the Europea maufacturig idustry should have i their competitio agaist outsourcig work to cheaper coutries. The collected data cosists of iformatio regardig the welded materials, the quality of the weldig spot, the settigs of the weldig machie, ad the voltage ad curret sigals measured durig the weldig evet. To demostrate the data, the left pael of Figure 3 displays a typical voltage curve acquired from a weldig spot, ad the right pael shows a resistace curve obtaied by pre-processig the data. Figure 3: The left pael shows a voltage sigal of a weldig spot measured durig a weldig evet. The high variatios ad the flat regios are still apparet i the diagram. The right pael shows the resistace curve after pre-processig. The data trasformatios eeded for pre-processig sigal curves cosist of removal of the flat regios from the sigal curves (weldig machie iactivity), ormalizatio of the curves to a pre-defied iterval, smoothig of the curves usig a filter, ad calculatio of the resistace curve based o the voltage ad curret sigals. The trasformatios are implemeted i software writte specifically for this project, called Tomahawk. The software icorporates all the algorithms required for calculatig the quality estimate of a weldig spot, alog with a database for storig the weldig data. The software ad the database are closely coected, but idepedet. The basic priciples of the system are preseted i Figure 4. The special beauty of Tomahawk lies i the way the algorithms are implemeted as a coected chai. Hece, the product of applyig all the algorithms is the desired output of the data miig process. The algorithms are called plug-is, ad the way

the data is trasmitted betwee each pair of plug-is is well defied. Whe the program is executed, the chai of plug-is is executed at oce. This is a implemetatio of the defiitio of the stepwise (traditioal) data miig process. TOMAHAWK Plug-i 1: Trasformatio 1 Plug-i : Trasformatio Weldig data Quality measure Plug-i 2: Trasformatio 2 Plug-i 3: Trasformatio 3 Figure 4: The operatig priciple of the Tomahawk software. The architecture is a realizatio of the stepwise data miig process. Whe the project has bee completed, all the plug-is should be ready ad work for all kids of weldig data as seamlessly as preseted i Figure 4. However, i the productio phase of the system, whe the plug-is are still uder active developmet, three major issues that iterfere with the daily work of the developmet team ca be recogized i the chapter The proposed vs. the traditioal method. Idepedece. It caot be guarateed that all parts of the pre-processig algorithms would work as they should for all the available data. However, the researcher workig o the pre-processed data is depedet o the preprocessig sequece. Because of this, she/he caot be sure that the data is always correctly pre-processed. Specializatio ad teamwork. The expert workig o the pre-processed data might ot have the expertise to correctly pre-process the raw data i the cotext of Tomahawk, which would make it impossible for him/her to perform her/his work correctly. The multitude of algorithms easily applicable to the data. I the productio phase, it is better if the rage of algorithms tested o the data is ot exclusively limited to the implemetatio of the algorithms i Tomahawk, sice it would require a lot of effort to re-implemet algorithms available elsewhere as plug-is before testig them. The solutio was to develop Tomahawk i such a way that it supports the truly stepwise data miig process. A plug-i capable of storig ad deliverig preprocessed data was implemeted. Figure 5 presets the effects of the developmets. The left pael displays the pre-processig sequece prior to the adjustmets. All the plug-is were calculated at oce, ad they had to be properly cofigured to obtai properly pre-processed data. The right pael shows the situatio after the adoptio of the truly stepwise data miig process. The pre-processig ca be doe i its ow

sequece, after which a plug-i that iserts the data ito the database is applied. Now the pre-processed data is i the database ad available for further use at ay give time. Pre-processig i TOMAHAWK Pre-process Tomahawk database Plug-i 1: Plug-i 8: Trasformatio Trasformatio 1 8 Plug-i 9: Trasformatio 9 Tomahawk database A sequece of preprocessig plug-is Plug-i 2: Trasformatio 2 Expert Plug-i 3: Trasformatio 3 Plug-i: Load preprocessed data from database Plug-i: Output preprocessed data to database Figure 5: The left pael shows the applicatio of the stepwise data miig process o the preprocessig of the raw data i Tomahawk. The right pael shows Tomahawk after the modificatios that made it support the truly stepwise data miig process for pre-processig. The first ad secod issues are simple to solve by usig the ew approach. The pre-processig expert of the project takes care of properly cofigurig the preprocessig plug-is. If the plug-is eed to be re-cofigured or re-programmed for differet data sets, she / he has the requisite kowledge to do it, ad after the applicatio of the re-cofigured plug-is, the data ca be saved i the database. If it is ot possible to fid a workig combiatio of plug-is at the curret state of developmet, the data ca still be pre-processed maually, which would ot be feasible whe usig the stepwise process. After this, the expert i workig o preprocessed data ca load the data from the database ad be cofidet that the data she / he is workig o has bee correctly pre-processed. The third issue is also easy to solve; after the modificatios, the set of algorithms that ca be tested o the data is o loger limited to those implemeted i Tomahawk, but icludes tools that have a database iterface implemeted i them, for example Matlab. This expads drastically the rage of available algorithms, which i tur makes it also faster to fid a algorithm suitable to a give task. As soo as a suitable algorithm has bee foud, it ca be implemeted i Tomahawk. Fially, a compariso of the steps required for pre-processig the data i the SIOUX project usig the stepwise ad truly stepwise approaches is preseted. The motivatio of the compariso is to demostrate how large a task it would be for the researcher workig o pre-processed data to pre-process the data usig the stepwise approach before she / he could start the actual work. If oe wats to acquire pre-processed data usig the stepwise approach, it takes the applicatio ad cofiguratio of 8 plug-is to pre-process the data. The left pael of Figure 6 shows oe of the cofiguratio dialogs of the plug-is. This particular pael has 4 umerical values that must be set correctly ad the optio of settig 6 check boxes. The total umber of optios the researcher has to set i the 8 plug-is for acquirig correctly pre-processed data is 68. The 68 optios are ot the same for all the data gathered i the project, ad it requires advaced pre-processig skills to cofigure them correctly. Therefore, it is quite a complicated task to pre-process the

data, ad it is especially difficult for a researcher who has ot costructed the preprocessig plug-is. The eed to cofigure the 68 optios of the pre-processig sequece would take a lot of time ad expertise away from the actual work ad still give poor cofidece i that the data is correctly pre-processed. To acquire the pre-processed data usig the truly stepwise approach, oe oly eeds to fetch the data from the database. The right pael of Figure 6 shows the cofiguratio dialog of the database plug-i, which is used to cofigure the data fetched for aalysis from the database. Usig the dialog, the researcher workig o the pre-processed data ca simply choose the pre-processed data items that will be used i the further aalyses, ad she / he does ot have to bother with the actual preprocessig of the data. The researcher ca be sure that all the data loaded from the database has bee correctly pre-processed by the expert i pre-processig. From the viewpoit of the researcher resposible for the pre-processig, it is good to kow that the sequece of pre-processig plug-is does ot have to be ru every time that pre-processed data is eeded, ad that she / he ca be sure that correctly preprocessed data will be used i the further steps of the data miig process. I coclusio, by usig the stepwise process, a researcher workig with preprocessed data could ever be certai that the data had bee correctly pre-processed, or that all the plug-is had bee cofigured the way they should, which resulted i cofusio ad ucertaity about the quality of the data. The truly stepwise process, o the other had, allowed a otably simple way to access the pre-processed data, resulted i time savigs, ad esure that the aalyzed data were correctly preprocessed. Figure 6: The left pael shows oe of the 8 dialogues that eed to be filled i to acquire pre-processed sigal curves. The right pael shows the dialogue that is used for fetchig raw ad pre-processed data directly from the database. 5. Coclusios This paper preseted a ew approach for maagig the data miig process, called truly stepwise data miig process. I the truly stepwise process, the trasformed data is stored after the mai phases of the data miig process, ad the trasformatios are applied to data fetched from the data storage medium. The

beefits of the process compared to the stepwise data miig process (the traditioal approach) were aalyzed. It was oticed that the proposed approach icreases the idepedece of the algorithms applied to the data ad the umber of algorithms easily applicable to the data ad makes it easier to maage ad allocate the expertise ad teamwork of the data aalysts. Also, data storage ad o-lie moitorig of the data miig process are easier to orgaize usig the ew method, ad it saves both staff ad CPU time. The approach was illustrated usig a case study of a spot weldig data miig project. The two approaches were compared, ad it was demostrated that the proposed method markedly simplified the tasks of the specialist workig o the pre-processed data. I the future, the possibilities to apply the approach o a fier scale will be studied - here it was oly applied after the mai phases of the data miig process. The feature ad model data of the approach will also be demostrated, ad the applicatio of the method will be exteded to other data miig projects. 6. Ackowledgemets We would like to express our gratitude to our colleagues at Fachochschule Karlsruhe, Istitut für Iovatio ud Trasfer, i Harms + Wede GmbH & Co.KG [11], i Techax Idustrie [12] ad i Stazbiegetechik GesmbH [13] for providig the data set, the expertise eeded i the case study ad for umerous other thigs that made it possible to accomplish this work. We also wish to thak the graduate school GETA [14], supported by Academy of Filad, for sposorig this research. Furthermore, this study has bee fiacially supported by the Commissio of the Europea Commuities, specific RTD programme Competitive ad Sustaiable Growth, G1ST-CT-2002-50245, SIOUX (Itelliget System for Dyamic Olie Quality Cotrol of Spot Weldig Processes for Cross(X)-Sectoral Applicatios ). It does ot ecessarily reflect the views of this programme ad i o way aticipates the Commissio s future policy i this area. Refereces [1] P. Chapma, J. Clito, T. Khabaza, T. Reiartz ad R. Wirth, "CRISP-DM 1.0 Step-bystep data miig guide," August, 2000. [2] Hotz, E., Grimmer, U. Heuser, W. & Nakhaeizadeh, G. 2001. REVI-MINER, a KDD- Eviromet for Deviatio Detectio ad Aalysis of Warraty ad Goodwill Cost Statemets i Automotive Idustry. I Proc. Seveth ACM SIGKDD Iteratioal Coferece o Kowledge Discovery ad Data Miig (KDD 2001), 432 437. [3] Liu, J.B. & Ha, J. 2002. A Practical Kowledge Discovery Process for Distributed Data Miig. I Proc. ISCA 11 th Iteratioal Coferece o Itelliget Systems: Emergig Techologies, 11 16. [4] Silva, E.M., do Prado, H.A. & Fereda, E. 2002. Text miig: crossig the chasm betwee the academy ad the idustry. I Proc. Third Iteratioal Coferece o Data Miig, 351 361. [5] S. Moyle ad A. Jorge, "RAMSYS - A methodology for supportig rapid remote collaborative data miig projects," i ECML/PKDD'01 workshop o Itegratig

Aspects of Data Miig, Decisio Support ad Meta-Learig: Iteral SolEuNet Sessio, 2001, pp. 20-31. [6] D. Pyle, Data Preparatio for Data Miig, Morga Kaufma Publishers, 1999. [7] R.J. Brachma ad T. Aad, "The Process of Kowledge Discovery i Databases: A Huma-Cetered Approach," i Advaces i Kowledge Discovery ad Data Miig, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth ad R. Uthurusamy Eds. MIT Press, 1996, pp. 37-58. [8] TWI World Cetre for Materials Joiig Techology, iformatio available at their homepage: http://www.twi.co.uk/j32k/protected/bad_3/kssaw001.html, refereced 31.12.2003. [9] Laurie, P.; Juo, H.; Tuovie, L.; Röig, J.; Studyig the Quality of Resistace Spot Weldig Joits Usig Bayesia Networks, Artificial Itelligece ad Applicatios (AIA 2004), February 16-18, 2004, Isbruck, Austria. [10] Juo, H.; Laurie, P.; Tuovie, L.; Röig, J.; Studyig the Quality of Resistace Spot Weldig Joits Usig Self-Orgaisig Maps, Fourth Iteratioal ICSC Symposium o Egieerig of Itelliget Systems (EIS 2004), February 29 -March 2, 2004, Madeira, Portugal. [11] Harms+Wede GmbH & Co.KG, the world wide web page: http://www.harmswede.de/, refereced 13.2.2004. [12] Techax Idustrie, the world wide web page: http://www.techaxidustrie.com/, refereced 13.2.2004. [13] Stazbiegetechik GesmbH, the world wide web page: http://www.stazbiegetechik.at/startseite/idex.php, refereced 13.2.2004. [14] Graduate school GETA, the world wide web page: http://wooster.hut.fi/geta/, refereced 13.2.2004.