A Data Mining Support Environment and its Application on Insurance Data



Similar documents
Teamwork. Abstract. 2.1 Overview

Australian Bureau of Statistics Management of Business Providers

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN:

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan

CUSTOM. Putting Your Benefits to Work. COMMUNICATIONS. Employee Communications Benefits Administration Benefits Outsourcing

Human Capital & Human Resources Certificate Programs

A Supplier Evaluation System for Automotive Industry According To Iso/Ts Requirements

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

Pay-on-delivery investing

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Vendor Performance Measurement Using Fuzzy Logic Controller

Chapter 3: JavaScript in Action Page 1 of 10. How to practice reading and writing JavaScript on a Web page

LADDER SAFETY Table of Contents

Secure Network Coding with a Cost Criterion

Bite-Size Steps to ITIL Success

Fixed income managers: evolution or revolution

History of Stars and Rain Education Institute for Autism (Stars and Rain)

Ricoh Healthcare. Process Optimized. Healthcare Simplified.

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

Leadership & Management Certificate Programs

Network/Communicational Vulnerability

PREFACE. Comptroller General of the United States. Page i

ASSET MANAGEMENT OUR APPROACH

MARKETING INFORMATION SYSTEM (MIS)

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Chapter 2 Traditional Software Development

3.3 SOFTWARE RISK MANAGEMENT (SRM)

APIS Software Training /Consulting

Internal Control. Guidance for Directors on the Combined Code

Integrating Risk into your Plant Lifecycle A next generation software architecture for risk based

CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY. Course Offered By: Indian Environmental Society

Chapter 3: e-business Integration Patterns

EDS-Unigraphics MIS DataBroker Architecture

How To Deiver Resuts


Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations.

STRATEGIC PLAN

MICROSOFT DYNAMICS CRM

With the arrival of Java 2 Micro Edition (J2ME) and its industry

A Description of the California Partnership for Long-Term Care Prepared by the California Department of Health Care Services


A Latent Variable Pairwise Classification Model of a Clustering Ensemble

Qualifications, professional development and probation

Automatic Structure Discovery for Large Source Code

ICAP CREDIT RISK SERVICES. Your Business Partner

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1

Program Management Seminar

Oracle. L. Ladoga Rybinsk Res. Volga. Finland. Volga. Dnieper. Dnestr. Danube. Lesbos. Auditing Oracle Applications Peloponnesus

Creative learning through the arts an action plan for Wales

IT Governance Principles & Key Metrics

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

Message. The Trade and Industry Bureau is committed to providing maximum support for Hong Kong s manufacturing and services industries.

L I C E N S I N G G U I D E

Degree Programs in Environmental Science/Studies

Oracle Project Financial Planning. User's Guide Release

DECEMBER Good practice contract management framework

endorsed programmes With our expertise and unique flexible approach NOCN will work with you to develop a product that achieves results.

The guaranteed selection. For certainty in uncertain times

Multi-Robot Task Scheduling

Driving Accountability Through Disciplined Planning with Hyperion Planning and Essbase

Breakeven analysis and short-term decision making

Undergraduate Studies in. Education and International Development

The Web Insider... The Best Tool for Building a Web Site *

professional indemnity insurance proposal form

Order-to-Cash Processes

Market Design & Analysis for a P2P Backup System

We are XMA and Viglen.

Chapter 2 Developing a Sustainable Supply Chain Strategy

Early access to FAS payments for members in poor health

Structural Developments and Innovations in the Asset- Backed Commercial Paper Market

Overview of Health and Safety in China

Business Banking. A guide for franchises

INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005

Investigating and Researching HR Issues

Introduction to XSL. Max Froumentin - W3C

Introduction the pressure for efficiency the Estates opportunity

Face Hallucination and Recognition

Technology and Consulting - Newsletter 1. IBM. July 2013

Frequently Asked Questions

Personal and Public Involvement Toolkit for Staff

Conflict analysis. What is conflict analysis and why is it important? Purpose of chapter. Who should read it. Why they should read it CHAPTER 2

German Auditors and Tax Advisors for foreign clients

GRADUATE RECORD EXAMINATIONS PROGRAM

SNMP Reference Guide for Avaya Communication Manager

HEALTH PROFESSIONS PATHWAYS

An Integrated Data Management Framework of Wireless Sensor Network

effect on major accidents

NCH Software FlexiServer

TCP/IP Gateways and Firewalls

Vital Steps. A cooperative feasibility study guide. U.S. Department of Agriculture Rural Business-Cooperative Service Service Report 58

Niagara Catholic. District School Board. High Performance. Support Program. Academic

ONE of the most challenging problems addressed by the

APPENDIX 10.1: SUBSTANTIVE AUDIT PROGRAMME FOR PRODUCTION WAGES: TROSTON PLC

Corporate Governance f o r M a i n M a r k e t a n d a i M C o M p a n i e s

l l ll l l Exploding the Myths about DETC Accreditation A Primer for Students

How To Get Acedo With Microsoft.Com

Migrating and Managing Dynamic, Non-Textua Content

COASTLINE GROUP HUMAN RESOURCES STRATEGY Great homes, great services, great people.

Transcription:

From: KDD-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). A rights reserved. A Data Mining Support Environment and its Appication on Insurance Data M. Staudt, J.-U. Kietz, U. Reimer Swiss Life, Information Systems Research (CH/IFuE), CH-8022 Zurich, Switzerand {Martin.Staudt,U we.kietz,uhich.reimer}@swissife.ch Abstract Huge masses of digita data about products, customers and competitors have become avaiabe for companies in the services sector. In order to expoit its inherent (and often hidden) knowedge for improving business processes the appication of data mining technoogy is the ony way for reaching good and efficient resuts, as opposed to purey manua and interactive data exporation. This paper reports on a project initiated at Swiss Life for mining its data resources from the ife insurance business. Based on the Data Warehouse MASY coecting a reevant data from the OLTP systems for the processing of private ife insurance contracts, a Data Mining environment is set up which integrates a paette of toos for automatic data anaysis, in particuar machine earning approaches. Specia emphasis ies on estabishing comfortabe data preprocessing support for normaised reationa databases and on the management of meta data. Introduction Athough the amount of digita data avaiabe in most companies is growing fast due to the rapid technica progress of hardware and data recording technoogy, the ots of vauabe information hidden in the data is barey expoited. Instead of huge coections of ow-eve data we need abstract and high-eve information that is taiored to the user s (mosty management peope) needs and can be directy appied for improving the decision making processes, for detecting new trs and eaborating suited strategies etc. Unfortunatey, the data coections from which to derive this information have a chaotic structure, are often erroneous, of doubtfu quaity and ony partiay integrated. In order to bridge the gap between both sides, i.e. to find a reasonabe way for turning data into information, we need (efficient) agorithms that can perform parts of the necessary transformations automaticay. There wi aways remain manua steps in this data anaysis and information gathering task, ike the seection of data subsets and the evauation of the resuting hypotheses. However, the automatic processing shoud cover a those Copyright 01998, American Association for Artificia Inteigence (www.aaai.org). A rights reserved. parts that can not be handed propery by humans due to the size of transformation and output. Knowedge discovery in databases (KDD) aims at the automatic detection of impicit, previousy unknown and potentiay usefu patterns in the data. One prerequisite for empoying automatic anaysis toos is a consoidated and homogenized set of data as. Data Warehouses provide exacty this, thus forming the idea first steps in setting up a KDD process. In order to expore how Data Mining toos can compement its Data Warehouse, Swiss Life set up a data mining project which is concerned with the design and impementation of the Data Mining environment ADLER. In particuar, ADLER aims at enabing users to execute mining tasks as indepenty as possibe from data mining experts support and at making a broad range of data mining appications possibe. It turned out that the most crucia factor for a successfu appication of mining agorithms is a comprehensive support for preprocessing which takes into account meta data about the data sources and the agorithms. Athough the data warehouse aready provides a consoidated and homogeneous data coection as for anaysis tasks we have to dea with certain taskand agorithm-specific data preparation steps which incudes further mining-too-specific data ceaning as we as restructuring, recoding and transformation of the muti-reationa data. The rest of this paper is organized as foows: Section 2 gives some information about the Data Warehouse MASY and the components of the Data Mining environment. Section 3 summarizes the mining technoogy empoyed, i.e. the agorithms and their integration. Section 4 describes the meta data management task within the mining environment and in particuar expains at an exampe from our concrete Data Warehouse schema how preprocessing operations can be supported. Section 5 presents severa appications for ADLER. Anaysis, Data Mining and Learning Environment of Rentenanst&/Swiss Life KDD-98 105

mainy concentrate on interactive exporation, abstraction and aggregation of data (which aso eads to new information) this extraction takes pace semiautomaticay. Interpreting the mining resuts and initiating certain preprocessing on the data (as discussed beow) remains a manua task. Figure 1 shows the overa architecture of ADLER and its reationship to MASY and the OLTP sources respectivey................................................................................~ Data Menegement :...................... Meta Dats Mining Enviro-t Figure 1: Data Mining Environment ADLER Data Warehouse and Data Mining Environment The masses of digita data avaiabe at Swiss Life - not ony data from insurance contracts but aso from externa sources, such as information about the sociodemographic structure and the purchasing power of the popuation in the various parts of the country - ed to the deveopment of the centra integrated Data Warehouse MASY (Fritz 1996). MASY comprises data from currenty four OLTP systems: contract data (about 700,000 contracts, some 500,000 cients) pus externay avaiabe data coections. Some of the data sources are shown in Figure 1. The basic insurance contract data e.g. stems from the systems EVBS and VWS whie GPRA contains (persona) data about a Swiss Life cients and partners. BWV is a pubicy avaiabe cataogue of a (3 Miion) Swiss househods. Whereas the OLTP systems ony contain the actua data, MASY keeps the history of changes, too. MASY is impemented on top of OR- ACLE and foows a ROLAP warehouse architecture, i.e. empoys on top of the reationa structures a mutidimensiona OLAP-front. The database itsef has both a normaized reationa scheme gained from integrating the schemas of a source systems, and a derived (redundant) d enormaized Gaaxy schema to efficienty support muti-dimensiona access for some predefined tasks. The normaized scheme contains around 20 GB of data, distributed over approximatey 30 tabes and 600 attributes. Figure 2 shows an excerpt of this schema with the 10 main reations. Based on the homogenized data in MASY, the Data Mining environment ADLER offers toos and a methodoogy for supporting information extraction from the reationa structures. Compared to OLAP toos which.i Data Mining Technoogy Since there is not the distinguished data mining approach suited for handing a kinds of anaysis tasks ADLER integrates a broad paette of approaches, incuding cassica statistica ones as we as techniques from the areas of Machine Learning and Neura Nets. Figure 1 shows some agorithms that are avaiabe. With respect to their output we can distinguish between the foowing categories of agorithms: 1. detection of associations, rues and constraints: a common appication of these techniques are e.g. market basket anayses. 2. identification of decision criteria: with decision trees as one possibe output format we can support tasks ike credit assignments. 3. discovery of profies, prototypes and segmentations: for exampe, casses of customers with simiar properties can be grouped together and handed in a uniform way. Another categorization concerns the kind of data aowed: a. sets of attribute-vaue pairs describing properties of certain data objects represented in one singe reation (attribute-based approaches) or b. tupes from different reations and background knowedge, e.g. Dataog programs (rezationae approaches). The atter category in particuar aows to incude additiona background knowedge and arbitrary combinations of different casses of data objects. Whie statistica agorithms as we as most Machine Learning and commerciay avaiabe Data Mining agorithms are attribute-based, Inductive Logic Programming approaches fa into the second category of reationa approaches (Kietz 1996). Foowing the two orthogona categorizations above the agorithms in Figure 1 can be cassified as foows: Too 0 I SIDOS/ 1 a Expora MIDOS 1 b Reference (Hoschka & KSsgen 1991) Wrobe 1997 106 Staudt

Leg: Tabe content TabtaName.KeyName I I..- - -7 Ai +--+ kid / Figure 2: MASY schema excerpt Offering a great variety of different mining methods requires an open framework that aows the integration of quite different types of agorithms and a high extensibiity. The heart of the ADLER-Workbench is the Data Mining too KEPLER (Wrobe et a. 1996) which reies on the idea of Pug-Ins : Its too description anguage and a basic API enabe the buiding of wrappers around a given impemented mining agorithm and its incusion into the appicabe set of toos. From a too description KEPLER can generate masks for the source data and for setting parameters. The type of the mining resuts (e.g. rue or decision tree) is aso given in the too description. A resut handing specification determines whether the too itsef presents the resut, KEPLER does it, or some externa dispay too shoud be started. A the mentioned mining agorithms are based on pure main-memory processing. We assume that the main memory in today s high- workstations (more than 1 GB) shoud be sufficient to hod the data needed to obtain optima resuts with hypothesis spaces that are sti sma enough to be searchabe in reasonabe time. Therefore, empoying specific disk-based methods does not seem meaningfu to us. However, according to our experience it is absoutey necessary to do a the preprocessing (incuding data seection, samping, recoding, etc.) on the data base directy and not in main memory because the amount of raw data to be deat with is in reaistic appications by far too arge (in our case about 11 GB). As the data warehouse is read-ony we have added a database as a cache between the warehouse and ADLER. A preprocessing is done on this cache database. The resut of preprocessing is oaded into main memory for the mining task proper. This approach is in contrast to other Data Mining environments which provide preprocessing ony for data in main memory, and certainy not for muti-reationa data as ADLER does. KEPLER aso offers a number of (main memory based) preprocessing operations and /output interfaces. For the integration with MASY these faciities were exted by buiding an interface to the cache database mentioned above on which the preprocessing operations are executed and intermediate data is buffered. Meta Data Management and Preprocessing Support We now turn to the management of meta data within ADLER. After expaining its basic roe we describe the mode that was impemented to capture the meta data reevant for mining with ADLER. We aso give an exampe for the support given by ADLER during the preprocessing and transformation phases. The Roe of Meta Data Meta data pays an important roe for the integration task in distributed information systems, both at design and run time. Figure 1 shows three different types of meta data important in our context: The Swiss Life IT department administrates an overa enterprise data mode that describes a reevant objects, agents and systems in the company. The actua impementation patform is Rochade. The MASY Data Warehouse meta data is mainy stored in certain ORACLE tabes. Parts of it have aso beeen ported to the Rochade repository. Typicay, such meta data contains information about the integration and transformation process (technica meta data) and the warehouse contents itsef, e.g. different (materiaized) views and eves of abstraction that are avaiabe for the users (business meta data). Among the meta data reevant for Data Mining we can distinguish between the foowing categories: - background knowedge, ike product modes or househod data - knowedge about data seection and transformations needed to propery feed each mining agorithm with data for a certain task - appication methodoogy that gives criteria as to which toos shoud be used in which way to sove which probems. The ADLER Meta Mode Instead of handing meta data of singe appications (ike cassica data dictionaries) within the operationa database system - a very common way for Data Warehouse soutions - it is much more appropriate to manage it separatey from the data and combine it with other kinds of meta data. Consequenty, we introduce KDD-98 107

ConceptBase Graph Fhmmu Ness Mare... 6 Figure 3: Meta Mode of ADLER: Data Representation and Preprocessing an indepent meta data repository with a powerfu representation formaism that pays a prominent roe throughout the whoe KDD process. ADLER empoys the deductive and object-oriented meta data management system ConceptBase (Jarke et a. 1995) and its knowedge representation anguage Teos to describe the mining meta data, reevant excerpts of the MASY meta data and the inks to the enterprise meta data stored in Rochade. In the foowing, we concentrate on meta data describing the MASY schema and certain transformation casses avaiabe for the preprocessing phase. Typica preprocessing operations needed to produce or transform attributes are: Recoding and Discretisation: Most Data Mining toos (incuding KEPLER) assume a correspondence between the data type of an attribute (e.g. integer, string) and its metric (e.g. nomina, ordina, scaar). This is usuay not the case in a data warehouse so that recoding or even discretisation is needed deping on the capabiities of the Data Mining too and the inted metric for the Data Mining task. Scaing: Scaing is needed if the scaes of two attributes to be compared are different. Without proper scaing simiarity measures, as used for custering or regression, wi produce garbage. Abstraction and grouping: Nomina attributes with too many different vaues are useess or at east difficut for Data Mining toos without attribute vaue grouping (e.g. the profession code in MASY has more than 14000 different vaues). Such attributes have to be abstracted to be of any use. Aggregation and construct ion: Constructed attributes can greaty improve data mining resuts (Re & Seshu 1990). Aggregation is most often needed to incorporate the summary of attributes from a 1:N reated tabe into a singe tabe, e.g. the tota amount of benefits from different poicies of a singe househod. Figure 3 shows the graphica representation of an 0-Teos object coection managed by ConceptBase. The dispayed objects constitute parts of the meta schema used in ADLER to contro the preprocessing and mining activities and to describe the source data and their reationships. We eave out the above mentioned expicit connection to the warehouse sources and start with reations avaiabe for the mining process. Apart from database reations (DBReation) we consider aso reations extracted from fat text fies avaiabe as additiona background knowedge not integrated in the warehouse (mainy for experimenta purposes). The mode aso provides so-caed dynamic views that do not exist in the database but are created on the fy during the KDD process, and it provides database views defined on top of base reations by SQL expressions. A types of reations have coumns, namey attributes (instances of ReAttribute). Attribute vaues are very often discrete and possibe vaues of an attribute can be organized in code tabes, particuary for reasons of efficiency. An entry in such a tabe reates a short code word with the actuay inted vaue. Our database currenty empoys about 250 code tabes with the number of entries ranging from 2 to severa hundreds. Discrete attributes can be either nomina or ordina, a further category comprises car- 108 Staudt

PARTNER in DBReation PARROL in DBRoation HHOLD in DBReation with with with has-attribute has-attribute has-attribute a: PTID; a: PRID; a: HHID; a27: BDATE; a2: PRPTID a34: ROCHLD; a3: PRVVID a35: HARSTAT; a17: POLDTE a67: PTHHID; a8: PRTYP...... PRPTID in Rettribute with equas 0: PTID Figure 4: Reation and attribute a49: BLDGTYP; a53: PROPQ; a55: BUYPWRCL;... PTHHID in ReAttribute equas 0: HHID specification dina data which is not ony ordered but aso aows to measure distances between vaues, These categories are covered by Metrics. Furthermore, from a semantic point of view attributes can be assigned to sorts, ike money or years. The data type of an attribute (ike NUMBER, STRING) is aso recorded in the meta data base but not shown in Figure 3 as we as the aocation of reations in databases, the owner, and other kinds of user information. One or more attributes usuay constitute the key of a reation, i.e. their vaue(s) uniquey identifies a tupe. Singe attribute keys are caed s impekey, ike a cient identifier (see tabe PARTNER beow). Particuary interesting for the restructuring of base data by generating new derived reations are reationships between attributes. In the case of basic reationships, e.g. given by foreign keys, we assume the equas reationship between the invoved attributes. Typicay, joins between reations do not invove arbitrary attributes but ony those where it makes sense. The mode aows to state such attribute pairs by specifying joins inks, possiby together with additiona information concerning the join resut (e.g. 1:N or N:M matching vaues). Links between attributes abeed comparabe_with give additiona hints for possibe comparisons during the mining phase. More compicated reationships between attributes resut from appying attribute transformation operators (instances of AttrTrans) to an attribute (as ) to obtain another (new) attribute (output). The appication of such operators wi be expained by the exampe in the section beow. Operators to generate derived reations (ReGen) specify target attributes of the (new) reation, a condition part posing certain restrictions both on these attributes as we as on additiona ones (cond-attr). From a invoved attributes the associated reations needed for executing the necessary joins and for performing the seections can be determined a,utomaticay. In order to give an idea how the meta schema expained above can be used in ADLER at runtime we now show an exampe stemming from the preparation of our first anayses of the data avaiabe in MASY. For the exampe we use three reations from Fig. 2: PARTNER: cient data reation with key attribute PTID for the cient number and further attributes, e.g. with birthdate (BDATE), marita status (MARSTAT), number of chidren (NOCELD); PARROL: reates insurance poicies (VVERT) with partners (PARTNER) wrt. to their roe in the contract (e.g. insured person or poicy hoder PRTYP) and certain other information (e.g., the contract date POLDTE) with foreign keys PRPTID to PARTNER and PRVVID to VVERT; HHOLD: househod data, mainy genera data (e.g. type of fats or houses (BLDGTYP), typica professiona quaification (PROFQ) and buying power cass (BUYPURCL) of inhabitants) avaiabe for so caed ces of 30-50 househods in Switzerand, with key HHID. The records in PARTNER are party inked to househods by the (foreign key) attribute PTHHID. Thus we can obtain further information about the cient s environment via the househod he is iving in, In Teos notation the meta data repository woud contain this information as shown in Figure 4. Note, that we instantiate the casses in Figure 3. The has-attribute vaues of the instances of DBReation simpy name the reevant attributes (in the Teos syntax each vaue isted has an identifier of its own, given in front of the vaue foowed by a coon). The foreign key reationships mentioned above are modeed as instances of the equas reationship from ReAttribute to itsef. In order to construct a reation NPARTNER for a certain mining task which restricts the cient set to those peope who made a new contract since the beginning of ast year, we perform the transformation steps given beow. We are particuary interested in the age and the (weighted) buying power cass distribution and want to incude the BLDGTYP and PROFQ attributes from BHOLD. 1. The birthday attribute BDATE of PARTNER has a specific date defaut for missing vaues. This defaut vaue infuences the age distribution, so it shoud be substituted with NULL. The resut is an attribute coumn BDATE. 2. From BDATE and the current date the age AGE of each cient can be obtained. 3. The buying power cass of a cient shoud be weighted with his marita status and the number of chidren. We first code the vaues of the marita status as numerica vaues (new attribute MARSTAT 1 ) under the assumption that singe persons can afford more things (vaue 1) whie th e vaue of married peope is neutra (0) and the vaue of divorced and separated persons is negative (-1). 4. The new weighted buying power cass WBUYPWRCL resuts from adding this vaue to the od cass vaue and further reducing it by one for every two chidren. 5. Reation NPARTNER is defined as consisting of the attributes constructed during the previous steps and those of HHOLD mentioned above. In addition to the join between PARTNER and HHOLD and the attribute transformations a join with PARROL is required in order to eiminate cients without new contracts. KDD-98 109

The join conditions in both cases are derived impicity from the equas inks in the mode. The seection condition concerning the contract date POLDTE has to be given expicity. Preprocessing Support: An Exampe Each of the above steps is reaized by instances of AttrTrans (Steps -4) and ReGen (Step 5). Instances of AttrTrans beong to different types of transformation casses (speciaizations of AttrTrans) with suiting parameters. Instances of ReGen can be understood as generaized muti-join operators which are appied to attributes ony, but with a given seection condition and an impicit derivation of the participating reations. The attribute transformations concern either one attribute (NuIntr, AttrDecode, AttrComp for repacing certain vaues by NULL, performing genera vaue repacements or arbitrary computations) or severa attributes (AttrComb for arbitrary vaue combinations). Step 1: buid BDATE NuBDATE in NuIntr with ': BDATE ouzput o: BDATE' vaue vi: v2: "10000101"; "99999999" Step 2: buid AGE CompAGE in AttrComp with i: BDATE' output o: AGE camp,expression c: ":o = substr(today,$,) - substr(:i,4,1)" The vaue attribute (defined for NuIntr) specifies the date vaues (digits of year, month and day) that were misused for marking missing vaues and shoud be repaced by NULL. The camp-expression vaue in CompAGE aows the derivation of the output vaue :o for attribute o of CompAGE from the vaue : i of i by subtracting the year component of : i from the system date today. Step 3: buid MARSTAT DecMARSTAT in DecodeAttr with -. MARSTAT ou&.t. MARSTAT' fr% from: "singe"; from2: "married"; from3: "separated; from4: "divorced" to to2: to1: 3 to3: 1;' to4: Step 4: buid WBUYPWRCL CombWBUYPWRCL in AttrComb with 11: MARSTAT'; i2: BUYPWRCL; i3: NNOCHLD output o: WBUYPWRCL camp,expression c: ":o = :i2 + :i - (:i3 div 2)" Each from vaue in DecMARSTAT is repaced by its c( )rresponding to vaue. Step 5: generate NPARTNER GenNPARTNER in ReGen with output o: NPARTNER tar et f PTID; :21 AGE. t3: WBUiPWRCL; t4: BLDGTYP; t5: PROFQ cond-attr a: POLDTE condition c: " :a1 > '19960101"' Besides the advantage of having a documentation of executed transformations, another motivation for modeing the preprocessing steps is to enabe the user to specify the required transformations in a stepwise fashion, whie the actua generation of a specified reation is done automaticay by ADLER so that the user does not need to have any knowedge of SQL. The automatic generation buids upon the concretey modeed operations ike those shown above. The operation casses (ike NuIntr) are associated via mapping rues to SQL code. Even for the sma number of attributes in our exampe the resuting (SQL) expressions become very compex and their direct manua specification is error-prone and aborious. Of course, there shoud aso be GUI support for protecting the user from having to type the sti cryptic Teos specifications. Furthermore, anayses and therefore aso the necessary preprocessing steps are usuay reexecuted from time to time on different database states. Storing the transformation sequences speeds up the whoe process and faciitates its automation. Meta Data for Mining Activities Whie our pans for recording transformation and preprocessing steps are reativey cear, the considerations of how to represent the actua mining activities are more vague. As a first step it is necessary to mode the mining agorithms wrt. different requirements, parameters and resuts. Each mining session consists of executing agorithms (corresponding to instantiating the mining mode) in order to pursue a certain mining goa. The type of the agorithm and the avaiabe data sources obviousy reate with the mining goa. We think it is usefu to record the compete path from the first transformation in the preprocessing phase to the successfuy 110 Staudt

generated resut (e.g., a decision tree) which satisfies the inted goas. In future sessions (in particuar on new database states) it can be reconstructed how the whoe anaysis process took pace. Furthermore, we int to boi down our mining experience for the various kinds of tasks into instructions and generaized exampes of how to best approach simiar minings tasks. This forms a very superficia mining methodoogy ony but shoud sti hep new users to get more rapidy acquainted with data mining in ADLER. In addition, the preprocessing and mining activities do not constitute a sequentia but cycic process where it is usefu even during the same mining session to have the possibiity to go back to previous steps in order to reexecute them after having made some modifications. Appications and Outook For Swiss Life, Data Mining has a high potentia to support marketing initiatives that preserve and ext the market share of the company. In order to experiment with different mining agorithms a number of concrete appications were identified and seected: Potentia Cients: One might think that the weathy inhabitants of a certain geographica area are the most promising candidates for acquiring new contracts, but this is usuay not the case. An interesting data mining task is therefore to find out what the typica profies of Swiss Life customers are with respect to the various insurance products. An insurance agent can then use these profies to determine from a pubicy avaiabe register of househods those non-customers of his geographic area which are quite possiby interested in a certain kind of ife insurance. Customer Losses: One way to reach ower canceation rates for insurance contracts is via preventive measures directed to customers that are angered through persona circumstances or better offers from competitors. By mining the data about previous canceations, using as background knowedge unempoyment statistics for specific regions and questionnaire data, we expect to obtain cassification rues that identify customers that may be about to terminate their insurance contracts. Other mining tasks concern the identification of differences between the typica Swiss Life customers and those of the competitors, and the segmentation of a persons in the MASY warehouse into the so caed RAD 2000 Target Groups based on a set of fuzzy and overapping group definitions deveoped by the Swiss Life marketing department some years ago. A subset of our data was anonymized and made avaiabe on the Web for further experiments2. References Cohen, W. W. 1995. Fast effective rue induction. In Machine Learning: Proceedings of the Twefth Internationa Conference (ML95). De Raedt, L., and Bruynooghe, M. 1993. A theory of causa discovery. In Muggeton, S. H., ed., The Third Internationa Workshop on Inductive Logic Programming. Fisher, D. 1987. Knowedge acquisition via incrementa conceptua custering. Machine Learning 2( 2) : 139-172. Fritz, M. 1996. The empoyment of a data warehouse and OLAP at Swiss Life (in German). Master s thesis, University of Konstanz. Hoschka, P., and KSsgen, W. 1991. A support system for interpreting statistica data. In Piatetsky-Shapiro, and Frawey., eds., Knowedge Discovery in Databases. Jarke, M.; Gaersdoerfer, R.; Jeusfed, M.; Staudt, M.; and Eherer, S. 1995. ConceptBase - a deductive object base for meta data management. Journa of Inteigent Information Systems 4(2):167-192. see aso http://www-i5.informatik.rwth-aachen.de/cbdoc. Kietz, J.-U. 1996. Inductive Anaysis of Reationa Data (in German). Ph.D. Dissertation, Technica University Berin. Quinan, R., and Cameron-Jones, R. M. 1993. Foi: A midterm report. In Brazdi, P., ed., Proceedings of the Sixth European Conference on Machine Leaning (ECML-93)) 3-20. LNAI 667,Springer. Quinan, J. R. 1993. C4.5: Programms for Machine Learning. Morgan Kaufmann. Re, L., and Seshu, R. 1990. Learning hard concepts through constructive induction: framework and rationae. Computationa Inteigence 6~247-270. Wrobe, S.; Wettschereck, D.; Sommer, E.; and Emde, W. 1996. Extensibiity in data mining systems. In Proc. of the 2nd Int. Conf On Knowedge Discovery and Data Mining. AAAI Press. Wrobe, S. 1997. An agorithm for muti-reationa discovery of subgroups. In Komorowski, J., and Zytkow, J., eds., Principes of Data Mining and h nowedge Discovery: First European Symposium (PKDD 97), 78-87. Berin, New York: Springer Verag. Acknowedgements: We woud ike to thank the KE- PLER team at GMD St. Augustin and Diaogis GmbH for their coaboration and support.3 2 It can be obtained from http://research.swissife.ch/kdd-sisyphus/. 3K~~~~~ is commerciay avaiabe from Diaogis GmbH (http://www.diaogis.de). KDD-98 111