Across a wide variety of fields, data are

Transcription

1 Frm Data Mining t Knwledge Discvery in Databases Usama Fayyad, Gregry Piatetsky-Shapir, and Padhraic Smyth Data mining and knwledge discvery in databases have been attracting a significant amunt f research, industry, and media attentin f late. What is all the ecitement abut? This article prvides an verview f this emerging field, clarifying hw data mining and knwledge discvery in databases are related bth t each ther and t related fields, such as machine learning, statistics, and databases. The article mentins particular real-wrld applicatins, specific data-mining techniques, challenges invlved in real-wrld applicatins f knwledge discvery, and current and future research directins in the field. Acrss a wide variety f fields, data are being cllected and accumulated at a dramatic pace. There is an urgent need fr a new generatin f cmputatinal theries and tls t assist humans in etracting useful infrmatin (knwledge) frm the rapidly grwing vlumes f digital data. These theries and tls are the subject f the emerging field f knwledge discvery in databases (KDD). At an abstract level, the KDD field is cncerned with the develpment f methds and techniques fr making sense f data. The basic prblem addressed by the KDD prcess is ne f mapping lw-level data (which are typically t vluminus t understand and digest easily) int ther frms that might be mre cmpact (fr eample, a shrt reprt), mre abstract (fr eample, a descriptive apprimatin r mdel f the prcess that generated the data), r mre useful (fr eample, a predictive mdel fr estimating the value f future cases). At the cre f the prcess is the applicatin f specific data-mining methds fr pattern discvery and etractin. 1 This article begins by discussing the histrical cntet f KDD and data mining and their intersectin with ther related fields. A brief summary f recent KDD real-wrld applicatins is prvided. Definitins f KDD and data mining are prvided, and the general multistep KDD prcess is utlined. This multistep prcess has the applicatin f data-mining algrithms as ne particular step in the prcess. The data-mining step is discussed in mre detail in the cntet f specific data-mining algrithms and their applicatin. Real-wrld practical applicatin issues are als utlined. Finally, the article enumerates challenges fr future research and develpment and in particular discusses ptential pprtunities fr AI technlgy in KDD systems. Why D We Need KDD? The traditinal methd f turning data int knwledge relies n manual analysis and interpretatin. Fr eample, in the health-care industry, it is cmmn fr specialists t peridically analyze current trends and changes in health-care data, say, n a quarterly basis. The specialists then prvide a reprt detailing the analysis t the spnsring health-care rganizatin; this reprt becmes the basis fr future decisin making and planning fr health-care management. In a ttally different type f applicatin, planetary gelgists sift thrugh remtely sensed images f planets and asterids, carefully lcating and catalging such gelgic bjects f interest as impact craters. Be it science, marketing, finance, health care, retail, r any ther field, the classical apprach t data analysis relies fundamentally n ne r mre analysts becming Cpyright 1996, American Assciatin fr Artificial Intelligence. All rights reserved / $2.00 FALL

2 There is an urgent need fr a new generatin f cmputatinal theries and tls t assist humans in etracting useful infrmatin (knwledge) frm the rapidly grwing vlumes f digital data. areas is astrnmy. Here, a ntable success was achieved by SKICAT, a system used by astrnmers t perfrm image analysis, classificatin, and catalging f sky bjects frm sky-survey images (Fayyad, Djrgvski, and Weir 1996). In its first applicatin, the system was used t prcess the 3 terabytes (10 12 bytes) f image data resulting frm the Secnd Palmar Observatry Sky Survey, where it is estimated that n the rder f 10 9 sky bjects are detectable. SKICAT can utperfrm humans and traditinal cmputatinal techniques in classifying faint sky bjects. See Fayyad, Haussler, and Stlrz (1996) fr a survey f scientific applicatins. In business, main KDD applicatin areas includes marketing, finance (especially investment), fraud detectin, manufacturing, telecmmunicatins, and Internet agents. Marketing: In marketing, the primary applicatin is database marketing systems, which analyze custmer databases t identify different custmer grups and frecast their behavir. Business Week (Berry 1994) estimated that ver half f all retailers are using r planning t use database marketing, and thse wh d use it have gd results; fr eample, American Epress reprts a 10- t 15- percent increase in credit-card use. Anther ntable marketing applicatin is market-basket analysis (Agrawal et al. 1996) systems, which find patterns such as, If custmer bught X, he/she is als likely t buy Y and Z. Such patterns are valuable t retailers. Investment: Numerus cmpanies use data mining fr investment, but mst d nt describe their systems. One eceptin is LBS Capital Management. Its system uses epert systems, neural nets, and genetic algrithms t manage prtflis ttaling $600 millin; since its start in 1993, the system has utperfrmed the brad stck market (Hall, Mani, and Barr 1996). Fraud detectin: HNC Falcn and Nestr PRISM systems are used fr mnitring creditcard fraud, watching ver millins f accunts. The FAIS system (Senatr et al. 1995), frm the U.S. Treasury Financial Crimes Enfrcement Netwrk, is used t identify financial transactins that might indicate mneylaundering activity. Manufacturing: The CASSIOPEE trubleshting system, develped as part f a jint venture between General Electric and SNECMA, was applied by three majr Eurpean airlines t diagnse and predict prblems fr the Being 737. T derive families f faults, clustering methds are used. CASSIOPEE received the Eurpean first prize fr innvaintimately familiar with the data and serving as an interface between the data and the users and prducts. Fr these (and many ther) applicatins, this frm f manual prbing f a data set is slw, epensive, and highly subjective. In fact, as data vlumes grw dramatically, this type f manual data analysis is becming cmpletely impractical in many dmains. Databases are increasing in size in tw ways: (1) the number N f recrds r bjects in the database and (2) the number d f fields r attributes t an bject. Databases cntaining n the rder f N = 10 9 bjects are becming increasingly cmmn, fr eample, in the astrnmical sciences. Similarly, the number f fields d can easily be n the rder f 10 2 r even 10 3, fr eample, in medical diagnstic applicatins. Wh culd be epected t digest millins f recrds, each having tens r hundreds f fields? We believe that this jb is certainly nt ne fr humans; hence, analysis wrk needs t be autmated, at least partially. The need t scale up human analysis capabilities t handling the large number f bytes that we can cllect is bth ecnmic and scientific. Businesses use data t gain cmpetitive advantage, increase efficiency, and prvide mre valuable services t custmers. Data we capture abut ur envirnment are the basic evidence we use t build theries and mdels f the universe we live in. Because cmputers have enabled humans t gather mre data than we can digest, it is nly natural t turn t cmputatinal techniques t help us unearth meaningful patterns and structures frm the massive vlumes f data. Hence, KDD is an attempt t address a prblem that the digital infrmatin era made a fact f life fr all f us: data verlad. Data Mining and Knwledge Discvery in the Real Wrld A large degree f the current interest in KDD is the result f the media interest surrunding successful KDD applicatins, fr eample, the fcus articles within the last tw years in Business Week, Newsweek, Byte, PC Week, and ther large-circulatin peridicals. Unfrtunately, it is nt always easy t separate fact frm media hype. Nnetheless, several welldcumented eamples f successful systems can rightly be referred t as KDD applicatins and have been deplyed in peratinal use n large-scale real-wrld prblems in science and in business. In science, ne f the primary applicatin 38 AI MAGAZINE

3 tive applicatins (Manag and Auril 1996). Telecmmunicatins: The telecmmunicatins alarm-sequence analyzer (TASA) was built in cperatin with a manufacturer f telecmmunicatins equipment and three telephne netwrks (Mannila, Tivnen, and Verkam 1995). The system uses a nvel framewrk fr lcating frequently ccurring alarm episdes frm the alarm stream and presenting them as rules. Large sets f discvered rules can be eplred with fleible infrmatin-retrieval tls supprting interactivity and iteratin. In this way, TASA ffers pruning, gruping, and rdering tls t refine the results f a basic brute-frce search fr rules. Data cleaning: The MERGE-PURGE system was applied t the identificatin f duplicate welfare claims (Hernandez and Stlf 1995). It was used successfully n data frm the Welfare Department f the State f Washingtn. In ther areas, a well-publicized system is IBM s ADVANCED SCOUT, a specialized data-mining system that helps Natinal Basketball Assciatin (NBA) caches rganize and interpret data frm NBA games (U.S. News 1995). ADVANCED SCOUT was used by several f the NBA teams in 1996, including the Seattle Supersnics, which reached the NBA finals. Finally, a nvel and increasingly imprtant type f discvery is ne based n the use f intelligent agents t navigate thrugh an infrmatin-rich envirnment. Althugh the idea f active triggers has lng been analyzed in the database field, really successful applicatins f this idea appeared nly with the advent f the Internet. These systems ask the user t specify a prfile f interest and search fr related infrmatin amng a wide variety f public-dmain and prprietary surces. Fr eample, FIREFLY is a persnal music-recmmendatin agent: It asks a user his/her pinin f several music pieces and then suggests ther music that the user might like (< CRAYON ( allws users t create their wn free newspaper (supprted by ads); NEWSHOUND (< sjmercury.cm/hund/>) frm the San Jse Mercury News and FARCAST (< autmatically search infrmatin frm a wide variety f surces, including newspapers and wire services, and relevant dcuments directly t the user. These are just a few f the numerus such systems that use KDD techniques t autmatically prduce useful infrmatin frm large masses f raw data. See Piatetsky-Shapir et al. (1996) fr an verview f issues in develping industrial KDD applicatins. Data Mining and KDD Histrically, the ntin f finding useful patterns in data has been given a variety f names, including data mining, knwledge etractin, infrmatin discvery, infrmatin harvesting, data archaelgy, and data pattern prcessing. The term data mining has mstly been used by statisticians, data analysts, and the management infrmatin systems (MIS) cmmunities. It has als gained ppularity in the database field. The phrase knwledge discvery in databases was cined at the first KDD wrkshp in 1989 (Piatetsky-Shapir 1991) t emphasize that knwledge is the end prduct f a data-driven discvery. It has been ppularized in the AI and machine-learning fields. In ur view, KDD refers t the verall prcess f discvering useful knwledge frm data, and data mining refers t a particular step in this prcess. Data mining is the applicatin f specific algrithms fr etracting patterns frm data. The distinctin between the KDD prcess and the data-mining step (within the prcess) is a central pint f this article. The additinal steps in the KDD prcess, such as data preparatin, data selectin, data cleaning, incrpratin f apprpriate prir knwledge, and prper interpretatin f the results f mining, are essential t ensure that useful knwledge is derived frm the data. Blind applicatin f data-mining methds (rightly criticized as data dredging in the statistical literature) can be a dangerus activity, easily leading t the discvery f meaningless and invalid patterns. The Interdisciplinary Nature f KDD KDD has evlved, and cntinues t evlve, frm the intersectin f research fields such as machine learning, pattern recgnitin, databases, statistics, AI, knwledge acquisitin fr epert systems, data visualizatin, and high-perfrmance cmputing. The unifying gal is etracting high-level knwledge frm lw-level data in the cntet f large data sets. The data-mining cmpnent f KDD currently relies heavily n knwn techniques frm machine learning, pattern recgnitin, and statistics t find patterns frm data in the data-mining step f the KDD prcess. A natural questin is, Hw is KDD different frm pattern recgnitin r machine learning (and related fields)? The answer is that these fields prvide sme f the data-mining methds that are used in the data-mining step f the KDD prcess. KDD fcuses n the verall prcess f knwledge discvery frm data, including hw the data are stred and accessed, hw algrithms can be scaled t massive data sets The basic prblem addressed by the KDD prcess is ne f mapping lw-level data int ther frms that might be mre cmpact, mre abstract, r mre useful. FALL

4 Data mining is a step in the KDD prcess that cnsists f applying data analysis and discvery algrithms that prduce a particular enumeratin f patterns (r mdels) ver the data. Basic Definitins KDD is the nntrivial prcess f identifying valid, nvel, ptentially useful, and ultimateand still run efficiently, hw results can be interpreted and visualized, and hw the verall man-machine interactin can usefully be mdeled and supprted. The KDD prcess can be viewed as a multidisciplinary activity that encmpasses techniques beynd the scpe f any ne particular discipline such as machine learning. In this cntet, there are clear pprtunities fr ther fields f AI (besides machine learning) t cntribute t KDD. KDD places a special emphasis n finding understandable patterns that can be interpreted as useful r interesting knwledge. Thus, fr eample, neural netwrks, althugh a pwerful mdeling tl, are relatively difficult t understand cmpared t decisin trees. KDD als emphasizes scaling and rbustness prperties f mdeling algrithms fr large nisy data sets. Related AI research fields include machine discvery, which targets the discvery f empirical laws frm bservatin and eperimentatin (Shrager and Langley 1990) (see Klesgen and Zytkw [1996] fr a glssary f terms cmmn t KDD and machine discvery), and causal mdeling fr the inference f causal mdels frm data (Spirtes, Glymur, and Scheines 1993). Statistics in particular has much in cmmn with KDD (see Elder and Pregibn [1996] and Glymur et al. [1996] fr a mre detailed discussin f this synergy). Knwledge discvery frm data is fundamentally a statistical endeavr. Statistics prvides a language and framewrk fr quantifying the uncertainty that results when ne tries t infer general patterns frm a particular sample f an verall ppulatin. As mentined earlier, the term data mining has had negative cnntatins in statistics since the 1960s when cmputer-based data analysis techniques were first intrduced. The cncern arse because if ne searches lng enugh in any data set (even randmly generated data), ne can find patterns that appear t be statistically significant but, in fact, are nt. Clearly, this issue is f fundamental imprtance t KDD. Substantial prgress has been made in recent years in understanding such issues in statistics. Much f this wrk is f direct relevance t KDD. Thus, data mining is a legitimate activity as lng as ne understands hw t d it crrectly; data mining carried ut prly (withut regard t the statistical aspects f the prblem) is t be avided. KDD can als be viewed as encmpassing a brader view f mdeling than statistics. KDD aims t prvide tls t autmate (t the degree pssible) the entire prcess f data analysis and the statistician s art f hypthesis selectin. A driving frce behind KDD is the database field (the secnd D in KDD). Indeed, the prblem f effective data manipulatin when data cannt fit in the main memry is f fundamental imprtance t KDD. Database techniques fr gaining efficient data access, gruping and rdering peratins when accessing data, and ptimizing queries cnstitute the basics fr scaling algrithms t larger data sets. Mst data-mining algrithms frm statistics, pattern recgnitin, and machine learning assume data are in the main memry and pay n attentin t hw the algrithm breaks dwn if nly limited views f the data are pssible. A related field evlving frm databases is data warehusing, which refers t the ppular business trend f cllecting and cleaning transactinal data t make them available fr nline analysis and decisin supprt. Data warehusing helps set the stage fr KDD in tw imprtant ways: (1) data cleaning and (2) data access. Data cleaning: As rganizatins are frced t think abut a unified lgical view f the wide variety f data and databases they pssess, they have t address the issues f mapping data t a single naming cnventin, unifrmly representing and handling missing data, and handling nise and errrs when pssible. Data access: Unifrm and well-defined methds must be created fr accessing the data and prviding access paths t data that were histrically difficult t get t (fr eample, stred ffline). Once rganizatins and individuals have slved the prblem f hw t stre and access their data, the natural net step is the questin, What else d we d with all the data? This is where pprtunities fr KDD naturally arise. A ppular apprach fr analysis f data warehuses is called nline analytical prcessing (OLAP), named fr a set f principles prpsed by Cdd (1993). OLAP tls fcus n prviding multidimensinal data analysis, which is superir t SQL in cmputing summaries and breakdwns alng many dimensins. OLAP tls are targeted tward simplifying and supprting interactive data analysis, but the gal f KDD tls is t autmate as much f the prcess as pssible. Thus, KDD is a step beynd what is currently supprted by mst standard database systems. 40 AI MAGAZINE

5 Interpretatin / Evaluatin Data Mining Preprcessing Transfrmatin Knwledge Selectin Preprcessed Data Transfrmed Data Patterns Data Target Date Figure 1. An Overview f the Steps That Cmpse the KDD Prcess. ly understandable patterns in data (Fayyad, Piatetsky-Shapir, and Smyth 1996). Here, data are a set f facts (fr eample, cases in a database), and pattern is an epressin in sme language describing a subset f the data r a mdel applicable t the subset. Hence, in ur usage here, etracting a pattern als designates fitting a mdel t data; finding structure frm data; r, in general, making any high-level descriptin f a set f data. The term prcess implies that KDD cmprises many steps, which invlve data preparatin, search fr patterns, knwledge evaluatin, and refinement, all repeated in multiple iteratins. By nntrivial, we mean that sme search r inference is invlved; that is, it is nt a straightfrward cmputatin f predefined quantities like cmputing the average value f a set f numbers. The discvered patterns shuld be valid n new data with sme degree f certainty. We als want patterns t be nvel (at least t the system and preferably t the user) and ptentially useful, that is, lead t sme benefit t the user r task. Finally, the patterns shuld be understandable, if nt immediately then after sme pstprcessing. The previus discussin implies that we can define quantitative measures fr evaluating etracted patterns. In many cases, it is pssible t define measures f certainty (fr eample, estimated predictin accuracy n new data) r utility (fr eample, gain, perhaps in dllars saved because f better predictins r speedup in respnse time f a system). Ntins such as nvelty and understandability are much mre subjective. In certain cntets, understandability can be estimated by simplicity (fr eample, the number f bits t describe a pattern). An imprtant ntin, called interestingness (fr eample, see Silberschatz and Tuzhilin [1995] and Piatetsky-Shapir and Matheus [1994]), is usually taken as an verall measure f pattern value, cmbining validity, nvelty, usefulness, and simplicity. Interestingness functins can be defined eplicitly r can be manifested implicitly thrugh an rdering placed by the KDD system n the discvered patterns r mdels. Given these ntins, we can cnsider a pattern t be knwledge if it eceeds sme interestingness threshld, which is by n means an attempt t define knwledge in the philsphical r even the ppular view. As a matter f fact, knwledge in this definitin is purely user riented and dmain specific and is determined by whatever functins and threshlds the user chses. Data mining is a step in the KDD prcess that cnsists f applying data analysis and discvery algrithms that, under acceptable cmputatinal efficiency limitatins, prduce a particular enumeratin f patterns (r mdels) ver the data. Nte that the space f FALL

6 patterns is ften infinite, and the enumeratin f patterns invlves sme frm f search in this space. Practical cmputatinal cnstraints place severe limits n the subspace that can be eplred by a data-mining algrithm. The KDD prcess invlves using the database alng with any required selectin, preprcessing, subsampling, and transfrmatins f it; applying data-mining methds (algrithms) t enumerate patterns frm it; and evaluating the prducts f data mining t identify the subset f the enumerated patterns deemed knwledge. The data-mining cmpnent f the KDD prcess is cncerned with the algrithmic means by which patterns are etracted and enumerated frm data. The verall KDD prcess (figure 1) includes the evaluatin and pssible interpretatin f the mined patterns t determine which patterns can be cnsidered new knwledge. The KDD prcess als includes all the additinal steps described in the net sectin. The ntin f an verall user-driven prcess is nt unique t KDD: analgus prpsals have been put frward bth in statistics (Hand 1994) and in machine learning (Brdley and Smyth 1996). The KDD Prcess The KDD prcess is interactive and iterative, invlving numerus steps with many decisins made by the user. Brachman and Anand (1996) give a practical view f the KDD prcess, emphasizing the interactive nature f the prcess. Here, we bradly utline sme f its basic steps: First is develping an understanding f the applicatin dmain and the relevant prir knwledge and identifying the gal f the KDD prcess frm the custmer s viewpint. Secnd is creating a target data set: selecting a data set, r fcusing n a subset f variables r data samples, n which discvery is t be perfrmed. Third is data cleaning and preprcessing. Basic peratins include remving nise if apprpriate, cllecting the necessary infrmatin t mdel r accunt fr nise, deciding n strategies fr handling missing data fields, and accunting fr time-sequence infrmatin and knwn changes. Furth is data reductin and prjectin: finding useful features t represent the data depending n the gal f the task. With dimensinality reductin r transfrmatin methds, the effective number f variables under cnsideratin can be reduced, r invariant representatins fr the data can be fund. Fifth is matching the gals f the KDD prcess (step 1) t a particular data-mining methd. Fr eample, summarizatin, classificatin, regressin, clustering, and s n, are described later as well as in Fayyad, Piatetsky-Shapir, and Smyth (1996). Sith is eplratry analysis and mdel and hypthesis selectin: chsing the datamining algrithm(s) and selecting methd(s) t be used fr searching fr data patterns. This prcess includes deciding which mdels and parameters might be apprpriate (fr eample, mdels f categrical data are different than mdels f vectrs ver the reals) and matching a particular data-mining methd with the verall criteria f the KDD prcess (fr eample, the end user might be mre interested in understanding the mdel than its predictive capabilities). Seventh is data mining: searching fr patterns f interest in a particular representatinal frm r a set f such representatins, including classificatin rules r trees, regressin, and clustering. The user can significantly aid the data-mining methd by crrectly perfrming the preceding steps. Eighth is interpreting mined patterns, pssibly returning t any f steps 1 thrugh 7 fr further iteratin. This step can als invlve visualizatin f the etracted patterns and mdels r visualizatin f the data given the etracted mdels. Ninth is acting n the discvered knwledge: using the knwledge directly, incrprating the knwledge int anther system fr further actin, r simply dcumenting it and reprting it t interested parties. This prcess als includes checking fr and reslving ptential cnflicts with previusly believed (r etracted) knwledge. The KDD prcess can invlve significant iteratin and can cntain lps between any tw steps. The basic flw f steps (althugh nt the ptential multitude f iteratins and lps) is illustrated in figure 1. Mst previus wrk n KDD has fcused n step 7, the data mining. Hwever, the ther steps are as imprtant (and prbably mre s) fr the successful applicatin f KDD in practice. Having defined the basic ntins and intrduced the KDD prcess, we nw fcus n the data-mining cmpnent, which has, by far, received the mst attentin in the literature. 42 AI MAGAZINE

7 The Data-Mining Step f the KDD Prcess Debt The data-mining cmpnent f the KDD prcess ften invlves repeated iterative applicatin f particular data-mining methds. This sectin presents an verview f the primary gals f data mining, a descriptin f the methds used t address these gals, and a brief descriptin f the data-mining algrithms that incrprate these methds. The knwledge discvery gals are defined by the intended use f the system. We can distinguish tw types f gals: (1) verificatin and (2) discvery. With verificatin, the system is limited t verifying the user s hypthesis. With discvery, the system autnmusly finds new patterns. We further subdivide the discvery gal int predictin, where the system finds patterns fr predicting the future behavir f sme entities, and descriptin, where the system finds patterns fr presentatin t a user in a human-understandable frm. In this article, we are primarily cncerned with discvery-riented data mining. Data mining invlves fitting mdels t, r determining patterns frm, bserved data. The fitted mdels play the rle f inferred knwledge: Whether the mdels reflect useful r interesting knwledge is part f the verall, interactive KDD prcess where subjective human judgment is typically required. Tw primary mathematical frmalisms are used in mdel fitting: (1) statistical and (2) lgical. The statistical apprach allws fr nndeterministic effects in the mdel, whereas a lgical mdel is purely deterministic. We fcus primarily n the statistical apprach t data mining, which tends t be the mst widely used basis fr practical data-mining applicatins given the typical presence f uncertainty in real-wrld data-generating prcesses. Mst data-mining methds are based n tried and tested techniques frm machine learning, pattern recgnitin, and statistics: classificatin, clustering, regressin, and s n. The array f different algrithms under each f these headings can ften be bewildering t bth the nvice and the eperienced data analyst. It shuld be emphasized that f the many data-mining methds advertised in the literature, there are really nly a few fundamental techniques. The actual underlying mdel representatin being used by a particular methd typically cmes frm a cmpsitin f a small number f well-knwn ptins: plynmials, splines, kernel and basis functins, threshld-blean functins, and s n. Thus, algrithms tend t differ primarily in the gdness-f-fit criterin used t evaluate mdel fit r in the search methd used t find a gd fit. In ur brief verview f data-mining methds, we try in particular t cnvey the ntin that mst (if nt all) methds can be viewed as etensins r hybrids f a few basic techniques and principles. We first discuss the primary methds f data mining and then shw that the data- mining methds can be viewed as cnsisting f three primary algrithmic cmpnents: (1) mdel representatin, (2) mdel evaluatin, and (3) search. In the discussin f KDD and data-mining methds, we use a simple eample t make sme f the ntins mre cncrete. Figure 2 shws a simple tw-dimensinal artificial data set cnsisting f 23 cases. Each pint n the graph represents a persn wh has been given a lan by a particular bank at sme time in the past. The hrizntal ais represents the incme f the persn; the vertical ais represents the ttal persnal debt f the persn (mrtgage, car payments, and s n). The data have been classified int tw classes: (1) the s represent persns wh have defaulted n their lans and (2) the s represent persns whse lans are in gd status with the bank. Thus, this simple artificial data set culd represent a histrical data set that can cntain useful knwledge frm the pint f view f the bank making the lans. Nte that in actual KDD applicatins, there are typically many mre dimensins (as many as several hundreds) and many mre data pints (many thusands r even millins). Incme Figure 2. A Simple Data Set with Tw Classes Used fr Illustrative Purpses. FALL

8 Debt N Lan Lan Incme Figure 3. A Simple Linear Classificatin Bundary fr the Lan Data Set. The shaped regin dentes class n lan. Debt Regressin Line Incme Figure 4. A Simple Linear Regressin fr the Lan Data Set. The purpse here is t illustrate basic ideas n a small prblem in tw-dimensinal space. Data-Mining Methds The tw high-level primary gals f data mining in practice tend t be predictin and descriptin. As stated earlier, predictin invlves using sme variables r fields in the database t predict unknwn r future values f ther variables f interest, and descriptin fcuses n finding human-interpretable patterns describing the data. Althugh the bundaries between predictin and descriptin are nt sharp (sme f the predictive mdels can be descriptive, t the degree that they are understandable, and vice versa), the distinctin is useful fr understanding the verall discvery gal. The relative imprtance f predictin and descriptin fr particular data-mining applicatins can vary cnsiderably. The gals f predictin and descriptin can be achieved using a variety f particular data-mining methds. Classificatin is learning a functin that maps (classifies) a data item int ne f several predefined classes (Weiss and Kulikwski 1991; Hand 1981). Eamples f classificatin methds used as part f knwledge discvery applicatins include the classifying f trends in financial markets (Apte and Hng 1996) and the autmated identificatin f bjects f interest in large image databases (Fayyad, Djrgvski, and Weir 1996). Figure 3 shws a simple partitining f the lan data int tw class regins; nte that it is nt pssible t separate the classes perfectly using a linear decisin bundary. The bank might want t use the classificatin regins t autmatically decide whether future lan applicants will be given a lan r nt. Regressin is learning a functin that maps a data item t a real-valued predictin variable. Regressin applicatins are many, fr eample, predicting the amunt f bimass present in a frest given remtely sensed micrwave measurements, estimating the prbability that a patient will survive given the results f a set f diagnstic tests, predicting cnsumer demand fr a new prduct as a functin f advertising ependiture, and predicting time series where the input variables can be time-lagged versins f the predictin variable. Figure 4 shws the result f simple linear regressin where ttal debt is fitted as a linear functin f incme: The fit is pr because nly a weak crrelatin eists between the tw variables. Clustering is a cmmn descriptive task 44 AI MAGAZINE

9 where ne seeks t identify a finite set f categries r clusters t describe the data (Jain and Dubes 1988; Titteringtn, Smith, and Makv 1985). The categries can be mutually eclusive and ehaustive r cnsist f a richer representatin, such as hierarchical r verlapping categries. Eamples f clustering applicatins in a knwledge discvery cntet include discvering hmgeneus subppulatins fr cnsumers in marketing databases and identifying subcategries f spectra frm infrared sky measurements (Cheeseman and Stutz 1996). Figure 5 shws a pssible clustering f the lan data set int three clusters; nte that the clusters verlap, allwing data pints t belng t mre than ne cluster. The riginal class labels (dented by s and s in the previus figures) have been replaced by a t indicate that the class membership is n lnger assumed knwn. Clsely related t clustering is the task f prbability density estimatin, which cnsists f techniques fr estimating frm data the jint multivariate prbability density functin f all the variables r fields in the database (Silverman 1986). Summarizatin invlves methds fr finding a cmpact descriptin fr a subset f data. A simple eample wuld be tabulating the mean and standard deviatins fr all fields. Mre sphisticated methds invlve the derivatin f summary rules (Agrawal et al. 1996), multivariate visualizatin techniques, and the discvery f functinal relatinships between variables (Zembwicz and Zytkw 1996). Summarizatin techniques are ften applied t interactive eplratry data analysis and autmated reprt generatin. Dependency mdeling cnsists f finding a mdel that describes significant dependencies between variables. Dependency mdels eist at tw levels: (1) the structural level f the mdel specifies (ften in graphic frm) which variables are lcally dependent n each ther and (2) the quantitative level f the mdel specifies the strengths f the dependencies using sme numeric scale. Fr eample, prbabilistic dependency netwrks use cnditinal independence t specify the structural aspect f the mdel and prbabilities r crrelatins t specify the strengths f the dependencies (Glymur et al. 1987; Heckerman 1996). Prbabilistic dependency netwrks are increasingly finding applicatins in areas as diverse as the develpment f prbabilistic medical epert systems frm databases, infrmatin retrieval, and mdeling f the human genme. Change and deviatin detectin fcuses n Debt Cluster 1 discvering the mst significant changes in the data frm previusly measured r nrmative values (Berndt and Cliffrd 1996; Guyn, Matic, and Vapnik 1996; Klesgen 1996; Matheus, Piatetsky-Shapir, and McNeill 1996; Basseville and Nikifrv 1993). The Cmpnents f Data-Mining Algrithms The net step is t cnstruct specific algrithms t implement the general methds we utlined. One can identify three primary cmpnents in any data-mining algrithm: (1) mdel representatin, (2) mdel evaluatin, and (3) search. This reductinist view is nt necessarily cmplete r fully encmpassing; rather, it is a cnvenient way t epress the key cncepts f data-mining algrithms in a relatively unified and cmpact manner. Cheeseman (1990) utlines a similar structure. Mdel representatin is the language used t describe discverable patterns. If the representatin is t limited, then n amunt f training time r eamples can prduce an accurate mdel fr the data. It is imprtant that a data analyst fully cmprehend the representatinal assumptins that might be inherent in a particular methd. It is equally imprtant that an algrithm designer clearly state which representatinal assumptins are being made by a particular algrithm. Nte that increased representatinal pwer fr mdels increases the danger f verfitting the training data, resulting in reduced predictin accuracy n unseen data. Mdel-evaluatin criteria are quantitative Cluster 3 Cluster 2 Incme Figure 5. A Simple Clustering f the Lan Data Set int Three Clusters. Nte that riginal labels are replaced by a. FALL

10 Debt N Lan t Incme statements (r fit functins) f hw well a particular pattern (a mdel and its parameters) meets the gals f the KDD prcess. Fr eample, predictive mdels are ften judged by the empirical predictin accuracy n sme test set. Descriptive mdels can be evaluated alng the dimensins f predictive accuracy, nvelty, utility, and understandability f the fitted mdel. Search methd cnsists f tw cmpnents: (1) parameter search and (2) mdel search. Once the mdel representatin (r family f representatins) and the mdel-evaluatin criteria are fied, then the data-mining prblem has been reduced t purely an ptimizatin task: Find the parameters and mdels frm the selected family that ptimize the evaluatin criteria. In parameter search, the algrithm must search fr the parameters that ptimize the mdel-evaluatin criteria given bserved data and a fied mdel representatin. Mdel search ccurs as a lp ver the parameter-search methd: The mdel representatin is changed s that a family f mdels is cnsidered. Sme Data-Mining Methds A wide variety f data-mining methds eist, but here, we nly fcus n a subset f ppular techniques. Each methd is discussed in the cntet f mdel representatin, mdel evaluatin, and search. Lan Figure 6. Using a Single Threshld n the Incme Variable t Try t Classify the Lan Data Set. Decisin Trees and Rules Decisin trees and rules that use univariate splits have a simple representatinal frm, making the inferred mdel relatively easy fr the user t cmprehend. Hwever, the restrictin t a particular tree r rule representatin can significantly restrict the functinal frm (and, thus, the apprimatin pwer) f the mdel. Fr eample, figure 6 illustrates the effect f a threshld split applied t the incme variable fr a lan data set: It is clear that using such simple threshld splits (parallel t the feature aes) severely limits the type f classificatin bundaries that can be induced. If ne enlarges the mdel space t allw mre general epressins (such as multivariate hyperplanes at arbitrary angles), then the mdel is mre pwerful fr predictin but can be much mre difficult t cmprehend. A large number f decisin tree and rule-inductin algrithms are described in the machinelearning and applied statistics literature (Quinlan 1992; Breiman et al. 1984). T a large etent, they depend n likelihd-based mdel-evaluatin methds, with varying degrees f sphisticatin in terms f penalizing mdel cmpleity. Greedy search methds, which invlve grwing and pruning rule and tree structures, are typically used t eplre the superepnential space f pssible mdels. Trees and rules are primarily used fr predictive mdeling, bth fr classificatin (Apte and Hng 1996; Fayyad, Djrgvski, and Weir 1996) and regressin, althugh they can als be applied t summary descriptive mdeling (Agrawal et al. 1996). Nnlinear Regressin and Classificatin Methds These methds cnsist f a family f techniques fr predictin that fit linear and nnlinear cmbinatins f basis functins (sigmids, splines, plynmials) t cmbinatins f the input variables. Eamples include feedfrward neural netwrks, adaptive spline methds, and prjectin pursuit regressin (see Elder and Pregibn [1996], Cheng and Titteringtn [1994], and Friedman [1989] fr mre detailed discussins). Cnsider neural netwrks, fr eample. Figure 7 illustrates the type f nnlinear decisin bundary that a neural netwrk might find fr the lan data set. In terms f mdel evaluatin, althugh netwrks f the apprpriate size can universally apprimate any smth functin t any desired degree f accuracy, relatively little is knwn abut the representatin prperties f fied-size netwrks estimated frm finite data sets. Als, the standard squared errr and 46 AI MAGAZINE

11 crss-entrpy lss functins used t train neural netwrks can be viewed as lg-likelihd functins fr regressin and classificatin, respectively (Ripley 1994; Geman, Bienenstck, and Dursat 1992). Back prpagatin is a parameter-search methd that perfrms gradient descent in parameter (weight) space t find a lcal maimum f the likelihd functin starting frm randm initial cnditins. Nnlinear regressin methds, althugh pwerful in representatinal pwer, can be difficult t interpret. Fr eample, althugh the classificatin bundaries f figure 7 might be mre accurate than the simple threshld bundary f figure 6, the threshld bundary has the advantage that the mdel can be epressed, t sme degree f certainty, as a simple rule f the frm if incme is greater than threshld, then lan will have gd status. Eample-Based Methds The representatin is simple: Use representative eamples frm the database t apprimate a mdel; that is, predictins n new eamples are derived frm the prperties f similar eamples in the mdel whse predictin is knwn. Techniques include nearestneighbr classificatin and regressin algrithms (Dasarathy 1991) and case-based reasning systems (Kldner 1993). Figure 8 illustrates the use f a nearest-neighbr classifier fr the lan data set: The class at any new pint in the tw-dimensinal space is the same as the class f the clsest pint in the riginal training data set. A ptential disadvantage f eample-based methds (cmpared with tree-based methds) is that a well-defined distance metric fr evaluating the distance between data pints is required. Fr the lan data in figure 8, this wuld nt be a prblem because incme and debt are measured in the same units. Hwever, if ne wished t include variables such as the duratin f the lan, se, and prfessin, then it wuld require mre effrt t define a sensible metric between the variables. Mdel evaluatin is typically based n crss-validatin estimates (Weiss and Kulikwski 1991) f a predictin errr: Parameters f the mdel t be estimated can include the number f neighbrs t use fr predictin and the distance metric itself. Like nnlinear regressin methds, eample-based methds are ften asympttically pwerful in terms f apprimatin prperties but, cnversely, can be difficult t interpret because the mdel is implicit in the data and nt eplicitly frmulated. Related techniques include kernel-density Debt N Lan Lan Incme Figure 7. An Eample f Classificatin Bundaries Learned by a Nnlinear Classifier (Such as a Neural Netwrk) fr the Lan Data Set. Debt N Lan Lan Incme Figure 8. Classificatin Bundaries fr a Nearest-Neighbr Classifier fr the Lan Data Set. FALL

12 Given the brad spectrum f data-mining methds and algrithms, ur verview is in- Understanding data mining and mdel inductin at this cmpnent level clarifies the behavir f any data-mining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. estimatin (Silverman 1986) and miture mdeling (Titteringtn, Smith, and Makv 1985). Prbabilistic Graphic Dependency Mdels Graphic mdels specify prbabilistic dependencies using a graph structure (Whittaker 1990; Pearl 1988). In its simplest frm, the mdel specifies which variables are directly dependent n each ther. Typically, these mdels are used with categrical r discrete-valued variables, but etensins t special cases, such as Gaussian densities, fr real-valued variables are als pssible. Within the AI and statistical cmmunities, these mdels were initially develped within the framewrk f prbabilistic epert systems; the structure f the mdel and the parameters (the cnditinal prbabilities attached t the links f the graph) were elicited frm eperts. Recently, there has been significant wrk in bth the AI and statistical cmmunities n methds whereby bth the structure and the parameters f graphic mdels can be learned directly frm databases (Buntine 1996; Heckerman 1996). Mdel-evaluatin criteria are typically Bayesian in frm, and parameter estimatin can be a miture f clsed-frm estimates and iterative methds depending n whether a variable is directly bserved r hidden. Mdel search can cnsist f greedy hill-climbing methds ver varius graph structures. Prir knwledge, such as a partial rdering f the variables based n causal relatins, can be useful in terms f reducing the mdel search space. Althugh still primarily in the research phase, graphic mdel inductin methds are f particular interest t KDD because the graphic frm f the mdel lends itself easily t human interpretatin. Relatinal Learning Mdels Althugh decisin trees and rules have a representatin restricted t prpsitinal lgic, relatinal learning (als knwn as inductive lgic prgramming) uses the mre fleible pattern language f first-rder lgic. A relatinal learner can easily find frmulas such as X = Y. Mst research t date n mdel-evaluatin methds fr relatinal learning is lgical in nature. The etra representatinal pwer f relatinal mdels cmes at the price f significant cmputatinal demands in terms f search. See Dzerski (1996) fr a mre detailed discussin. Discussin evitably limited in scpe; many data-mining techniques, particularly specialized methds fr particular types f data and dmains, were nt mentined specifically. We believe the general discussin n data-mining tasks and cmpnents has general relevance t a variety f methds. Fr eample, cnsider timeseries predictin, which traditinally has been cast as a predictive regressin task (autregressive mdels, and s n). Recently, mre general mdels have been develped fr time-series applicatins, such as nnlinear basis functins, eample-based mdels, and kernel methds. Furthermre, there has been significant interest in descriptive graphic and lcal data mdeling f time series rather than purely predictive mdeling (Weigend and Gershenfeld 1993). Thus, althugh different algrithms and applicatins might appear different n the surface, it is nt uncmmn t find that they share many cmmn cmpnents. Understanding data mining and mdel inductin at this cmpnent level clarifies the behavir f any data-mining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. An imprtant pint is that each technique typically suits sme prblems better than thers. Fr eample, decisin tree classifiers can be useful fr finding structure in high-dimensinal spaces and in prblems with mied cntinuus and categrical data (because tree methds d nt require distance metrics). Hwever, classificatin trees might nt be suitable fr prblems where the true decisin bundaries between classes are described by a secnd-rder plynmial (fr eample). Thus, there is n universal data-mining methd, and chsing a particular algrithm fr a particular applicatin is smething f an art. In practice, a large prtin f the applicatin effrt can g int prperly frmulating the prblem (asking the right questin) rather than int ptimizing the algrithmic details f a particular data-mining methd (Langley and Simn 1995; Hand 1994). Because ur discussin and verview f data-mining methds has been brief, we want t make tw imprtant pints clear: First, ur verview f autmated search fcused mainly n autmated methds fr etracting patterns r mdels frm data. Althugh this apprach is cnsistent with the definitin we gave earlier, it des nt necessarily represent what ther cmmunities might refer t as data mining. Fr eample, sme use the term t designate any manual 48 AI MAGAZINE

13 search f the data r search assisted by queries t a database management system r t refer t humans visualizing patterns in data. In ther cmmunities, it is used t refer t the autmated crrelatin f data frm transactins r the autmated generatin f transactin reprts. We chse t fcus nly n methds that cntain certain degrees f search autnmy. Secnd, beware the hype: The state f the art in autmated methds in data mining is still in a fairly early stage f develpment. There are n established criteria fr deciding which methds t use in which circumstances, and many f the appraches are based n crude heuristic apprimatins t avid the epensive search required t find ptimal, r even gd, slutins. Hence, the reader shuld be careful when cnfrnted with verstated claims abut the great ability f a system t mine useful infrmatin frm large (r even small) databases. Applicatin Issues Fr a survey f KDD applicatins as well as detailed eamples, see Piatetsky-Shapir et al. (1996) fr industrial applicatins and Fayyad, Haussler, and Stlrz (1996) fr applicatins in science data analysis. Here, we eamine criteria fr selecting ptential applicatins, which can be divided int practical and technical categries. The practical criteria fr KDD prjects are similar t thse fr ther applicatins f advanced technlgy and include the ptential impact f an applicatin, the absence f simpler alternative slutins, and strng rganizatinal supprt fr using technlgy. Fr applicatins dealing with persnal data, ne shuld als cnsider the privacy and legal issues (Piatetsky-Shapir 1995). The technical criteria include cnsideratins such as the availability f sufficient data (cases). In general, the mre fields there are and the mre cmple the patterns being sught, the mre data are needed. Hwever, strng prir knwledge (see discussin later) can reduce the number f needed cases significantly. Anther cnsideratin is the relevance f attributes. It is imprtant t have data attributes that are relevant t the discvery task; n amunt f data will allw predictin based n attributes that d nt capture the required infrmatin. Furthermre, lw nise levels (few data errrs) are anther cnsideratin. High amunts f nise make it hard t identify patterns unless a large number f cases can mitigate randm nise and help clarify the aggregate patterns. Changing and timeriented data, althugh making the applicatin develpment mre difficult, make it ptentially much mre useful because it is easier t retrain a system than a human. Finally, and perhaps ne f the mst imprtant cnsideratins, is prir knwledge. It is useful t knw smething abut the dmain what are the imprtant fields, what are the likely relatinships, what is the user utility functin, what patterns are already knwn, and s n. Research and Applicatin Challenges We utline sme f the current primary research and applicatin challenges fr KDD. This list is by n means ehaustive and is intended t give the reader a feel fr the types f prblem that KDD practitiners wrestle with. Larger databases: Databases with hundreds f fields and tables and millins f recrds and f a multigigabyte size are cmmnplace, and terabyte (10 12 bytes) databases are beginning t appear. Methds fr dealing with large data vlumes include mre efficient algrithms (Agrawal et al. 1996), sampling, apprimatin, and massively parallel prcessing (Hlsheimer et al. 1996). High dimensinality: Nt nly is there ften a large number f recrds in the database, but there can als be a large number f fields (attributes, variables); s, the dimensinality f the prblem is high. A high-dimensinal data set creates prblems in terms f increasing the size f the search space fr mdel inductin in a cmbinatrially eplsive manner. In additin, it increases the chances that a data-mining algrithm will find spurius patterns that are nt valid in general. Appraches t this prblem include methds t reduce the effective dimensinality f the prblem and the use f prir knwledge t identify irrelevant variables. Overfitting: When the algrithm searches fr the best parameters fr ne particular mdel using a limited set f data, it can mdel nt nly the general patterns in the data but als any nise specific t the data set, resulting in pr perfrmance f the mdel n test data. Pssible slutins include crss-validatin, regularizatin, and ther sphisticated statistical strategies. Assessing f statistical significance: A prblem (related t verfitting) ccurs when the system is searching ver many pssible mdels. Fr eample, if a system tests mdels at the significance level, then n average, with purely randm data, N/1000 f these mdels will be accepted as significant. FALL

14 This pint is frequently missed by many initial attempts at KDD. One way t deal with this prblem is t use methds that adjust the test statistic as a functin f the search, fr eample, Bnferrni adjustments fr independent tests r randmizatin testing. Changing data and knwledge: Rapidly changing (nnstatinary) data can make previusly discvered patterns invalid. In additin, the variables measured in a given applicatin database can be mdified, deleted, r augmented with new measurements ver time. Pssible slutins include incremental methds fr updating the patterns and treating change as an pprtunity fr discvery by using it t cue the search fr patterns f change nly (Matheus, Piatetsky-Shapir, and McNeill 1996). See als Agrawal and Psaila (1995) and Mannila, Tivnen, and Verkam (1995). Missing and nisy data: This prblem is especially acute in business databases. U.S. census data reprtedly have errr rates as great as 20 percent in sme fields. Imprtant attributes can be missing if the database was nt designed with discvery in mind. Pssible slutins include mre sphisticated statistical strategies t identify hidden variables and dependencies (Heckerman 1996; Smyth et al. 1996). Cmple relatinships between fields: Hierarchically structured attributes r values, relatins between attributes, and mre sphisticated means fr representing knwledge abut the cntents f a database will require algrithms that can effectively use such infrmatin. Histrically, data-mining algrithms have been develped fr simple attribute-value recrds, althugh new techniques fr deriving relatins between variables are being develped (Dzerski 1996; Djk, Ck, and Hlder 1995). Understandability f patterns: In many applicatins, it is imprtant t make the discveries mre understandable by humans. Pssible slutins include graphic representatins (Buntine 1996; Heckerman 1996), rule structuring, natural language generatin, and techniques fr visualizatin f data and knwledge. Rule-refinement strategies (fr eample, Majr and Mangan [1995]) can be used t address a related prblem: The discvered knwledge might be implicitly r eplicitly redundant. User interactin and prir knwledge: Many current KDD methds and tls are nt truly interactive and cannt easily incrprate prir knwledge abut a prblem ecept in simple ways. The use f dmain knwledge is imprtant in all the steps f the KDD prcess. Bayesian appraches (fr eample, Cheeseman [1990]) use prir prbabilities ver data and distributins as ne frm f encding prir knwledge. Others emply deductive database capabilities t discver knwledge that is then used t guide the data-mining search (fr eample, Simudis, Livezey, and Kerber [1995]). Integratin with ther systems: A standalne discvery system might nt be very useful. Typical integratin issues include integratin with a database management system (fr eample, thrugh a query interface), integratin with spreadsheets and visualizatin tls, and accmmdating f real-time sensr readings. Eamples f integrated KDD systems are described by Simudis, Livezey, and Kerber (1995) and Stlrz, Nakamura, Mesrbiam, Muntz, Shek, Sants, Yi, Ng, Chien, Mechs, and Farrara (1995). Cncluding Remarks: The Ptential Rle f AI in KDD In additin t machine learning, ther AI fields can ptentially cntribute significantly t varius aspects f the KDD prcess. We mentin a few eamples f these areas here: Natural language presents significant pprtunities fr mining in free-frm tet, especially fr autmated anntatin and indeing prir t classificatin f tet crpra. Limited parsing capabilities can help substantially in the task f deciding what an article refers t. Hence, the spectrum frm simple natural language prcessing all the way t language understanding can help substantially. Als, natural language prcessing can cntribute significantly as an effective interface fr stating hints t mining algrithms and visualizing and eplaining knwledge derived by a KDD system. Planning cnsiders a cmplicated data analysis prcess. It invlves cnducting cmplicated data-access and data-transfrmatin peratins; applying preprcessing rutines; and, in sme cases, paying attentin t resurce and data-access cnstraints. Typically, data prcessing steps are epressed in terms f desired pstcnditins and precnditins fr the applicatin f certain rutines, which lends itself easily t representatin as a planning prblem. In additin, planning ability can play an imprtant rle in autmated agents (see net item) t cllect data samples r cnduct a search t btain needed data sets. Intelligent agents can be fired ff t cllect necessary infrmatin frm a variety f 50 AI MAGAZINE

15 surces. In additin, infrmatin agents can be activated remtely ver the netwrk r can trigger n the ccurrence f a certain event and start an analysis peratin. Finally, agents can help navigate and mdel the Wrld-Wide Web (Etzini 1996), anther area grwing in imprtance. Uncertainty in AI includes issues fr managing uncertainty, prper inference mechanisms in the presence f uncertainty, and the reasning abut causality, all fundamental t KDD thery and practice. In fact, the KDD-96 cnference had a jint sessin with the UAI-96 cnference this year (Hrvitz and Jensen 1996). Knwledge representatin includes ntlgies, new cncepts fr representing, string, and accessing knwledge. Als included are schemes fr representing knwledge and allwing the use f prir human knwledge abut the underlying prcess by the KDD system. These ptential cntributins f AI are but a sampling; many thers, including humancmputer interactin, knwledge-acquisitin techniques, and the study f mechanisms fr reasning, have the pprtunity t cntribute t KDD. In cnclusin, we presented sme definitins f basic ntins in the KDD field. Our primary aim was t clarify the relatin between knwledge discvery and data mining. We prvided an verview f the KDD prcess and basic data-mining methds. Given the brad spectrum f data-mining methds and algrithms, ur verview is inevitably limited in scpe: There are many data-mining techniques, particularly specialized methds fr particular types f data and dmain. Althugh varius algrithms and applicatins might appear quite different n the surface, it is nt uncmmn t find that they share many cmmn cmpnents. Understanding data mining and mdel inductin at this cmpnent level clarifies the task f any data-mining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. This article represents a step tward a cmmn framewrk that we hpe will ultimately prvide a unifying visin f the cmmn verall gals and methds used in KDD. We hpe this will eventually lead t a better understanding f the variety f appraches in this multidisciplinary field and hw they fit tgether. Acknwledgments We thank Sam Uthurusamy, Rn Brachman, and KDD-96 referees fr their valuable suggestins and ideas. Nte 1. Thrughut this article, we use the term pattern t designate a pattern fund in data. We als refer t mdels. One can think f patterns as cmpnents f mdels, fr eample, a particular rule in a classificatin mdel r a linear cmpnent in a regressin mdel. References Agrawal, R., and Psaila, G Active Data Mining. In Prceedings f the First Internatinal Cnference n Knwledge Discvery and Data Mining (KDD-95), 3 8. Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Agrawal, R.; Mannila, H.; Srikant, R.; Tivnen, H.; and Verkam, I Fast Discvery f Assciatin Rules. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Apte, C., and Hng, S. J Predicting Equity Returns frm Securities Data with Minimal Rule Generatin. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Basseville, M., and Nikifrv, I. V Detectin f Abrupt Changes: Thery and Applicatin. Englewd Cliffs, N.J.: Prentice Hall. Berndt, D., and Cliffrd, J Finding Patterns in Time Series: A Dynamic Prgramming Apprach. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Berry, J Database Marketing. Business Week, September 5, Brachman, R., and Anand, T The Prcess f Knwledge Discvery in Databases: A Human-Centered Apprach. In Advances in Knwledge Discvery and Data Mining, 37 58, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy. Menl Park, Calif.: AAAI Press. Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stne, C. J Classificatin and Regressin Trees. Belmnt, Calif.: Wadswrth. Brdley, C. E., and Smyth, P Applying Classificatin Algrithms in Practice. Statistics and Cmputing. Frthcming. Buntine, W Graphical Mdels fr Discvering Knwledge. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky- Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Cheeseman, P On Finding the Mst Prbable Mdel. In Cmputatinal Mdels f Scientific Discvery and Thery Frmatin, eds. J. Shrager and P. Langley, San Francisc, Calif.: Mrgan Kaufmann. Cheeseman, P., and Stutz, J Bayesian Classificatin (AUTOCLASS): Thery and Results. In Advances in Knwledge Discvery and Data Mining, eds. FALL

16 U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Cheng, B., and Titteringtn, D. M Neural Netwrks A Review frm a Statistical Perspective. Statistical Science 9(1): Cdd, E. F Prviding OLAP (On-Line Analytical Prcessing) t User-Analysts: An IT Mandate. E. F. Cdd and Assciates. Dasarathy, B. V Nearest Neighbr (NN) Nrms: NN Pattern Classificatin Techniques. Washingtn, D.C.: IEEE Cmputer Sciety. Djk, S.; Ck, D.; and Hlder, L Analyzing the Benefits f Dmain Knwledge in Substructure Discvery. In Prceedings f KDD-95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Dzerski, S Inductive Lgic Prgramming fr Knwledge Discvery in Databases. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Elder, J., and Pregibn, D A Statistical Perspective n KDD. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky- Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Etzini, O The Wrld Wide Web: Quagmire r Gld Mine? Cmmunicatins f the ACM (Special Issue n Data Mining). Nvember Frthcming. Fayyad, U. M.; Djrgvski, S. G.; and Weir, N Frm Digitized Images t On-Line Catalgs: Data Mining a Sky Survey. AI Magazine 17(2): Fayyad, U. M.; Haussler, D.; and Stlrz, Z KDD fr Science Data Analysis: Issues and Eamples. In Prceedings f the Secnd Internatinal Cnference n Knwledge Discvery and Data Mining (KDD-96), Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Fayyad, U. M.; Piatetsky-Shapir, G.; and Smyth, P Frm Data Mining t Knwledge Discvery: An Overview. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky- Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Fayyad, U. M.; Piatetsky-Shapir, G.; Smyth, P.; and Uthurusamy, R Advances in Knwledge Discvery and Data Mining. Menl Park, Calif.: AAAI Press. Friedman, J. H Multivariate Adaptive Regressin Splines. Annals f Statistics 19: Geman, S.; Bienenstck, E.; and Dursat, R Neural Netwrks and the Bias/Variance Dilemma. Neural Cmputatin 4:1 58. Glymur, C.; Madigan, D.; Pregibn, D.; and Smyth, P Statistics and Data Mining. Cmmunicatins f the ACM (Special Issue n Data Mining). Nvember Frthcming. Glymur, C.; Scheines, R.; Spirtes, P.; Kelly, K Discvering Causal Structure. New Yrk: Academic. Guyn, O.; Matic, N.; and Vapnik, N Discv- ering Infrmative Patterns and Data Cleaning. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hall, J.; Mani, G.; and Barr, D Applying Cmputatinal Intelligence t the Investment Prcess. In Prceedings f CIFER-96: Cmputatinal Intelligence in Financial Engineering. Washingtn, D.C.: IEEE Cmputer Sciety. Hand, D. J Decnstructing Statistical Questins. Jurnal f the Ryal Statistical Sciety A. 157(3): Hand, D. J Discriminatin and Classificatin. Chichester, U.K.: Wiley. Heckerman, D Bayesian Netwrks fr Knwledge Discvery. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky- Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hernandez, M., and Stlf, S The MERGE- PURGE Prblem fr Large Databases. In Prceedings f the 1995 ACM-SIGMOD Cnference, New Yrk: Assciatin fr Cmputing Machinery. Hlsheimer, M.; Kersten, M. L.; Mannila, H.; and Tivnen, H Data Surveyr: Searching the Nuggets in Parallel. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hrvitz, E., and Jensen, F Prceedings f the Twelfth Cnference f Uncertainty in Artificial Intelligence. San Mate, Calif.: Mrgan Kaufmann. Jain, A. K., and Dubes, R. C Algrithms fr Clustering Data. Englewd Cliffs, N.J.: Prentice- Hall. Klesgen, W A Multipattern and Multistrategy Discvery Assistant. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Klesgen, W., and Zytkw, J Knwledge Discvery in Databases Terminlgy. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Kldner, J Case-Based Reasning. San Francisc, Calif.: Mrgan Kaufmann. Langley, P., and Simn, H. A Applicatins f Machine Learning and Rule Inductin. Cmmunicatins f the ACM 38: Majr, J., and Mangan, J Selecting amng Rules Induced frm a Hurricane Database. Jurnal f Intelligent Infrmatin Systems 4(1): Manag, M., and Auril, M Mining fr OR. ORMS Tday (Special Issue n Data Mining), February, Mannila, H.; Tivnen, H.; and Verkam, A. I Discvering Frequent Episdes in Sequences. In Prceedings f the First Internatinal Cnference n Knwledge Discvery and Data Mining (KDD-95), Menl Park, Calif.: American 52 AI MAGAZINE

17 Assciatin fr Artificial Intelligence. Matheus, C.; Piatetsky-Shapir, G.; and McNeill, D Selecting and Reprting What Is Interesting: The KEfiR Applicatin t Healthcare Data. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Pearl, J Prbabilistic Reasning in Intelligent Systems. San Francisc, Calif.: Mrgan Kaufmann. Piatetsky-Shapir, G Knwledge Discvery in Persnal Data versus Privacy A Mini-Sympsium. IEEE Epert 10(5). Piatetsky-Shapir, G Knwledge Discvery in Real Databases: A Reprt n the IJCAI-89 Wrkshp. AI Magazine 11(5): Piatetsky-Shapir, G., and Matheus, C The Interestingness f Deviatins. In Prceedings f KDD-94, eds. U. M. Fayyad and R. Uthurusamy. Technical Reprt WS-03. Menl Park, Calif.: AAAI Press. Piatetsky-Shapir, G.; Brachman, R.; Khabaza, T.; Klesgen, W.; and Simudis, E., An Overview f Issues in Develping Industrial Data Mining and Knwledge Discvery Applicatins. In Prceedings f the Secnd Internatinal Cnference n Knwledge Discvery and Data Mining (KDD-96), eds. J. Han and E. Simudis, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Quinlan, J C4.5: Prgrams fr Machine Learning. San Francisc, Calif.: Mrgan Kaufmann. Ripley, B. D Neural Netwrks and Related Methds fr Classificatin. Jurnal f the Ryal Statistical Sciety B. 56(3): Senatr, T.; Gldberg, H. G.; Wtn, J.; Cttini, M. A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.; Marrne, M. P.; and Wng, R. W. H The Financial Crimes Enfrcement Netwrk AI System (FAIS): Identifying Ptential Mney Laundering frm Reprts f Large Cash Transactins. AI Magazine 16(4): Shrager, J., and Langley, P., eds Cmputatinal Mdels f Scientific Discvery and Thery Frmatin. San Francisc, Calif.: Mrgan Kaufmann. Silberschatz, A., and Tuzhilin, A On Subjective Measures f Interestingness in Knwledge Discvery. In Prceedings f KDD-95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Silverman, B Density Estimatin fr Statistics and Data Analysis. New Yrk: Chapman and Hall. Simudis, E.; Livezey, B.; and Kerber, R Using Recn fr Data Cleaning. In Prceedings f KDD-95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Smyth, P.; Burl, M.; Fayyad, U.; and Perna, P Mdeling Subjective Uncertainty in Image Anntatin. In Advances in Knwledge Discvery and Data Mining, Menl Park, Calif.: AAAI Press. Spirtes, P.; Glymur, C.; and Scheines, R Causatin, Predictin, and Search. New Yrk: Springer-Verlag. Stlrz, P.; Nakamura, H.; Mesrbian, E.; Muntz, R.; Shek, E.; Sants, J.; Yi, J.; Ng, K.; Chien, S.; Mechs, C.; and Farrara, J Fast Spati-Tempral Data Mining f Large Gephysical Datasets. In Prceedings f KDD-95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Titteringtn, D. M.; Smith, A. F. M.; and Makv, U. E Statistical Analysis f Finite-Miture Distributins. Chichester, U.K.: Wiley. U.S. News Basketball s New High-Tech Guru: IBM Sftware Is Changing Caches Game Plans. U.S. News and Wrld Reprt, 11 December. Weigend, A., and Gershenfeld, N., eds Predicting the Future and Understanding the Past. Redwd City, Calif.: Addisn-Wesley. Weiss, S. I., and Kulikwski, C Cmputer Systems That Learn: Classificatin and Predictin Methds frm Statistics, Neural Netwrks, Machine Learning, and Epert Systems. San Francisc, Calif.: Mrgan Kaufmann. Whittaker, J Graphical Mdels in Applied Multivariate Statistics. New Yrk: Wiley. Zembwicz, R., and Zytkw, J Frm Cntingency Tables t Varius Frms f Knwledge in Databases. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Usama Fayyad is a senir researcher at Micrsft Research. He received his Ph.D. in 1991 frm the University f Michigan at Ann Arbr. Prir t jining Micrsft in 1996, he headed the Machine Learning Systems Grup at the Jet Prpulsin Labratry (JPL), Califrnia Institute f Technlgy, where he develped data-mining systems fr autmated science data analysis. He remains affiliated with JPL as a distinguished visiting scientist. Fayyad received the JPL 1993 Lew Allen Award fr Ecellence in Research and the 1994 Natinal Aernautics and Space Administratin Eceptinal Achievement Medal. His research interests include knwledge discvery in large databases, data mining, machine-learning thery and applicatins, statistical pattern recgnitin, and clustering. He was prgram cchair f KDD-94 and KDD-95 (the First Internatinal Cnference n Knwledge Discvery and Data Mining). He is general chair f KDD-96, an editr in chief f the jurnal Data Mining and Knwledge Discvery, and ceditr f the 1996 AAAI Press bk Advances in Knwledge Discvery and Data Mining. FALL

18 Gregry Piatetsky-Shapir is a principal member f the technical staff at GTE Labratries and the principal investigatr f the Knwledge Discvery in Databases (KDD) Prject, which fcuses n develping and deplying advanced KDD systems fr business applicatins. Previusly, he wrked n applying intelligent frnt ends t hetergeneus databases. Piatetsky-Shapir received several GTE awards, including GTE s highest technical achievement award fr the KEfiR system fr health-care data analysis. His research interests include intelligent database systems, dependency netwrks, and Internet resurce discvery. Prir t GTE, he wrked at Strategic Infrmatin develping financial database systems. Piatetsky-Shapir received his M.S. in 1979 and his Ph.D. in 1984, bth frm New Yrk University (NYU). His Ph.D. dissertatin n self-rganizing database systems received NYU awards as the best dissertatin in cmputer science and in all natural sciences. Piatetsky- Shapir rganized and chaired the first three (1989, 1991, and 1993) KDD wrkshps and helped in develping them int successful cnferences (KDD-95 and KDD-96). He has als been n the prgram cmmittees f numerus ther cnferences and wrkshps n AI and databases. He edited and cedited several cllectins n KDD, including tw bks Knwledge Discvery in Databases (AAAI Press, 1991) and Advances in Knwledge Discvery in Databases (AAAI Press, 1996) and has many ther publicatins in the areas f AI and databases. He is a ceditr in chief f the new Data Mining and Knwledge Discvery jurnal. Piatetsky-Shapir funded and mderates the KDD Nuggets electrnic newsletter ([email protected]) and is the web master fr Knwledge Discvery Mine (< ~kdd /inde.html>). Padhraic Smyth received a firstclass-hnrs Bachelr f Engineering frm the Natinal University f Ireland in 1984 and an MSEE and a Ph.D. frm the Electrical Engineering Department at the Califrnia Institute f Technlgy (Caltech) in 1985 and 1988, respectively. Frm 1988 t 1996, he was a technical grup leader at the Jet Prpulsin Labratry (JPL). Since April 1996, he has been a faculty member in the Infrmatin and Cmputer Science Department at the University f Califrnia at Irvine. He is als currently a principal investigatr at JPL (part-time) and is a cnsultant t private industry. Smyth received the Lew Allen Award fr Ecellence in Research at JPL in 1993 and has been awarded 14 Natinal Aernautics and Space Administratin certificates fr technical innvatin since He was ceditr f the bk Advances in Knwledge Discvery and Data Mining (AAAI Press, 1996). Smyth was a visiting lecturer in the Cmputatinal and Neural Systems and Electri- cal Engineering Departments at Caltech (1994) and regularly cnducts tutrials n prbabilistic learning algrithms at natinal cnferences (including UAI-93, AAAI-94, CAIA-95, IJCAI-95). He is general chair f the Sith Internatinal Wrkshp n AI and Statistics, t be held in Smyth s research interests include statistical pattern recgnitin, machine learning, decisin thery, prbabilistic reasning, infrmatin thery, and the applicatin f prbability and statistics in AI. He has published 16 jurnal papers, 10 bk chapters, and 60 cnference papers n these tpics. AAAI 97 Prvidence, Rhde Island July 27 31, 1997 Title pages due January 6, 1997 Papers due January 8, 1997 Camera cpy due April 2, 1997 [email protected] Cnferences/Natinal/1997/aaai97.html 54 AI MAGAZINE