Across a wide variety of fields, data are


 Eustacia Cole
 1 years ago
 Views:
Transcription
1 Frm Data Mining t Knwledge Discvery in Databases Usama Fayyad, Gregry PiatetskyShapir, and Padhraic Smyth Data mining and knwledge discvery in databases have been attracting a significant amunt f research, industry, and media attentin f late. What is all the ecitement abut? This article prvides an verview f this emerging field, clarifying hw data mining and knwledge discvery in databases are related bth t each ther and t related fields, such as machine learning, statistics, and databases. The article mentins particular realwrld applicatins, specific datamining techniques, challenges invlved in realwrld applicatins f knwledge discvery, and current and future research directins in the field. Acrss a wide variety f fields, data are being cllected and accumulated at a dramatic pace. There is an urgent need fr a new generatin f cmputatinal theries and tls t assist humans in etracting useful infrmatin (knwledge) frm the rapidly grwing vlumes f digital data. These theries and tls are the subject f the emerging field f knwledge discvery in databases (KDD). At an abstract level, the KDD field is cncerned with the develpment f methds and techniques fr making sense f data. The basic prblem addressed by the KDD prcess is ne f mapping lwlevel data (which are typically t vluminus t understand and digest easily) int ther frms that might be mre cmpact (fr eample, a shrt reprt), mre abstract (fr eample, a descriptive apprimatin r mdel f the prcess that generated the data), r mre useful (fr eample, a predictive mdel fr estimating the value f future cases). At the cre f the prcess is the applicatin f specific datamining methds fr pattern discvery and etractin. 1 This article begins by discussing the histrical cntet f KDD and data mining and their intersectin with ther related fields. A brief summary f recent KDD realwrld applicatins is prvided. Definitins f KDD and data mining are prvided, and the general multistep KDD prcess is utlined. This multistep prcess has the applicatin f datamining algrithms as ne particular step in the prcess. The datamining step is discussed in mre detail in the cntet f specific datamining algrithms and their applicatin. Realwrld practical applicatin issues are als utlined. Finally, the article enumerates challenges fr future research and develpment and in particular discusses ptential pprtunities fr AI technlgy in KDD systems. Why D We Need KDD? The traditinal methd f turning data int knwledge relies n manual analysis and interpretatin. Fr eample, in the healthcare industry, it is cmmn fr specialists t peridically analyze current trends and changes in healthcare data, say, n a quarterly basis. The specialists then prvide a reprt detailing the analysis t the spnsring healthcare rganizatin; this reprt becmes the basis fr future decisin making and planning fr healthcare management. In a ttally different type f applicatin, planetary gelgists sift thrugh remtely sensed images f planets and asterids, carefully lcating and catalging such gelgic bjects f interest as impact craters. Be it science, marketing, finance, health care, retail, r any ther field, the classical apprach t data analysis relies fundamentally n ne r mre analysts becming Cpyright 1996, American Assciatin fr Artificial Intelligence. All rights reserved / $2.00 FALL
2 There is an urgent need fr a new generatin f cmputatinal theries and tls t assist humans in etracting useful infrmatin (knwledge) frm the rapidly grwing vlumes f digital data. areas is astrnmy. Here, a ntable success was achieved by SKICAT, a system used by astrnmers t perfrm image analysis, classificatin, and catalging f sky bjects frm skysurvey images (Fayyad, Djrgvski, and Weir 1996). In its first applicatin, the system was used t prcess the 3 terabytes (10 12 bytes) f image data resulting frm the Secnd Palmar Observatry Sky Survey, where it is estimated that n the rder f 10 9 sky bjects are detectable. SKICAT can utperfrm humans and traditinal cmputatinal techniques in classifying faint sky bjects. See Fayyad, Haussler, and Stlrz (1996) fr a survey f scientific applicatins. In business, main KDD applicatin areas includes marketing, finance (especially investment), fraud detectin, manufacturing, telecmmunicatins, and Internet agents. Marketing: In marketing, the primary applicatin is database marketing systems, which analyze custmer databases t identify different custmer grups and frecast their behavir. Business Week (Berry 1994) estimated that ver half f all retailers are using r planning t use database marketing, and thse wh d use it have gd results; fr eample, American Epress reprts a 10 t 15 percent increase in creditcard use. Anther ntable marketing applicatin is marketbasket analysis (Agrawal et al. 1996) systems, which find patterns such as, If custmer bught X, he/she is als likely t buy Y and Z. Such patterns are valuable t retailers. Investment: Numerus cmpanies use data mining fr investment, but mst d nt describe their systems. One eceptin is LBS Capital Management. Its system uses epert systems, neural nets, and genetic algrithms t manage prtflis ttaling $600 millin; since its start in 1993, the system has utperfrmed the brad stck market (Hall, Mani, and Barr 1996). Fraud detectin: HNC Falcn and Nestr PRISM systems are used fr mnitring creditcard fraud, watching ver millins f accunts. The FAIS system (Senatr et al. 1995), frm the U.S. Treasury Financial Crimes Enfrcement Netwrk, is used t identify financial transactins that might indicate mneylaundering activity. Manufacturing: The CASSIOPEE trubleshting system, develped as part f a jint venture between General Electric and SNECMA, was applied by three majr Eurpean airlines t diagnse and predict prblems fr the Being 737. T derive families f faults, clustering methds are used. CASSIOPEE received the Eurpean first prize fr innvaintimately familiar with the data and serving as an interface between the data and the users and prducts. Fr these (and many ther) applicatins, this frm f manual prbing f a data set is slw, epensive, and highly subjective. In fact, as data vlumes grw dramatically, this type f manual data analysis is becming cmpletely impractical in many dmains. Databases are increasing in size in tw ways: (1) the number N f recrds r bjects in the database and (2) the number d f fields r attributes t an bject. Databases cntaining n the rder f N = 10 9 bjects are becming increasingly cmmn, fr eample, in the astrnmical sciences. Similarly, the number f fields d can easily be n the rder f 10 2 r even 10 3, fr eample, in medical diagnstic applicatins. Wh culd be epected t digest millins f recrds, each having tens r hundreds f fields? We believe that this jb is certainly nt ne fr humans; hence, analysis wrk needs t be autmated, at least partially. The need t scale up human analysis capabilities t handling the large number f bytes that we can cllect is bth ecnmic and scientific. Businesses use data t gain cmpetitive advantage, increase efficiency, and prvide mre valuable services t custmers. Data we capture abut ur envirnment are the basic evidence we use t build theries and mdels f the universe we live in. Because cmputers have enabled humans t gather mre data than we can digest, it is nly natural t turn t cmputatinal techniques t help us unearth meaningful patterns and structures frm the massive vlumes f data. Hence, KDD is an attempt t address a prblem that the digital infrmatin era made a fact f life fr all f us: data verlad. Data Mining and Knwledge Discvery in the Real Wrld A large degree f the current interest in KDD is the result f the media interest surrunding successful KDD applicatins, fr eample, the fcus articles within the last tw years in Business Week, Newsweek, Byte, PC Week, and ther largecirculatin peridicals. Unfrtunately, it is nt always easy t separate fact frm media hype. Nnetheless, several welldcumented eamples f successful systems can rightly be referred t as KDD applicatins and have been deplyed in peratinal use n largescale realwrld prblems in science and in business. In science, ne f the primary applicatin 38 AI MAGAZINE
3 tive applicatins (Manag and Auril 1996). Telecmmunicatins: The telecmmunicatins alarmsequence analyzer (TASA) was built in cperatin with a manufacturer f telecmmunicatins equipment and three telephne netwrks (Mannila, Tivnen, and Verkam 1995). The system uses a nvel framewrk fr lcating frequently ccurring alarm episdes frm the alarm stream and presenting them as rules. Large sets f discvered rules can be eplred with fleible infrmatinretrieval tls supprting interactivity and iteratin. In this way, TASA ffers pruning, gruping, and rdering tls t refine the results f a basic brutefrce search fr rules. Data cleaning: The MERGEPURGE system was applied t the identificatin f duplicate welfare claims (Hernandez and Stlf 1995). It was used successfully n data frm the Welfare Department f the State f Washingtn. In ther areas, a wellpublicized system is IBM s ADVANCED SCOUT, a specialized datamining system that helps Natinal Basketball Assciatin (NBA) caches rganize and interpret data frm NBA games (U.S. News 1995). ADVANCED SCOUT was used by several f the NBA teams in 1996, including the Seattle Supersnics, which reached the NBA finals. Finally, a nvel and increasingly imprtant type f discvery is ne based n the use f intelligent agents t navigate thrugh an infrmatinrich envirnment. Althugh the idea f active triggers has lng been analyzed in the database field, really successful applicatins f this idea appeared nly with the advent f the Internet. These systems ask the user t specify a prfile f interest and search fr related infrmatin amng a wide variety f publicdmain and prprietary surces. Fr eample, FIREFLY is a persnal musicrecmmendatin agent: It asks a user his/her pinin f several music pieces and then suggests ther music that the user might like (<http:// CRAYON (http://crayn.net/>) allws users t create their wn free newspaper (supprted by ads); NEWSHOUND (<http://www. sjmercury.cm/hund/>) frm the San Jse Mercury News and FARCAST (<http://www.farcast.cm/> autmatically search infrmatin frm a wide variety f surces, including newspapers and wire services, and relevant dcuments directly t the user. These are just a few f the numerus such systems that use KDD techniques t autmatically prduce useful infrmatin frm large masses f raw data. See PiatetskyShapir et al. (1996) fr an verview f issues in develping industrial KDD applicatins. Data Mining and KDD Histrically, the ntin f finding useful patterns in data has been given a variety f names, including data mining, knwledge etractin, infrmatin discvery, infrmatin harvesting, data archaelgy, and data pattern prcessing. The term data mining has mstly been used by statisticians, data analysts, and the management infrmatin systems (MIS) cmmunities. It has als gained ppularity in the database field. The phrase knwledge discvery in databases was cined at the first KDD wrkshp in 1989 (PiatetskyShapir 1991) t emphasize that knwledge is the end prduct f a datadriven discvery. It has been ppularized in the AI and machinelearning fields. In ur view, KDD refers t the verall prcess f discvering useful knwledge frm data, and data mining refers t a particular step in this prcess. Data mining is the applicatin f specific algrithms fr etracting patterns frm data. The distinctin between the KDD prcess and the datamining step (within the prcess) is a central pint f this article. The additinal steps in the KDD prcess, such as data preparatin, data selectin, data cleaning, incrpratin f apprpriate prir knwledge, and prper interpretatin f the results f mining, are essential t ensure that useful knwledge is derived frm the data. Blind applicatin f datamining methds (rightly criticized as data dredging in the statistical literature) can be a dangerus activity, easily leading t the discvery f meaningless and invalid patterns. The Interdisciplinary Nature f KDD KDD has evlved, and cntinues t evlve, frm the intersectin f research fields such as machine learning, pattern recgnitin, databases, statistics, AI, knwledge acquisitin fr epert systems, data visualizatin, and highperfrmance cmputing. The unifying gal is etracting highlevel knwledge frm lwlevel data in the cntet f large data sets. The datamining cmpnent f KDD currently relies heavily n knwn techniques frm machine learning, pattern recgnitin, and statistics t find patterns frm data in the datamining step f the KDD prcess. A natural questin is, Hw is KDD different frm pattern recgnitin r machine learning (and related fields)? The answer is that these fields prvide sme f the datamining methds that are used in the datamining step f the KDD prcess. KDD fcuses n the verall prcess f knwledge discvery frm data, including hw the data are stred and accessed, hw algrithms can be scaled t massive data sets The basic prblem addressed by the KDD prcess is ne f mapping lwlevel data int ther frms that might be mre cmpact, mre abstract, r mre useful. FALL
4 Data mining is a step in the KDD prcess that cnsists f applying data analysis and discvery algrithms that prduce a particular enumeratin f patterns (r mdels) ver the data. Basic Definitins KDD is the nntrivial prcess f identifying valid, nvel, ptentially useful, and ultimateand still run efficiently, hw results can be interpreted and visualized, and hw the verall manmachine interactin can usefully be mdeled and supprted. The KDD prcess can be viewed as a multidisciplinary activity that encmpasses techniques beynd the scpe f any ne particular discipline such as machine learning. In this cntet, there are clear pprtunities fr ther fields f AI (besides machine learning) t cntribute t KDD. KDD places a special emphasis n finding understandable patterns that can be interpreted as useful r interesting knwledge. Thus, fr eample, neural netwrks, althugh a pwerful mdeling tl, are relatively difficult t understand cmpared t decisin trees. KDD als emphasizes scaling and rbustness prperties f mdeling algrithms fr large nisy data sets. Related AI research fields include machine discvery, which targets the discvery f empirical laws frm bservatin and eperimentatin (Shrager and Langley 1990) (see Klesgen and Zytkw [1996] fr a glssary f terms cmmn t KDD and machine discvery), and causal mdeling fr the inference f causal mdels frm data (Spirtes, Glymur, and Scheines 1993). Statistics in particular has much in cmmn with KDD (see Elder and Pregibn [1996] and Glymur et al. [1996] fr a mre detailed discussin f this synergy). Knwledge discvery frm data is fundamentally a statistical endeavr. Statistics prvides a language and framewrk fr quantifying the uncertainty that results when ne tries t infer general patterns frm a particular sample f an verall ppulatin. As mentined earlier, the term data mining has had negative cnntatins in statistics since the 1960s when cmputerbased data analysis techniques were first intrduced. The cncern arse because if ne searches lng enugh in any data set (even randmly generated data), ne can find patterns that appear t be statistically significant but, in fact, are nt. Clearly, this issue is f fundamental imprtance t KDD. Substantial prgress has been made in recent years in understanding such issues in statistics. Much f this wrk is f direct relevance t KDD. Thus, data mining is a legitimate activity as lng as ne understands hw t d it crrectly; data mining carried ut prly (withut regard t the statistical aspects f the prblem) is t be avided. KDD can als be viewed as encmpassing a brader view f mdeling than statistics. KDD aims t prvide tls t autmate (t the degree pssible) the entire prcess f data analysis and the statistician s art f hypthesis selectin. A driving frce behind KDD is the database field (the secnd D in KDD). Indeed, the prblem f effective data manipulatin when data cannt fit in the main memry is f fundamental imprtance t KDD. Database techniques fr gaining efficient data access, gruping and rdering peratins when accessing data, and ptimizing queries cnstitute the basics fr scaling algrithms t larger data sets. Mst datamining algrithms frm statistics, pattern recgnitin, and machine learning assume data are in the main memry and pay n attentin t hw the algrithm breaks dwn if nly limited views f the data are pssible. A related field evlving frm databases is data warehusing, which refers t the ppular business trend f cllecting and cleaning transactinal data t make them available fr nline analysis and decisin supprt. Data warehusing helps set the stage fr KDD in tw imprtant ways: (1) data cleaning and (2) data access. Data cleaning: As rganizatins are frced t think abut a unified lgical view f the wide variety f data and databases they pssess, they have t address the issues f mapping data t a single naming cnventin, unifrmly representing and handling missing data, and handling nise and errrs when pssible. Data access: Unifrm and welldefined methds must be created fr accessing the data and prviding access paths t data that were histrically difficult t get t (fr eample, stred ffline). Once rganizatins and individuals have slved the prblem f hw t stre and access their data, the natural net step is the questin, What else d we d with all the data? This is where pprtunities fr KDD naturally arise. A ppular apprach fr analysis f data warehuses is called nline analytical prcessing (OLAP), named fr a set f principles prpsed by Cdd (1993). OLAP tls fcus n prviding multidimensinal data analysis, which is superir t SQL in cmputing summaries and breakdwns alng many dimensins. OLAP tls are targeted tward simplifying and supprting interactive data analysis, but the gal f KDD tls is t autmate as much f the prcess as pssible. Thus, KDD is a step beynd what is currently supprted by mst standard database systems. 40 AI MAGAZINE
5 Interpretatin / Evaluatin Data Mining Preprcessing Transfrmatin Knwledge Selectin Preprcessed Data Transfrmed Data Patterns Data Target Date Figure 1. An Overview f the Steps That Cmpse the KDD Prcess. ly understandable patterns in data (Fayyad, PiatetskyShapir, and Smyth 1996). Here, data are a set f facts (fr eample, cases in a database), and pattern is an epressin in sme language describing a subset f the data r a mdel applicable t the subset. Hence, in ur usage here, etracting a pattern als designates fitting a mdel t data; finding structure frm data; r, in general, making any highlevel descriptin f a set f data. The term prcess implies that KDD cmprises many steps, which invlve data preparatin, search fr patterns, knwledge evaluatin, and refinement, all repeated in multiple iteratins. By nntrivial, we mean that sme search r inference is invlved; that is, it is nt a straightfrward cmputatin f predefined quantities like cmputing the average value f a set f numbers. The discvered patterns shuld be valid n new data with sme degree f certainty. We als want patterns t be nvel (at least t the system and preferably t the user) and ptentially useful, that is, lead t sme benefit t the user r task. Finally, the patterns shuld be understandable, if nt immediately then after sme pstprcessing. The previus discussin implies that we can define quantitative measures fr evaluating etracted patterns. In many cases, it is pssible t define measures f certainty (fr eample, estimated predictin accuracy n new data) r utility (fr eample, gain, perhaps in dllars saved because f better predictins r speedup in respnse time f a system). Ntins such as nvelty and understandability are much mre subjective. In certain cntets, understandability can be estimated by simplicity (fr eample, the number f bits t describe a pattern). An imprtant ntin, called interestingness (fr eample, see Silberschatz and Tuzhilin [1995] and PiatetskyShapir and Matheus [1994]), is usually taken as an verall measure f pattern value, cmbining validity, nvelty, usefulness, and simplicity. Interestingness functins can be defined eplicitly r can be manifested implicitly thrugh an rdering placed by the KDD system n the discvered patterns r mdels. Given these ntins, we can cnsider a pattern t be knwledge if it eceeds sme interestingness threshld, which is by n means an attempt t define knwledge in the philsphical r even the ppular view. As a matter f fact, knwledge in this definitin is purely user riented and dmain specific and is determined by whatever functins and threshlds the user chses. Data mining is a step in the KDD prcess that cnsists f applying data analysis and discvery algrithms that, under acceptable cmputatinal efficiency limitatins, prduce a particular enumeratin f patterns (r mdels) ver the data. Nte that the space f FALL
6 patterns is ften infinite, and the enumeratin f patterns invlves sme frm f search in this space. Practical cmputatinal cnstraints place severe limits n the subspace that can be eplred by a datamining algrithm. The KDD prcess invlves using the database alng with any required selectin, preprcessing, subsampling, and transfrmatins f it; applying datamining methds (algrithms) t enumerate patterns frm it; and evaluating the prducts f data mining t identify the subset f the enumerated patterns deemed knwledge. The datamining cmpnent f the KDD prcess is cncerned with the algrithmic means by which patterns are etracted and enumerated frm data. The verall KDD prcess (figure 1) includes the evaluatin and pssible interpretatin f the mined patterns t determine which patterns can be cnsidered new knwledge. The KDD prcess als includes all the additinal steps described in the net sectin. The ntin f an verall userdriven prcess is nt unique t KDD: analgus prpsals have been put frward bth in statistics (Hand 1994) and in machine learning (Brdley and Smyth 1996). The KDD Prcess The KDD prcess is interactive and iterative, invlving numerus steps with many decisins made by the user. Brachman and Anand (1996) give a practical view f the KDD prcess, emphasizing the interactive nature f the prcess. Here, we bradly utline sme f its basic steps: First is develping an understanding f the applicatin dmain and the relevant prir knwledge and identifying the gal f the KDD prcess frm the custmer s viewpint. Secnd is creating a target data set: selecting a data set, r fcusing n a subset f variables r data samples, n which discvery is t be perfrmed. Third is data cleaning and preprcessing. Basic peratins include remving nise if apprpriate, cllecting the necessary infrmatin t mdel r accunt fr nise, deciding n strategies fr handling missing data fields, and accunting fr timesequence infrmatin and knwn changes. Furth is data reductin and prjectin: finding useful features t represent the data depending n the gal f the task. With dimensinality reductin r transfrmatin methds, the effective number f variables under cnsideratin can be reduced, r invariant representatins fr the data can be fund. Fifth is matching the gals f the KDD prcess (step 1) t a particular datamining methd. Fr eample, summarizatin, classificatin, regressin, clustering, and s n, are described later as well as in Fayyad, PiatetskyShapir, and Smyth (1996). Sith is eplratry analysis and mdel and hypthesis selectin: chsing the datamining algrithm(s) and selecting methd(s) t be used fr searching fr data patterns. This prcess includes deciding which mdels and parameters might be apprpriate (fr eample, mdels f categrical data are different than mdels f vectrs ver the reals) and matching a particular datamining methd with the verall criteria f the KDD prcess (fr eample, the end user might be mre interested in understanding the mdel than its predictive capabilities). Seventh is data mining: searching fr patterns f interest in a particular representatinal frm r a set f such representatins, including classificatin rules r trees, regressin, and clustering. The user can significantly aid the datamining methd by crrectly perfrming the preceding steps. Eighth is interpreting mined patterns, pssibly returning t any f steps 1 thrugh 7 fr further iteratin. This step can als invlve visualizatin f the etracted patterns and mdels r visualizatin f the data given the etracted mdels. Ninth is acting n the discvered knwledge: using the knwledge directly, incrprating the knwledge int anther system fr further actin, r simply dcumenting it and reprting it t interested parties. This prcess als includes checking fr and reslving ptential cnflicts with previusly believed (r etracted) knwledge. The KDD prcess can invlve significant iteratin and can cntain lps between any tw steps. The basic flw f steps (althugh nt the ptential multitude f iteratins and lps) is illustrated in figure 1. Mst previus wrk n KDD has fcused n step 7, the data mining. Hwever, the ther steps are as imprtant (and prbably mre s) fr the successful applicatin f KDD in practice. Having defined the basic ntins and intrduced the KDD prcess, we nw fcus n the datamining cmpnent, which has, by far, received the mst attentin in the literature. 42 AI MAGAZINE
7 The DataMining Step f the KDD Prcess Debt The datamining cmpnent f the KDD prcess ften invlves repeated iterative applicatin f particular datamining methds. This sectin presents an verview f the primary gals f data mining, a descriptin f the methds used t address these gals, and a brief descriptin f the datamining algrithms that incrprate these methds. The knwledge discvery gals are defined by the intended use f the system. We can distinguish tw types f gals: (1) verificatin and (2) discvery. With verificatin, the system is limited t verifying the user s hypthesis. With discvery, the system autnmusly finds new patterns. We further subdivide the discvery gal int predictin, where the system finds patterns fr predicting the future behavir f sme entities, and descriptin, where the system finds patterns fr presentatin t a user in a humanunderstandable frm. In this article, we are primarily cncerned with discveryriented data mining. Data mining invlves fitting mdels t, r determining patterns frm, bserved data. The fitted mdels play the rle f inferred knwledge: Whether the mdels reflect useful r interesting knwledge is part f the verall, interactive KDD prcess where subjective human judgment is typically required. Tw primary mathematical frmalisms are used in mdel fitting: (1) statistical and (2) lgical. The statistical apprach allws fr nndeterministic effects in the mdel, whereas a lgical mdel is purely deterministic. We fcus primarily n the statistical apprach t data mining, which tends t be the mst widely used basis fr practical datamining applicatins given the typical presence f uncertainty in realwrld datagenerating prcesses. Mst datamining methds are based n tried and tested techniques frm machine learning, pattern recgnitin, and statistics: classificatin, clustering, regressin, and s n. The array f different algrithms under each f these headings can ften be bewildering t bth the nvice and the eperienced data analyst. It shuld be emphasized that f the many datamining methds advertised in the literature, there are really nly a few fundamental techniques. The actual underlying mdel representatin being used by a particular methd typically cmes frm a cmpsitin f a small number f wellknwn ptins: plynmials, splines, kernel and basis functins, threshldblean functins, and s n. Thus, algrithms tend t differ primarily in the gdnessffit criterin used t evaluate mdel fit r in the search methd used t find a gd fit. In ur brief verview f datamining methds, we try in particular t cnvey the ntin that mst (if nt all) methds can be viewed as etensins r hybrids f a few basic techniques and principles. We first discuss the primary methds f data mining and then shw that the data mining methds can be viewed as cnsisting f three primary algrithmic cmpnents: (1) mdel representatin, (2) mdel evaluatin, and (3) search. In the discussin f KDD and datamining methds, we use a simple eample t make sme f the ntins mre cncrete. Figure 2 shws a simple twdimensinal artificial data set cnsisting f 23 cases. Each pint n the graph represents a persn wh has been given a lan by a particular bank at sme time in the past. The hrizntal ais represents the incme f the persn; the vertical ais represents the ttal persnal debt f the persn (mrtgage, car payments, and s n). The data have been classified int tw classes: (1) the s represent persns wh have defaulted n their lans and (2) the s represent persns whse lans are in gd status with the bank. Thus, this simple artificial data set culd represent a histrical data set that can cntain useful knwledge frm the pint f view f the bank making the lans. Nte that in actual KDD applicatins, there are typically many mre dimensins (as many as several hundreds) and many mre data pints (many thusands r even millins). Incme Figure 2. A Simple Data Set with Tw Classes Used fr Illustrative Purpses. FALL
8 Debt N Lan Lan Incme Figure 3. A Simple Linear Classificatin Bundary fr the Lan Data Set. The shaped regin dentes class n lan. Debt Regressin Line Incme Figure 4. A Simple Linear Regressin fr the Lan Data Set. The purpse here is t illustrate basic ideas n a small prblem in twdimensinal space. DataMining Methds The tw highlevel primary gals f data mining in practice tend t be predictin and descriptin. As stated earlier, predictin invlves using sme variables r fields in the database t predict unknwn r future values f ther variables f interest, and descriptin fcuses n finding humaninterpretable patterns describing the data. Althugh the bundaries between predictin and descriptin are nt sharp (sme f the predictive mdels can be descriptive, t the degree that they are understandable, and vice versa), the distinctin is useful fr understanding the verall discvery gal. The relative imprtance f predictin and descriptin fr particular datamining applicatins can vary cnsiderably. The gals f predictin and descriptin can be achieved using a variety f particular datamining methds. Classificatin is learning a functin that maps (classifies) a data item int ne f several predefined classes (Weiss and Kulikwski 1991; Hand 1981). Eamples f classificatin methds used as part f knwledge discvery applicatins include the classifying f trends in financial markets (Apte and Hng 1996) and the autmated identificatin f bjects f interest in large image databases (Fayyad, Djrgvski, and Weir 1996). Figure 3 shws a simple partitining f the lan data int tw class regins; nte that it is nt pssible t separate the classes perfectly using a linear decisin bundary. The bank might want t use the classificatin regins t autmatically decide whether future lan applicants will be given a lan r nt. Regressin is learning a functin that maps a data item t a realvalued predictin variable. Regressin applicatins are many, fr eample, predicting the amunt f bimass present in a frest given remtely sensed micrwave measurements, estimating the prbability that a patient will survive given the results f a set f diagnstic tests, predicting cnsumer demand fr a new prduct as a functin f advertising ependiture, and predicting time series where the input variables can be timelagged versins f the predictin variable. Figure 4 shws the result f simple linear regressin where ttal debt is fitted as a linear functin f incme: The fit is pr because nly a weak crrelatin eists between the tw variables. Clustering is a cmmn descriptive task 44 AI MAGAZINE
9 where ne seeks t identify a finite set f categries r clusters t describe the data (Jain and Dubes 1988; Titteringtn, Smith, and Makv 1985). The categries can be mutually eclusive and ehaustive r cnsist f a richer representatin, such as hierarchical r verlapping categries. Eamples f clustering applicatins in a knwledge discvery cntet include discvering hmgeneus subppulatins fr cnsumers in marketing databases and identifying subcategries f spectra frm infrared sky measurements (Cheeseman and Stutz 1996). Figure 5 shws a pssible clustering f the lan data set int three clusters; nte that the clusters verlap, allwing data pints t belng t mre than ne cluster. The riginal class labels (dented by s and s in the previus figures) have been replaced by a t indicate that the class membership is n lnger assumed knwn. Clsely related t clustering is the task f prbability density estimatin, which cnsists f techniques fr estimating frm data the jint multivariate prbability density functin f all the variables r fields in the database (Silverman 1986). Summarizatin invlves methds fr finding a cmpact descriptin fr a subset f data. A simple eample wuld be tabulating the mean and standard deviatins fr all fields. Mre sphisticated methds invlve the derivatin f summary rules (Agrawal et al. 1996), multivariate visualizatin techniques, and the discvery f functinal relatinships between variables (Zembwicz and Zytkw 1996). Summarizatin techniques are ften applied t interactive eplratry data analysis and autmated reprt generatin. Dependency mdeling cnsists f finding a mdel that describes significant dependencies between variables. Dependency mdels eist at tw levels: (1) the structural level f the mdel specifies (ften in graphic frm) which variables are lcally dependent n each ther and (2) the quantitative level f the mdel specifies the strengths f the dependencies using sme numeric scale. Fr eample, prbabilistic dependency netwrks use cnditinal independence t specify the structural aspect f the mdel and prbabilities r crrelatins t specify the strengths f the dependencies (Glymur et al. 1987; Heckerman 1996). Prbabilistic dependency netwrks are increasingly finding applicatins in areas as diverse as the develpment f prbabilistic medical epert systems frm databases, infrmatin retrieval, and mdeling f the human genme. Change and deviatin detectin fcuses n Debt Cluster 1 discvering the mst significant changes in the data frm previusly measured r nrmative values (Berndt and Cliffrd 1996; Guyn, Matic, and Vapnik 1996; Klesgen 1996; Matheus, PiatetskyShapir, and McNeill 1996; Basseville and Nikifrv 1993). The Cmpnents f DataMining Algrithms The net step is t cnstruct specific algrithms t implement the general methds we utlined. One can identify three primary cmpnents in any datamining algrithm: (1) mdel representatin, (2) mdel evaluatin, and (3) search. This reductinist view is nt necessarily cmplete r fully encmpassing; rather, it is a cnvenient way t epress the key cncepts f datamining algrithms in a relatively unified and cmpact manner. Cheeseman (1990) utlines a similar structure. Mdel representatin is the language used t describe discverable patterns. If the representatin is t limited, then n amunt f training time r eamples can prduce an accurate mdel fr the data. It is imprtant that a data analyst fully cmprehend the representatinal assumptins that might be inherent in a particular methd. It is equally imprtant that an algrithm designer clearly state which representatinal assumptins are being made by a particular algrithm. Nte that increased representatinal pwer fr mdels increases the danger f verfitting the training data, resulting in reduced predictin accuracy n unseen data. Mdelevaluatin criteria are quantitative Cluster 3 Cluster 2 Incme Figure 5. A Simple Clustering f the Lan Data Set int Three Clusters. Nte that riginal labels are replaced by a. FALL
10 Debt N Lan t Incme statements (r fit functins) f hw well a particular pattern (a mdel and its parameters) meets the gals f the KDD prcess. Fr eample, predictive mdels are ften judged by the empirical predictin accuracy n sme test set. Descriptive mdels can be evaluated alng the dimensins f predictive accuracy, nvelty, utility, and understandability f the fitted mdel. Search methd cnsists f tw cmpnents: (1) parameter search and (2) mdel search. Once the mdel representatin (r family f representatins) and the mdelevaluatin criteria are fied, then the datamining prblem has been reduced t purely an ptimizatin task: Find the parameters and mdels frm the selected family that ptimize the evaluatin criteria. In parameter search, the algrithm must search fr the parameters that ptimize the mdelevaluatin criteria given bserved data and a fied mdel representatin. Mdel search ccurs as a lp ver the parametersearch methd: The mdel representatin is changed s that a family f mdels is cnsidered. Sme DataMining Methds A wide variety f datamining methds eist, but here, we nly fcus n a subset f ppular techniques. Each methd is discussed in the cntet f mdel representatin, mdel evaluatin, and search. Lan Figure 6. Using a Single Threshld n the Incme Variable t Try t Classify the Lan Data Set. Decisin Trees and Rules Decisin trees and rules that use univariate splits have a simple representatinal frm, making the inferred mdel relatively easy fr the user t cmprehend. Hwever, the restrictin t a particular tree r rule representatin can significantly restrict the functinal frm (and, thus, the apprimatin pwer) f the mdel. Fr eample, figure 6 illustrates the effect f a threshld split applied t the incme variable fr a lan data set: It is clear that using such simple threshld splits (parallel t the feature aes) severely limits the type f classificatin bundaries that can be induced. If ne enlarges the mdel space t allw mre general epressins (such as multivariate hyperplanes at arbitrary angles), then the mdel is mre pwerful fr predictin but can be much mre difficult t cmprehend. A large number f decisin tree and ruleinductin algrithms are described in the machinelearning and applied statistics literature (Quinlan 1992; Breiman et al. 1984). T a large etent, they depend n likelihdbased mdelevaluatin methds, with varying degrees f sphisticatin in terms f penalizing mdel cmpleity. Greedy search methds, which invlve grwing and pruning rule and tree structures, are typically used t eplre the superepnential space f pssible mdels. Trees and rules are primarily used fr predictive mdeling, bth fr classificatin (Apte and Hng 1996; Fayyad, Djrgvski, and Weir 1996) and regressin, althugh they can als be applied t summary descriptive mdeling (Agrawal et al. 1996). Nnlinear Regressin and Classificatin Methds These methds cnsist f a family f techniques fr predictin that fit linear and nnlinear cmbinatins f basis functins (sigmids, splines, plynmials) t cmbinatins f the input variables. Eamples include feedfrward neural netwrks, adaptive spline methds, and prjectin pursuit regressin (see Elder and Pregibn [1996], Cheng and Titteringtn [1994], and Friedman [1989] fr mre detailed discussins). Cnsider neural netwrks, fr eample. Figure 7 illustrates the type f nnlinear decisin bundary that a neural netwrk might find fr the lan data set. In terms f mdel evaluatin, althugh netwrks f the apprpriate size can universally apprimate any smth functin t any desired degree f accuracy, relatively little is knwn abut the representatin prperties f fiedsize netwrks estimated frm finite data sets. Als, the standard squared errr and 46 AI MAGAZINE
11 crssentrpy lss functins used t train neural netwrks can be viewed as lglikelihd functins fr regressin and classificatin, respectively (Ripley 1994; Geman, Bienenstck, and Dursat 1992). Back prpagatin is a parametersearch methd that perfrms gradient descent in parameter (weight) space t find a lcal maimum f the likelihd functin starting frm randm initial cnditins. Nnlinear regressin methds, althugh pwerful in representatinal pwer, can be difficult t interpret. Fr eample, althugh the classificatin bundaries f figure 7 might be mre accurate than the simple threshld bundary f figure 6, the threshld bundary has the advantage that the mdel can be epressed, t sme degree f certainty, as a simple rule f the frm if incme is greater than threshld, then lan will have gd status. EampleBased Methds The representatin is simple: Use representative eamples frm the database t apprimate a mdel; that is, predictins n new eamples are derived frm the prperties f similar eamples in the mdel whse predictin is knwn. Techniques include nearestneighbr classificatin and regressin algrithms (Dasarathy 1991) and casebased reasning systems (Kldner 1993). Figure 8 illustrates the use f a nearestneighbr classifier fr the lan data set: The class at any new pint in the twdimensinal space is the same as the class f the clsest pint in the riginal training data set. A ptential disadvantage f eamplebased methds (cmpared with treebased methds) is that a welldefined distance metric fr evaluating the distance between data pints is required. Fr the lan data in figure 8, this wuld nt be a prblem because incme and debt are measured in the same units. Hwever, if ne wished t include variables such as the duratin f the lan, se, and prfessin, then it wuld require mre effrt t define a sensible metric between the variables. Mdel evaluatin is typically based n crssvalidatin estimates (Weiss and Kulikwski 1991) f a predictin errr: Parameters f the mdel t be estimated can include the number f neighbrs t use fr predictin and the distance metric itself. Like nnlinear regressin methds, eamplebased methds are ften asympttically pwerful in terms f apprimatin prperties but, cnversely, can be difficult t interpret because the mdel is implicit in the data and nt eplicitly frmulated. Related techniques include kerneldensity Debt N Lan Lan Incme Figure 7. An Eample f Classificatin Bundaries Learned by a Nnlinear Classifier (Such as a Neural Netwrk) fr the Lan Data Set. Debt N Lan Lan Incme Figure 8. Classificatin Bundaries fr a NearestNeighbr Classifier fr the Lan Data Set. FALL
12 Given the brad spectrum f datamining methds and algrithms, ur verview is in Understanding data mining and mdel inductin at this cmpnent level clarifies the behavir f any datamining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. estimatin (Silverman 1986) and miture mdeling (Titteringtn, Smith, and Makv 1985). Prbabilistic Graphic Dependency Mdels Graphic mdels specify prbabilistic dependencies using a graph structure (Whittaker 1990; Pearl 1988). In its simplest frm, the mdel specifies which variables are directly dependent n each ther. Typically, these mdels are used with categrical r discretevalued variables, but etensins t special cases, such as Gaussian densities, fr realvalued variables are als pssible. Within the AI and statistical cmmunities, these mdels were initially develped within the framewrk f prbabilistic epert systems; the structure f the mdel and the parameters (the cnditinal prbabilities attached t the links f the graph) were elicited frm eperts. Recently, there has been significant wrk in bth the AI and statistical cmmunities n methds whereby bth the structure and the parameters f graphic mdels can be learned directly frm databases (Buntine 1996; Heckerman 1996). Mdelevaluatin criteria are typically Bayesian in frm, and parameter estimatin can be a miture f clsedfrm estimates and iterative methds depending n whether a variable is directly bserved r hidden. Mdel search can cnsist f greedy hillclimbing methds ver varius graph structures. Prir knwledge, such as a partial rdering f the variables based n causal relatins, can be useful in terms f reducing the mdel search space. Althugh still primarily in the research phase, graphic mdel inductin methds are f particular interest t KDD because the graphic frm f the mdel lends itself easily t human interpretatin. Relatinal Learning Mdels Althugh decisin trees and rules have a representatin restricted t prpsitinal lgic, relatinal learning (als knwn as inductive lgic prgramming) uses the mre fleible pattern language f firstrder lgic. A relatinal learner can easily find frmulas such as X = Y. Mst research t date n mdelevaluatin methds fr relatinal learning is lgical in nature. The etra representatinal pwer f relatinal mdels cmes at the price f significant cmputatinal demands in terms f search. See Dzerski (1996) fr a mre detailed discussin. Discussin evitably limited in scpe; many datamining techniques, particularly specialized methds fr particular types f data and dmains, were nt mentined specifically. We believe the general discussin n datamining tasks and cmpnents has general relevance t a variety f methds. Fr eample, cnsider timeseries predictin, which traditinally has been cast as a predictive regressin task (autregressive mdels, and s n). Recently, mre general mdels have been develped fr timeseries applicatins, such as nnlinear basis functins, eamplebased mdels, and kernel methds. Furthermre, there has been significant interest in descriptive graphic and lcal data mdeling f time series rather than purely predictive mdeling (Weigend and Gershenfeld 1993). Thus, althugh different algrithms and applicatins might appear different n the surface, it is nt uncmmn t find that they share many cmmn cmpnents. Understanding data mining and mdel inductin at this cmpnent level clarifies the behavir f any datamining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. An imprtant pint is that each technique typically suits sme prblems better than thers. Fr eample, decisin tree classifiers can be useful fr finding structure in highdimensinal spaces and in prblems with mied cntinuus and categrical data (because tree methds d nt require distance metrics). Hwever, classificatin trees might nt be suitable fr prblems where the true decisin bundaries between classes are described by a secndrder plynmial (fr eample). Thus, there is n universal datamining methd, and chsing a particular algrithm fr a particular applicatin is smething f an art. In practice, a large prtin f the applicatin effrt can g int prperly frmulating the prblem (asking the right questin) rather than int ptimizing the algrithmic details f a particular datamining methd (Langley and Simn 1995; Hand 1994). Because ur discussin and verview f datamining methds has been brief, we want t make tw imprtant pints clear: First, ur verview f autmated search fcused mainly n autmated methds fr etracting patterns r mdels frm data. Althugh this apprach is cnsistent with the definitin we gave earlier, it des nt necessarily represent what ther cmmunities might refer t as data mining. Fr eample, sme use the term t designate any manual 48 AI MAGAZINE
13 search f the data r search assisted by queries t a database management system r t refer t humans visualizing patterns in data. In ther cmmunities, it is used t refer t the autmated crrelatin f data frm transactins r the autmated generatin f transactin reprts. We chse t fcus nly n methds that cntain certain degrees f search autnmy. Secnd, beware the hype: The state f the art in autmated methds in data mining is still in a fairly early stage f develpment. There are n established criteria fr deciding which methds t use in which circumstances, and many f the appraches are based n crude heuristic apprimatins t avid the epensive search required t find ptimal, r even gd, slutins. Hence, the reader shuld be careful when cnfrnted with verstated claims abut the great ability f a system t mine useful infrmatin frm large (r even small) databases. Applicatin Issues Fr a survey f KDD applicatins as well as detailed eamples, see PiatetskyShapir et al. (1996) fr industrial applicatins and Fayyad, Haussler, and Stlrz (1996) fr applicatins in science data analysis. Here, we eamine criteria fr selecting ptential applicatins, which can be divided int practical and technical categries. The practical criteria fr KDD prjects are similar t thse fr ther applicatins f advanced technlgy and include the ptential impact f an applicatin, the absence f simpler alternative slutins, and strng rganizatinal supprt fr using technlgy. Fr applicatins dealing with persnal data, ne shuld als cnsider the privacy and legal issues (PiatetskyShapir 1995). The technical criteria include cnsideratins such as the availability f sufficient data (cases). In general, the mre fields there are and the mre cmple the patterns being sught, the mre data are needed. Hwever, strng prir knwledge (see discussin later) can reduce the number f needed cases significantly. Anther cnsideratin is the relevance f attributes. It is imprtant t have data attributes that are relevant t the discvery task; n amunt f data will allw predictin based n attributes that d nt capture the required infrmatin. Furthermre, lw nise levels (few data errrs) are anther cnsideratin. High amunts f nise make it hard t identify patterns unless a large number f cases can mitigate randm nise and help clarify the aggregate patterns. Changing and timeriented data, althugh making the applicatin develpment mre difficult, make it ptentially much mre useful because it is easier t retrain a system than a human. Finally, and perhaps ne f the mst imprtant cnsideratins, is prir knwledge. It is useful t knw smething abut the dmain what are the imprtant fields, what are the likely relatinships, what is the user utility functin, what patterns are already knwn, and s n. Research and Applicatin Challenges We utline sme f the current primary research and applicatin challenges fr KDD. This list is by n means ehaustive and is intended t give the reader a feel fr the types f prblem that KDD practitiners wrestle with. Larger databases: Databases with hundreds f fields and tables and millins f recrds and f a multigigabyte size are cmmnplace, and terabyte (10 12 bytes) databases are beginning t appear. Methds fr dealing with large data vlumes include mre efficient algrithms (Agrawal et al. 1996), sampling, apprimatin, and massively parallel prcessing (Hlsheimer et al. 1996). High dimensinality: Nt nly is there ften a large number f recrds in the database, but there can als be a large number f fields (attributes, variables); s, the dimensinality f the prblem is high. A highdimensinal data set creates prblems in terms f increasing the size f the search space fr mdel inductin in a cmbinatrially eplsive manner. In additin, it increases the chances that a datamining algrithm will find spurius patterns that are nt valid in general. Appraches t this prblem include methds t reduce the effective dimensinality f the prblem and the use f prir knwledge t identify irrelevant variables. Overfitting: When the algrithm searches fr the best parameters fr ne particular mdel using a limited set f data, it can mdel nt nly the general patterns in the data but als any nise specific t the data set, resulting in pr perfrmance f the mdel n test data. Pssible slutins include crssvalidatin, regularizatin, and ther sphisticated statistical strategies. Assessing f statistical significance: A prblem (related t verfitting) ccurs when the system is searching ver many pssible mdels. Fr eample, if a system tests mdels at the significance level, then n average, with purely randm data, N/1000 f these mdels will be accepted as significant. FALL
14 This pint is frequently missed by many initial attempts at KDD. One way t deal with this prblem is t use methds that adjust the test statistic as a functin f the search, fr eample, Bnferrni adjustments fr independent tests r randmizatin testing. Changing data and knwledge: Rapidly changing (nnstatinary) data can make previusly discvered patterns invalid. In additin, the variables measured in a given applicatin database can be mdified, deleted, r augmented with new measurements ver time. Pssible slutins include incremental methds fr updating the patterns and treating change as an pprtunity fr discvery by using it t cue the search fr patterns f change nly (Matheus, PiatetskyShapir, and McNeill 1996). See als Agrawal and Psaila (1995) and Mannila, Tivnen, and Verkam (1995). Missing and nisy data: This prblem is especially acute in business databases. U.S. census data reprtedly have errr rates as great as 20 percent in sme fields. Imprtant attributes can be missing if the database was nt designed with discvery in mind. Pssible slutins include mre sphisticated statistical strategies t identify hidden variables and dependencies (Heckerman 1996; Smyth et al. 1996). Cmple relatinships between fields: Hierarchically structured attributes r values, relatins between attributes, and mre sphisticated means fr representing knwledge abut the cntents f a database will require algrithms that can effectively use such infrmatin. Histrically, datamining algrithms have been develped fr simple attributevalue recrds, althugh new techniques fr deriving relatins between variables are being develped (Dzerski 1996; Djk, Ck, and Hlder 1995). Understandability f patterns: In many applicatins, it is imprtant t make the discveries mre understandable by humans. Pssible slutins include graphic representatins (Buntine 1996; Heckerman 1996), rule structuring, natural language generatin, and techniques fr visualizatin f data and knwledge. Rulerefinement strategies (fr eample, Majr and Mangan [1995]) can be used t address a related prblem: The discvered knwledge might be implicitly r eplicitly redundant. User interactin and prir knwledge: Many current KDD methds and tls are nt truly interactive and cannt easily incrprate prir knwledge abut a prblem ecept in simple ways. The use f dmain knwl edge is imprtant in all the steps f the KDD prcess. Bayesian appraches (fr eample, Cheeseman [1990]) use prir prbabilities ver data and distributins as ne frm f encding prir knwledge. Others emply deductive database capabilities t discver knwledge that is then used t guide the datamining search (fr eample, Simudis, Livezey, and Kerber [1995]). Integratin with ther systems: A standalne discvery system might nt be very useful. Typical integratin issues include integratin with a database management system (fr eample, thrugh a query interface), integratin with spreadsheets and visualizatin tls, and accmmdating f realtime sensr readings. Eamples f integrated KDD systems are described by Simudis, Livezey, and Kerber (1995) and Stlrz, Nakamura, Mesrbiam, Muntz, Shek, Sants, Yi, Ng, Chien, Mechs, and Farrara (1995). Cncluding Remarks: The Ptential Rle f AI in KDD In additin t machine learning, ther AI fields can ptentially cntribute significantly t varius aspects f the KDD prcess. We mentin a few eamples f these areas here: Natural language presents significant pprtunities fr mining in freefrm tet, especially fr autmated anntatin and indeing prir t classificatin f tet crpra. Limited parsing capabilities can help substantially in the task f deciding what an article refers t. Hence, the spectrum frm simple natural language prcessing all the way t language understanding can help substantially. Als, natural language prcessing can cntribute significantly as an effective interface fr stating hints t mining algrithms and visualizing and eplaining knwledge derived by a KDD system. Planning cnsiders a cmplicated data analysis prcess. It invlves cnducting cmplicated dataaccess and datatransfrmatin peratins; applying preprcessing rutines; and, in sme cases, paying attentin t resurce and dataaccess cnstraints. Typically, data prcessing steps are epressed in terms f desired pstcnditins and precnditins fr the applicatin f certain rutines, which lends itself easily t representatin as a planning prblem. In additin, planning ability can play an imprtant rle in autmated agents (see net item) t cllect data samples r cnduct a search t btain needed data sets. Intelligent agents can be fired ff t cllect necessary infrmatin frm a variety f 50 AI MAGAZINE
15 surces. In additin, infrmatin agents can be activated remtely ver the netwrk r can trigger n the ccurrence f a certain event and start an analysis peratin. Finally, agents can help navigate and mdel the WrldWide Web (Etzini 1996), anther area grwing in imprtance. Uncertainty in AI includes issues fr managing uncertainty, prper inference mechanisms in the presence f uncertainty, and the reasning abut causality, all fundamental t KDD thery and practice. In fact, the KDD96 cnference had a jint sessin with the UAI96 cnference this year (Hrvitz and Jensen 1996). Knwledge representatin includes ntlgies, new cncepts fr representing, string, and accessing knwledge. Als included are schemes fr representing knwledge and allwing the use f prir human knwledge abut the underlying prcess by the KDD system. These ptential cntributins f AI are but a sampling; many thers, including humancmputer interactin, knwledgeacquisitin techniques, and the study f mechanisms fr reasning, have the pprtunity t cntribute t KDD. In cnclusin, we presented sme definitins f basic ntins in the KDD field. Our primary aim was t clarify the relatin between knwledge discvery and data mining. We prvided an verview f the KDD prcess and basic datamining methds. Given the brad spectrum f datamining methds and algrithms, ur verview is inevitably limited in scpe: There are many datamining techniques, particularly specialized methds fr particular types f data and dmain. Althugh varius algrithms and applicatins might appear quite different n the surface, it is nt uncmmn t find that they share many cmmn cmpnents. Understanding data mining and mdel inductin at this cmpnent level clarifies the task f any datamining algrithm and makes it easier fr the user t understand its verall cntributin and applicability t the KDD prcess. This article represents a step tward a cmmn framewrk that we hpe will ultimately prvide a unifying visin f the cmmn verall gals and methds used in KDD. We hpe this will eventually lead t a better understanding f the variety f appraches in this multidisciplinary field and hw they fit tgether. Acknwledgments We thank Sam Uthurusamy, Rn Brachman, and KDD96 referees fr their valuable suggestins and ideas. Nte 1. Thrughut this article, we use the term pattern t designate a pattern fund in data. We als refer t mdels. One can think f patterns as cmpnents f mdels, fr eample, a particular rule in a classificatin mdel r a linear cmpnent in a regressin mdel. References Agrawal, R., and Psaila, G Active Data Mining. In Prceedings f the First Internatinal Cnference n Knwledge Discvery and Data Mining (KDD95), 3 8. Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Agrawal, R.; Mannila, H.; Srikant, R.; Tivnen, H.; and Verkam, I Fast Discvery f Assciatin Rules. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Apte, C., and Hng, S. J Predicting Equity Returns frm Securities Data with Minimal Rule Generatin. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Basseville, M., and Nikifrv, I. V Detectin f Abrupt Changes: Thery and Applicatin. Englewd Cliffs, N.J.: Prentice Hall. Berndt, D., and Cliffrd, J Finding Patterns in Time Series: A Dynamic Prgramming Apprach. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Berry, J Database Marketing. Business Week, September 5, Brachman, R., and Anand, T The Prcess f Knwledge Discvery in Databases: A HumanCentered Apprach. In Advances in Knwledge Discvery and Data Mining, 37 58, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy. Menl Park, Calif.: AAAI Press. Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stne, C. J Classificatin and Regressin Trees. Belmnt, Calif.: Wadswrth. Brdley, C. E., and Smyth, P Applying Classificatin Algrithms in Practice. Statistics and Cmputing. Frthcming. Buntine, W Graphical Mdels fr Discvering Knwledge. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Cheeseman, P On Finding the Mst Prbable Mdel. In Cmputatinal Mdels f Scientific Discvery and Thery Frmatin, eds. J. Shrager and P. Langley, San Francisc, Calif.: Mrgan Kaufmann. Cheeseman, P., and Stutz, J Bayesian Classificatin (AUTOCLASS): Thery and Results. In Advances in Knwledge Discvery and Data Mining, eds. FALL
16 U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Cheng, B., and Titteringtn, D. M Neural Netwrks A Review frm a Statistical Perspective. Statistical Science 9(1): Cdd, E. F Prviding OLAP (OnLine Analytical Prcessing) t UserAnalysts: An IT Mandate. E. F. Cdd and Assciates. Dasarathy, B. V Nearest Neighbr (NN) Nrms: NN Pattern Classificatin Techniques. Washingtn, D.C.: IEEE Cmputer Sciety. Djk, S.; Ck, D.; and Hlder, L Analyzing the Benefits f Dmain Knwledge in Substructure Discvery. In Prceedings f KDD95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Dzerski, S Inductive Lgic Prgramming fr Knwledge Discvery in Databases. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Elder, J., and Pregibn, D A Statistical Perspective n KDD. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Etzini, O The Wrld Wide Web: Quagmire r Gld Mine? Cmmunicatins f the ACM (Special Issue n Data Mining). Nvember Frthcming. Fayyad, U. M.; Djrgvski, S. G.; and Weir, N Frm Digitized Images t OnLine Catalgs: Data Mining a Sky Survey. AI Magazine 17(2): Fayyad, U. M.; Haussler, D.; and Stlrz, Z KDD fr Science Data Analysis: Issues and Eamples. In Prceedings f the Secnd Internatinal Cnference n Knwledge Discvery and Data Mining (KDD96), Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Fayyad, U. M.; PiatetskyShapir, G.; and Smyth, P Frm Data Mining t Knwledge Discvery: An Overview. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Fayyad, U. M.; PiatetskyShapir, G.; Smyth, P.; and Uthurusamy, R Advances in Knwledge Discvery and Data Mining. Menl Park, Calif.: AAAI Press. Friedman, J. H Multivariate Adaptive Regressin Splines. Annals f Statistics 19: Geman, S.; Bienenstck, E.; and Dursat, R Neural Netwrks and the Bias/Variance Dilemma. Neural Cmputatin 4:1 58. Glymur, C.; Madigan, D.; Pregibn, D.; and Smyth, P Statistics and Data Mining. Cmmunicatins f the ACM (Special Issue n Data Mining). Nvember Frthcming. Glymur, C.; Scheines, R.; Spirtes, P.; Kelly, K Discvering Causal Structure. New Yrk: Academic. Guyn, O.; Matic, N.; and Vapnik, N Discv ering Infrmative Patterns and Data Cleaning. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hall, J.; Mani, G.; and Barr, D Applying Cmputatinal Intelligence t the Investment Prcess. In Prceedings f CIFER96: Cmputatinal Intelligence in Financial Engineering. Washingtn, D.C.: IEEE Cmputer Sciety. Hand, D. J Decnstructing Statistical Questins. Jurnal f the Ryal Statistical Sciety A. 157(3): Hand, D. J Discriminatin and Classificatin. Chichester, U.K.: Wiley. Heckerman, D Bayesian Netwrks fr Knwledge Discvery. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. Piatetsky Shapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hernandez, M., and Stlf, S The MERGE PURGE Prblem fr Large Databases. In Prceedings f the 1995 ACMSIGMOD Cnference, New Yrk: Assciatin fr Cmputing Machinery. Hlsheimer, M.; Kersten, M. L.; Mannila, H.; and Tivnen, H Data Surveyr: Searching the Nuggets in Parallel. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Hrvitz, E., and Jensen, F Prceedings f the Twelfth Cnference f Uncertainty in Artificial Intelligence. San Mate, Calif.: Mrgan Kaufmann. Jain, A. K., and Dubes, R. C Algrithms fr Clustering Data. Englewd Cliffs, N.J.: Prentice Hall. Klesgen, W A Multipattern and Multistrategy Discvery Assistant. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Klesgen, W., and Zytkw, J Knwledge Discvery in Databases Terminlgy. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Kldner, J CaseBased Reasning. San Francisc, Calif.: Mrgan Kaufmann. Langley, P., and Simn, H. A Applicatins f Machine Learning and Rule Inductin. Cmmunicatins f the ACM 38: Majr, J., and Mangan, J Selecting amng Rules Induced frm a Hurricane Database. Jurnal f Intelligent Infrmatin Systems 4(1): Manag, M., and Auril, M Mining fr OR. ORMS Tday (Special Issue n Data Mining), February, Mannila, H.; Tivnen, H.; and Verkam, A. I Discvering Frequent Episdes in Sequences. In Prceedings f the First Internatinal Cnference n Knwledge Discvery and Data Mining (KDD95), Menl Park, Calif.: American 52 AI MAGAZINE
17 Assciatin fr Artificial Intelligence. Matheus, C.; PiatetskyShapir, G.; and McNeill, D Selecting and Reprting What Is Interesting: The KEfiR Applicatin t Healthcare Data. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Pearl, J Prbabilistic Reasning in Intelligent Systems. San Francisc, Calif.: Mrgan Kaufmann. PiatetskyShapir, G Knwledge Discvery in Persnal Data versus Privacy A MiniSympsium. IEEE Epert 10(5). PiatetskyShapir, G Knwledge Discvery in Real Databases: A Reprt n the IJCAI89 Wrkshp. AI Magazine 11(5): PiatetskyShapir, G., and Matheus, C The Interestingness f Deviatins. In Prceedings f KDD94, eds. U. M. Fayyad and R. Uthurusamy. Technical Reprt WS03. Menl Park, Calif.: AAAI Press. PiatetskyShapir, G.; Brachman, R.; Khabaza, T.; Klesgen, W.; and Simudis, E., An Overview f Issues in Develping Industrial Data Mining and Knwledge Discvery Applicatins. In Prceedings f the Secnd Internatinal Cnference n Knwledge Discvery and Data Mining (KDD96), eds. J. Han and E. Simudis, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Quinlan, J C4.5: Prgrams fr Machine Learning. San Francisc, Calif.: Mrgan Kaufmann. Ripley, B. D Neural Netwrks and Related Methds fr Classificatin. Jurnal f the Ryal Statistical Sciety B. 56(3): Senatr, T.; Gldberg, H. G.; Wtn, J.; Cttini, M. A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.; Marrne, M. P.; and Wng, R. W. H The Financial Crimes Enfrcement Netwrk AI System (FAIS): Identifying Ptential Mney Laundering frm Reprts f Large Cash Transactins. AI Magazine 16(4): Shrager, J., and Langley, P., eds Cmputatinal Mdels f Scientific Discvery and Thery Frmatin. San Francisc, Calif.: Mrgan Kaufmann. Silberschatz, A., and Tuzhilin, A On Subjective Measures f Interestingness in Knwledge Discvery. In Prceedings f KDD95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Silverman, B Density Estimatin fr Statistics and Data Analysis. New Yrk: Chapman and Hall. Simudis, E.; Livezey, B.; and Kerber, R Using Recn fr Data Cleaning. In Prceedings f KDD95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Smyth, P.; Burl, M.; Fayyad, U.; and Perna, P Mdeling Subjective Uncertainty in Image Anntatin. In Advances in Knwledge Discvery and Data Mining, Menl Park, Calif.: AAAI Press. Spirtes, P.; Glymur, C.; and Scheines, R Causatin, Predictin, and Search. New Yrk: SpringerVerlag. Stlrz, P.; Nakamura, H.; Mesrbian, E.; Muntz, R.; Shek, E.; Sants, J.; Yi, J.; Ng, K.; Chien, S.; Mechs, C.; and Farrara, J Fast SpatiTempral Data Mining f Large Gephysical Datasets. In Prceedings f KDD95: First Internatinal Cnference n Knwledge Discvery and Data Mining, Menl Park, Calif.: American Assciatin fr Artificial Intelligence. Titteringtn, D. M.; Smith, A. F. M.; and Makv, U. E Statistical Analysis f FiniteMiture Distributins. Chichester, U.K.: Wiley. U.S. News Basketball s New HighTech Guru: IBM Sftware Is Changing Caches Game Plans. U.S. News and Wrld Reprt, 11 December. Weigend, A., and Gershenfeld, N., eds Predicting the Future and Understanding the Past. Redwd City, Calif.: AddisnWesley. Weiss, S. I., and Kulikwski, C Cmputer Systems That Learn: Classificatin and Predictin Methds frm Statistics, Neural Netwrks, Machine Learning, and Epert Systems. San Francisc, Calif.: Mrgan Kaufmann. Whittaker, J Graphical Mdels in Applied Multivariate Statistics. New Yrk: Wiley. Zembwicz, R., and Zytkw, J Frm Cntingency Tables t Varius Frms f Knwledge in Databases. In Advances in Knwledge Discvery and Data Mining, eds. U. Fayyad, G. PiatetskyShapir, P. Smyth, and R. Uthurusamy, Menl Park, Calif.: AAAI Press. Usama Fayyad is a senir researcher at Micrsft Research. He received his Ph.D. in 1991 frm the University f Michigan at Ann Arbr. Prir t jining Micrsft in 1996, he headed the Machine Learning Systems Grup at the Jet Prpulsin Labratry (JPL), Califrnia Institute f Technlgy, where he develped datamining systems fr autmated science data analysis. He remains affiliated with JPL as a distinguished visiting scientist. Fayyad received the JPL 1993 Lew Allen Award fr Ecellence in Research and the 1994 Natinal Aernautics and Space Administratin Eceptinal Achievement Medal. His research interests include knwledge discvery in large databases, data mining, machinelearning thery and applicatins, statistical pattern recgnitin, and clustering. He was prgram cchair f KDD94 and KDD95 (the First Internatinal Cnference n Knwledge Discvery and Data Mining). He is general chair f KDD96, an editr in chief f the jurnal Data Mining and Knwledge Discvery, and ceditr f the 1996 AAAI Press bk Advances in Knwledge Discvery and Data Mining. FALL
18 Gregry PiatetskyShapir is a principal member f the technical staff at GTE Labratries and the principal investigatr f the Knwledge Discvery in Databases (KDD) Prject, which fcuses n develping and deplying advanced KDD systems fr business applicatins. Previusly, he wrked n applying intelligent frnt ends t hetergeneus databases. PiatetskyShapir received several GTE awards, including GTE s highest technical achievement award fr the KEfiR system fr healthcare data analysis. His research interests include intelligent database systems, dependency netwrks, and Internet resurce discvery. Prir t GTE, he wrked at Strategic Infrmatin develping financial database systems. PiatetskyShapir received his M.S. in 1979 and his Ph.D. in 1984, bth frm New Yrk University (NYU). His Ph.D. dissertatin n selfrganizing database systems received NYU awards as the best dissertatin in cmputer science and in all natural sciences. Piatetsky Shapir rganized and chaired the first three (1989, 1991, and 1993) KDD wrkshps and helped in develping them int successful cnferences (KDD95 and KDD96). He has als been n the prgram cmmittees f numerus ther cnferences and wrkshps n AI and databases. He edited and cedited several cllectins n KDD, including tw bks Knwledge Discvery in Databases (AAAI Press, 1991) and Advances in Knwledge Discvery in Databases (AAAI Press, 1996) and has many ther publicatins in the areas f AI and databases. He is a ceditr in chief f the new Data Mining and Knwledge Discvery jurnal. PiatetskyShapir funded and mderates the KDD Nuggets electrnic newsletter and is the web master fr Knwledge Discvery Mine (<http://inf.gte.cm/ ~kdd /inde.html>). Padhraic Smyth received a firstclasshnrs Bachelr f Engineering frm the Natinal University f Ireland in 1984 and an MSEE and a Ph.D. frm the Electrical Engineering Department at the Califrnia Institute f Technlgy (Caltech) in 1985 and 1988, respectively. Frm 1988 t 1996, he was a technical grup leader at the Jet Prpulsin Labratry (JPL). Since April 1996, he has been a faculty member in the Infrmatin and Cmputer Science Department at the University f Califrnia at Irvine. He is als currently a principal investigatr at JPL (parttime) and is a cnsultant t private industry. Smyth received the Lew Allen Award fr Ecellence in Research at JPL in 1993 and has been awarded 14 Natinal Aernautics and Space Administratin certificates fr technical innvatin since He was ceditr f the bk Advances in Knwledge Discvery and Data Mining (AAAI Press, 1996). Smyth was a visiting lecturer in the Cmputatinal and Neural Systems and Electri cal Engineering Departments at Caltech (1994) and regularly cnducts tutrials n prbabilistic learning algrithms at natinal cnferences (including UAI93, AAAI94, CAIA95, IJCAI95). He is general chair f the Sith Internatinal Wrkshp n AI and Statistics, t be held in Smyth s research interests include statistical pattern recgnitin, machine learning, decisin thery, prbabilistic reasning, infrmatin thery, and the applicatin f prbability and statistics in AI. He has published 16 jurnal papers, 10 bk chapters, and 60 cnference papers n these tpics. AAAI 97 Prvidence, Rhde Island July 27 31, 1997 Title pages due January 6, 1997 Papers due January 8, 1997 Camera cpy due April 2, Cnferences/Natinal/1997/aaai97.html 54 AI MAGAZINE
An Introduction to Statistical Learning
Springer Texts in Statistics Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani An Intrductin t Statistical Learning with Applicatins in R Springer Texts in Statistics 103 Series Editrs: G. Casella
More informationThe Elements of Statistical Learning
Springer Series in Statistics Trevr Hastie Rbert Tibshirani Jerme Friedman The Elements f Statistical Learning Data Mining, Inference, and Predictin Secnd Editin This is page v Printer: paque this T ur
More information
THE INTERNATIONAL FRAMEWORK
THE INTERNATIONAL FRAMEWORK ABOUT THE IIRC The Internatinal Integrated Reprting Cuncil (IIRC) is a glbal calitin f regulatrs, investrs, cmpanies, standard setters, the accunting prfessin and NGOs.
More informationHow to use Moodle 2.7. Teacher s Manual for the world s most popular LMS. Jaswinder Singh
Teacher s Manual fr the wrld s mst ppular LMS Jaswinder Singh Hw t Use Mdle 2.7 2 Hw t use Mdle 2.7, 1 st Editin Teacher s Manual fr the wrld s mst ppular LMS Jaswinder Singh 3 This bk is dedicated t my
More informationThe Synchronization of Periodic Routing Messages
The Synchrnizatin f Peridic Ruting Messages Sally Flyd and Van Jacbsn, Lawrence Berkeley Labratry, One Cycltrn Rad, Berkeley CA 9470, flyd@eelblgv, van@eelblgv T appear in the April 994 IEEE/ACM Transactins
More informationSECURITY GUIDANCE FOR CRITICAL AREAS OF FOCUS IN CLOUD COMPUTING V3.0
SECURITY GUIDANCE FOR CRITICAL AREAS OF FOCUS IN CLOUD COMPUTING V3.0 INTRODUCTION The guidance prvided herein is the third versin f the Clud Security Alliance dcument, Security Guidance fr Critical Areas
More informationMEASURING AND/OR ESTIMATING SOCIAL VALUE CREATION: Insights Into Eight Integrated Cost Approaches
MEASURING AND/OR ESTIMATING SOCIAL VALUE CREATION: Insights Int Eight Integrated Cst Appraches Prepared fr Bill & Melinda Gates Fundatin Impact Planning and Imprvement Prepared by Melinda T. Tuan P.O.
More informationA Beginner s Guide to Successfully Securing Grant Funding
A Beginner s Guide t Successfully Securing Grant Funding Intrductin There is a wide range f supprt mechanisms ut there in the funding wrld, including grants, lans, equity investments, award schemes and
More informationNo Unsafe Lift. Workbook
N Unsafe Lift Wrkbk Cver and Sectin Break image prvided curtesy f Arj Canada Inc. Table Of Cntents Purpse f this wrkbk... 2 Hw t use this wrkbk...3 SECTION ONE A Brief Review f the Literature...5 SECTION
More informationHow to Write Program Objectives/Outcomes
Hw t Write Prgram Objectives/Outcmes Objectives Gals and Objectives are similar in that they describe the intended purpses and expected results f teaching activities and establish the fundatin fr assessment.
More informationMost Significant Change
Click4it Wiki  Tlkit Mst Significant Change Step by Step Step 1: Starting and raising interest A. It may help t use ne f the fllwing metaphrs t explain the MSC: Newspaper: Newspapers are structured int
More informationEuropean Investment Bank. Guide to Procurement
GUIDE TO PROCUREMENT fr prjects financed by the EIB Updated versin f June 2011 TABLE OF CONTENTS Intrductin 1. General Aspects...4 1.1. The Bank s Plicy... 4 1.2. Eligibility f Cntractrs and Suppliers
More informationReport for the Food Standards Agency. Nutrition and Public Health Intervention Research Unit London School of Hygiene & Tropical Medicine
Cmparisn f cmpsitin (nutrients and ther substances) f rganically and cnventinally prduced fdstuffs: a systematic review f the available literature Reprt fr the Fd Standards Agency Nutritin and Public Health
More informationRISING TO THE CHALLENGE. ReEnvisioning Public Libraries
RISING TO THE CHALLENGE ReEnvisining Public Libraries RISING TO THE CHALLENGE ReEnvisining Public Libraries A reprt f the Aspen Institute Dialgue n Public Libraries by Amy K. Garmer Directr Aspen Institute
More informationThe Data Center Management Elephant
The Data Center Management Elephant By David Cle DATA CENTER SOLUTIONS Fr Mre Infrmatin: (866) 7873271 Sales@PTSdcs.cm 2010 N Limits Sftware. All rights reserved. N part f this publicatin may be used,
More informationSocial Media Use by Governments
Please cite this paper as: Mickleit, A. (2014), Scial Media Use by Gvernments: A Plicy Primer t Discuss Trends, Identify Plicy Opprtunities and Guide Decisin Makers, OECD Wrking Papers n Public Gvernance,
More informationWhat's New in SAS 9.4
What's New in SAS 9.4 SAS Dcumentatin The crrect bibligraphic citatin fr this manual is as fllws: SAS Institute Inc. 2013. What's New in SAS 9.4. Cary, NC: SAS Institute Inc. What's New in SAS 9.4 Cpyright
More informationNot in Cully: AntiDisplacement Strategies for the Cully Neighborhood
Nt in Cully: AntiDisplacement Strategies fr the Cully Neighbrhd Prepared fr Living Cully: A Cully Ecdistrict June 2013 Nt in Cully: AntiDisplacement Strategies fr the Cully Neighbrhd June 2013 Acknwledgements
More informationR for Beginners. Emmanuel Paradis. Institut des Sciences de l Évolution Université Montpellier II F34095 Montpellier cédex 05 France
R fr Beginners Emmanuel Paradis Institut des Sciences de l Évlutin Université Mntpellier II F34095 Mntpellier cédex 05 France Email: paradis@isem.univmntp2.fr I thank Julien Claude, Christphe Declercq,
More information1 IS THERE A CONTRACT?
1 IS THERE A CONTRACT? MANIFESTATION OF MUTUAL ASSENT: There must be an bjective manifestatin f mutual assent t a K. Judged by what a reasnable persn wuld understand the parties actins t mean.  At stake
More informationBuilding Your Book for Kindle
Building Yur Bk fr Kindle We are excited yu ve decided t design, frmat, and prepare yur bk fr Kindle! We ll walk yu thrugh the necessary steps in creating a prfessinal digital file f yur bk fr quick uplad
More informationns Rev. 0 (3.9.15) Reporting water MDL is allowable) Preparatory Method Analysis Method The MDL programs and by covered The LOD reporting?
NR149 LOD/ /LOQ Clarificatin Required frequency Annually an MDL study must be perfrmed fr each cmbinatin f the fllwing: Matrix (if the slid and aqueus matrix methds are identical, extraplatin frm the water
More informationDevelop Agency SPF From SafetyAnalystWiki
Develp Agency SPF Frm SafetyAnalystWiki Cntents [hide] 1 Safety Perfrmance Functins 1.1 What SPFs Are Needed 1.2 Functinal Frm f SPFs 1.3 Data Needs fr Develpment f SPFs 1.4 Statistical Assumptins and
More informationElectronic Communication
Applicatin fr Tree Wrks: Wrks t Trees Subject t a Tree Preservatin Order (TPO) and/r Ntificatin f Prpsed Wrks t Trees in Cnservatin Areas (CA) Twn and Cuntry Planning Act 1990 Electrnic Cmmunicatin If
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationHow to Convert your Paper into a Presentation
Hw t Cnvert yur Paper int a Presentatin During yur cllege career, yu may be asked t present yur academic wrk in the classrm, at cnferences, r at special events. Tw types f talks are cmmn in academia: presentatins
More informationEssendant Online Terms of Use
Essendant Online Terms f Use Thank yu fr visiting this website. These Terms f Use gvern yur use f any website wned by Essendant C. r any f its subsidiaries (including Essendant Industrial LLC), n which
More information