Data Mining Overview By: Dr. Michael Gilman, CEO, Data Mining Technlgies Inc. With the prliferatin f data warehuses, data mining tls are flding the market. Their bjective is t discver hidden gld in yur data. Many traditinal reprt and query tls and statistical analysis systems use the term "data mining" in their prduct descriptins. Extic Artificial Intelligence-based systems are als being tuted as new data mining tls. What is a data mining tl and what isn't? The ultimate bjective f data mining is knwledge discvery. Data mining methdlgy extracts hidden predictive infrmatin frm large databases. With such a brad definitin, hwever, an nline analytical prcessing (OLAP) prduct r a statistical package culd qualify as a data mining tl. Data mining methdlgy extracts hidden predictive infrmatin frm large databases. That's where technlgy cmes in: fr true knwledge discvery a data mining tl shuld unearth hidden infrmatin autmatically. By this definitin data mining is datadriven, nt user-driven r verificatin-driven. Verificatin-Driven Data Mining: An Example Traditinally the gal f identifying and utilizing infrmatin hidden in data has prceeded via query generatrs and data interpretatin systems. A user frmats a thery abut a pssible relatin in a database and cnverts this hypthesis int a query. Fr example, a user might hypthesize abut the relatinship between industrial sales f clr cpiers and custmers' specific industries. He r she wuld generate a query against a data warehuse and segment the results int a reprt. Typically, the generated infrmatin prvides a gd verview. This verificatin type f data mining is limited in tw ways, hwever. First, it's based n a hunch. In ur example, the hunch is that the industry a cmpany is in crrelates with the number f cpiers it buys r leases. Secnd, the quality f the extracted infrmatin depends n the user's interpretatin f the results and is thus subject t errr. Multifactr analyses f variance and multivariate analyses identify the relatinships amng factrs that influence the utcme f cpier sales. Pearsn prduct-mment crrelatins measure the strength and directin f the relatinship between each database field and the dependent variable. One f the prblems with this apprach, aside frm its resurce intensity, is that the techniques tend t fcus n tasks in which all the attributes have cntinuus r rdinal values. Many f the attributes are als parametric. A linear classifier, fr instance, assumes that a relatinship is expressible as a linear cmbinatin f the attribute values.
Statistical methdlgy assumes nrmally distributed data an ften tenuus assumptin in the real wrld f crprate data warehuses. Manual vs. Autmatic Manual data mining stems frm the need t knw facts, such as reginal sales reprts stratified by type f business while autmatic data mining cmes frm the need t discver the factrs that influence these sales. One way t identify a true data mining tl is by hw it perates n the data: is it manual (tp-dwn) r autmatic (bttm-up)? In ther wrds, wh riginates the query, the user r the sftware? Even sme sphisticated AI-based tls that use case-based reasning, a nearest neighbr indexing system, fuzzy (cntinuus) lgic, and genetic algrithms dn't qualify as data mining tls since their queries als riginate with the user. Certainly the way these tls ptimize their search n a dataset is unique, but they d nt perfrm autnmus data discvery. Neural netwrks, plynmial netwrks, and symblic classifiers, n the ther hand, d qualify as true autmatic data mining tls because they autnmusly interrgate the data fr patterns. Neural netwrks, hwever, ften require extensive care and feeding they can nly wrk with preprcessed numeric, nrmalized, scaled data. They als need a fair amunt f tuning such as the setting f a stpping criterin, learning rates, hidden ndes, mmentum cefficients, and weights. And their results are nt always cmprehensible. Anther Paradigm Symblic classifiers that use machine learning technlgy hld great ptential as data mining tls fr crprate data warehuses. These tls d nt require any manual interventin in rder t perfrm their analysis. Their strength is their ability t autmatically identify key relatinships in a database -- t discver rather than cnfirm trends r patterns in data and t present slutins in usable business frmats. They can als handle the type f real-wrld business data that statistical and neural systems have t "scrub" and scale. Mst f these symblic classifiers are als knwn as rule-inductin prgrams r decisin-tree generatrs. They use statistical algrithms r machine-learning algrithms such as ID3, C4.5, AC2, CART, CHAID, CN2, r mdificatins f these algrithms. Symblic classifiers split a database int classes that differ as much as pssible in their relatin t a selected utput. That is, the tl partitins a database accrding t the results f statistical tests cnducted n an utput by the algrithm instead f by the user. Machine learning algrithms use the data -- nt the user's hypthesis -- t autmate the stratificatin prcess. T start the prcess, this type f data mining tl requires a "dependent variable" r utcme, such as cpier sales, which shuld be a field in the database. The rest is
autmatic. The tl's algrithm tests a multitude f hyptheses in an effrt t discver the factrs r cmbinatin f factrs, (e.g., business type, lcatin, number f emplyees) that have the mst influence n the utcme. The algrithm engages in a kind f "20 Questins" game. Presented with a database f 5,000 buyers and 5,000 nnbuyers f cpiers, the algrithm asks a series f questins abut the values f each recrd. Its gal is t classify each sample int either a buyer r nnbuyer grup. The tl prcesses every field in every recrd in the database until it sufficiently splits the buyers frm the nnbuyers and learns the main differences between them. Once the tl has learned the crucial attributes it can rank them in rder f imprtance. A user can then exclude attributes that have little r n effect n targeting ptential new custmers. Rule Generatin Mst data mining tls generate their findings in the frmat f "if then" rules. Here's an example f a data mining prcess that discvers ranges fr targeting ptential prduct buyers. CONDITIONA IF CUSTOMER SINCE = 1978 thrugh 1994 AND REVOLVING L/M/T = 5120 thrugh 8900 AND CREDIT/DEBITRAT/O =67 THEN Ptential Buyer = 89% CONDITIONZ IF CUSTOMER SINCE= 1994 thrugh 1996 AND REVOLVING LIMIT = 1311 thrugh 5120 AND CREDIT/DEB/TRAT/O =67 THEN Ptential Buyer=49% Advantages f Symblic Classifiers Symblic classifiers d nt require an intensive data preparatin effrt. This is a cnvenience t end-users wh freely mix numeric, categrical, and date variables. Anther advantage f these tls is the breadth f the analyses they prvide. Unlike traditinal statistical methds f data analysis which require the user t stratify a database int smaller subgrups in rder t maximize classificatin r predictin, data mining tls use all the data as the surce f their analysis. Still anther advantage is that these tls frmulate their slutins in English. They can extract "if-then" business rules directly frm the data based n tests they cnduct fr statistical significance. They can ptimize business cnditins by prviding answers t decisin-makers n imprtant questins. Almst all f the current symblic classifier-type data mining tls incrprate a methdlgy fr explaining their findings. They als tabulate mdel errr-rates fr estimating the gdness f their predictins. In a business envirnment where small
changes in strategy translate t millins f dllars, this type f insight can quickly equate t prfits. Sme f these tls can als generate graphic decisin trees which display a summary f significant patterns and relatinships in the data. The Bttm Line Many f tday's analytic tls have tremendus capabilities fr perfrming sphisticated user-driven queries. They are, hwever, limited in their abilities t discver hidden trends and patterns in a database. Statistical tls can prvide excellent features fr describing and visualizing large chunks f data, as well as perfrming verificatin driven data analysis. Autnmus data mining tls, hwever, based n machine-learning algrithms, are the nly tls designed t autmate the prcess f knwledge discvery. The Ten Steps f Data Mining Here is a prcess fr extracting hidden knwledge frm yur data warehuse, yur custmer infrmatin file, r any ther cmpany database. 1. Identify The Objective Befre yu begin, be clear n what yu hpe t accmplish with yur analysis. Knw in advance the business gal f the data mining. Establish whether r nt the gal is measurable. Sme pssible gals are t find sales relatinships between specific prducts r services identify specific purchasing patterns ver time identify ptential types f custmers find prduct sales trends 2. Select The Data Once yu have defined yur gal, yur next step is t select the data t meet this gal. This may be a subset f yur data warehuse r a data mart that cntains specific prduct infrmatin. It may be yur custmer infrmatin file. Segment as much as pssible the scpe f the data t be mined. Here are sme key issues: Are the data adequate t describe the phenmena the data mining analysis is attempting t mdel? Can yu enhance internal custmer recrds with external lifestyle and demgraphic data? Are the data stable will the mined attributes be the same after the analysis? If yu are merging databases can yu find a cmmn field fr linking them? Hw current and relevant are the data t the business gal? 3. Prepare The Data
Once yu've assembled the data, yu must decide which attributes t cnvert int usable frmats. Cnsider the input f dmain experts creatrs and users f the data. Establish strategies fr handling missing data, extraneus nise, and utliers Identify redundant variables in the dataset and decide which fields t exclude Decide n a lg r square transfrmatin, if necessary Visually inspect the dataset t get a feel fr the database Determine the distributin frequencies f the data Yu can pstpne sme f these decisins until yu select a data mining tl. Fr example, if yu need a neural netwrk r plynmial netwrk yu may have t transfrm sme f yur fields. 4. Audit The Data Evaluate the structure f yur data in rder t determine the apprpriate tls. What is the rati f categrical/binary attributes in the database? What is the nature and structure f the database? What is the verall cnditin f the dataset? What is the distributin f the dataset? Balance the bjective assessment f the structure f yur data against yur users' need t understand the findings. Neural nets, fr example, dn't explain their results. 5. Select The Tls Tw cncerns drive the selectin f the apprpriate data mining tl yur business bjectives and yur data structure. Bth shuld guide yu t the same tl. Cnsider these questins when evaluating a set f ptential tls. Is the data set heavily categrical? What platfrms d yur candidate tls supprt? Are the candidate tls ODBC-cmpliant? What data frmat can the tls imprt? N single tl is likely t prvide the answer t yur data mining prject. Sme tls integrate several technlgies int a suite f statistical analysis prgrams, a neural netwrk, and a symblic classifier. 6. Frmat The Slutin In cnjunctin with yur data audit, yur business bjective and the selectin f yur tl determine the frmat f yur slutin. The Key questins are
What is the ptimum frmat f the slutin decisin tree, rules, C cde, SQL syntax? What are the available frmat ptins? What is the gal f the slutin? What d the end-users need graphs, reprts, cde? 7. Cnstruct The Mdel At this pint that the data mining prcess begins. Usually the first step is t use a randm number seed t split the data int a training set and a test set and cnstruct and evaluate a mdel. The generatin f classificatin rules, decisin trees, clustering sub-grups, scres, cde, weights and evaluatin data/errr rates takes place at this stage. Reslve these issues: Are errr rates at acceptable levels? Can yu imprve them? What extraneus attributes did yu find? Can yu purge them? Is additinal data r a different methdlgy necessary? Will yu have t train and test a new data set? 8. Validate The Findings Share and discuss the results f the analysis with the business client r dmain expert. Ensure that the findings are crrect and apprpriate t the business bjectives. D the findings make sense? D yu have t return t any prir steps t imprve results? Can use ther data mining tls t replicate the findings? 9. Deliver The Findings Prvide a final reprt t the business unit r client. The reprt shuld dcument the entire data mining prcess including data preparatin, tls used, test results, surce cde, and rules. Sme f the issues are: Will additinal data imprve the analysis? What strategic insight did yu discver and hw is it applicable? What prpsals can result frm the data mining analysis? D the findings meet the business bjective? 10. Integrate The Slutin Share the findings with all interested end-users in the apprpriate business units. Yu might wind up incrprating the results f the analysis int the cmpany's business prcedures. Sme f the data mining slutins may invlve SQL syntax fr distributin t end-usersc cde incrprated int a prductin system Rules integrated int a decisin supprt system. Althugh data mining tls autmate database analysis, they can lead t faulty findings and errneus cnclusins if yu're nt careful. Bear in mind that data mining is a business prcess with a specific gal t extract a cmpetitive insight frm histrical recrds in a database. 2000 Data Mining Technlgies Inc. All rights reserved. Data Mining Technlgies, Inc. prvides knwledge management and decisin supprt sftware and services fr data warehuse/business
applicatins pertaining t Internet e-cmmerce and direct marketing, healthcare, stck predictin and financial services. Our cre prduct is Nuggets, a desktp data mining tlkit, using the mst pwerful rule inductin engine n the market