26onetodescriberelationshipsbetweenvariablesforprediction,quantifyingeects,or



Similar documents
UNIVERSITY of TORONTO. Faculty of Arts and Science

Centralized vs Onsite Monitoring:

Accident Prevention Techniques

Title: The BCL2-938 C>A promoter polymorphism is associated with risk group classification in children with acute lymphoblastic leukemia

PRE/POST TESTS and PRE/POST TEST INSTRUCTOR KEYS

ESI ANNUAL SALARY SURVEY

Longitudinal Data Analysis. Wiley Series in Probability and Statistics

VCE Business Management

Creating Customer Value, Satisfaction, and Loyalty 9/5/2008. Building Customer Value and Satisfaction


Colocation Services. Retail Colocation as it s meant to be

Homework 3 Solution, due July 16

Essential QA Metrics for Determining Solution Quality

THE PREDICTIVE MODELLING PROCESS

Math 370/408, Spring 2008 Prof. A.J. Hildebrand. Actuarial Exam Practice Problem Set 1

Master of Science in Statistics

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Sample Script of an Initial Brief Alcohol Counseling Session

Master of Science in Statistics

Getting Started Different Ways of Deleading Other Options and Resources

Auditorium Acoustics and Architectural Design

Does my patient need more therapy after prostate cancer surgery?

Statistics in Applications III. Distribution Theory and Inference

Small employers. Issue Brief. Health Insurance Purchasing Cooperatives. Elliot K.Wicks Economic and Social Research Institute

Multinational Comparisons of Health Systems Data, 2014

Prostate cancer. Christopher Eden. The Royal Surrey County Hospital, Guildford & The Hampshire Clinic, Old Basing.

MBA PROGRAMME: Appendix 1 FINANCE AND RESPONSIBLE INVESTMENT SUBJECT CODE: CMBC 191

How To Understand Predictive Analysis And Data Mining

Curriculum Vitae: Raul J. Cano, Ph.D.

Social Networks and their Economics. Influencing Consumer Choice. Daniel Birke

Trends in Publicly Reported Nursing Facility Quality Measures

Radiation Therapy for Prostate Cancer: Treatment options and future directions

Quality Scorecard overall heart attack care overall heart failure overall pneumonia care overall surgical infection rate patient safety survival

Atherosclerosis of the aorta. Artur Evangelista

Julio is [it] the best option?

Core Music Curriculum General Education

Creating Strategic Alliances for Post-Acute Coordination of Care

An Introduction to Advanced Analytics and Data Mining

Test your knowledge on risk. Fill in the box for the correct answer for each question or statement.

Multinomial Logistic Regression

TABLE OF CONTENTS BACKGROUND AND INTRODUCTION... 5 PURPOSE... 5 SCOPE... 6 RISK ASSESSMENT PROCESS... 6

Plugging Premium Leakage

Decision & Risk Analysis Lecture 6. Risk and Utility

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Milwaukee County Early Intervention Program

Building flexible, easy to change and rock-solid applications with BRFplus decision services. Carsten Ziegler, James Taylor

Waterfall vs. Agile Project Management

Project Management in a Multi-Environment Ken Halloway, PMP, ITIL 21 October 2015

Information asymmetries

Life expectancy of children with cerebral palsy

Administrative Measures of Settlement Reserve Funds by China Securities Depository and Clearing Corporation Limited

Rockford s map update project is a joint effort with FEMA in cooperation with local associations and other state partners.

The Entrepreneur s Guide to Financial Maturity Factoring - Financing for Companies Seeking Fast Cash

Copyright 2009 Pearson Education Canada

International Services

Sun Li Centre for Academic Computing

Sample Size Designs to Assess Controls

Doctorates in Occupational Safety and Health: A Critical Shortage

UNIT-LINKED LIFE INSURANCE CONTRACTS WITH INVESTMENT GUARANTEES A PROPOSAL FOR ROMANIAN LIFE INSURANCE MARKET

Time s Up: DCAA s Renewed Focus on Incurred Cost Submissions

WORKING CAPITAL MANAGEMENT OF BAJAJ AUTO LTD. WITH SPECIAL REFERENCE TO AUTOMOBILE INDUSTRY.

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

The Use of M&S VV&A as a Risk Mitigation Strategy in Defense Acquisition

FP7-ICT Scalable Data Analytics. Deadline: 16 April 2013 at 17:00:00 (Brussels local time)

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Transcription:

StatisticalThemesandLessonsforDataMining c1996kluweracademicpublishers,boston.manufacturedinthenetherlands. DataMiningandKnowledgeDiscovery,1,25{42(1996) CLARKGLYMOUR DepartmentofCognitivePsychology,CarnegieMellonUniversity,Pittsburgh,PA15213 DAVIDMADIGAN DepartmentofStatistics,Box354322,UniversityofWashington,Seattle,WA98195 DARYLPREGIBON PADHRAICSMYTH StatisticsResearch,AT&TLaboratories,MurrayHill,NJ07974madigan@stat.washington.edu daryl@research.att.com cg09@andrew.cmu.edu somestatisticalthemesandlessonsthataredirectlyrelevanttodataminingandattemptstoidentifyopportunitieswhereclosecooperationbetweenthestatisticalandcomputationalcommunities inbothdisciplinestomakeprogressinextractinginformationfromlargedatabases.itisanemergingeldthathasattractedmuchattentioninaveryshortperiodoftime.thisarticlehighlights InformationandComputerScience,UniversityofCalifornia,Irvine,CA92717 Editor:UsamaFayyad Abstract.DataminingisontheinterfaceofComputerScienceandStatistics,utilizingadvances smyth@ics.uci.edu mightreasonablyprovidesynergyforfurtherprogressindataanalysis. Keywords:Statistics,uncertainty,modeling,bias,variance 1.Introduction softwarehavefreedthestatisticianfromnarrowlyspeciedmodelsandspawned statisticaltoolkitdrawsonarichbodyoftheoreticalandmethodologicalresearch (Table1). afreshapproachtothesubject,especiallyasitrelatestodataanalysis.today's Statisticsisenjoyingarenaissanceperiod.Moderncomputinghardwareand andinterpretationofnumericaldata,especiallytheanalysisofpopulation characteristicsbyinferencefromsampling.(americanheritagedictionary). Sta-tis-tics(noun).Themathematicsofthecollection,organization, or\turningdataintoinformation".thecontextencompassesstatistics,butwith asomewhatdierentemphasis.inparticular,datamininginvolvesretrospective analysesofdata:thus,topicssuchasexperimentaldesignareoutsidethescopeof estedinunderstandabilitythanaccuracyorpredictabilityperse.thus,thereisa soforth.applicationsinvolvingverylargenumbersofvariablesandvastnumbers focusonrelativelysimpleinterpretablemodelsinvolvingrules,trees,graphs,and dataminingandfallwithinstatisticsproper.dataminersareoftenmoreinter- ofmeasurementsarealsocommonindatamining.thus,computationaleciency Theeldofdatamining,likestatistics,concernsitselfwith\learningfromdata"

26onetodescriberelationshipsbetweenvariablesforprediction,quantifyingeects,or Table1.Statisticianshavedevelopedalargeinfrastructure(theory)tosupporttheir theuncertaintyassociatedwithdrawinginferencesfromdata.thesemethodsenable methodsandalanguage(probabilitycalculus)todescribetheirapproachtoquantifying C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH AreaofStatistics experimentaldesign&samplinghowtoselectcasesifonehasthelibertytochoose suggestingcausalpaths. exploratorydataanalysis DescriptionofActivities andscalabilityarecriticallyimportant,andissuesofstatisticalconsistencymay beasecondaryconsideration.furthermore,thecurrentpracticeofdataminingis statisticalgraphics statisticalmodeling statisticalinference hypothesisgenerationratherthanhypothesistesting datavisualization regressionandclassicationtechniques (suchasanyofthemanyruleinductionsystemsonthemarket)willproducesets oftenpattern-focusedratherthanmodel-focused,i.e.,ratherthanbuildingacoherentglobalmodelwhichincludesallvariablesofinterest,dataminingalgorithms estimationandpredictiontechniques ticalcomputationalconcerns.however,infocusingalmostexclusivelyoncomputa- tionalissues,itiseasytoforgetthatstatisticsisinfactacorecomponent.theterm thefundamentalstatisticalnatureoftheinferenceproblemisindeedtobeavoided. andstuart,1966;chateld,1995).dataminingwithoutproperconsiderationof \datamining"haslonghadnegativeconnotationsinthestatisticsliterature(selvin However,agoalofthisarticleistoconvincethereaderthatmodernstatisticscan Inthisoverallcontext,currentdataminingpracticeisverymuchdrivenbyprac- ofstatementsaboutlocaldependenciesamongvariables(inruleform). oersignicantconstructiveadvicetothedataminer,althoughmanyproblemsremainunsolved.throughoutthearticlewehighlightsomemajorthemesofstatistics todatamining.forarigoroussurveyofstatistics,themathematicallyinclined research,focusinginparticularonthepracticallessonspertinenttodatamining. anumberofinterestingtopics,includingtimeseriesanalysisandmeta-analysis. readershouldsee,forexample,schervish(1995).forreasonsofspacewewillignore 2.AnOverviewofStatisticalScience ThisSectionbrieydescribessomeofthecentralstatisticalideaswethinkrelevant marginalization(summingoverasubsetofvalues)andconditionalization(forming characterizationsofawealthofprobabilitydistributions,aswellaspropertiesof sureassignsvalues.importantrelationsamongprobabilitydistributionsinclude randomvariables{functionsdenedonthe\events"towhichaprobabilitymea- ProbabilityDistributions.Thestatisticalliteraturecontainsmathematical

aconditionalprobabilitymeasurefromameasureonasamplespaceandsome eventofpositivemeasure).essentialrelationsamongrandomvariablesinclude STATISTICALTHEMESANDLESSONSFORDATAMINING independence,conditionalindependence,andvariousmeasuresofdependence,of anyparticularmemberofthefamilyfromdata,orbyclosurepropertiesusefulin characterizesfamiliesofdistributionsbypropertiesthatareusefulinidentifying whichthemostfamousisthecorrelationcoecient.thestatisticalliteraturealso 27 modelconstructionorinference,forexampleconjugatefamilies,closedunderconditionalization,andthemultinormalfamily,closedunderlinearcombination.a aprobabilitydistribution.classicalstatisticsinvestigatessuchdistributionsof ofestimatorscorrespondingtoallpossiblesamplesfromthatcollectionalsohas actualorpotentialcollectiongovernedbysomeprobabilitydistribution,thefamily dataandmakingappropriateinferences. knowledgeofthepropertiesofdistributionfamiliescanbeinvaluableinanalyzing estimatorsinordertoestablishbasicpropertiessuchasreliabilityanduncertainty. Avarietyofresamplingandsimulationtechniquesalsoexistforassessingestimator uncertainty(efronandtibshirani,1993). ModelAveraging.Anestimatorisafunctionfromsampledatatosomeestimand, suchasthevalueofaparameter.whenthedatacompriseasamplefromalarger Estimation,Consistency,Uncertainty,Assumptions,Robustness,and aretypicallyfalse,butoftenuseful.ifamodel(whichwecanthinkofasasetof assumptions)isincorrect,estimatesbasedonitcanbeexpectedtobeincorrect aswell.oneoftheaimsofstatisticalresearchistondwaystoweakenthe assumptionsnecessaryforgoodestimation.\robuststatistics"(huber,1981) looksforestimatorsthatworksatisfactorilyforlargerfamiliesofdistributionsand havesmallerrorswhenassumptionsareviolated. Estimationalmostalwaysrequiressomesetofassumptions.Suchassumptions sumptionsareoftenplausible.ratherthanmakinganestimatebasedonasingle model,severalmodelscanbeconsideredandanestimateobtainedastheweighted Carloanalysis.Ourimpressionisthattheerrorratesofsearchproceduresproposed 1994).Infact,suchBayesianmodelaveragingisboundtoimprovepredictiveperformance,onaverage.Sincethemodelsobtainedindataminingareusuallythe resultsofsomeautomatedsearchprocedure,accountingforthepotentialerrors Bayesianestimationemphasizesthatalternativemodelsandtheircompetingas- averageoftheestimatesgivenbytheindividualmodels(madiganandraftery, associatedwiththesearchitselfiscrucial.inpractice,thisoftenrequiresamonte hypothesistestingisinconsistentunlessthealphalevelofthetestingruleisdecreasedappropriatelyasthesamplesizeincreases.generally,anleveltestofone hypothesisandanleveltestofanotherhypothesisdonotjointlyprovidean leveltestoftheconjunctionofthetwohypotheses.inspecialcases,rules(some- andusedinthedataminingandinthestatisticalliteraturearefartoorarelyesti- matedinthisway.(seespirtesetal.,1993formontecarlotestdesignforsearch portantlimitationsshouldbenoted.viewedasaone-sidedestimationmethod, procedures.) HypothesisTesting.Sincestatisticaltestsarewidelyused,someoftheirim-

28 oferroneouslyndingsomedependentsetofvariableswheninfactallpairsare testingaseriesofhypothesis.if,forexample,foreachpairofasetofvariables, timescalledcontrasts)existforsimultaneouslytestingseveralhypotheses(miller, hypothesesofindependencearetestedat=0:05,then0.05isnottheprobability ingdirectlytodowiththeprobabilityoferrorinasearchprocedurethatinvolves 1981).Animportantcorollaryfordataminingisthatthelevelofatesthasnoth- C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH independent.thus,indataminingproceduresthatuseasequenceofhypothesis tests,thealphalevelofthetestscannotgenerallybetakenasanestimateofany nomatterhowcloselytheyseemtotthedata. ples;testsoflinearmodels,forexample,typicallyrejecttheminverylargesamples errorprobabilityrelatedtotheoutcomeofthesearch. dowiththetruthofhypotheses,theconnectionissomewhattenuous(seesection 5.3).Hypothesesthatareexcellentapproximationsmayberejectedinlargesam- Dataminersshouldnotethatwhileerrorprobabilitiesoftestshavesomethingto correspondstoapreferenceorderingoverthespaceofmodels,giventhedata.for thereasonsjustconsidered,scoringrulesareoftenanattractivealternativetotests. modelsorhypothesestoothers,andtobeindierentbetweenstillothermodels.a InformationCriterion(Raftery,1995),andMinimumDescriptionlength(Rissanen, scoreisanyrulethatmapsmodelsanddatatonumberswhosenumericalordering withthemodel,thenumberofparameters,ordimension,ofthemodel,andthe Typicalrulesassignmodelsavaluedeterminedbythelikelihoodfunctionassociated data.popularrulesincludetheakaikeinformationcriterion(akaike,1974),bayes ModelScoring.Theevidenceprovidedbydatashouldleadustoprefersome onthedataisitselfascoringfunction,arguablyaprivilegedone.thebayes InformationCriterionapproximatesposteriorprobabilitiesinlargesamples. 1978).Givenapriorprobabilitydistributionovermodels,theposteriorprobability modelspacetocalculatescoresforallmodels;itis,however,oftenfeasibleto samemodel,butevendierentorderingsofmodels. fromthesamedistributionmayyieldnotonlydierentnumericalvaluesforthe uncertaintiesassociatedwithscores,sincetwodierentsamplesofthesamesize scores.aicscoresarenot,ingeneral,consistent(schwartz,1978).therearealso plelimit,almostsurelythetruemodelshouldbeamongthosereceivingmaximal Forobviouscombinatorialreasons,itisoftenimpossiblewhensearchingalarge Thereisanotionofconsistencyappropriatetoscoringrules;inthelargesam- describeandcalculatescoresforafewequivalenceclassesofmodelsreceivingthe highestscores. inmontecarlomethodshave,however,liberatedanalystsfromsomeofthesecon- Bayesianmodelsandcomplexlikelihoodcalculations.Recentdramaticadvances dicultiesforceddataanalyststoeschewexactanalysisofelaboratehierarchical frominferencesmadewithhypothesistests.raftery(1995)givesexamplesofmodelsthataccountforalmostallofthevarianceofanoutcomeofinterest,andhave veryhighbayesianscores,butareoverwhelminglyrejectedbystatisticaltests. Insomecontexts,inferencesmadeusingBayesianscorescandieragreatdeal MarkovChainMonteCarlo.Historically,insurmountablecomputational

straints.oneparticularclassofsimulationmethods,dubbedmarkovchainmonte STATISTICALTHEMESANDLESSONSFORDATAMINING Carlo,originallydevelopedinstatisticalmechanics,hasrevolutionizedthepractice ofbayesianstatistics.smithandroberts(1993)provideanaccessibleoverview fromthebayesianperspective;gilksetal.(1996)provideapracticalintroduction addressingbothbayesianandnon-bayesianperspectives. Simulationmethodsmaybecomeunacceptablyslowwhenfacedwithmassive 29 GeneralizedLinearModels,forinstance,embracemanyclassicallinearmodels,and calresearchhasbeenthedevelopmentofverygeneralandexiblemodelclasses. seeforexamplekooperbergetal.(1996),kassandraftery(1995),andgeigeret al.(1996). unifyestimationandtestingtheoryforsuchmodels(mccullaghandnelder,1989). GeneralizedAdditiveModelsshowsimilarpotential(HastieandTibshirani,1990). datasets.insuchcases,recentadvancesinanalyticapproximationsproveuseful- Graphicalmodels(Lauritzen,1996)representprobabilisticandstatisticalmodels fordescribingmodelsandthegraphsthemselvesmakemodelingassumptionsexplicit.graphicalmodelsprovideimportantbridgesbetweenthevaststatistical analysis,anddatamining. withplanargraphs,wheretheverticesrepresent(possiblylatent)randomvariables andtheedgesrepresentstochasticdependences.thisprovidesapowerfullanguage Generalizedmodelclasses.Amajorachievementofstatisticalmethodologi- literatureonmultivariateanalysisandsucheldsasarticialintelligence,causal etc.typically,rationaldecisionmakingandplanningarethegoalsofdatamining, Givenallofthisinformation,adecisionrulespecieswhichofthealternativeactionsoughttobetaken.Alargeliteratureinstatisticsandeconomicsaddresses alternativedecisionrules{maximizingexpectedutility,minimizingmaximumloss, sumesthedecisionmakerhasavailableadenitesetofalternativeactions,knowl- edgeofadenitesetofpossiblealternativestatesoftheworld,knowledgeofthe RationalDecisionMakingandPlanning.Thetheoryofrationalchoiceas- theworld,andknowledgeoftheprobabilitiesofvariouspossiblestatesoftheworld. payosorutilitiesoftheoutcomesofeachpossibleactionineachpossiblestateof rationalchoiceposesnormsfortheuseofinformationobtainedfromadatabase. andratherthanprovidingtechniquesormethodsfordatamining,thetheoryof knowledgeoftheeectsalternativeactionswillhave.toknowtheoutcomesof ofbernoulliandlaplace,theabsenceofcausalconnectionbetweentwovariables actionsistoknowsomethingofcauseandeectrelations,andextractingsuch causalinformationisoftenoneoftheprinciplegoalsofdataminingandofstatisticalinferencemoregenerally. historicaldevelopmentofstatistics.fromthebeginningofthesubject,inthework Theveryframeworkofrationaldecisionmakingrequiresprobabilitiesanda hasbeentakentoimplytheirprobabilisticindependence(seestigler,1986),and thesameideaisfundamentalinthetheoryofexperimentaldesign(fisher,1958). Earlyinthiscentury,Wright(1921)introduceddirectedgraphstorepresentcausal hypotheses(withverticesasrandomvariablesandedgesrepresentingdirectinu- InferencetoCauses.Understandingcausationisthehiddenforcebehindthe

30 socialsciences,biology,computerscienceandengineering. ences),andtheyhavebecomecommonrepresentationsofcausalhypothesesinthe betweenindependenceandabsenceofcausalconnectioninwhattheycalledthe Markovcondition:providedYisnotaneectofX,XandYareconditionally independentgiventhedirectcausesofx.theyshowedthatmuchofthelinear KiiveriandSpeed(1982)combineddirectedgraphswithageneralizedconnection C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH causalmodelsofcategoricaldata,andvirtuallyallcausalmodelsofsystemswithoutfeedback.underadditionalassumptions,conditionalindependencetherefore modelingliteraturetacitlyassumedthemarkovcondition;thesameistruefor manysourcesoferroranddataminersshouldproceedwithextremecaution. tributionssatisfyingthemarkovconditionarecalledbydierentnamesindierent names,including\faithfulness."directedgraphswithassociatedprobabilitydis- literatures:bayesnets,beliefnets,structuralequationmodels,pathmodels,etc. oughlyinvestigated,additionalassumptionisthatallconditionalindependencies Nonetheless,causalinferencesfromuncontrolledconveniencesamplesareliableto providesinformationaboutcausaldependence.themostcommon,andmostthortualcausalprocessesgeneratingthedata,arequirementthathasbeengivenmany areduetothemarkovconditionappliedtothedirectedgraphdescribingtheacpliedbyhumanexperts,orinferredfromthedatabaseautomatically.regression, probabilitydistribution.indataminingcontexts,structureistypicallyeithersup- obtainedfromthesameprobabilitydistribution.aswithestimation,inprediction varianceofthepredictor. weareinterestedbothinreliabilityandinuncertainty,oftenmeasuredbythe predictpropertiesofanewsample,whereitisassumedthatthetwosamplesare forexample,assumesaparticularfunctionalformrelatingvariables.structurecan Predictionmethodsforthissortofproblemalwaysassumesomestructureinthe Prediction.Sometimesoneisinterestedinusingasample,oradatabase,to bealsobespeciedintermsofconstraints,suchasindependence,conditionalindependence,higherorderconditionsoncorrelations,etc.onaverage,aprediction methodthatguaranteessatisfactionoftheconstraintsrealizedintheprobability distribution{andnoothers{willbemoreaccurateandhavesmallervariancethan Inthemid1960's,thestatisticscommunityreferredtounfetteredexplorationof 3.IsDataMining\StatisticalDejaVu"(AllOverAgain)? bymodelaveraging,providedthepriorprobabilitiesofthealternativeassumptions imposedbythemodelareavailable. cultissueinthissortofprediction.aswithestimation,predictioncanbeimproved onethatdoesnot.findingtheappropriateconstraintstosatisfyisthemostdi- arguedthatsincetheirtheorieswereinvalidatedby\lookingatthedata",itwas enamoredbyelegant(analytical)mathematicalsolutionstoinferentialproblems, wrongtodoso.themajorproponentoftheexploratorydataanalysis(eda) dataas\shing"or\datadredging"(selvinandstuart,1966).thecommunity, school,j.w.tukey,counteredthisargumentwiththeobviousretortthatstatis-

ticianswereputtingthecartbeforethehorse.hearguedthatstatisticaltheory STATISTICALTHEMESANDLESSONSFORDATAMINING anddevisingformalmethodstoaccountforsearchintheirinferentialprocedures. shouldadapttothescienticmethodratherthantheotherwayaround.thirty yearshence,thestatisticalcommunityhaslargelyadoptedtukey'sperspective, andhasmadeconsiderableprogressinservingbothmasters,namelyacknowledgingthatmodelsearchisacriticalandunavoidablestepinthemodelingprocess, 31 minersare:clarityaboutgoals,appropriatereliabilityassessment,andadequate ticularlychallengingindynamicsituations).inyetothercases,dataanalysisaims accountingforsourcesofuncertainty. Inothercases,dataanalysisaimstopredictfeaturesofnewcases,ornewsamples, drawnfromoutsidethedatabaseusedtodevelopapredictivemodel(thisispar- computablerepresentationofhowthedataaredistributedinaparticulardatabase. Threethemesofmodernstatisticsthatareoffundamentalimportancetodata fromwhichthemodel(ormodels)weredeveloped.eachofthesegoalspresent causalmechanismsthatareusedtoformpredictionsaboutnewsamplesthatmight toprovideabasisforpolicy.thatis,theanalysisisintendedtoyieldinsightinto beproducedbyinterventionsoractionsthatdidnotapplyintheoriginaldatabase Clarityaboutgoals.Sometimesdataanalysisaimstondaconvenient,easily distinctinferenceproblems,withdistincthazards.confusingorequivocatingover theaiminvitestheuseofinappropriatemethodsandmayresultinunfortunate usewillresultinimprovedobstetricoutcome".fortunately,thereexistsindependentevidencetosupportthiscausalclaim.however,muchofchasnoetal.'spaper focusesonastatisticalanalysis(analysisofvariance)thathaslittle,ifanything,to dowiththecausalquestionofinterest. (1989)comparingbabiesborntococaine-usingmotherswithbabiesborntononcocaine-usingmothers.Theauthorsconcluded:\Forwomenwhobecomepregnant Asanexample,considertheobservationalstudyreportedbyChasnoetal. andareusersofcocaine,interventioninearlypregnancywithcessationofcocaine predictionsandinferences. particulartreatment(diggleandkenward,1994).inthiscase,theimportantissue analyzingclinicaltrialdatawherepatientsdropoutduetoadverseside-eectsofa thepopulationwhoremainwithinthetrial?thisproblemarisesinmoregeneral settingsthaninclinicaltrials,e.g.,non-respondents(refusers)insurveydata.in answer. iswhichpopulationisoneinterestedinmodelling?thepopulationatlargeversus rightanswerstothewrongquestion.forexample,hediscussestheproblemof suchsituationsitisimportanttobeexplicitaboutthequestionsoneistryingto Hand(1994)providesaseriesofexamplesillustratinghoweasyitistogivethe problemsothattherightquestioncanbeasked?hand'sconclusionisthatthis islargelyan\art"becauseitislesswellformalizedthanthemathematicaland thatofformulatingstatisticalstrategyi.e.,howdoesonestructureadataanalysis computationaldetailsofapplyingaparticulartechnique.this\art"isgained throughexperience(atpresentatleast)ratherthantaught.theimplicationfor Inthisgeneralcontextanimportantissue(discussedatlengthinHand(1994))is

32 dataminingisthathumanjudgementisessentialformanynon-trivialinference problems.thus,automationcanatbestonlypartiallyguidethedataanalysis oftendicult,process. theuser(andconsumer)understandsandndsplausibleinthecontext. process.properlydeningthegoalsofananalysisremainsahuman-centred,and Useofmethodsthatarereliablemeanstothegoal,underassumptions C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Statisticaltheoryappliesseveralmeaningstotheword\Reliability",manyofwhich alsoapplytomodelsearch.forexample,underwhatconditionsdoesasearch procedureprovidecorrectinformation,ofthekindsought,withprobabilityone asthesamplesizeincreaseswithoutbound?answerstosuchquestionsareoften available,thedataanalystshouldpaycarefulattentiontothereasonablenessof elusiveandcanrequiresophisticatedmathematicalanalysis.whereanswersare underlyingassumptions.anotherkeydataminingquestionisthis:whatarethe probabilitiesofvariouskindsoferrorsthatresultfromusingamethodinnite samples?theanswerstothisquestionwilltypicallyvarywiththekindsoferrors considered,withthesamplesize,andwiththefrequencyofoccurrenceofthevarious pellingexample. orthecorrectprediction.thedataanalystmustquantifytheseuncertaintiesso shouldleavetheinvestigatorwitharangeofuncertaintiesaboutthecorrectmodel, kindsoftargetsorsignalswhosedescriptionisthegoalofinference.thesequestions areoftenbestaddressedbymontecarlomethods,althoughinsomecasesanalytic thatsubsequentdecisionscanbeappropriatelyhedged.section4providesacomgroundknowledgeandeventhebestmethodsofsearchandstatisticalassessment resultsmaybeavailable. questioniswhetherornotspecicrecurrentpressurepatternscanbeclearlyidentiedfromdailygeopotentialheightrecordswhichhavebeencompiledinthe Anotherexampleinvolvesacurrentdebateintheatmosphericsciences.The Asenseoftheuncertaintiesofmodelsandpredictions.Quiteoftenback- NorthernHemispheresince1948.Theexistenceofwell-denedrecurrentpatterns modelsviaresamplingtechniques,itisdiculttoinferfromthemultiplestudies (or\regimes")hassignicantimplicationsformodelsofupperatmospherelowfrequencyvariabilitybeyondthetime-scaleofdailyweatherdisturbances(and, low-dimensionalprojectionsofthegriddeddata(seemichelangelietal.(1995)and thus,modelsoftheearth'sclimateoverlargetime-scales).severalstudieshave othersreferredtotherein).whilethisworkhasattemptedtovalidatethecluster degreeofcertaintyandthatthereisafundamentaluncertainty(giventhecurrent data)abouttheunderlyingmechanismsatwork.allisnotlost,however,sinceit whetherregimestrulyexist,and,iftheydo,wherepreciselytheyarelocated.it seemslikelythat48wintersworthofdataisnotenoughtoidentifyregimestoany usedavarietyofclusteringalgorithmstodetectinhomogeneities(\bumps")in isalsoclearthatonecouldquantifymodeluncertaintyinthiscontext,andtheorize accordingly(seesection4). ofthehazardsofdatamining. Inwhatfollowswewillelaborateonthesepointsandoeraperspectiveonsome

estimateorapredictionisalmostalwaysinadequate.quanticationoftheuncertaintyassociatedwithasinglenumber,whileoftenchallenging,iscriticalfor 4.CharacterizingUncertainty STATISTICALTHEMESANDLESSONSFORDATAMINING 33 Thestatisticalapproachcontendsthatreportingasinglenumberforaparameter subsequentdecisionmaking.asanexample,draper(1995),consideredthecaseof the1980energymodelingforum(emf)atstanforduniversitywherea43-person workinggroupofeconomistsandenergyexpertsconvenedtoforecastworldoil pricesfrom1981to2020.thegroupgeneratedpredictionsbasedonanumberof econometricmodelsandscenarios,embodyingavarietyofassumptionsaboutsupply,demand,andgrowthratesofrelevantquantities.aplausiblereferencescenario andmodelwasselectedasrepresentative,butthesummaryreport(emf,1982) thewarningaboutthepotentialuncertaintyassociatedwiththepointestimates, toacceptanyprojectionasaforecast."thesummaryreportdidconclude,however,thatmostoftheuncertaintyaboutfutureoilprices\concernsnotwhether cautionedagainstinterpretingpointpredictionsbasedonthereferencescenarioas thesepriceswillrise...buthowrapidlytheywillrise." inthequotationabove,andproceededtoinvestanestimated$500billiondollars, \[theworkinggroup's]`forecast'oftheoilfuture,astherearetoomanyunknowns governmentsandprivatecompaniesaroundtheworldfocusedonthelastsentence onthebasisthatthepricewouldprobablybecloseto$40dollarsperbarrelinthe mid-eighties.infact,theactual1986worldaveragespotpriceofoilwasabout$13 perbarrel. In1980,theaveragespotpriceofcrudeoilwasaround$32perbarrel.Despite (andshould)haveproceededmorecautiouslyin1980,hadtheyunderstoodthefull extentoftheiruncertainty. intervalforthe1986pricewouldhaverangedfromabout$20toover$90.note tisticalanalysisdoesnotprovideclairvoyance.however,decisionmakerswould thatthisintervaldoesnotactuallycontaintheactual1986price{insightfulstafulbutelementarystatisticalmethods,draper(1995)showsthata90%predictive Correctlyaccountingforthedierentsourcesofuncertaintypresentssignicant UsingonlytheinformationavailabletotheEMFin1980,alongwiththought- parametricandpredictiveuncertaintyinthecontextofaparticularmodel.two distinctapproachesareincommonuse.\frequentist"statisticiansfocusonthe tersandpredictionsbyso-calledsamplingdistributions.\bayesian"statisticians randomnessinsampleddataandsummarizetheinducedrandomnessinparame- insteadtreatthedataasxed,andusebayestheoremtoturnprioropinionabout challenges.untilrecently,thestatisticalliteraturefocusedprimarilyonquantifying calledposteriordistributionthatembracesalltheavailableinformation.theerce quantitiesofinterest(alwaysexpressedbyaprobabilitydistribution),intoaso- conictsbetweenpreviousgenerationsoffrequentistsandbayesians,havelargely givenwayinrecentyearstoamorepragmaticapproach;moststatisticianswill basetheirchoiceoftoolonscienticappropriatenessandconvenience.

34 uncertainty(asdiscussedinthepreviousparagraph)mayoften,inpractice,be andyork,1995).itiscommonpracticenowadaysforstatisticiansanddataminers tousecomputationallyintensivemodelselectionalgorithmstoseekoutasingle dominatedbybetween-modeluncertainty(chateld,1995,draper,1995,madigan optimalmodelfromanenormousclassofpotentialmodels.theproblemisthat Inanyevent,recentresearchhasleadtoincreasedawarenessthatwithin-model C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH ofuncertaintyincludebayesianmodelaveraging(draper,1995)andresampling carefullyaboutmodelassessmentandlookbeyondcommonlyusedgoodness-of-t measuressuchasmeansquareerror. Intuitively,ambiguityoverthemodelshoulddiluteinformationabouteectparametersandpredictions,since\partoftheevidenceisspenttospecifythemodel" (Leamer,1978,p.91).Promisingtechniquesforproperlyaccountingforthissource severaldierentmodelsmaybeclosetooptimal,yetleadtodierentinferences. methods(breiman,1996).themainpointhereisthatdataminersneedtothink meetsdata. ofstatistics.whilestatisticsdoesnothavealltheanswersforthedataminer,it thissection,wedescribesomelessonsthatstatisticianshavelearnedwhentheory doesprovideausefulandpracticalframeworkforwhichtosearchforsolutions.in 5.Whatcangowrong,willgowrong 5.1.DataCanLie Dataminingposesdicultandfundamentalchallengestothetheoryandpractice Dataminingapplicationstypicallyrelyonobservational(asopposedtoexperimental)data.Interpretingobservedassociationsinsuchdataischallenging;sensiblhospitaldeaths)from1981to1990,focusingspecicallyonpatientswhohadreceivedaprimaryopencholecystectomy.Someofthesepatientshadinaddition deaths.achi-squaretestcomparingthisoutcomeforthetwogroupsofpatients receivedanincidental(i.e.discretionary)appendectomyduringthecholecystectomyprocedure.table2displaysthedataononeoutcome,namelyin-hospital showsa\statisticallysignicant"dierence.this\nding"issurprisingsincelongtermpreventionofappendicitisisthesolerationalefortheincidentalappendectomy Wen,Hernandez,andNaylor(1995;WHNhereafter)analyzedadministrative factors.hereweoeradetailedexampletosupportthisposition. inferencesrequirecarefulanalysis,anddetailedconsiderationoftheunderlying recordsofallontariogeneralhospitalseparations(discharges,transfers,orin- procedure{noshort-termimprovementinoutcomesisexpected.this\nding" mightleadanaivehospitalpolicymakertoconcludethatallcholecystectomypatientsshouldhaveanincidentalappendectomytoimprovetheirchancesofagood outcome!clearlysomethingisamiss-howcouldincidentalappendectomyimprove outcomes?

STATISTICALTHEMESANDLESSONSFORDATAMINING Table2.In-hospitalSurvivalofPatientsUndergoingPrimaryOpen CholecystectomyWithandWithoutIncidentalAppendectomy. AppendectomyAppendectomy Without 35 (usingtendierentdenitionsof\low-risk"),incidentalappendectomyindeedre- butappearstopositivelyaectoutcomeswhenthelow-riskandhigh-riskpatients sultedinpooreroutcomes.paradoxically,itcouldevenbethecasethatappendec- tomyadverselyaectsoutcomesforbothhigh-riskpatientsandlow-riskpatients, WHNdidseparatelyconsiderasubgroupoflow-riskpatients.Forthesepatients In-hospitaldeaths,No.(%)21(0.27%)1,394(0.73%) In-hospitalsurvivors,No.(%)7,825(99.73%)190,205(99.27%) arecombined.whndonotprovideenoughdatatocheckwhetherthisso-called \Simpson'sParadox"(Simpson,1951)occurredinthisexample.However,Table3 presentsdatathatareplausibleandconsistentwithwhn'sdata. Table3.FictitiousdataconsistentwiththeWenetal.(1995) data. tiousdata.clearlytheriskanddeathcategoriesaredirectlycorrelated.inaddition, Table4displaysthecorrespondingproportionsofin-hospitaldeathforthesecti- Survival7700 DeathLow-RiskHigh-RiskLow-RiskHigh-Risk Appendectomy 7With12516400926196 14 100 Appendectomy Without thattheyhadanappendectomyallowsustoinferthattheyaremorelikelytobe appendectomiesaremorelikelytobecarriedoutonlow-riskpatientsthanonhighriskones.thus,ifwedidnotknowtheriskcategory(age)ofapatient,knowing 1294 pendectomywilllowerone'srisk.nonetheless,whenriskisomittedfromthetable, exactlysuchafallaciousconclusionappearsjustiedfromthedata. lowerrisk(younger).however,thisdoesnotinanywayimplythathavinganap- analysis,adjustingformanypossibleconfoundingvariables(e.g.age,sex,admissionstatus).theyconcludethat\thereisabsolutelynobasisforanyshort-term improvementinoutcomes"duetoincidentalappendectomy.thiscarefulanalysis agreeswithcommonsenseinthiscase.ingeneral,analysesofobservationaldata demandsuchcare,andcomewithnoguarantees.othercharacteristicsofavailable datathatconnivetospoilcausalinferencesinclude: Returningtotheoriginaldata,WHNprovideamoresophisticatedregression

36 riskgroupingforthectitiousdataoftable3. Table4.Proportionofin-hospitaldeathscrossclassiedbyincidentalappendectomyandpatient C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Low-Risk0.0009 AppendectomyAppendectomy With Without Thepopulationunderstudymaybeamixtureofdistinctcausalsystems,resultinginstatisticalassociationsthatareduetothemixingratherthantoany 0.007 0.05 Associationsinthedatabasemaybedueinwholeorparttounrecordedcommon causes(latentvariables). Combined0.003 High-Risk 0.10 0.0006 Missingvaluesofvariablesforsomeunitsmayresultinmisleadingassociations Membershipinthedatabasemaybeinuencedbytwoormorefactorsunderstudy,whichwillcreatea\spurious"statisticalassociationbetweenthose directinuenceofvariablesononeanotheroranysubstantivecommoncause. Manymodelswithquitedistinctcausalimplicationsmay\t"thedataequally amongtherecordedvalues. Thefrequencydistributionsinsamplesmaynotbewellapproximatedbythe Therecordedvaluesofvariablesmaybetheresultof\feedback"mechanisms variables. oralmostequallywell. mostfamiliarfamiliesofprobabilitydistributions. regressioncaninsomecasesproduceinferiorestimatesofeectsizes.procedures asintheappendectomyexample,buttheyarenotalwaysadequateguardsagainst thesehazards.indeed,controllingforpossiblyconfoundingvariableswithmultiple suchasmultipleregression,andlogisticregressionmayworkinmanycases,such tisticalproceduresyetavailablethatcanbeused\otheshelf"{thewayrandom- izationisusedinexperimentaldesign{toreducetheserisks.standardtechniques Thereisresearchthataddressesaspectsoftheseproblems,buttherearefewsta- whicharenotwellrepresentedbysimple\non-recursive"statisticalmodels. recentlydevelopedinthearticialintelligenceandstatisticsliterature(spirteset al.,1993)addresssomeoftheproblemsassociatedwithlatentvariablesandmixing,butsofaronlyfortwofamiliesofprobabilitydistributions,thenormaland multinomial.

institutionsthatgiverisetodata,canbeuncooperative.insuchcases,inferences 5.2.Sometimesit'snotwhat'sinthedatathatmatters Classicalstatisticalmethodsstartwitharandomsample,yetinpractice,dataorthe STATISTICALTHEMESANDLESSONSFORDATAMINING thatignorehowthedatawere\selected"canleadtodistortedconclusions. Consider,forexample,theChallengerSpaceShuttleaccident.TheRogersCommissionconcludedthatanO-ringfailureinthesolidrocketboosterledtothe structuralbreakupandlossofthechallenger.inreconstructingtheeventsleadinguptothedecisiontolaunch,thecommissionnotedamistakeintheanalysis ofthermal-distressdatawherebyightswithno(i.e.zero)incidentsofo-ring thetemperatureeect.thistruncationofthedataledtotheconclusionthat temperaturesinceitwasfeltthattheydidnotcontributeanyinformationabout norelationshipbetweeno-ringdamageandtemperatureexisted,andultimately, damagewereexcludedfromcriticalplotsofo-ringdamageandambientlaunch thedecisiontolaunch.dalaletal.(1989)throwstatisticallightonthematter ariskyproposition. andquantifyingtherisk(ofcatastrophicfailure)at31of.hadtheoriginalanalysis bydemonstratingthestrongcorrelationbetweeno-ringdamageandtemperature, usedallofthedata,itwouldhaveindicatedthatthedecisiontolaunchwasatbest couldeasilyhavebeenavoided.inmostproblems,selectionbiasisaninherent standardinferences.thelessonstobelearnedhereare thatanytechniqueusedtoanalyzetruncateddataasifitwasarandomsample, characteristicoftheavailabledataandmethodsofanalysisneedtodealwithit.it isourexperiencethateverydatasethasthepotentialforselectionbiastoinvalidate Intheabovecase,theselectionbiasproblemwasoneof\humanerror"and 37 thedatathemselvesareseldomcapabletoalerttheanalystthataselection canbefooled,regardlessofhowthetruncationwasinduced; mechanismisoperating informationexternaltothedataathandiscritical dataminersastrayinmostapplications. makewidespreaduseofp-values.however,indiscriminateuseofp-valuescanlead classical(frequentist)statistics.itseemsnatural,therefore,thatdataminersshould 5.3.ThePerversityofthePervasiveP-value P-valuesandassociatedsignicance(orhypothesis)testsplayacentralrolein inunderstandingthenatureandextentofpotentialbiases. pothesesabouttheworld:thenullhypothesis,commonlydenotedbyh0,andthe isselectedandcalculatedfromthedataathand.theideaisthatt(data)should AlternativeHypothesis,commonlydenotedbyHA.TypicallyH0is\nested"within tozero,whilehamightplacenorestrictiononthecombination.ateststatistic,t HA;forexample,H0mightstatethatacertaincombinationofparametersisequal Thestandardsignicancetestproceedsasfollows.Considertwocompetinghy-

38 measuretheevidenceinthedataagainsth0.theanalystrejectsh0infavorofha ift(data)ismoreextremethanwouldbeexpectedifh0weretrue.specically, islessthanapresetsignicancelevel,. orequaltot(data),giventhath0istrue.theanalystrejectsh0ifthep-value theanalystcomputesthep-value,thatis,theprobabilityoftbeinggreaterthan Therearethreeprimarydicultiesassociatedwiththisapproach: C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 1.Thestandardadvicethatstatisticseducatorsprovide,andscienticjournals 2.Raftery(1995)pointsoutthatthewholehypothesistestingframeworkrests rigidlyadhereto,istochoosetobe0.05or0.01,regardlessofsamplesize. agriculturalexperiments(ontheorderof30-200plots).textbookadvice(e.g., NeymanandPearson,1933)hasemphasizedtheneedtotakeaccountofthe Theseparticular-levelsaroseinSirRonaldFisher'sstudyofrelativelysmall samplesizeislarge.thiscrucialbutvagueadvicehaslargelyfallenondeaf powerofthetestagainsthawhensetting,andsomehowreducewhenthe onthebasicassumptionthatonlytwohypothesesareeverentertained.in ears. 3.TheP-valueistheprobabilityassociatedwiththeeventthattheteststatistic canleadtoundesirableoutcomessuchasselectingamodelwithparameters thatarehighlysignicantlydierentfromzero,evenwhenthetrainingdata aconsequence,indiscriminateuseofp-valueswith\standard"xed-levels practice,dataminerswillconsiderverylargenumbersofpossiblemodels.as arepurenoise(freedman,1983).thispointisoffundamentalimportancefor dataminers. wasasextremeasthevalueobserved,ormoreso.however,theeventthat actuallyhappenedwasthataspecicvalueoftheteststatisticwasobserved. Consequently,therelationshipbetweentheP-valueandtheveracityofH0is subtleatbest.jereys(1980)putsitthisway: toamoredirectinterpretation-thebayesiananalystcomputestheposteriorprobabilitythatahypothesisiscorrect.withxed-levels,thefrequentistandthe BayesFactorsaretheBayesiananalogueofthefrequentistP-valuesandadmit Theyamounttosayingthatahypothesisthatmayormaynotbe trueisrejectedbecauseagreaterdeparturefromthetrialvaluewas happened. improbable;thatis,thatithasnotpredictedsomethingthathasnot IhavealwaysconsideredtheargumentsfortheuseofPabsurd. Bayesianwillarriveatverydierentconclusions.Forexample,BergerandSellke distribution.onewaytoreconcilethetwopositionsistoviewbayesfactorsasa resultinaposteriorprobabilityforh0thatisatleast0.30forany\objective"prior methodforselectingappropriate-levels-seeraftery(1995). (1987)showthatdatathatyieldaP-valueof0.05whentestinganormalmean,

5.4.InterventionandPrediction STATISTICALTHEMESANDLESSONSFORDATAMINING Aspecicclassofpredictionproblemsinvolveinterventionsthataltertheprobabilitydistributionoftheproblem,asinpredictingthevalues(orprobabilities)of 39 variablesunderachangeinmanufacturingprocedures,orchangesineconomicor averagingapply.forgraphicalrepresentationsofcausalhypothesesaccordingto tionsfromcompleteorincompletecausalmodelsweredevelopedin(spirtesetal., tionwithoutintervention,althoughtheusualcaveatsaboutuncertaintyandmodel themarkovcondition,generalalgorithmsforpredictingtheoutcomesofintervenedgeoftherelevantcausalstructure,andareingeneralquitedierentfrompredicvenientcalculusbypearl(1995).arelatedtheorywithoutgraphicalmodelswas 1993).Someoftheseprocedureshavebeenextendedandmadeintoamorecon- developedearlierbyrubin(1974)andothers,andbyrobbins(1986). medicaltreatmentpolicies.accuratepredictionsofthiskindrequiresomeknowl- eachmeasurednumberisalinearcombinationofthetruevalueandanerror,and relationofleaddepositsinchildren'steethwiththeiriqsresulted,eventually, inremovaloftertraethylleadfromgasolineintheunitedstates.onedataset ingthatallofthevariablesweremeasuredwitherror.theirmodelassumesthat signicantregressors,includinglead.klepper(1988)reanalyzedthedataassum- Needlemanexaminedincludedmorethan200subjects,andmeasuredalargenumberofcovariates.Needleman,Geiger,andFrank(1985)re-analyzedthedatausing backwardsstep-wiseregressionofverbaliqonthesevariablesandobtainedsix Considerthefollowingexample.HerbertNeedleman'sfamousstudiesofthecor- thattheparametersofinterestarenottheregressioncoecientsbutratherthe coecientsrelatingtheunmeasured\truevalue"variablestotheunmeasuredtrue valueofverbaliq.thesecoecientsareinfactindeterminate{ineconometricterminology,\unidentiable".anintervalestimateofthecoecientsthatisstrictly positiveornegativeforeachcoecientcanbemade,however,iftheamountof measurementerrorcanbeboundedwithpriorknowledgebyanamountthatvaries tions(usingtetradmethodology)andconcludedthatthreeofthesixregressors couldhavenoinuenceoniq.theregressionincludedthethreeextravariables asstrongasneedleman'sanalysissuggested. fromcasetocase.klepperfoundthattheboundrequiredtoensuretheexistence ofastrictlynegativeintervalestimateforthelead{iqcoecientwasmuchtoo onlybecausethepartialregressioncoecientisestimatedbyconditioningonall stricttobecredible,thusheconcludedthatthecaseagainstleadwasnotnearly permodel,butwithoutthethreeirrelevantvariables,andassigningtoallofthe wrongthingtodoforcausalinferenceusingthemarkovcondition.usingtheklep- otherregressors,whichisjusttherightthingtodoforlinearprediction,butthe parametersanormalpriorprobabilitywithmeanzeroandasubstantialvariance, ScheinesthenusedMarkovchainMonteCarlotocomputeaposteriorprobabilitydistributionforthelead{IQparameter.Theprobabilityisveryhighthatlead Allowingthepossibilityoflatentvariables,Scheines(1996)reanalyzedthecorrela- exposurereducesverbaliq.

40 Easyaccesstodataindigitalformandtheavailabilityofsoftwaretoolsforstatisticalanalyseshavemadeitpossibleforthemaninstreettosetupshopand \dostatistics."nowhereisthismoretruetodaythanindatamining.basedon C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 6.SymbiosisinStatistics assertthat: theargumentsinthisarticle,letusassumethatstatisticsisanecessarybutnot sucientcomponentinthepracticeofdatamining.howwellwillthestatistics professionservethedataminingcommunity?hoerletal.(1993),forexample, applicationsdoinfactdrivemuchofwhatgoesonitstatistics,althoughoftenina Despitethisrathernegativeviewoftherelevanceofstatisticalresearch,real-world veryindirectmanner. Asanexampleconsidertheeldofsignalprocessingandcommunications,anarea sionisintendedforothermembersofthestatisticalprofession. Weareourownbestcustomers.Muchoftheworkofthestatisticalprofes- fromclaudeshannonandothersinthe1940's.likemostoftheothercontributors totheeld,shannonwasnotastatistician,butpossessedadeepunderstanding intoeverydayuseinradioandnetworkcommunicationssystems.modernstatistical relevantstatisticalmethodssuchasestimationanddetectionhavefoundtheirway duetorapidadvancesinboththeoryandhardware,theeldhasexplodedand whereaspecializedsetofrelativelysophisticatedstatisticalmethodsandmodels Engineeringresearchersintheeldareineect\adjunct"statisticians:educated communicationsreectsthesymbiosisofstatisticaltheoryandengineeringpractice. havebeenhonedforpracticaluse.theeldwasdrivenbyfundamentaladvances inprobabilitytheoryandbasicstatisticstheyhavethetoolstoapplystatistical ofprobabilitytheoryanditsapplications.throughthe1950'stothepresent, methodstotheirproblemsofinterest.meanwhilestatisticianscontinuetodevelop speechrecognition(whereforexamplehiddenmarkovmodelsprovidethestate-ofthe-artintheeld),andmostnotably,epidemiology.indeed,ifstatisticscanclaistandstatisticalprinciples,andstatisticiansneedtounderstandthenatureofthe problemsincommunications. moregeneralmodelsandestimationtechniquesofpotentialapplicabilitytonew importantproblemsthatthedataminingcommunityisattackingorbeingasked tohaverevolutionizedanyeld,itisinthebiologicalandhealthscienceswherethe statisticalapproachtodataanalysisgavebirthtotheeldofbiostatistics. Thistypeofsymbiosiscanalsobeseeninotherareassuchasnancialmodelling, toattack.thishasbeenasuccessfulmodelinthepastforeldswherestatistics hashadconsiderableimpactandhasthepotentialtoseeongoingsuccess. Therelevanceofthissymbiosisfordataminingisthatdata-minersneedtounder-

STATISTICALTHEMESANDLESSONSFORDATAMINING 41 7.Conclusion Thestatisticalliteraturehasawealthoftechnicalproceduresandresultstooer datamining,butitalsohasafewsimplemethodologicalmorals:provethatestimationandsearchproceduresusedindataminingareconsistentunderconditions reasonablythoughttoapplyinapplications;useandrevealuncertainty,don'thide it;calibratetheerrorsofsearch,bothforhonestyandtotakeadvantagesofmodel averaging;don'tconfuseconditioningwithintervening;andnally,don'ttakethe errorprobabilitiesofhypothesisteststobetheerrorprobabilitiesofsearchprocedures. References Akaike,H.1974.Anewlookatthestatisticalmodelidentication.IEEETrans.Automat. Contr.AC-19:716{723. Berger,J.O.andSellke,T.1987.Testingapointnullhypothesis:theirreconcilabilityofPvalues andevidence(withdiscussion).journaloftheamericanstatisticalassociation82:112{122. Breiman,L.1996.Baggingpredictors.MachineLearning,toappear. Chasno,I.J.,Grith,D.R.,MacGregor,S.,Dirkes,K.,Burns,K.A.1989.Temporalpatterns ofcocaineuseinpregnancy:perinataloutcome.journaloftheamericanmedicalassociation 261(12):1741{4. Chateld,C.1995.Modeluncertainty,datamining,andstatisticalinference(withdiscussion). JournaloftheRoyalStatisticalSociety(SeriesA)158:419{466. Dalal,S.R.,Fowlkes,E.B.andHoadley,B.1989.Riskanalysisofthespaceshuttle:Pre-Challenger predictionoffailure.journaloftheamericanstatisticalassociation84:945{957. Diggle,P.andKenward,M.G.1994.Informativedrop-outinlongitudinaldataanalysis(with discussion).appliedstatistics:43:49{93. Draper,D.,Gaver,D.P.,Goel,P.K.,Greenhouse,J.B.,Hedges,L.V.,Morris,C.N.,Tucker,J., andwaternaux,c.1993.combininginformation:nationalresearchcouncilpanelonstatisticalissuesandopportunitiesforresearchinthecombinationofinformation.washington: NationalAcademyPress. Draper,D.1995.Assessmentandpropagationofmodeluncertainty(withdiscussion).Journalof theroyalstatisticalsociety(seriesb).57:45{97. Efron,B.andTibshirani,R.J.1993.AnIntroductiontotheBoostrap.NewYork:Chapmanand Hall. EnergyModelingForum1982.WorldOil:Summaryreport.EMFReport6,EnergyModeling Forum,StanfordUniversity,Stanford,CA. Fisher,R.A.1958.Statisticalmethodsforresearchworkers.NewYork:HafnerPub.Co. Freedman,D.A.1983.Anoteonscreeningregressionequations.TheAmericanStatistician 37:152{155. Geiger,D.Heckerman,D.,andMeek,C.1996.Asymptoticmodelselectionfordirectednetworkswithhiddenvariables.ProceedingsoftheTwelfthAnnualConferenceonUncertaintyin ArticialIntelligence.SanFrancisco:MorganKaufman. Gilks,W.R.,Richardson,S.,andSpiegelhalter,D.J.1996.MarkovchainMonteCarloinpractice. London:ChapmanandHall. Hand,D.J.1994.Deconstructingstatisticalquestions(withdiscussion).JournaloftheRoyal StatisticalSociety(SeriesA)157:317{356. Hastie,T.J.andTibshirani,R.1990.GeneralizedAdditiveModels.London:ChapmanandHall. Hoerl,R.W.,Hooper,J.H.,Jacobs,P.J.,Lucas,J.M.1993.Skillsforindustrialstatisticiansto surviveandprosperintheemergingqualityenvironment.theamericanstatistician47:280{292. Huber,P.J.1981.RobustStatistics.NewYork:Wiley.

42 C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Jereys,H.1980.Somegeneralpointsinprobabilitytheory.In:A.Zellner(Ed.),Bayesian AnalysisinEconometricsandStatistics.Amsterdam:North-Holland,451{454. Kass,R.E.andRaftery,A.E.1995.Bayesfactors.JournaloftheAmericanStatisticalAssociation 90:773{795. Kiiveri,H.andSpeed,T.P.1982.Structuralanalysisofmultivariatedata:Areview.Sociological Methodology209{289. Kooperberg,C.,Bose,S.,andStone,C.J.1996.Polychotomousregression.JournaloftheAmericanStatisticalAssociation,toappear. Lauritzen,S.L.1996.GraphicalModels.Oxford:OxfordUniversityPress. Leamer,E.E.1978.SpecicationSearches.AdHocInferencewithNonexperimentalData.Wiley: NewYork. Madigan,D.andRaftery,A.E.1994.Modelselectionandaccountingformodeluncertainty ingraphicalmodelsusingoccam'swindow.journaloftheamericanstatisticalassociation 89:1335{1346. Madigan,D.andYork,J.1995.Bayesiangraphicalmodelsfordiscretedata.International StatisticalReview63:215{232. Matheson,J.E.andWinkler,R.L.1976.Scoringrulesforcontinuousprobabilitydistributions. ManagementScience22:1087{1096. McCullagh,P.andNelder,J.A.1989.GeneralizedLinearModels.London:ChapmanandHall. Michelangeli,P.A.,Vautard,R.,andLegras,B.1995.Weatherregimes:recurrenceandquasistationarity.JournaloftheAtmosphericSciences52(8):1237{56. Miller,R.G.Jr.1981.Simultaneousstatisticalinference(SecondEdition).NewYork:Springer- Verlag. Neyman,J.andPearson,E.S.1933.Ontheproblemofthemostecienttestsofstatistical hypotheses.philosophicaltransactionsoftheroyalsociety(seriesa)231:289{337. Raftery,A.E.1995.Bayesianmodelselectioninsocialresearch(withdiscussion).InSociological Methodology(ed.P.V.Marsden),Oxford,U.K.:Blackwells,111{196. Rissanen,J.1978.Modelingbyshortestdatadescription.Automatica14:465{471. Schervish,M.J.1995.TheoryofStatistics,NewYork:SpringerVerlag. Schwartz,G.1978.Estimatingthedimensionofamodel.AnnalsofStatistics6:461{464. Selvin,H.andStuart,A.1966.Datadredgingproceduresinsurveyanalysis.TheAmerican Statistician20(3):20{23. Simpson,C.H.1951.Theinterpretationofinteractionincontingencytables.Journalofthe RoyalStatisticalSociety(SeriesB)13:238{241. Smith,A.F.M.andRoberts,G.1993.BayesiancomputationviatheGibbssamplerandrelated MarkovchainMonteCarlomethods(withdiscussion).JournaloftheRoyalStatisticalSociety (SeriesB)55:3{23. Spirtes,P.,GlymourC.,andScheines,R.1993.Causation,PredictionandSearch,Springer LectureNotesinStatistics,NewYork:SpringerVerlag. Stigler,S.M.1986.Thehistoryofstatistics:Themeasurementofuncertaintybefore1900. Harvard:HarvarduniversityPress. Wen,S.W.,Hernandez,R.,andNaylor,C.D.1995.Pitfallsinnonrandomizedstudies:The caseofincidentalappendectomywithopencholecystectomy.journaloftheamericanmedical Association274:1687{1691. Wright,S.1921.Correlationandcausation.JournalofAgriculturalResearch20:557{585. ReceivedDate AcceptedDate FinalManuscriptDate