StatisticalThemesandLessonsforDataMining c1996kluweracademicpublishers,boston.manufacturedinthenetherlands. DataMiningandKnowledgeDiscovery,1,25{42(1996) CLARKGLYMOUR DepartmentofCognitivePsychology,CarnegieMellonUniversity,Pittsburgh,PA15213 DAVIDMADIGAN DepartmentofStatistics,Box354322,UniversityofWashington,Seattle,WA98195 DARYLPREGIBON PADHRAICSMYTH StatisticsResearch,AT&TLaboratories,MurrayHill,NJ07974madigan@stat.washington.edu daryl@research.att.com cg09@andrew.cmu.edu somestatisticalthemesandlessonsthataredirectlyrelevanttodataminingandattemptstoidentifyopportunitieswhereclosecooperationbetweenthestatisticalandcomputationalcommunities inbothdisciplinestomakeprogressinextractinginformationfromlargedatabases.itisanemergingeldthathasattractedmuchattentioninaveryshortperiodoftime.thisarticlehighlights InformationandComputerScience,UniversityofCalifornia,Irvine,CA92717 Editor:UsamaFayyad Abstract.DataminingisontheinterfaceofComputerScienceandStatistics,utilizingadvances smyth@ics.uci.edu mightreasonablyprovidesynergyforfurtherprogressindataanalysis. Keywords:Statistics,uncertainty,modeling,bias,variance 1.Introduction softwarehavefreedthestatisticianfromnarrowlyspeciedmodelsandspawned statisticaltoolkitdrawsonarichbodyoftheoreticalandmethodologicalresearch (Table1). afreshapproachtothesubject,especiallyasitrelatestodataanalysis.today's Statisticsisenjoyingarenaissanceperiod.Moderncomputinghardwareand andinterpretationofnumericaldata,especiallytheanalysisofpopulation characteristicsbyinferencefromsampling.(americanheritagedictionary). Sta-tis-tics(noun).Themathematicsofthecollection,organization, or\turningdataintoinformation".thecontextencompassesstatistics,butwith asomewhatdierentemphasis.inparticular,datamininginvolvesretrospective analysesofdata:thus,topicssuchasexperimentaldesignareoutsidethescopeof estedinunderstandabilitythanaccuracyorpredictabilityperse.thus,thereisa soforth.applicationsinvolvingverylargenumbersofvariablesandvastnumbers focusonrelativelysimpleinterpretablemodelsinvolvingrules,trees,graphs,and dataminingandfallwithinstatisticsproper.dataminersareoftenmoreinter- ofmeasurementsarealsocommonindatamining.thus,computationaleciency Theeldofdatamining,likestatistics,concernsitselfwith\learningfromdata"
26onetodescriberelationshipsbetweenvariablesforprediction,quantifyingeects,or Table1.Statisticianshavedevelopedalargeinfrastructure(theory)tosupporttheir theuncertaintyassociatedwithdrawinginferencesfromdata.thesemethodsenable methodsandalanguage(probabilitycalculus)todescribetheirapproachtoquantifying C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH AreaofStatistics experimentaldesign&samplinghowtoselectcasesifonehasthelibertytochoose suggestingcausalpaths. exploratorydataanalysis DescriptionofActivities andscalabilityarecriticallyimportant,andissuesofstatisticalconsistencymay beasecondaryconsideration.furthermore,thecurrentpracticeofdataminingis statisticalgraphics statisticalmodeling statisticalinference hypothesisgenerationratherthanhypothesistesting datavisualization regressionandclassicationtechniques (suchasanyofthemanyruleinductionsystemsonthemarket)willproducesets oftenpattern-focusedratherthanmodel-focused,i.e.,ratherthanbuildingacoherentglobalmodelwhichincludesallvariablesofinterest,dataminingalgorithms estimationandpredictiontechniques ticalcomputationalconcerns.however,infocusingalmostexclusivelyoncomputa- tionalissues,itiseasytoforgetthatstatisticsisinfactacorecomponent.theterm thefundamentalstatisticalnatureoftheinferenceproblemisindeedtobeavoided. andstuart,1966;chateld,1995).dataminingwithoutproperconsiderationof \datamining"haslonghadnegativeconnotationsinthestatisticsliterature(selvin However,agoalofthisarticleistoconvincethereaderthatmodernstatisticscan Inthisoverallcontext,currentdataminingpracticeisverymuchdrivenbyprac- ofstatementsaboutlocaldependenciesamongvariables(inruleform). oersignicantconstructiveadvicetothedataminer,althoughmanyproblemsremainunsolved.throughoutthearticlewehighlightsomemajorthemesofstatistics todatamining.forarigoroussurveyofstatistics,themathematicallyinclined research,focusinginparticularonthepracticallessonspertinenttodatamining. anumberofinterestingtopics,includingtimeseriesanalysisandmeta-analysis. readershouldsee,forexample,schervish(1995).forreasonsofspacewewillignore 2.AnOverviewofStatisticalScience ThisSectionbrieydescribessomeofthecentralstatisticalideaswethinkrelevant marginalization(summingoverasubsetofvalues)andconditionalization(forming characterizationsofawealthofprobabilitydistributions,aswellaspropertiesof sureassignsvalues.importantrelationsamongprobabilitydistributionsinclude randomvariables{functionsdenedonthe\events"towhichaprobabilitymea- ProbabilityDistributions.Thestatisticalliteraturecontainsmathematical
aconditionalprobabilitymeasurefromameasureonasamplespaceandsome eventofpositivemeasure).essentialrelationsamongrandomvariablesinclude STATISTICALTHEMESANDLESSONSFORDATAMINING independence,conditionalindependence,andvariousmeasuresofdependence,of anyparticularmemberofthefamilyfromdata,orbyclosurepropertiesusefulin characterizesfamiliesofdistributionsbypropertiesthatareusefulinidentifying whichthemostfamousisthecorrelationcoecient.thestatisticalliteraturealso 27 modelconstructionorinference,forexampleconjugatefamilies,closedunderconditionalization,andthemultinormalfamily,closedunderlinearcombination.a aprobabilitydistribution.classicalstatisticsinvestigatessuchdistributionsof ofestimatorscorrespondingtoallpossiblesamplesfromthatcollectionalsohas actualorpotentialcollectiongovernedbysomeprobabilitydistribution,thefamily dataandmakingappropriateinferences. knowledgeofthepropertiesofdistributionfamiliescanbeinvaluableinanalyzing estimatorsinordertoestablishbasicpropertiessuchasreliabilityanduncertainty. Avarietyofresamplingandsimulationtechniquesalsoexistforassessingestimator uncertainty(efronandtibshirani,1993). ModelAveraging.Anestimatorisafunctionfromsampledatatosomeestimand, suchasthevalueofaparameter.whenthedatacompriseasamplefromalarger Estimation,Consistency,Uncertainty,Assumptions,Robustness,and aretypicallyfalse,butoftenuseful.ifamodel(whichwecanthinkofasasetof assumptions)isincorrect,estimatesbasedonitcanbeexpectedtobeincorrect aswell.oneoftheaimsofstatisticalresearchistondwaystoweakenthe assumptionsnecessaryforgoodestimation.\robuststatistics"(huber,1981) looksforestimatorsthatworksatisfactorilyforlargerfamiliesofdistributionsand havesmallerrorswhenassumptionsareviolated. Estimationalmostalwaysrequiressomesetofassumptions.Suchassumptions sumptionsareoftenplausible.ratherthanmakinganestimatebasedonasingle model,severalmodelscanbeconsideredandanestimateobtainedastheweighted Carloanalysis.Ourimpressionisthattheerrorratesofsearchproceduresproposed 1994).Infact,suchBayesianmodelaveragingisboundtoimprovepredictiveperformance,onaverage.Sincethemodelsobtainedindataminingareusuallythe resultsofsomeautomatedsearchprocedure,accountingforthepotentialerrors Bayesianestimationemphasizesthatalternativemodelsandtheircompetingas- averageoftheestimatesgivenbytheindividualmodels(madiganandraftery, associatedwiththesearchitselfiscrucial.inpractice,thisoftenrequiresamonte hypothesistestingisinconsistentunlessthealphalevelofthetestingruleisdecreasedappropriatelyasthesamplesizeincreases.generally,anleveltestofone hypothesisandanleveltestofanotherhypothesisdonotjointlyprovidean leveltestoftheconjunctionofthetwohypotheses.inspecialcases,rules(some- andusedinthedataminingandinthestatisticalliteraturearefartoorarelyesti- matedinthisway.(seespirtesetal.,1993formontecarlotestdesignforsearch portantlimitationsshouldbenoted.viewedasaone-sidedestimationmethod, procedures.) HypothesisTesting.Sincestatisticaltestsarewidelyused,someoftheirim-
28 oferroneouslyndingsomedependentsetofvariableswheninfactallpairsare testingaseriesofhypothesis.if,forexample,foreachpairofasetofvariables, timescalledcontrasts)existforsimultaneouslytestingseveralhypotheses(miller, hypothesesofindependencearetestedat=0:05,then0.05isnottheprobability ingdirectlytodowiththeprobabilityoferrorinasearchprocedurethatinvolves 1981).Animportantcorollaryfordataminingisthatthelevelofatesthasnoth- C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH independent.thus,indataminingproceduresthatuseasequenceofhypothesis tests,thealphalevelofthetestscannotgenerallybetakenasanestimateofany nomatterhowcloselytheyseemtotthedata. ples;testsoflinearmodels,forexample,typicallyrejecttheminverylargesamples errorprobabilityrelatedtotheoutcomeofthesearch. dowiththetruthofhypotheses,theconnectionissomewhattenuous(seesection 5.3).Hypothesesthatareexcellentapproximationsmayberejectedinlargesam- Dataminersshouldnotethatwhileerrorprobabilitiesoftestshavesomethingto correspondstoapreferenceorderingoverthespaceofmodels,giventhedata.for thereasonsjustconsidered,scoringrulesareoftenanattractivealternativetotests. modelsorhypothesestoothers,andtobeindierentbetweenstillothermodels.a InformationCriterion(Raftery,1995),andMinimumDescriptionlength(Rissanen, scoreisanyrulethatmapsmodelsanddatatonumberswhosenumericalordering withthemodel,thenumberofparameters,ordimension,ofthemodel,andthe Typicalrulesassignmodelsavaluedeterminedbythelikelihoodfunctionassociated data.popularrulesincludetheakaikeinformationcriterion(akaike,1974),bayes ModelScoring.Theevidenceprovidedbydatashouldleadustoprefersome onthedataisitselfascoringfunction,arguablyaprivilegedone.thebayes InformationCriterionapproximatesposteriorprobabilitiesinlargesamples. 1978).Givenapriorprobabilitydistributionovermodels,theposteriorprobability modelspacetocalculatescoresforallmodels;itis,however,oftenfeasibleto samemodel,butevendierentorderingsofmodels. fromthesamedistributionmayyieldnotonlydierentnumericalvaluesforthe uncertaintiesassociatedwithscores,sincetwodierentsamplesofthesamesize scores.aicscoresarenot,ingeneral,consistent(schwartz,1978).therearealso plelimit,almostsurelythetruemodelshouldbeamongthosereceivingmaximal Forobviouscombinatorialreasons,itisoftenimpossiblewhensearchingalarge Thereisanotionofconsistencyappropriatetoscoringrules;inthelargesam- describeandcalculatescoresforafewequivalenceclassesofmodelsreceivingthe highestscores. inmontecarlomethodshave,however,liberatedanalystsfromsomeofthesecon- Bayesianmodelsandcomplexlikelihoodcalculations.Recentdramaticadvances dicultiesforceddataanalyststoeschewexactanalysisofelaboratehierarchical frominferencesmadewithhypothesistests.raftery(1995)givesexamplesofmodelsthataccountforalmostallofthevarianceofanoutcomeofinterest,andhave veryhighbayesianscores,butareoverwhelminglyrejectedbystatisticaltests. Insomecontexts,inferencesmadeusingBayesianscorescandieragreatdeal MarkovChainMonteCarlo.Historically,insurmountablecomputational
straints.oneparticularclassofsimulationmethods,dubbedmarkovchainmonte STATISTICALTHEMESANDLESSONSFORDATAMINING Carlo,originallydevelopedinstatisticalmechanics,hasrevolutionizedthepractice ofbayesianstatistics.smithandroberts(1993)provideanaccessibleoverview fromthebayesianperspective;gilksetal.(1996)provideapracticalintroduction addressingbothbayesianandnon-bayesianperspectives. Simulationmethodsmaybecomeunacceptablyslowwhenfacedwithmassive 29 GeneralizedLinearModels,forinstance,embracemanyclassicallinearmodels,and calresearchhasbeenthedevelopmentofverygeneralandexiblemodelclasses. seeforexamplekooperbergetal.(1996),kassandraftery(1995),andgeigeret al.(1996). unifyestimationandtestingtheoryforsuchmodels(mccullaghandnelder,1989). GeneralizedAdditiveModelsshowsimilarpotential(HastieandTibshirani,1990). datasets.insuchcases,recentadvancesinanalyticapproximationsproveuseful- Graphicalmodels(Lauritzen,1996)representprobabilisticandstatisticalmodels fordescribingmodelsandthegraphsthemselvesmakemodelingassumptionsexplicit.graphicalmodelsprovideimportantbridgesbetweenthevaststatistical analysis,anddatamining. withplanargraphs,wheretheverticesrepresent(possiblylatent)randomvariables andtheedgesrepresentstochasticdependences.thisprovidesapowerfullanguage Generalizedmodelclasses.Amajorachievementofstatisticalmethodologi- literatureonmultivariateanalysisandsucheldsasarticialintelligence,causal etc.typically,rationaldecisionmakingandplanningarethegoalsofdatamining, Givenallofthisinformation,adecisionrulespecieswhichofthealternativeactionsoughttobetaken.Alargeliteratureinstatisticsandeconomicsaddresses alternativedecisionrules{maximizingexpectedutility,minimizingmaximumloss, sumesthedecisionmakerhasavailableadenitesetofalternativeactions,knowl- edgeofadenitesetofpossiblealternativestatesoftheworld,knowledgeofthe RationalDecisionMakingandPlanning.Thetheoryofrationalchoiceas- theworld,andknowledgeoftheprobabilitiesofvariouspossiblestatesoftheworld. payosorutilitiesoftheoutcomesofeachpossibleactionineachpossiblestateof rationalchoiceposesnormsfortheuseofinformationobtainedfromadatabase. andratherthanprovidingtechniquesormethodsfordatamining,thetheoryof knowledgeoftheeectsalternativeactionswillhave.toknowtheoutcomesof ofbernoulliandlaplace,theabsenceofcausalconnectionbetweentwovariables actionsistoknowsomethingofcauseandeectrelations,andextractingsuch causalinformationisoftenoneoftheprinciplegoalsofdataminingandofstatisticalinferencemoregenerally. historicaldevelopmentofstatistics.fromthebeginningofthesubject,inthework Theveryframeworkofrationaldecisionmakingrequiresprobabilitiesanda hasbeentakentoimplytheirprobabilisticindependence(seestigler,1986),and thesameideaisfundamentalinthetheoryofexperimentaldesign(fisher,1958). Earlyinthiscentury,Wright(1921)introduceddirectedgraphstorepresentcausal hypotheses(withverticesasrandomvariablesandedgesrepresentingdirectinu- InferencetoCauses.Understandingcausationisthehiddenforcebehindthe
30 socialsciences,biology,computerscienceandengineering. ences),andtheyhavebecomecommonrepresentationsofcausalhypothesesinthe betweenindependenceandabsenceofcausalconnectioninwhattheycalledthe Markovcondition:providedYisnotaneectofX,XandYareconditionally independentgiventhedirectcausesofx.theyshowedthatmuchofthelinear KiiveriandSpeed(1982)combineddirectedgraphswithageneralizedconnection C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH causalmodelsofcategoricaldata,andvirtuallyallcausalmodelsofsystemswithoutfeedback.underadditionalassumptions,conditionalindependencetherefore modelingliteraturetacitlyassumedthemarkovcondition;thesameistruefor manysourcesoferroranddataminersshouldproceedwithextremecaution. tributionssatisfyingthemarkovconditionarecalledbydierentnamesindierent names,including\faithfulness."directedgraphswithassociatedprobabilitydis- literatures:bayesnets,beliefnets,structuralequationmodels,pathmodels,etc. oughlyinvestigated,additionalassumptionisthatallconditionalindependencies Nonetheless,causalinferencesfromuncontrolledconveniencesamplesareliableto providesinformationaboutcausaldependence.themostcommon,andmostthortualcausalprocessesgeneratingthedata,arequirementthathasbeengivenmany areduetothemarkovconditionappliedtothedirectedgraphdescribingtheacpliedbyhumanexperts,orinferredfromthedatabaseautomatically.regression, probabilitydistribution.indataminingcontexts,structureistypicallyeithersup- obtainedfromthesameprobabilitydistribution.aswithestimation,inprediction varianceofthepredictor. weareinterestedbothinreliabilityandinuncertainty,oftenmeasuredbythe predictpropertiesofanewsample,whereitisassumedthatthetwosamplesare forexample,assumesaparticularfunctionalformrelatingvariables.structurecan Predictionmethodsforthissortofproblemalwaysassumesomestructureinthe Prediction.Sometimesoneisinterestedinusingasample,oradatabase,to bealsobespeciedintermsofconstraints,suchasindependence,conditionalindependence,higherorderconditionsoncorrelations,etc.onaverage,aprediction methodthatguaranteessatisfactionoftheconstraintsrealizedintheprobability distribution{andnoothers{willbemoreaccurateandhavesmallervariancethan Inthemid1960's,thestatisticscommunityreferredtounfetteredexplorationof 3.IsDataMining\StatisticalDejaVu"(AllOverAgain)? bymodelaveraging,providedthepriorprobabilitiesofthealternativeassumptions imposedbythemodelareavailable. cultissueinthissortofprediction.aswithestimation,predictioncanbeimproved onethatdoesnot.findingtheappropriateconstraintstosatisfyisthemostdi- arguedthatsincetheirtheorieswereinvalidatedby\lookingatthedata",itwas enamoredbyelegant(analytical)mathematicalsolutionstoinferentialproblems, wrongtodoso.themajorproponentoftheexploratorydataanalysis(eda) dataas\shing"or\datadredging"(selvinandstuart,1966).thecommunity, school,j.w.tukey,counteredthisargumentwiththeobviousretortthatstatis-
ticianswereputtingthecartbeforethehorse.hearguedthatstatisticaltheory STATISTICALTHEMESANDLESSONSFORDATAMINING anddevisingformalmethodstoaccountforsearchintheirinferentialprocedures. shouldadapttothescienticmethodratherthantheotherwayaround.thirty yearshence,thestatisticalcommunityhaslargelyadoptedtukey'sperspective, andhasmadeconsiderableprogressinservingbothmasters,namelyacknowledgingthatmodelsearchisacriticalandunavoidablestepinthemodelingprocess, 31 minersare:clarityaboutgoals,appropriatereliabilityassessment,andadequate ticularlychallengingindynamicsituations).inyetothercases,dataanalysisaims accountingforsourcesofuncertainty. Inothercases,dataanalysisaimstopredictfeaturesofnewcases,ornewsamples, drawnfromoutsidethedatabaseusedtodevelopapredictivemodel(thisispar- computablerepresentationofhowthedataaredistributedinaparticulardatabase. Threethemesofmodernstatisticsthatareoffundamentalimportancetodata fromwhichthemodel(ormodels)weredeveloped.eachofthesegoalspresent causalmechanismsthatareusedtoformpredictionsaboutnewsamplesthatmight toprovideabasisforpolicy.thatis,theanalysisisintendedtoyieldinsightinto beproducedbyinterventionsoractionsthatdidnotapplyintheoriginaldatabase Clarityaboutgoals.Sometimesdataanalysisaimstondaconvenient,easily distinctinferenceproblems,withdistincthazards.confusingorequivocatingover theaiminvitestheuseofinappropriatemethodsandmayresultinunfortunate usewillresultinimprovedobstetricoutcome".fortunately,thereexistsindependentevidencetosupportthiscausalclaim.however,muchofchasnoetal.'spaper focusesonastatisticalanalysis(analysisofvariance)thathaslittle,ifanything,to dowiththecausalquestionofinterest. (1989)comparingbabiesborntococaine-usingmotherswithbabiesborntononcocaine-usingmothers.Theauthorsconcluded:\Forwomenwhobecomepregnant Asanexample,considertheobservationalstudyreportedbyChasnoetal. andareusersofcocaine,interventioninearlypregnancywithcessationofcocaine predictionsandinferences. particulartreatment(diggleandkenward,1994).inthiscase,theimportantissue analyzingclinicaltrialdatawherepatientsdropoutduetoadverseside-eectsofa thepopulationwhoremainwithinthetrial?thisproblemarisesinmoregeneral settingsthaninclinicaltrials,e.g.,non-respondents(refusers)insurveydata.in answer. iswhichpopulationisoneinterestedinmodelling?thepopulationatlargeversus rightanswerstothewrongquestion.forexample,hediscussestheproblemof suchsituationsitisimportanttobeexplicitaboutthequestionsoneistryingto Hand(1994)providesaseriesofexamplesillustratinghoweasyitistogivethe problemsothattherightquestioncanbeasked?hand'sconclusionisthatthis islargelyan\art"becauseitislesswellformalizedthanthemathematicaland thatofformulatingstatisticalstrategyi.e.,howdoesonestructureadataanalysis computationaldetailsofapplyingaparticulartechnique.this\art"isgained throughexperience(atpresentatleast)ratherthantaught.theimplicationfor Inthisgeneralcontextanimportantissue(discussedatlengthinHand(1994))is
32 dataminingisthathumanjudgementisessentialformanynon-trivialinference problems.thus,automationcanatbestonlypartiallyguidethedataanalysis oftendicult,process. theuser(andconsumer)understandsandndsplausibleinthecontext. process.properlydeningthegoalsofananalysisremainsahuman-centred,and Useofmethodsthatarereliablemeanstothegoal,underassumptions C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Statisticaltheoryappliesseveralmeaningstotheword\Reliability",manyofwhich alsoapplytomodelsearch.forexample,underwhatconditionsdoesasearch procedureprovidecorrectinformation,ofthekindsought,withprobabilityone asthesamplesizeincreaseswithoutbound?answerstosuchquestionsareoften available,thedataanalystshouldpaycarefulattentiontothereasonablenessof elusiveandcanrequiresophisticatedmathematicalanalysis.whereanswersare underlyingassumptions.anotherkeydataminingquestionisthis:whatarethe probabilitiesofvariouskindsoferrorsthatresultfromusingamethodinnite samples?theanswerstothisquestionwilltypicallyvarywiththekindsoferrors considered,withthesamplesize,andwiththefrequencyofoccurrenceofthevarious pellingexample. orthecorrectprediction.thedataanalystmustquantifytheseuncertaintiesso shouldleavetheinvestigatorwitharangeofuncertaintiesaboutthecorrectmodel, kindsoftargetsorsignalswhosedescriptionisthegoalofinference.thesequestions areoftenbestaddressedbymontecarlomethods,althoughinsomecasesanalytic thatsubsequentdecisionscanbeappropriatelyhedged.section4providesacomgroundknowledgeandeventhebestmethodsofsearchandstatisticalassessment resultsmaybeavailable. questioniswhetherornotspecicrecurrentpressurepatternscanbeclearlyidentiedfromdailygeopotentialheightrecordswhichhavebeencompiledinthe Anotherexampleinvolvesacurrentdebateintheatmosphericsciences.The Asenseoftheuncertaintiesofmodelsandpredictions.Quiteoftenback- NorthernHemispheresince1948.Theexistenceofwell-denedrecurrentpatterns modelsviaresamplingtechniques,itisdiculttoinferfromthemultiplestudies (or\regimes")hassignicantimplicationsformodelsofupperatmospherelowfrequencyvariabilitybeyondthetime-scaleofdailyweatherdisturbances(and, low-dimensionalprojectionsofthegriddeddata(seemichelangelietal.(1995)and thus,modelsoftheearth'sclimateoverlargetime-scales).severalstudieshave othersreferredtotherein).whilethisworkhasattemptedtovalidatethecluster degreeofcertaintyandthatthereisafundamentaluncertainty(giventhecurrent data)abouttheunderlyingmechanismsatwork.allisnotlost,however,sinceit whetherregimestrulyexist,and,iftheydo,wherepreciselytheyarelocated.it seemslikelythat48wintersworthofdataisnotenoughtoidentifyregimestoany usedavarietyofclusteringalgorithmstodetectinhomogeneities(\bumps")in isalsoclearthatonecouldquantifymodeluncertaintyinthiscontext,andtheorize accordingly(seesection4). ofthehazardsofdatamining. Inwhatfollowswewillelaborateonthesepointsandoeraperspectiveonsome
estimateorapredictionisalmostalwaysinadequate.quanticationoftheuncertaintyassociatedwithasinglenumber,whileoftenchallenging,iscriticalfor 4.CharacterizingUncertainty STATISTICALTHEMESANDLESSONSFORDATAMINING 33 Thestatisticalapproachcontendsthatreportingasinglenumberforaparameter subsequentdecisionmaking.asanexample,draper(1995),consideredthecaseof the1980energymodelingforum(emf)atstanforduniversitywherea43-person workinggroupofeconomistsandenergyexpertsconvenedtoforecastworldoil pricesfrom1981to2020.thegroupgeneratedpredictionsbasedonanumberof econometricmodelsandscenarios,embodyingavarietyofassumptionsaboutsupply,demand,andgrowthratesofrelevantquantities.aplausiblereferencescenario andmodelwasselectedasrepresentative,butthesummaryreport(emf,1982) thewarningaboutthepotentialuncertaintyassociatedwiththepointestimates, toacceptanyprojectionasaforecast."thesummaryreportdidconclude,however,thatmostoftheuncertaintyaboutfutureoilprices\concernsnotwhether cautionedagainstinterpretingpointpredictionsbasedonthereferencescenarioas thesepriceswillrise...buthowrapidlytheywillrise." inthequotationabove,andproceededtoinvestanestimated$500billiondollars, \[theworkinggroup's]`forecast'oftheoilfuture,astherearetoomanyunknowns governmentsandprivatecompaniesaroundtheworldfocusedonthelastsentence onthebasisthatthepricewouldprobablybecloseto$40dollarsperbarrelinthe mid-eighties.infact,theactual1986worldaveragespotpriceofoilwasabout$13 perbarrel. In1980,theaveragespotpriceofcrudeoilwasaround$32perbarrel.Despite (andshould)haveproceededmorecautiouslyin1980,hadtheyunderstoodthefull extentoftheiruncertainty. intervalforthe1986pricewouldhaverangedfromabout$20toover$90.note tisticalanalysisdoesnotprovideclairvoyance.however,decisionmakerswould thatthisintervaldoesnotactuallycontaintheactual1986price{insightfulstafulbutelementarystatisticalmethods,draper(1995)showsthata90%predictive Correctlyaccountingforthedierentsourcesofuncertaintypresentssignicant UsingonlytheinformationavailabletotheEMFin1980,alongwiththought- parametricandpredictiveuncertaintyinthecontextofaparticularmodel.two distinctapproachesareincommonuse.\frequentist"statisticiansfocusonthe tersandpredictionsbyso-calledsamplingdistributions.\bayesian"statisticians randomnessinsampleddataandsummarizetheinducedrandomnessinparame- insteadtreatthedataasxed,andusebayestheoremtoturnprioropinionabout challenges.untilrecently,thestatisticalliteraturefocusedprimarilyonquantifying calledposteriordistributionthatembracesalltheavailableinformation.theerce quantitiesofinterest(alwaysexpressedbyaprobabilitydistribution),intoaso- conictsbetweenpreviousgenerationsoffrequentistsandbayesians,havelargely givenwayinrecentyearstoamorepragmaticapproach;moststatisticianswill basetheirchoiceoftoolonscienticappropriatenessandconvenience.
34 uncertainty(asdiscussedinthepreviousparagraph)mayoften,inpractice,be andyork,1995).itiscommonpracticenowadaysforstatisticiansanddataminers tousecomputationallyintensivemodelselectionalgorithmstoseekoutasingle dominatedbybetween-modeluncertainty(chateld,1995,draper,1995,madigan optimalmodelfromanenormousclassofpotentialmodels.theproblemisthat Inanyevent,recentresearchhasleadtoincreasedawarenessthatwithin-model C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH ofuncertaintyincludebayesianmodelaveraging(draper,1995)andresampling carefullyaboutmodelassessmentandlookbeyondcommonlyusedgoodness-of-t measuressuchasmeansquareerror. Intuitively,ambiguityoverthemodelshoulddiluteinformationabouteectparametersandpredictions,since\partoftheevidenceisspenttospecifythemodel" (Leamer,1978,p.91).Promisingtechniquesforproperlyaccountingforthissource severaldierentmodelsmaybeclosetooptimal,yetleadtodierentinferences. methods(breiman,1996).themainpointhereisthatdataminersneedtothink meetsdata. ofstatistics.whilestatisticsdoesnothavealltheanswersforthedataminer,it thissection,wedescribesomelessonsthatstatisticianshavelearnedwhentheory doesprovideausefulandpracticalframeworkforwhichtosearchforsolutions.in 5.Whatcangowrong,willgowrong 5.1.DataCanLie Dataminingposesdicultandfundamentalchallengestothetheoryandpractice Dataminingapplicationstypicallyrelyonobservational(asopposedtoexperimental)data.Interpretingobservedassociationsinsuchdataischallenging;sensiblhospitaldeaths)from1981to1990,focusingspecicallyonpatientswhohadreceivedaprimaryopencholecystectomy.Someofthesepatientshadinaddition deaths.achi-squaretestcomparingthisoutcomeforthetwogroupsofpatients receivedanincidental(i.e.discretionary)appendectomyduringthecholecystectomyprocedure.table2displaysthedataononeoutcome,namelyin-hospital showsa\statisticallysignicant"dierence.this\nding"issurprisingsincelongtermpreventionofappendicitisisthesolerationalefortheincidentalappendectomy Wen,Hernandez,andNaylor(1995;WHNhereafter)analyzedadministrative factors.hereweoeradetailedexampletosupportthisposition. inferencesrequirecarefulanalysis,anddetailedconsiderationoftheunderlying recordsofallontariogeneralhospitalseparations(discharges,transfers,orin- procedure{noshort-termimprovementinoutcomesisexpected.this\nding" mightleadanaivehospitalpolicymakertoconcludethatallcholecystectomypatientsshouldhaveanincidentalappendectomytoimprovetheirchancesofagood outcome!clearlysomethingisamiss-howcouldincidentalappendectomyimprove outcomes?
STATISTICALTHEMESANDLESSONSFORDATAMINING Table2.In-hospitalSurvivalofPatientsUndergoingPrimaryOpen CholecystectomyWithandWithoutIncidentalAppendectomy. AppendectomyAppendectomy Without 35 (usingtendierentdenitionsof\low-risk"),incidentalappendectomyindeedre- butappearstopositivelyaectoutcomeswhenthelow-riskandhigh-riskpatients sultedinpooreroutcomes.paradoxically,itcouldevenbethecasethatappendec- tomyadverselyaectsoutcomesforbothhigh-riskpatientsandlow-riskpatients, WHNdidseparatelyconsiderasubgroupoflow-riskpatients.Forthesepatients In-hospitaldeaths,No.(%)21(0.27%)1,394(0.73%) In-hospitalsurvivors,No.(%)7,825(99.73%)190,205(99.27%) arecombined.whndonotprovideenoughdatatocheckwhetherthisso-called \Simpson'sParadox"(Simpson,1951)occurredinthisexample.However,Table3 presentsdatathatareplausibleandconsistentwithwhn'sdata. Table3.FictitiousdataconsistentwiththeWenetal.(1995) data. tiousdata.clearlytheriskanddeathcategoriesaredirectlycorrelated.inaddition, Table4displaysthecorrespondingproportionsofin-hospitaldeathforthesecti- Survival7700 DeathLow-RiskHigh-RiskLow-RiskHigh-Risk Appendectomy 7With12516400926196 14 100 Appendectomy Without thattheyhadanappendectomyallowsustoinferthattheyaremorelikelytobe appendectomiesaremorelikelytobecarriedoutonlow-riskpatientsthanonhighriskones.thus,ifwedidnotknowtheriskcategory(age)ofapatient,knowing 1294 pendectomywilllowerone'srisk.nonetheless,whenriskisomittedfromthetable, exactlysuchafallaciousconclusionappearsjustiedfromthedata. lowerrisk(younger).however,thisdoesnotinanywayimplythathavinganap- analysis,adjustingformanypossibleconfoundingvariables(e.g.age,sex,admissionstatus).theyconcludethat\thereisabsolutelynobasisforanyshort-term improvementinoutcomes"duetoincidentalappendectomy.thiscarefulanalysis agreeswithcommonsenseinthiscase.ingeneral,analysesofobservationaldata demandsuchcare,andcomewithnoguarantees.othercharacteristicsofavailable datathatconnivetospoilcausalinferencesinclude: Returningtotheoriginaldata,WHNprovideamoresophisticatedregression
36 riskgroupingforthectitiousdataoftable3. Table4.Proportionofin-hospitaldeathscrossclassiedbyincidentalappendectomyandpatient C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Low-Risk0.0009 AppendectomyAppendectomy With Without Thepopulationunderstudymaybeamixtureofdistinctcausalsystems,resultinginstatisticalassociationsthatareduetothemixingratherthantoany 0.007 0.05 Associationsinthedatabasemaybedueinwholeorparttounrecordedcommon causes(latentvariables). Combined0.003 High-Risk 0.10 0.0006 Missingvaluesofvariablesforsomeunitsmayresultinmisleadingassociations Membershipinthedatabasemaybeinuencedbytwoormorefactorsunderstudy,whichwillcreatea\spurious"statisticalassociationbetweenthose directinuenceofvariablesononeanotheroranysubstantivecommoncause. Manymodelswithquitedistinctcausalimplicationsmay\t"thedataequally amongtherecordedvalues. Thefrequencydistributionsinsamplesmaynotbewellapproximatedbythe Therecordedvaluesofvariablesmaybetheresultof\feedback"mechanisms variables. oralmostequallywell. mostfamiliarfamiliesofprobabilitydistributions. regressioncaninsomecasesproduceinferiorestimatesofeectsizes.procedures asintheappendectomyexample,buttheyarenotalwaysadequateguardsagainst thesehazards.indeed,controllingforpossiblyconfoundingvariableswithmultiple suchasmultipleregression,andlogisticregressionmayworkinmanycases,such tisticalproceduresyetavailablethatcanbeused\otheshelf"{thewayrandom- izationisusedinexperimentaldesign{toreducetheserisks.standardtechniques Thereisresearchthataddressesaspectsoftheseproblems,buttherearefewsta- whicharenotwellrepresentedbysimple\non-recursive"statisticalmodels. recentlydevelopedinthearticialintelligenceandstatisticsliterature(spirteset al.,1993)addresssomeoftheproblemsassociatedwithlatentvariablesandmixing,butsofaronlyfortwofamiliesofprobabilitydistributions,thenormaland multinomial.
institutionsthatgiverisetodata,canbeuncooperative.insuchcases,inferences 5.2.Sometimesit'snotwhat'sinthedatathatmatters Classicalstatisticalmethodsstartwitharandomsample,yetinpractice,dataorthe STATISTICALTHEMESANDLESSONSFORDATAMINING thatignorehowthedatawere\selected"canleadtodistortedconclusions. Consider,forexample,theChallengerSpaceShuttleaccident.TheRogersCommissionconcludedthatanO-ringfailureinthesolidrocketboosterledtothe structuralbreakupandlossofthechallenger.inreconstructingtheeventsleadinguptothedecisiontolaunch,thecommissionnotedamistakeintheanalysis ofthermal-distressdatawherebyightswithno(i.e.zero)incidentsofo-ring thetemperatureeect.thistruncationofthedataledtotheconclusionthat temperaturesinceitwasfeltthattheydidnotcontributeanyinformationabout norelationshipbetweeno-ringdamageandtemperatureexisted,andultimately, damagewereexcludedfromcriticalplotsofo-ringdamageandambientlaunch thedecisiontolaunch.dalaletal.(1989)throwstatisticallightonthematter ariskyproposition. andquantifyingtherisk(ofcatastrophicfailure)at31of.hadtheoriginalanalysis bydemonstratingthestrongcorrelationbetweeno-ringdamageandtemperature, usedallofthedata,itwouldhaveindicatedthatthedecisiontolaunchwasatbest couldeasilyhavebeenavoided.inmostproblems,selectionbiasisaninherent standardinferences.thelessonstobelearnedhereare thatanytechniqueusedtoanalyzetruncateddataasifitwasarandomsample, characteristicoftheavailabledataandmethodsofanalysisneedtodealwithit.it isourexperiencethateverydatasethasthepotentialforselectionbiastoinvalidate Intheabovecase,theselectionbiasproblemwasoneof\humanerror"and 37 thedatathemselvesareseldomcapabletoalerttheanalystthataselection canbefooled,regardlessofhowthetruncationwasinduced; mechanismisoperating informationexternaltothedataathandiscritical dataminersastrayinmostapplications. makewidespreaduseofp-values.however,indiscriminateuseofp-valuescanlead classical(frequentist)statistics.itseemsnatural,therefore,thatdataminersshould 5.3.ThePerversityofthePervasiveP-value P-valuesandassociatedsignicance(orhypothesis)testsplayacentralrolein inunderstandingthenatureandextentofpotentialbiases. pothesesabouttheworld:thenullhypothesis,commonlydenotedbyh0,andthe isselectedandcalculatedfromthedataathand.theideaisthatt(data)should AlternativeHypothesis,commonlydenotedbyHA.TypicallyH0is\nested"within tozero,whilehamightplacenorestrictiononthecombination.ateststatistic,t HA;forexample,H0mightstatethatacertaincombinationofparametersisequal Thestandardsignicancetestproceedsasfollows.Considertwocompetinghy-
38 measuretheevidenceinthedataagainsth0.theanalystrejectsh0infavorofha ift(data)ismoreextremethanwouldbeexpectedifh0weretrue.specically, islessthanapresetsignicancelevel,. orequaltot(data),giventhath0istrue.theanalystrejectsh0ifthep-value theanalystcomputesthep-value,thatis,theprobabilityoftbeinggreaterthan Therearethreeprimarydicultiesassociatedwiththisapproach: C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 1.Thestandardadvicethatstatisticseducatorsprovide,andscienticjournals 2.Raftery(1995)pointsoutthatthewholehypothesistestingframeworkrests rigidlyadhereto,istochoosetobe0.05or0.01,regardlessofsamplesize. agriculturalexperiments(ontheorderof30-200plots).textbookadvice(e.g., NeymanandPearson,1933)hasemphasizedtheneedtotakeaccountofthe Theseparticular-levelsaroseinSirRonaldFisher'sstudyofrelativelysmall samplesizeislarge.thiscrucialbutvagueadvicehaslargelyfallenondeaf powerofthetestagainsthawhensetting,andsomehowreducewhenthe onthebasicassumptionthatonlytwohypothesesareeverentertained.in ears. 3.TheP-valueistheprobabilityassociatedwiththeeventthattheteststatistic canleadtoundesirableoutcomessuchasselectingamodelwithparameters thatarehighlysignicantlydierentfromzero,evenwhenthetrainingdata aconsequence,indiscriminateuseofp-valueswith\standard"xed-levels practice,dataminerswillconsiderverylargenumbersofpossiblemodels.as arepurenoise(freedman,1983).thispointisoffundamentalimportancefor dataminers. wasasextremeasthevalueobserved,ormoreso.however,theeventthat actuallyhappenedwasthataspecicvalueoftheteststatisticwasobserved. Consequently,therelationshipbetweentheP-valueandtheveracityofH0is subtleatbest.jereys(1980)putsitthisway: toamoredirectinterpretation-thebayesiananalystcomputestheposteriorprobabilitythatahypothesisiscorrect.withxed-levels,thefrequentistandthe BayesFactorsaretheBayesiananalogueofthefrequentistP-valuesandadmit Theyamounttosayingthatahypothesisthatmayormaynotbe trueisrejectedbecauseagreaterdeparturefromthetrialvaluewas happened. improbable;thatis,thatithasnotpredictedsomethingthathasnot IhavealwaysconsideredtheargumentsfortheuseofPabsurd. Bayesianwillarriveatverydierentconclusions.Forexample,BergerandSellke distribution.onewaytoreconcilethetwopositionsistoviewbayesfactorsasa resultinaposteriorprobabilityforh0thatisatleast0.30forany\objective"prior methodforselectingappropriate-levels-seeraftery(1995). (1987)showthatdatathatyieldaP-valueof0.05whentestinganormalmean,
5.4.InterventionandPrediction STATISTICALTHEMESANDLESSONSFORDATAMINING Aspecicclassofpredictionproblemsinvolveinterventionsthataltertheprobabilitydistributionoftheproblem,asinpredictingthevalues(orprobabilities)of 39 variablesunderachangeinmanufacturingprocedures,orchangesineconomicor averagingapply.forgraphicalrepresentationsofcausalhypothesesaccordingto tionsfromcompleteorincompletecausalmodelsweredevelopedin(spirtesetal., tionwithoutintervention,althoughtheusualcaveatsaboutuncertaintyandmodel themarkovcondition,generalalgorithmsforpredictingtheoutcomesofintervenedgeoftherelevantcausalstructure,andareingeneralquitedierentfrompredicvenientcalculusbypearl(1995).arelatedtheorywithoutgraphicalmodelswas 1993).Someoftheseprocedureshavebeenextendedandmadeintoamorecon- developedearlierbyrubin(1974)andothers,andbyrobbins(1986). medicaltreatmentpolicies.accuratepredictionsofthiskindrequiresomeknowl- eachmeasurednumberisalinearcombinationofthetruevalueandanerror,and relationofleaddepositsinchildren'steethwiththeiriqsresulted,eventually, inremovaloftertraethylleadfromgasolineintheunitedstates.onedataset ingthatallofthevariablesweremeasuredwitherror.theirmodelassumesthat signicantregressors,includinglead.klepper(1988)reanalyzedthedataassum- Needlemanexaminedincludedmorethan200subjects,andmeasuredalargenumberofcovariates.Needleman,Geiger,andFrank(1985)re-analyzedthedatausing backwardsstep-wiseregressionofverbaliqonthesevariablesandobtainedsix Considerthefollowingexample.HerbertNeedleman'sfamousstudiesofthecor- thattheparametersofinterestarenottheregressioncoecientsbutratherthe coecientsrelatingtheunmeasured\truevalue"variablestotheunmeasuredtrue valueofverbaliq.thesecoecientsareinfactindeterminate{ineconometricterminology,\unidentiable".anintervalestimateofthecoecientsthatisstrictly positiveornegativeforeachcoecientcanbemade,however,iftheamountof measurementerrorcanbeboundedwithpriorknowledgebyanamountthatvaries tions(usingtetradmethodology)andconcludedthatthreeofthesixregressors couldhavenoinuenceoniq.theregressionincludedthethreeextravariables asstrongasneedleman'sanalysissuggested. fromcasetocase.klepperfoundthattheboundrequiredtoensuretheexistence ofastrictlynegativeintervalestimateforthelead{iqcoecientwasmuchtoo onlybecausethepartialregressioncoecientisestimatedbyconditioningonall stricttobecredible,thusheconcludedthatthecaseagainstleadwasnotnearly permodel,butwithoutthethreeirrelevantvariables,andassigningtoallofthe wrongthingtodoforcausalinferenceusingthemarkovcondition.usingtheklep- otherregressors,whichisjusttherightthingtodoforlinearprediction,butthe parametersanormalpriorprobabilitywithmeanzeroandasubstantialvariance, ScheinesthenusedMarkovchainMonteCarlotocomputeaposteriorprobabilitydistributionforthelead{IQparameter.Theprobabilityisveryhighthatlead Allowingthepossibilityoflatentvariables,Scheines(1996)reanalyzedthecorrela- exposurereducesverbaliq.
40 Easyaccesstodataindigitalformandtheavailabilityofsoftwaretoolsforstatisticalanalyseshavemadeitpossibleforthemaninstreettosetupshopand \dostatistics."nowhereisthismoretruetodaythanindatamining.basedon C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 6.SymbiosisinStatistics assertthat: theargumentsinthisarticle,letusassumethatstatisticsisanecessarybutnot sucientcomponentinthepracticeofdatamining.howwellwillthestatistics professionservethedataminingcommunity?hoerletal.(1993),forexample, applicationsdoinfactdrivemuchofwhatgoesonitstatistics,althoughoftenina Despitethisrathernegativeviewoftherelevanceofstatisticalresearch,real-world veryindirectmanner. Asanexampleconsidertheeldofsignalprocessingandcommunications,anarea sionisintendedforothermembersofthestatisticalprofession. Weareourownbestcustomers.Muchoftheworkofthestatisticalprofes- fromclaudeshannonandothersinthe1940's.likemostoftheothercontributors totheeld,shannonwasnotastatistician,butpossessedadeepunderstanding intoeverydayuseinradioandnetworkcommunicationssystems.modernstatistical relevantstatisticalmethodssuchasestimationanddetectionhavefoundtheirway duetorapidadvancesinboththeoryandhardware,theeldhasexplodedand whereaspecializedsetofrelativelysophisticatedstatisticalmethodsandmodels Engineeringresearchersintheeldareineect\adjunct"statisticians:educated communicationsreectsthesymbiosisofstatisticaltheoryandengineeringpractice. havebeenhonedforpracticaluse.theeldwasdrivenbyfundamentaladvances inprobabilitytheoryandbasicstatisticstheyhavethetoolstoapplystatistical ofprobabilitytheoryanditsapplications.throughthe1950'stothepresent, methodstotheirproblemsofinterest.meanwhilestatisticianscontinuetodevelop speechrecognition(whereforexamplehiddenmarkovmodelsprovidethestate-ofthe-artintheeld),andmostnotably,epidemiology.indeed,ifstatisticscanclaistandstatisticalprinciples,andstatisticiansneedtounderstandthenatureofthe problemsincommunications. moregeneralmodelsandestimationtechniquesofpotentialapplicabilitytonew importantproblemsthatthedataminingcommunityisattackingorbeingasked tohaverevolutionizedanyeld,itisinthebiologicalandhealthscienceswherethe statisticalapproachtodataanalysisgavebirthtotheeldofbiostatistics. Thistypeofsymbiosiscanalsobeseeninotherareassuchasnancialmodelling, toattack.thishasbeenasuccessfulmodelinthepastforeldswherestatistics hashadconsiderableimpactandhasthepotentialtoseeongoingsuccess. Therelevanceofthissymbiosisfordataminingisthatdata-minersneedtounder-
STATISTICALTHEMESANDLESSONSFORDATAMINING 41 7.Conclusion Thestatisticalliteraturehasawealthoftechnicalproceduresandresultstooer datamining,butitalsohasafewsimplemethodologicalmorals:provethatestimationandsearchproceduresusedindataminingareconsistentunderconditions reasonablythoughttoapplyinapplications;useandrevealuncertainty,don'thide it;calibratetheerrorsofsearch,bothforhonestyandtotakeadvantagesofmodel averaging;don'tconfuseconditioningwithintervening;andnally,don'ttakethe errorprobabilitiesofhypothesisteststobetheerrorprobabilitiesofsearchprocedures. References Akaike,H.1974.Anewlookatthestatisticalmodelidentication.IEEETrans.Automat. Contr.AC-19:716{723. Berger,J.O.andSellke,T.1987.Testingapointnullhypothesis:theirreconcilabilityofPvalues andevidence(withdiscussion).journaloftheamericanstatisticalassociation82:112{122. Breiman,L.1996.Baggingpredictors.MachineLearning,toappear. Chasno,I.J.,Grith,D.R.,MacGregor,S.,Dirkes,K.,Burns,K.A.1989.Temporalpatterns ofcocaineuseinpregnancy:perinataloutcome.journaloftheamericanmedicalassociation 261(12):1741{4. Chateld,C.1995.Modeluncertainty,datamining,andstatisticalinference(withdiscussion). JournaloftheRoyalStatisticalSociety(SeriesA)158:419{466. Dalal,S.R.,Fowlkes,E.B.andHoadley,B.1989.Riskanalysisofthespaceshuttle:Pre-Challenger predictionoffailure.journaloftheamericanstatisticalassociation84:945{957. Diggle,P.andKenward,M.G.1994.Informativedrop-outinlongitudinaldataanalysis(with discussion).appliedstatistics:43:49{93. Draper,D.,Gaver,D.P.,Goel,P.K.,Greenhouse,J.B.,Hedges,L.V.,Morris,C.N.,Tucker,J., andwaternaux,c.1993.combininginformation:nationalresearchcouncilpanelonstatisticalissuesandopportunitiesforresearchinthecombinationofinformation.washington: NationalAcademyPress. Draper,D.1995.Assessmentandpropagationofmodeluncertainty(withdiscussion).Journalof theroyalstatisticalsociety(seriesb).57:45{97. Efron,B.andTibshirani,R.J.1993.AnIntroductiontotheBoostrap.NewYork:Chapmanand Hall. EnergyModelingForum1982.WorldOil:Summaryreport.EMFReport6,EnergyModeling Forum,StanfordUniversity,Stanford,CA. Fisher,R.A.1958.Statisticalmethodsforresearchworkers.NewYork:HafnerPub.Co. Freedman,D.A.1983.Anoteonscreeningregressionequations.TheAmericanStatistician 37:152{155. Geiger,D.Heckerman,D.,andMeek,C.1996.Asymptoticmodelselectionfordirectednetworkswithhiddenvariables.ProceedingsoftheTwelfthAnnualConferenceonUncertaintyin ArticialIntelligence.SanFrancisco:MorganKaufman. Gilks,W.R.,Richardson,S.,andSpiegelhalter,D.J.1996.MarkovchainMonteCarloinpractice. London:ChapmanandHall. Hand,D.J.1994.Deconstructingstatisticalquestions(withdiscussion).JournaloftheRoyal StatisticalSociety(SeriesA)157:317{356. Hastie,T.J.andTibshirani,R.1990.GeneralizedAdditiveModels.London:ChapmanandHall. Hoerl,R.W.,Hooper,J.H.,Jacobs,P.J.,Lucas,J.M.1993.Skillsforindustrialstatisticiansto surviveandprosperintheemergingqualityenvironment.theamericanstatistician47:280{292. Huber,P.J.1981.RobustStatistics.NewYork:Wiley.
42 C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Jereys,H.1980.Somegeneralpointsinprobabilitytheory.In:A.Zellner(Ed.),Bayesian AnalysisinEconometricsandStatistics.Amsterdam:North-Holland,451{454. Kass,R.E.andRaftery,A.E.1995.Bayesfactors.JournaloftheAmericanStatisticalAssociation 90:773{795. Kiiveri,H.andSpeed,T.P.1982.Structuralanalysisofmultivariatedata:Areview.Sociological Methodology209{289. Kooperberg,C.,Bose,S.,andStone,C.J.1996.Polychotomousregression.JournaloftheAmericanStatisticalAssociation,toappear. Lauritzen,S.L.1996.GraphicalModels.Oxford:OxfordUniversityPress. Leamer,E.E.1978.SpecicationSearches.AdHocInferencewithNonexperimentalData.Wiley: NewYork. Madigan,D.andRaftery,A.E.1994.Modelselectionandaccountingformodeluncertainty ingraphicalmodelsusingoccam'swindow.journaloftheamericanstatisticalassociation 89:1335{1346. Madigan,D.andYork,J.1995.Bayesiangraphicalmodelsfordiscretedata.International StatisticalReview63:215{232. Matheson,J.E.andWinkler,R.L.1976.Scoringrulesforcontinuousprobabilitydistributions. ManagementScience22:1087{1096. McCullagh,P.andNelder,J.A.1989.GeneralizedLinearModels.London:ChapmanandHall. Michelangeli,P.A.,Vautard,R.,andLegras,B.1995.Weatherregimes:recurrenceandquasistationarity.JournaloftheAtmosphericSciences52(8):1237{56. Miller,R.G.Jr.1981.Simultaneousstatisticalinference(SecondEdition).NewYork:Springer- Verlag. Neyman,J.andPearson,E.S.1933.Ontheproblemofthemostecienttestsofstatistical hypotheses.philosophicaltransactionsoftheroyalsociety(seriesa)231:289{337. Raftery,A.E.1995.Bayesianmodelselectioninsocialresearch(withdiscussion).InSociological Methodology(ed.P.V.Marsden),Oxford,U.K.:Blackwells,111{196. Rissanen,J.1978.Modelingbyshortestdatadescription.Automatica14:465{471. Schervish,M.J.1995.TheoryofStatistics,NewYork:SpringerVerlag. Schwartz,G.1978.Estimatingthedimensionofamodel.AnnalsofStatistics6:461{464. Selvin,H.andStuart,A.1966.Datadredgingproceduresinsurveyanalysis.TheAmerican Statistician20(3):20{23. Simpson,C.H.1951.Theinterpretationofinteractionincontingencytables.Journalofthe RoyalStatisticalSociety(SeriesB)13:238{241. Smith,A.F.M.andRoberts,G.1993.BayesiancomputationviatheGibbssamplerandrelated MarkovchainMonteCarlomethods(withdiscussion).JournaloftheRoyalStatisticalSociety (SeriesB)55:3{23. Spirtes,P.,GlymourC.,andScheines,R.1993.Causation,PredictionandSearch,Springer LectureNotesinStatistics,NewYork:SpringerVerlag. Stigler,S.M.1986.Thehistoryofstatistics:Themeasurementofuncertaintybefore1900. Harvard:HarvarduniversityPress. Wen,S.W.,Hernandez,R.,andNaylor,C.D.1995.Pitfallsinnonrandomizedstudies:The caseofincidentalappendectomywithopencholecystectomy.journaloftheamericanmedical Association274:1687{1691. Wright,S.1921.Correlationandcausation.JournalofAgriculturalResearch20:557{585. ReceivedDate AcceptedDate FinalManuscriptDate