26onetodescriberelationshipsbetweenvariablesforprediction,quantifyingeects,or
|
|
|
- Penelope Nash
- 10 years ago
- Views:
Transcription
1 StatisticalThemesandLessonsforDataMining c1996kluweracademicpublishers,boston.manufacturedinthenetherlands. DataMiningandKnowledgeDiscovery,1,25{42(1996) CLARKGLYMOUR DepartmentofCognitivePsychology,CarnegieMellonUniversity,Pittsburgh,PA15213 DAVIDMADIGAN DepartmentofStatistics,Box354322,UniversityofWashington,Seattle,WA98195 DARYLPREGIBON PADHRAICSMYTH somestatisticalthemesandlessonsthataredirectlyrelevanttodataminingandattemptstoidentifyopportunitieswhereclosecooperationbetweenthestatisticalandcomputationalcommunities inbothdisciplinestomakeprogressinextractinginformationfromlargedatabases.itisanemergingeldthathasattractedmuchattentioninaveryshortperiodoftime.thisarticlehighlights InformationandComputerScience,UniversityofCalifornia,Irvine,CA92717 Editor:UsamaFayyad Abstract.DataminingisontheinterfaceofComputerScienceandStatistics,utilizingadvances mightreasonablyprovidesynergyforfurtherprogressindataanalysis. Keywords:Statistics,uncertainty,modeling,bias,variance 1.Introduction softwarehavefreedthestatisticianfromnarrowlyspeciedmodelsandspawned statisticaltoolkitdrawsonarichbodyoftheoreticalandmethodologicalresearch (Table1). afreshapproachtothesubject,especiallyasitrelatestodataanalysis.today's Statisticsisenjoyingarenaissanceperiod.Moderncomputinghardwareand andinterpretationofnumericaldata,especiallytheanalysisofpopulation characteristicsbyinferencefromsampling.(americanheritagedictionary). Sta-tis-tics(noun).Themathematicsofthecollection,organization, or\turningdataintoinformation".thecontextencompassesstatistics,butwith asomewhatdierentemphasis.inparticular,datamininginvolvesretrospective analysesofdata:thus,topicssuchasexperimentaldesignareoutsidethescopeof estedinunderstandabilitythanaccuracyorpredictabilityperse.thus,thereisa soforth.applicationsinvolvingverylargenumbersofvariablesandvastnumbers focusonrelativelysimpleinterpretablemodelsinvolvingrules,trees,graphs,and dataminingandfallwithinstatisticsproper.dataminersareoftenmoreinter- ofmeasurementsarealsocommonindatamining.thus,computationaleciency Theeldofdatamining,likestatistics,concernsitselfwith\learningfromdata"
2 26onetodescriberelationshipsbetweenvariablesforprediction,quantifyingeects,or Table1.Statisticianshavedevelopedalargeinfrastructure(theory)tosupporttheir theuncertaintyassociatedwithdrawinginferencesfromdata.thesemethodsenable methodsandalanguage(probabilitycalculus)todescribetheirapproachtoquantifying C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH AreaofStatistics experimentaldesign&samplinghowtoselectcasesifonehasthelibertytochoose suggestingcausalpaths. exploratorydataanalysis DescriptionofActivities andscalabilityarecriticallyimportant,andissuesofstatisticalconsistencymay beasecondaryconsideration.furthermore,thecurrentpracticeofdataminingis statisticalgraphics statisticalmodeling statisticalinference hypothesisgenerationratherthanhypothesistesting datavisualization regressionandclassicationtechniques (suchasanyofthemanyruleinductionsystemsonthemarket)willproducesets oftenpattern-focusedratherthanmodel-focused,i.e.,ratherthanbuildingacoherentglobalmodelwhichincludesallvariablesofinterest,dataminingalgorithms estimationandpredictiontechniques ticalcomputationalconcerns.however,infocusingalmostexclusivelyoncomputa- tionalissues,itiseasytoforgetthatstatisticsisinfactacorecomponent.theterm thefundamentalstatisticalnatureoftheinferenceproblemisindeedtobeavoided. andstuart,1966;chateld,1995).dataminingwithoutproperconsiderationof \datamining"haslonghadnegativeconnotationsinthestatisticsliterature(selvin However,agoalofthisarticleistoconvincethereaderthatmodernstatisticscan Inthisoverallcontext,currentdataminingpracticeisverymuchdrivenbyprac- ofstatementsaboutlocaldependenciesamongvariables(inruleform). oersignicantconstructiveadvicetothedataminer,althoughmanyproblemsremainunsolved.throughoutthearticlewehighlightsomemajorthemesofstatistics todatamining.forarigoroussurveyofstatistics,themathematicallyinclined research,focusinginparticularonthepracticallessonspertinenttodatamining. anumberofinterestingtopics,includingtimeseriesanalysisandmeta-analysis. readershouldsee,forexample,schervish(1995).forreasonsofspacewewillignore 2.AnOverviewofStatisticalScience ThisSectionbrieydescribessomeofthecentralstatisticalideaswethinkrelevant marginalization(summingoverasubsetofvalues)andconditionalization(forming characterizationsofawealthofprobabilitydistributions,aswellaspropertiesof sureassignsvalues.importantrelationsamongprobabilitydistributionsinclude randomvariables{functionsdenedonthe\events"towhichaprobabilitymea- ProbabilityDistributions.Thestatisticalliteraturecontainsmathematical
3 aconditionalprobabilitymeasurefromameasureonasamplespaceandsome eventofpositivemeasure).essentialrelationsamongrandomvariablesinclude STATISTICALTHEMESANDLESSONSFORDATAMINING independence,conditionalindependence,andvariousmeasuresofdependence,of anyparticularmemberofthefamilyfromdata,orbyclosurepropertiesusefulin characterizesfamiliesofdistributionsbypropertiesthatareusefulinidentifying whichthemostfamousisthecorrelationcoecient.thestatisticalliteraturealso 27 modelconstructionorinference,forexampleconjugatefamilies,closedunderconditionalization,andthemultinormalfamily,closedunderlinearcombination.a aprobabilitydistribution.classicalstatisticsinvestigatessuchdistributionsof ofestimatorscorrespondingtoallpossiblesamplesfromthatcollectionalsohas actualorpotentialcollectiongovernedbysomeprobabilitydistribution,thefamily dataandmakingappropriateinferences. knowledgeofthepropertiesofdistributionfamiliescanbeinvaluableinanalyzing estimatorsinordertoestablishbasicpropertiessuchasreliabilityanduncertainty. Avarietyofresamplingandsimulationtechniquesalsoexistforassessingestimator uncertainty(efronandtibshirani,1993). ModelAveraging.Anestimatorisafunctionfromsampledatatosomeestimand, suchasthevalueofaparameter.whenthedatacompriseasamplefromalarger Estimation,Consistency,Uncertainty,Assumptions,Robustness,and aretypicallyfalse,butoftenuseful.ifamodel(whichwecanthinkofasasetof assumptions)isincorrect,estimatesbasedonitcanbeexpectedtobeincorrect aswell.oneoftheaimsofstatisticalresearchistondwaystoweakenthe assumptionsnecessaryforgoodestimation.\robuststatistics"(huber,1981) looksforestimatorsthatworksatisfactorilyforlargerfamiliesofdistributionsand havesmallerrorswhenassumptionsareviolated. Estimationalmostalwaysrequiressomesetofassumptions.Suchassumptions sumptionsareoftenplausible.ratherthanmakinganestimatebasedonasingle model,severalmodelscanbeconsideredandanestimateobtainedastheweighted Carloanalysis.Ourimpressionisthattheerrorratesofsearchproceduresproposed 1994).Infact,suchBayesianmodelaveragingisboundtoimprovepredictiveperformance,onaverage.Sincethemodelsobtainedindataminingareusuallythe resultsofsomeautomatedsearchprocedure,accountingforthepotentialerrors Bayesianestimationemphasizesthatalternativemodelsandtheircompetingas- averageoftheestimatesgivenbytheindividualmodels(madiganandraftery, associatedwiththesearchitselfiscrucial.inpractice,thisoftenrequiresamonte hypothesistestingisinconsistentunlessthealphalevelofthetestingruleisdecreasedappropriatelyasthesamplesizeincreases.generally,anleveltestofone hypothesisandanleveltestofanotherhypothesisdonotjointlyprovidean leveltestoftheconjunctionofthetwohypotheses.inspecialcases,rules(some- andusedinthedataminingandinthestatisticalliteraturearefartoorarelyesti- matedinthisway.(seespirtesetal.,1993formontecarlotestdesignforsearch portantlimitationsshouldbenoted.viewedasaone-sidedestimationmethod, procedures.) HypothesisTesting.Sincestatisticaltestsarewidelyused,someoftheirim-
4 28 oferroneouslyndingsomedependentsetofvariableswheninfactallpairsare testingaseriesofhypothesis.if,forexample,foreachpairofasetofvariables, timescalledcontrasts)existforsimultaneouslytestingseveralhypotheses(miller, hypothesesofindependencearetestedat=0:05,then0.05isnottheprobability ingdirectlytodowiththeprobabilityoferrorinasearchprocedurethatinvolves 1981).Animportantcorollaryfordataminingisthatthelevelofatesthasnoth- C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH independent.thus,indataminingproceduresthatuseasequenceofhypothesis tests,thealphalevelofthetestscannotgenerallybetakenasanestimateofany nomatterhowcloselytheyseemtotthedata. ples;testsoflinearmodels,forexample,typicallyrejecttheminverylargesamples errorprobabilityrelatedtotheoutcomeofthesearch. dowiththetruthofhypotheses,theconnectionissomewhattenuous(seesection 5.3).Hypothesesthatareexcellentapproximationsmayberejectedinlargesam- Dataminersshouldnotethatwhileerrorprobabilitiesoftestshavesomethingto correspondstoapreferenceorderingoverthespaceofmodels,giventhedata.for thereasonsjustconsidered,scoringrulesareoftenanattractivealternativetotests. modelsorhypothesestoothers,andtobeindierentbetweenstillothermodels.a InformationCriterion(Raftery,1995),andMinimumDescriptionlength(Rissanen, scoreisanyrulethatmapsmodelsanddatatonumberswhosenumericalordering withthemodel,thenumberofparameters,ordimension,ofthemodel,andthe Typicalrulesassignmodelsavaluedeterminedbythelikelihoodfunctionassociated data.popularrulesincludetheakaikeinformationcriterion(akaike,1974),bayes ModelScoring.Theevidenceprovidedbydatashouldleadustoprefersome onthedataisitselfascoringfunction,arguablyaprivilegedone.thebayes InformationCriterionapproximatesposteriorprobabilitiesinlargesamples. 1978).Givenapriorprobabilitydistributionovermodels,theposteriorprobability modelspacetocalculatescoresforallmodels;itis,however,oftenfeasibleto samemodel,butevendierentorderingsofmodels. fromthesamedistributionmayyieldnotonlydierentnumericalvaluesforthe uncertaintiesassociatedwithscores,sincetwodierentsamplesofthesamesize scores.aicscoresarenot,ingeneral,consistent(schwartz,1978).therearealso plelimit,almostsurelythetruemodelshouldbeamongthosereceivingmaximal Forobviouscombinatorialreasons,itisoftenimpossiblewhensearchingalarge Thereisanotionofconsistencyappropriatetoscoringrules;inthelargesam- describeandcalculatescoresforafewequivalenceclassesofmodelsreceivingthe highestscores. inmontecarlomethodshave,however,liberatedanalystsfromsomeofthesecon- Bayesianmodelsandcomplexlikelihoodcalculations.Recentdramaticadvances dicultiesforceddataanalyststoeschewexactanalysisofelaboratehierarchical frominferencesmadewithhypothesistests.raftery(1995)givesexamplesofmodelsthataccountforalmostallofthevarianceofanoutcomeofinterest,andhave veryhighbayesianscores,butareoverwhelminglyrejectedbystatisticaltests. Insomecontexts,inferencesmadeusingBayesianscorescandieragreatdeal MarkovChainMonteCarlo.Historically,insurmountablecomputational
5 straints.oneparticularclassofsimulationmethods,dubbedmarkovchainmonte STATISTICALTHEMESANDLESSONSFORDATAMINING Carlo,originallydevelopedinstatisticalmechanics,hasrevolutionizedthepractice ofbayesianstatistics.smithandroberts(1993)provideanaccessibleoverview fromthebayesianperspective;gilksetal.(1996)provideapracticalintroduction addressingbothbayesianandnon-bayesianperspectives. Simulationmethodsmaybecomeunacceptablyslowwhenfacedwithmassive 29 GeneralizedLinearModels,forinstance,embracemanyclassicallinearmodels,and calresearchhasbeenthedevelopmentofverygeneralandexiblemodelclasses. seeforexamplekooperbergetal.(1996),kassandraftery(1995),andgeigeret al.(1996). unifyestimationandtestingtheoryforsuchmodels(mccullaghandnelder,1989). GeneralizedAdditiveModelsshowsimilarpotential(HastieandTibshirani,1990). datasets.insuchcases,recentadvancesinanalyticapproximationsproveuseful- Graphicalmodels(Lauritzen,1996)representprobabilisticandstatisticalmodels fordescribingmodelsandthegraphsthemselvesmakemodelingassumptionsexplicit.graphicalmodelsprovideimportantbridgesbetweenthevaststatistical analysis,anddatamining. withplanargraphs,wheretheverticesrepresent(possiblylatent)randomvariables andtheedgesrepresentstochasticdependences.thisprovidesapowerfullanguage Generalizedmodelclasses.Amajorachievementofstatisticalmethodologi- literatureonmultivariateanalysisandsucheldsasarticialintelligence,causal etc.typically,rationaldecisionmakingandplanningarethegoalsofdatamining, Givenallofthisinformation,adecisionrulespecieswhichofthealternativeactionsoughttobetaken.Alargeliteratureinstatisticsandeconomicsaddresses alternativedecisionrules{maximizingexpectedutility,minimizingmaximumloss, sumesthedecisionmakerhasavailableadenitesetofalternativeactions,knowl- edgeofadenitesetofpossiblealternativestatesoftheworld,knowledgeofthe RationalDecisionMakingandPlanning.Thetheoryofrationalchoiceas- theworld,andknowledgeoftheprobabilitiesofvariouspossiblestatesoftheworld. payosorutilitiesoftheoutcomesofeachpossibleactionineachpossiblestateof rationalchoiceposesnormsfortheuseofinformationobtainedfromadatabase. andratherthanprovidingtechniquesormethodsfordatamining,thetheoryof knowledgeoftheeectsalternativeactionswillhave.toknowtheoutcomesof ofbernoulliandlaplace,theabsenceofcausalconnectionbetweentwovariables actionsistoknowsomethingofcauseandeectrelations,andextractingsuch causalinformationisoftenoneoftheprinciplegoalsofdataminingandofstatisticalinferencemoregenerally. historicaldevelopmentofstatistics.fromthebeginningofthesubject,inthework Theveryframeworkofrationaldecisionmakingrequiresprobabilitiesanda hasbeentakentoimplytheirprobabilisticindependence(seestigler,1986),and thesameideaisfundamentalinthetheoryofexperimentaldesign(fisher,1958). Earlyinthiscentury,Wright(1921)introduceddirectedgraphstorepresentcausal hypotheses(withverticesasrandomvariablesandedgesrepresentingdirectinu- InferencetoCauses.Understandingcausationisthehiddenforcebehindthe
6 30 socialsciences,biology,computerscienceandengineering. ences),andtheyhavebecomecommonrepresentationsofcausalhypothesesinthe betweenindependenceandabsenceofcausalconnectioninwhattheycalledthe Markovcondition:providedYisnotaneectofX,XandYareconditionally independentgiventhedirectcausesofx.theyshowedthatmuchofthelinear KiiveriandSpeed(1982)combineddirectedgraphswithageneralizedconnection C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH causalmodelsofcategoricaldata,andvirtuallyallcausalmodelsofsystemswithoutfeedback.underadditionalassumptions,conditionalindependencetherefore modelingliteraturetacitlyassumedthemarkovcondition;thesameistruefor manysourcesoferroranddataminersshouldproceedwithextremecaution. tributionssatisfyingthemarkovconditionarecalledbydierentnamesindierent names,including\faithfulness."directedgraphswithassociatedprobabilitydis- literatures:bayesnets,beliefnets,structuralequationmodels,pathmodels,etc. oughlyinvestigated,additionalassumptionisthatallconditionalindependencies Nonetheless,causalinferencesfromuncontrolledconveniencesamplesareliableto providesinformationaboutcausaldependence.themostcommon,andmostthortualcausalprocessesgeneratingthedata,arequirementthathasbeengivenmany areduetothemarkovconditionappliedtothedirectedgraphdescribingtheacpliedbyhumanexperts,orinferredfromthedatabaseautomatically.regression, probabilitydistribution.indataminingcontexts,structureistypicallyeithersup- obtainedfromthesameprobabilitydistribution.aswithestimation,inprediction varianceofthepredictor. weareinterestedbothinreliabilityandinuncertainty,oftenmeasuredbythe predictpropertiesofanewsample,whereitisassumedthatthetwosamplesare forexample,assumesaparticularfunctionalformrelatingvariables.structurecan Predictionmethodsforthissortofproblemalwaysassumesomestructureinthe Prediction.Sometimesoneisinterestedinusingasample,oradatabase,to bealsobespeciedintermsofconstraints,suchasindependence,conditionalindependence,higherorderconditionsoncorrelations,etc.onaverage,aprediction methodthatguaranteessatisfactionoftheconstraintsrealizedintheprobability distribution{andnoothers{willbemoreaccurateandhavesmallervariancethan Inthemid1960's,thestatisticscommunityreferredtounfetteredexplorationof 3.IsDataMining\StatisticalDejaVu"(AllOverAgain)? bymodelaveraging,providedthepriorprobabilitiesofthealternativeassumptions imposedbythemodelareavailable. cultissueinthissortofprediction.aswithestimation,predictioncanbeimproved onethatdoesnot.findingtheappropriateconstraintstosatisfyisthemostdi- arguedthatsincetheirtheorieswereinvalidatedby\lookingatthedata",itwas enamoredbyelegant(analytical)mathematicalsolutionstoinferentialproblems, wrongtodoso.themajorproponentoftheexploratorydataanalysis(eda) dataas\shing"or\datadredging"(selvinandstuart,1966).thecommunity, school,j.w.tukey,counteredthisargumentwiththeobviousretortthatstatis-
7 ticianswereputtingthecartbeforethehorse.hearguedthatstatisticaltheory STATISTICALTHEMESANDLESSONSFORDATAMINING anddevisingformalmethodstoaccountforsearchintheirinferentialprocedures. shouldadapttothescienticmethodratherthantheotherwayaround.thirty yearshence,thestatisticalcommunityhaslargelyadoptedtukey'sperspective, andhasmadeconsiderableprogressinservingbothmasters,namelyacknowledgingthatmodelsearchisacriticalandunavoidablestepinthemodelingprocess, 31 minersare:clarityaboutgoals,appropriatereliabilityassessment,andadequate ticularlychallengingindynamicsituations).inyetothercases,dataanalysisaims accountingforsourcesofuncertainty. Inothercases,dataanalysisaimstopredictfeaturesofnewcases,ornewsamples, drawnfromoutsidethedatabaseusedtodevelopapredictivemodel(thisispar- computablerepresentationofhowthedataaredistributedinaparticulardatabase. Threethemesofmodernstatisticsthatareoffundamentalimportancetodata fromwhichthemodel(ormodels)weredeveloped.eachofthesegoalspresent causalmechanismsthatareusedtoformpredictionsaboutnewsamplesthatmight toprovideabasisforpolicy.thatis,theanalysisisintendedtoyieldinsightinto beproducedbyinterventionsoractionsthatdidnotapplyintheoriginaldatabase Clarityaboutgoals.Sometimesdataanalysisaimstondaconvenient,easily distinctinferenceproblems,withdistincthazards.confusingorequivocatingover theaiminvitestheuseofinappropriatemethodsandmayresultinunfortunate usewillresultinimprovedobstetricoutcome".fortunately,thereexistsindependentevidencetosupportthiscausalclaim.however,muchofchasnoetal.'spaper focusesonastatisticalanalysis(analysisofvariance)thathaslittle,ifanything,to dowiththecausalquestionofinterest. (1989)comparingbabiesborntococaine-usingmotherswithbabiesborntononcocaine-usingmothers.Theauthorsconcluded:\Forwomenwhobecomepregnant Asanexample,considertheobservationalstudyreportedbyChasnoetal. andareusersofcocaine,interventioninearlypregnancywithcessationofcocaine predictionsandinferences. particulartreatment(diggleandkenward,1994).inthiscase,theimportantissue analyzingclinicaltrialdatawherepatientsdropoutduetoadverseside-eectsofa thepopulationwhoremainwithinthetrial?thisproblemarisesinmoregeneral settingsthaninclinicaltrials,e.g.,non-respondents(refusers)insurveydata.in answer. iswhichpopulationisoneinterestedinmodelling?thepopulationatlargeversus rightanswerstothewrongquestion.forexample,hediscussestheproblemof suchsituationsitisimportanttobeexplicitaboutthequestionsoneistryingto Hand(1994)providesaseriesofexamplesillustratinghoweasyitistogivethe problemsothattherightquestioncanbeasked?hand'sconclusionisthatthis islargelyan\art"becauseitislesswellformalizedthanthemathematicaland thatofformulatingstatisticalstrategyi.e.,howdoesonestructureadataanalysis computationaldetailsofapplyingaparticulartechnique.this\art"isgained throughexperience(atpresentatleast)ratherthantaught.theimplicationfor Inthisgeneralcontextanimportantissue(discussedatlengthinHand(1994))is
8 32 dataminingisthathumanjudgementisessentialformanynon-trivialinference problems.thus,automationcanatbestonlypartiallyguidethedataanalysis oftendicult,process. theuser(andconsumer)understandsandndsplausibleinthecontext. process.properlydeningthegoalsofananalysisremainsahuman-centred,and Useofmethodsthatarereliablemeanstothegoal,underassumptions C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Statisticaltheoryappliesseveralmeaningstotheword\Reliability",manyofwhich alsoapplytomodelsearch.forexample,underwhatconditionsdoesasearch procedureprovidecorrectinformation,ofthekindsought,withprobabilityone asthesamplesizeincreaseswithoutbound?answerstosuchquestionsareoften available,thedataanalystshouldpaycarefulattentiontothereasonablenessof elusiveandcanrequiresophisticatedmathematicalanalysis.whereanswersare underlyingassumptions.anotherkeydataminingquestionisthis:whatarethe probabilitiesofvariouskindsoferrorsthatresultfromusingamethodinnite samples?theanswerstothisquestionwilltypicallyvarywiththekindsoferrors considered,withthesamplesize,andwiththefrequencyofoccurrenceofthevarious pellingexample. orthecorrectprediction.thedataanalystmustquantifytheseuncertaintiesso shouldleavetheinvestigatorwitharangeofuncertaintiesaboutthecorrectmodel, kindsoftargetsorsignalswhosedescriptionisthegoalofinference.thesequestions areoftenbestaddressedbymontecarlomethods,althoughinsomecasesanalytic thatsubsequentdecisionscanbeappropriatelyhedged.section4providesacomgroundknowledgeandeventhebestmethodsofsearchandstatisticalassessment resultsmaybeavailable. questioniswhetherornotspecicrecurrentpressurepatternscanbeclearlyidentiedfromdailygeopotentialheightrecordswhichhavebeencompiledinthe Anotherexampleinvolvesacurrentdebateintheatmosphericsciences.The Asenseoftheuncertaintiesofmodelsandpredictions.Quiteoftenback- NorthernHemispheresince1948.Theexistenceofwell-denedrecurrentpatterns modelsviaresamplingtechniques,itisdiculttoinferfromthemultiplestudies (or\regimes")hassignicantimplicationsformodelsofupperatmospherelowfrequencyvariabilitybeyondthetime-scaleofdailyweatherdisturbances(and, low-dimensionalprojectionsofthegriddeddata(seemichelangelietal.(1995)and thus,modelsoftheearth'sclimateoverlargetime-scales).severalstudieshave othersreferredtotherein).whilethisworkhasattemptedtovalidatethecluster degreeofcertaintyandthatthereisafundamentaluncertainty(giventhecurrent data)abouttheunderlyingmechanismsatwork.allisnotlost,however,sinceit whetherregimestrulyexist,and,iftheydo,wherepreciselytheyarelocated.it seemslikelythat48wintersworthofdataisnotenoughtoidentifyregimestoany usedavarietyofclusteringalgorithmstodetectinhomogeneities(\bumps")in isalsoclearthatonecouldquantifymodeluncertaintyinthiscontext,andtheorize accordingly(seesection4). ofthehazardsofdatamining. Inwhatfollowswewillelaborateonthesepointsandoeraperspectiveonsome
9 estimateorapredictionisalmostalwaysinadequate.quanticationoftheuncertaintyassociatedwithasinglenumber,whileoftenchallenging,iscriticalfor 4.CharacterizingUncertainty STATISTICALTHEMESANDLESSONSFORDATAMINING 33 Thestatisticalapproachcontendsthatreportingasinglenumberforaparameter subsequentdecisionmaking.asanexample,draper(1995),consideredthecaseof the1980energymodelingforum(emf)atstanforduniversitywherea43-person workinggroupofeconomistsandenergyexpertsconvenedtoforecastworldoil pricesfrom1981to2020.thegroupgeneratedpredictionsbasedonanumberof econometricmodelsandscenarios,embodyingavarietyofassumptionsaboutsupply,demand,andgrowthratesofrelevantquantities.aplausiblereferencescenario andmodelwasselectedasrepresentative,butthesummaryreport(emf,1982) thewarningaboutthepotentialuncertaintyassociatedwiththepointestimates, toacceptanyprojectionasaforecast."thesummaryreportdidconclude,however,thatmostoftheuncertaintyaboutfutureoilprices\concernsnotwhether cautionedagainstinterpretingpointpredictionsbasedonthereferencescenarioas thesepriceswillrise...buthowrapidlytheywillrise." inthequotationabove,andproceededtoinvestanestimated$500billiondollars, \[theworkinggroup's]`forecast'oftheoilfuture,astherearetoomanyunknowns governmentsandprivatecompaniesaroundtheworldfocusedonthelastsentence onthebasisthatthepricewouldprobablybecloseto$40dollarsperbarrelinthe mid-eighties.infact,theactual1986worldaveragespotpriceofoilwasabout$13 perbarrel. In1980,theaveragespotpriceofcrudeoilwasaround$32perbarrel.Despite (andshould)haveproceededmorecautiouslyin1980,hadtheyunderstoodthefull extentoftheiruncertainty. intervalforthe1986pricewouldhaverangedfromabout$20toover$90.note tisticalanalysisdoesnotprovideclairvoyance.however,decisionmakerswould thatthisintervaldoesnotactuallycontaintheactual1986price{insightfulstafulbutelementarystatisticalmethods,draper(1995)showsthata90%predictive Correctlyaccountingforthedierentsourcesofuncertaintypresentssignicant UsingonlytheinformationavailabletotheEMFin1980,alongwiththought- parametricandpredictiveuncertaintyinthecontextofaparticularmodel.two distinctapproachesareincommonuse.\frequentist"statisticiansfocusonthe tersandpredictionsbyso-calledsamplingdistributions.\bayesian"statisticians randomnessinsampleddataandsummarizetheinducedrandomnessinparame- insteadtreatthedataasxed,andusebayestheoremtoturnprioropinionabout challenges.untilrecently,thestatisticalliteraturefocusedprimarilyonquantifying calledposteriordistributionthatembracesalltheavailableinformation.theerce quantitiesofinterest(alwaysexpressedbyaprobabilitydistribution),intoaso- conictsbetweenpreviousgenerationsoffrequentistsandbayesians,havelargely givenwayinrecentyearstoamorepragmaticapproach;moststatisticianswill basetheirchoiceoftoolonscienticappropriatenessandconvenience.
10 34 uncertainty(asdiscussedinthepreviousparagraph)mayoften,inpractice,be andyork,1995).itiscommonpracticenowadaysforstatisticiansanddataminers tousecomputationallyintensivemodelselectionalgorithmstoseekoutasingle dominatedbybetween-modeluncertainty(chateld,1995,draper,1995,madigan optimalmodelfromanenormousclassofpotentialmodels.theproblemisthat Inanyevent,recentresearchhasleadtoincreasedawarenessthatwithin-model C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH ofuncertaintyincludebayesianmodelaveraging(draper,1995)andresampling carefullyaboutmodelassessmentandlookbeyondcommonlyusedgoodness-of-t measuressuchasmeansquareerror. Intuitively,ambiguityoverthemodelshoulddiluteinformationabouteectparametersandpredictions,since\partoftheevidenceisspenttospecifythemodel" (Leamer,1978,p.91).Promisingtechniquesforproperlyaccountingforthissource severaldierentmodelsmaybeclosetooptimal,yetleadtodierentinferences. methods(breiman,1996).themainpointhereisthatdataminersneedtothink meetsdata. ofstatistics.whilestatisticsdoesnothavealltheanswersforthedataminer,it thissection,wedescribesomelessonsthatstatisticianshavelearnedwhentheory doesprovideausefulandpracticalframeworkforwhichtosearchforsolutions.in 5.Whatcangowrong,willgowrong 5.1.DataCanLie Dataminingposesdicultandfundamentalchallengestothetheoryandpractice Dataminingapplicationstypicallyrelyonobservational(asopposedtoexperimental)data.Interpretingobservedassociationsinsuchdataischallenging;sensiblhospitaldeaths)from1981to1990,focusingspecicallyonpatientswhohadreceivedaprimaryopencholecystectomy.Someofthesepatientshadinaddition deaths.achi-squaretestcomparingthisoutcomeforthetwogroupsofpatients receivedanincidental(i.e.discretionary)appendectomyduringthecholecystectomyprocedure.table2displaysthedataononeoutcome,namelyin-hospital showsa\statisticallysignicant"dierence.this\nding"issurprisingsincelongtermpreventionofappendicitisisthesolerationalefortheincidentalappendectomy Wen,Hernandez,andNaylor(1995;WHNhereafter)analyzedadministrative factors.hereweoeradetailedexampletosupportthisposition. inferencesrequirecarefulanalysis,anddetailedconsiderationoftheunderlying recordsofallontariogeneralhospitalseparations(discharges,transfers,orin- procedure{noshort-termimprovementinoutcomesisexpected.this\nding" mightleadanaivehospitalpolicymakertoconcludethatallcholecystectomypatientsshouldhaveanincidentalappendectomytoimprovetheirchancesofagood outcome!clearlysomethingisamiss-howcouldincidentalappendectomyimprove outcomes?
11 STATISTICALTHEMESANDLESSONSFORDATAMINING Table2.In-hospitalSurvivalofPatientsUndergoingPrimaryOpen CholecystectomyWithandWithoutIncidentalAppendectomy. AppendectomyAppendectomy Without 35 (usingtendierentdenitionsof\low-risk"),incidentalappendectomyindeedre- butappearstopositivelyaectoutcomeswhenthelow-riskandhigh-riskpatients sultedinpooreroutcomes.paradoxically,itcouldevenbethecasethatappendec- tomyadverselyaectsoutcomesforbothhigh-riskpatientsandlow-riskpatients, WHNdidseparatelyconsiderasubgroupoflow-riskpatients.Forthesepatients In-hospitaldeaths,No.(%)21(0.27%)1,394(0.73%) In-hospitalsurvivors,No.(%)7,825(99.73%)190,205(99.27%) arecombined.whndonotprovideenoughdatatocheckwhetherthisso-called \Simpson'sParadox"(Simpson,1951)occurredinthisexample.However,Table3 presentsdatathatareplausibleandconsistentwithwhn'sdata. Table3.FictitiousdataconsistentwiththeWenetal.(1995) data. tiousdata.clearlytheriskanddeathcategoriesaredirectlycorrelated.inaddition, Table4displaysthecorrespondingproportionsofin-hospitaldeathforthesecti- Survival7700 DeathLow-RiskHigh-RiskLow-RiskHigh-Risk Appendectomy 7With Appendectomy Without thattheyhadanappendectomyallowsustoinferthattheyaremorelikelytobe appendectomiesaremorelikelytobecarriedoutonlow-riskpatientsthanonhighriskones.thus,ifwedidnotknowtheriskcategory(age)ofapatient,knowing 1294 pendectomywilllowerone'srisk.nonetheless,whenriskisomittedfromthetable, exactlysuchafallaciousconclusionappearsjustiedfromthedata. lowerrisk(younger).however,thisdoesnotinanywayimplythathavinganap- analysis,adjustingformanypossibleconfoundingvariables(e.g.age,sex,admissionstatus).theyconcludethat\thereisabsolutelynobasisforanyshort-term improvementinoutcomes"duetoincidentalappendectomy.thiscarefulanalysis agreeswithcommonsenseinthiscase.ingeneral,analysesofobservationaldata demandsuchcare,andcomewithnoguarantees.othercharacteristicsofavailable datathatconnivetospoilcausalinferencesinclude: Returningtotheoriginaldata,WHNprovideamoresophisticatedregression
12 36 riskgroupingforthectitiousdataoftable3. Table4.Proportionofin-hospitaldeathscrossclassiedbyincidentalappendectomyandpatient C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Low-Risk AppendectomyAppendectomy With Without Thepopulationunderstudymaybeamixtureofdistinctcausalsystems,resultinginstatisticalassociationsthatareduetothemixingratherthantoany Associationsinthedatabasemaybedueinwholeorparttounrecordedcommon causes(latentvariables). Combined0.003 High-Risk Missingvaluesofvariablesforsomeunitsmayresultinmisleadingassociations Membershipinthedatabasemaybeinuencedbytwoormorefactorsunderstudy,whichwillcreatea\spurious"statisticalassociationbetweenthose directinuenceofvariablesononeanotheroranysubstantivecommoncause. Manymodelswithquitedistinctcausalimplicationsmay\t"thedataequally amongtherecordedvalues. Thefrequencydistributionsinsamplesmaynotbewellapproximatedbythe Therecordedvaluesofvariablesmaybetheresultof\feedback"mechanisms variables. oralmostequallywell. mostfamiliarfamiliesofprobabilitydistributions. regressioncaninsomecasesproduceinferiorestimatesofeectsizes.procedures asintheappendectomyexample,buttheyarenotalwaysadequateguardsagainst thesehazards.indeed,controllingforpossiblyconfoundingvariableswithmultiple suchasmultipleregression,andlogisticregressionmayworkinmanycases,such tisticalproceduresyetavailablethatcanbeused\otheshelf"{thewayrandom- izationisusedinexperimentaldesign{toreducetheserisks.standardtechniques Thereisresearchthataddressesaspectsoftheseproblems,buttherearefewsta- whicharenotwellrepresentedbysimple\non-recursive"statisticalmodels. recentlydevelopedinthearticialintelligenceandstatisticsliterature(spirteset al.,1993)addresssomeoftheproblemsassociatedwithlatentvariablesandmixing,butsofaronlyfortwofamiliesofprobabilitydistributions,thenormaland multinomial.
13 institutionsthatgiverisetodata,canbeuncooperative.insuchcases,inferences 5.2.Sometimesit'snotwhat'sinthedatathatmatters Classicalstatisticalmethodsstartwitharandomsample,yetinpractice,dataorthe STATISTICALTHEMESANDLESSONSFORDATAMINING thatignorehowthedatawere\selected"canleadtodistortedconclusions. Consider,forexample,theChallengerSpaceShuttleaccident.TheRogersCommissionconcludedthatanO-ringfailureinthesolidrocketboosterledtothe structuralbreakupandlossofthechallenger.inreconstructingtheeventsleadinguptothedecisiontolaunch,thecommissionnotedamistakeintheanalysis ofthermal-distressdatawherebyightswithno(i.e.zero)incidentsofo-ring thetemperatureeect.thistruncationofthedataledtotheconclusionthat temperaturesinceitwasfeltthattheydidnotcontributeanyinformationabout norelationshipbetweeno-ringdamageandtemperatureexisted,andultimately, damagewereexcludedfromcriticalplotsofo-ringdamageandambientlaunch thedecisiontolaunch.dalaletal.(1989)throwstatisticallightonthematter ariskyproposition. andquantifyingtherisk(ofcatastrophicfailure)at31of.hadtheoriginalanalysis bydemonstratingthestrongcorrelationbetweeno-ringdamageandtemperature, usedallofthedata,itwouldhaveindicatedthatthedecisiontolaunchwasatbest couldeasilyhavebeenavoided.inmostproblems,selectionbiasisaninherent standardinferences.thelessonstobelearnedhereare thatanytechniqueusedtoanalyzetruncateddataasifitwasarandomsample, characteristicoftheavailabledataandmethodsofanalysisneedtodealwithit.it isourexperiencethateverydatasethasthepotentialforselectionbiastoinvalidate Intheabovecase,theselectionbiasproblemwasoneof\humanerror"and 37 thedatathemselvesareseldomcapabletoalerttheanalystthataselection canbefooled,regardlessofhowthetruncationwasinduced; mechanismisoperating informationexternaltothedataathandiscritical dataminersastrayinmostapplications. makewidespreaduseofp-values.however,indiscriminateuseofp-valuescanlead classical(frequentist)statistics.itseemsnatural,therefore,thatdataminersshould 5.3.ThePerversityofthePervasiveP-value P-valuesandassociatedsignicance(orhypothesis)testsplayacentralrolein inunderstandingthenatureandextentofpotentialbiases. pothesesabouttheworld:thenullhypothesis,commonlydenotedbyh0,andthe isselectedandcalculatedfromthedataathand.theideaisthatt(data)should AlternativeHypothesis,commonlydenotedbyHA.TypicallyH0is\nested"within tozero,whilehamightplacenorestrictiononthecombination.ateststatistic,t HA;forexample,H0mightstatethatacertaincombinationofparametersisequal Thestandardsignicancetestproceedsasfollows.Considertwocompetinghy-
14 38 measuretheevidenceinthedataagainsth0.theanalystrejectsh0infavorofha ift(data)ismoreextremethanwouldbeexpectedifh0weretrue.specically, islessthanapresetsignicancelevel,. orequaltot(data),giventhath0istrue.theanalystrejectsh0ifthep-value theanalystcomputesthep-value,thatis,theprobabilityoftbeinggreaterthan Therearethreeprimarydicultiesassociatedwiththisapproach: C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 1.Thestandardadvicethatstatisticseducatorsprovide,andscienticjournals 2.Raftery(1995)pointsoutthatthewholehypothesistestingframeworkrests rigidlyadhereto,istochoosetobe0.05or0.01,regardlessofsamplesize. agriculturalexperiments(ontheorderof30-200plots).textbookadvice(e.g., NeymanandPearson,1933)hasemphasizedtheneedtotakeaccountofthe Theseparticular-levelsaroseinSirRonaldFisher'sstudyofrelativelysmall samplesizeislarge.thiscrucialbutvagueadvicehaslargelyfallenondeaf powerofthetestagainsthawhensetting,andsomehowreducewhenthe onthebasicassumptionthatonlytwohypothesesareeverentertained.in ears. 3.TheP-valueistheprobabilityassociatedwiththeeventthattheteststatistic canleadtoundesirableoutcomessuchasselectingamodelwithparameters thatarehighlysignicantlydierentfromzero,evenwhenthetrainingdata aconsequence,indiscriminateuseofp-valueswith\standard"xed-levels practice,dataminerswillconsiderverylargenumbersofpossiblemodels.as arepurenoise(freedman,1983).thispointisoffundamentalimportancefor dataminers. wasasextremeasthevalueobserved,ormoreso.however,theeventthat actuallyhappenedwasthataspecicvalueoftheteststatisticwasobserved. Consequently,therelationshipbetweentheP-valueandtheveracityofH0is subtleatbest.jereys(1980)putsitthisway: toamoredirectinterpretation-thebayesiananalystcomputestheposteriorprobabilitythatahypothesisiscorrect.withxed-levels,thefrequentistandthe BayesFactorsaretheBayesiananalogueofthefrequentistP-valuesandadmit Theyamounttosayingthatahypothesisthatmayormaynotbe trueisrejectedbecauseagreaterdeparturefromthetrialvaluewas happened. improbable;thatis,thatithasnotpredictedsomethingthathasnot IhavealwaysconsideredtheargumentsfortheuseofPabsurd. Bayesianwillarriveatverydierentconclusions.Forexample,BergerandSellke distribution.onewaytoreconcilethetwopositionsistoviewbayesfactorsasa resultinaposteriorprobabilityforh0thatisatleast0.30forany\objective"prior methodforselectingappropriate-levels-seeraftery(1995). (1987)showthatdatathatyieldaP-valueof0.05whentestinganormalmean,
15 5.4.InterventionandPrediction STATISTICALTHEMESANDLESSONSFORDATAMINING Aspecicclassofpredictionproblemsinvolveinterventionsthataltertheprobabilitydistributionoftheproblem,asinpredictingthevalues(orprobabilities)of 39 variablesunderachangeinmanufacturingprocedures,orchangesineconomicor averagingapply.forgraphicalrepresentationsofcausalhypothesesaccordingto tionsfromcompleteorincompletecausalmodelsweredevelopedin(spirtesetal., tionwithoutintervention,althoughtheusualcaveatsaboutuncertaintyandmodel themarkovcondition,generalalgorithmsforpredictingtheoutcomesofintervenedgeoftherelevantcausalstructure,andareingeneralquitedierentfrompredicvenientcalculusbypearl(1995).arelatedtheorywithoutgraphicalmodelswas 1993).Someoftheseprocedureshavebeenextendedandmadeintoamorecon- developedearlierbyrubin(1974)andothers,andbyrobbins(1986). medicaltreatmentpolicies.accuratepredictionsofthiskindrequiresomeknowl- eachmeasurednumberisalinearcombinationofthetruevalueandanerror,and relationofleaddepositsinchildren'steethwiththeiriqsresulted,eventually, inremovaloftertraethylleadfromgasolineintheunitedstates.onedataset ingthatallofthevariablesweremeasuredwitherror.theirmodelassumesthat signicantregressors,includinglead.klepper(1988)reanalyzedthedataassum- Needlemanexaminedincludedmorethan200subjects,andmeasuredalargenumberofcovariates.Needleman,Geiger,andFrank(1985)re-analyzedthedatausing backwardsstep-wiseregressionofverbaliqonthesevariablesandobtainedsix Considerthefollowingexample.HerbertNeedleman'sfamousstudiesofthecor- thattheparametersofinterestarenottheregressioncoecientsbutratherthe coecientsrelatingtheunmeasured\truevalue"variablestotheunmeasuredtrue valueofverbaliq.thesecoecientsareinfactindeterminate{ineconometricterminology,\unidentiable".anintervalestimateofthecoecientsthatisstrictly positiveornegativeforeachcoecientcanbemade,however,iftheamountof measurementerrorcanbeboundedwithpriorknowledgebyanamountthatvaries tions(usingtetradmethodology)andconcludedthatthreeofthesixregressors couldhavenoinuenceoniq.theregressionincludedthethreeextravariables asstrongasneedleman'sanalysissuggested. fromcasetocase.klepperfoundthattheboundrequiredtoensuretheexistence ofastrictlynegativeintervalestimateforthelead{iqcoecientwasmuchtoo onlybecausethepartialregressioncoecientisestimatedbyconditioningonall stricttobecredible,thusheconcludedthatthecaseagainstleadwasnotnearly permodel,butwithoutthethreeirrelevantvariables,andassigningtoallofthe wrongthingtodoforcausalinferenceusingthemarkovcondition.usingtheklep- otherregressors,whichisjusttherightthingtodoforlinearprediction,butthe parametersanormalpriorprobabilitywithmeanzeroandasubstantialvariance, ScheinesthenusedMarkovchainMonteCarlotocomputeaposteriorprobabilitydistributionforthelead{IQparameter.Theprobabilityisveryhighthatlead Allowingthepossibilityoflatentvariables,Scheines(1996)reanalyzedthecorrela- exposurereducesverbaliq.
16 40 Easyaccesstodataindigitalformandtheavailabilityofsoftwaretoolsforstatisticalanalyseshavemadeitpossibleforthemaninstreettosetupshopand \dostatistics."nowhereisthismoretruetodaythanindatamining.basedon C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH 6.SymbiosisinStatistics assertthat: theargumentsinthisarticle,letusassumethatstatisticsisanecessarybutnot sucientcomponentinthepracticeofdatamining.howwellwillthestatistics professionservethedataminingcommunity?hoerletal.(1993),forexample, applicationsdoinfactdrivemuchofwhatgoesonitstatistics,althoughoftenina Despitethisrathernegativeviewoftherelevanceofstatisticalresearch,real-world veryindirectmanner. Asanexampleconsidertheeldofsignalprocessingandcommunications,anarea sionisintendedforothermembersofthestatisticalprofession. Weareourownbestcustomers.Muchoftheworkofthestatisticalprofes- fromclaudeshannonandothersinthe1940's.likemostoftheothercontributors totheeld,shannonwasnotastatistician,butpossessedadeepunderstanding intoeverydayuseinradioandnetworkcommunicationssystems.modernstatistical relevantstatisticalmethodssuchasestimationanddetectionhavefoundtheirway duetorapidadvancesinboththeoryandhardware,theeldhasexplodedand whereaspecializedsetofrelativelysophisticatedstatisticalmethodsandmodels Engineeringresearchersintheeldareineect\adjunct"statisticians:educated communicationsreectsthesymbiosisofstatisticaltheoryandengineeringpractice. havebeenhonedforpracticaluse.theeldwasdrivenbyfundamentaladvances inprobabilitytheoryandbasicstatisticstheyhavethetoolstoapplystatistical ofprobabilitytheoryanditsapplications.throughthe1950'stothepresent, methodstotheirproblemsofinterest.meanwhilestatisticianscontinuetodevelop speechrecognition(whereforexamplehiddenmarkovmodelsprovidethestate-ofthe-artintheeld),andmostnotably,epidemiology.indeed,ifstatisticscanclaistandstatisticalprinciples,andstatisticiansneedtounderstandthenatureofthe problemsincommunications. moregeneralmodelsandestimationtechniquesofpotentialapplicabilitytonew importantproblemsthatthedataminingcommunityisattackingorbeingasked tohaverevolutionizedanyeld,itisinthebiologicalandhealthscienceswherethe statisticalapproachtodataanalysisgavebirthtotheeldofbiostatistics. Thistypeofsymbiosiscanalsobeseeninotherareassuchasnancialmodelling, toattack.thishasbeenasuccessfulmodelinthepastforeldswherestatistics hashadconsiderableimpactandhasthepotentialtoseeongoingsuccess. Therelevanceofthissymbiosisfordataminingisthatdata-minersneedtounder-
17 STATISTICALTHEMESANDLESSONSFORDATAMINING 41 7.Conclusion Thestatisticalliteraturehasawealthoftechnicalproceduresandresultstooer datamining,butitalsohasafewsimplemethodologicalmorals:provethatestimationandsearchproceduresusedindataminingareconsistentunderconditions reasonablythoughttoapplyinapplications;useandrevealuncertainty,don'thide it;calibratetheerrorsofsearch,bothforhonestyandtotakeadvantagesofmodel averaging;don'tconfuseconditioningwithintervening;andnally,don'ttakethe errorprobabilitiesofhypothesisteststobetheerrorprobabilitiesofsearchprocedures. References Akaike,H.1974.Anewlookatthestatisticalmodelidentication.IEEETrans.Automat. Contr.AC-19:716{723. Berger,J.O.andSellke,T.1987.Testingapointnullhypothesis:theirreconcilabilityofPvalues andevidence(withdiscussion).journaloftheamericanstatisticalassociation82:112{122. Breiman,L.1996.Baggingpredictors.MachineLearning,toappear. Chasno,I.J.,Grith,D.R.,MacGregor,S.,Dirkes,K.,Burns,K.A.1989.Temporalpatterns ofcocaineuseinpregnancy:perinataloutcome.journaloftheamericanmedicalassociation 261(12):1741{4. Chateld,C.1995.Modeluncertainty,datamining,andstatisticalinference(withdiscussion). JournaloftheRoyalStatisticalSociety(SeriesA)158:419{466. Dalal,S.R.,Fowlkes,E.B.andHoadley,B.1989.Riskanalysisofthespaceshuttle:Pre-Challenger predictionoffailure.journaloftheamericanstatisticalassociation84:945{957. Diggle,P.andKenward,M.G.1994.Informativedrop-outinlongitudinaldataanalysis(with discussion).appliedstatistics:43:49{93. Draper,D.,Gaver,D.P.,Goel,P.K.,Greenhouse,J.B.,Hedges,L.V.,Morris,C.N.,Tucker,J., andwaternaux,c.1993.combininginformation:nationalresearchcouncilpanelonstatisticalissuesandopportunitiesforresearchinthecombinationofinformation.washington: NationalAcademyPress. Draper,D.1995.Assessmentandpropagationofmodeluncertainty(withdiscussion).Journalof theroyalstatisticalsociety(seriesb).57:45{97. Efron,B.andTibshirani,R.J.1993.AnIntroductiontotheBoostrap.NewYork:Chapmanand Hall. EnergyModelingForum1982.WorldOil:Summaryreport.EMFReport6,EnergyModeling Forum,StanfordUniversity,Stanford,CA. Fisher,R.A.1958.Statisticalmethodsforresearchworkers.NewYork:HafnerPub.Co. Freedman,D.A.1983.Anoteonscreeningregressionequations.TheAmericanStatistician 37:152{155. Geiger,D.Heckerman,D.,andMeek,C.1996.Asymptoticmodelselectionfordirectednetworkswithhiddenvariables.ProceedingsoftheTwelfthAnnualConferenceonUncertaintyin ArticialIntelligence.SanFrancisco:MorganKaufman. Gilks,W.R.,Richardson,S.,andSpiegelhalter,D.J.1996.MarkovchainMonteCarloinpractice. London:ChapmanandHall. Hand,D.J.1994.Deconstructingstatisticalquestions(withdiscussion).JournaloftheRoyal StatisticalSociety(SeriesA)157:317{356. Hastie,T.J.andTibshirani,R.1990.GeneralizedAdditiveModels.London:ChapmanandHall. Hoerl,R.W.,Hooper,J.H.,Jacobs,P.J.,Lucas,J.M.1993.Skillsforindustrialstatisticiansto surviveandprosperintheemergingqualityenvironment.theamericanstatistician47:280{292. Huber,P.J.1981.RobustStatistics.NewYork:Wiley.
18 42 C.GLYMOUR,D.MADIGAN,D.PREGIBONANDP.SMYTH Jereys,H.1980.Somegeneralpointsinprobabilitytheory.In:A.Zellner(Ed.),Bayesian AnalysisinEconometricsandStatistics.Amsterdam:North-Holland,451{454. Kass,R.E.andRaftery,A.E.1995.Bayesfactors.JournaloftheAmericanStatisticalAssociation 90:773{795. Kiiveri,H.andSpeed,T.P.1982.Structuralanalysisofmultivariatedata:Areview.Sociological Methodology209{289. Kooperberg,C.,Bose,S.,andStone,C.J.1996.Polychotomousregression.JournaloftheAmericanStatisticalAssociation,toappear. Lauritzen,S.L.1996.GraphicalModels.Oxford:OxfordUniversityPress. Leamer,E.E.1978.SpecicationSearches.AdHocInferencewithNonexperimentalData.Wiley: NewYork. Madigan,D.andRaftery,A.E.1994.Modelselectionandaccountingformodeluncertainty ingraphicalmodelsusingoccam'swindow.journaloftheamericanstatisticalassociation 89:1335{1346. Madigan,D.andYork,J.1995.Bayesiangraphicalmodelsfordiscretedata.International StatisticalReview63:215{232. Matheson,J.E.andWinkler,R.L.1976.Scoringrulesforcontinuousprobabilitydistributions. ManagementScience22:1087{1096. McCullagh,P.andNelder,J.A.1989.GeneralizedLinearModels.London:ChapmanandHall. Michelangeli,P.A.,Vautard,R.,andLegras,B.1995.Weatherregimes:recurrenceandquasistationarity.JournaloftheAtmosphericSciences52(8):1237{56. Miller,R.G.Jr.1981.Simultaneousstatisticalinference(SecondEdition).NewYork:Springer- Verlag. Neyman,J.andPearson,E.S.1933.Ontheproblemofthemostecienttestsofstatistical hypotheses.philosophicaltransactionsoftheroyalsociety(seriesa)231:289{337. Raftery,A.E.1995.Bayesianmodelselectioninsocialresearch(withdiscussion).InSociological Methodology(ed.P.V.Marsden),Oxford,U.K.:Blackwells,111{196. Rissanen,J.1978.Modelingbyshortestdatadescription.Automatica14:465{471. Schervish,M.J.1995.TheoryofStatistics,NewYork:SpringerVerlag. Schwartz,G.1978.Estimatingthedimensionofamodel.AnnalsofStatistics6:461{464. Selvin,H.andStuart,A.1966.Datadredgingproceduresinsurveyanalysis.TheAmerican Statistician20(3):20{23. Simpson,C.H.1951.Theinterpretationofinteractionincontingencytables.Journalofthe RoyalStatisticalSociety(SeriesB)13:238{241. Smith,A.F.M.andRoberts,G.1993.BayesiancomputationviatheGibbssamplerandrelated MarkovchainMonteCarlomethods(withdiscussion).JournaloftheRoyalStatisticalSociety (SeriesB)55:3{23. Spirtes,P.,GlymourC.,andScheines,R.1993.Causation,PredictionandSearch,Springer LectureNotesinStatistics,NewYork:SpringerVerlag. Stigler,S.M.1986.Thehistoryofstatistics:Themeasurementofuncertaintybefore1900. Harvard:HarvarduniversityPress. Wen,S.W.,Hernandez,R.,andNaylor,C.D.1995.Pitfallsinnonrandomizedstudies:The caseofincidentalappendectomywithopencholecystectomy.journaloftheamericanmedical Association274:1687{1691. Wright,S.1921.Correlationandcausation.JournalofAgriculturalResearch20:557{585. ReceivedDate AcceptedDate FinalManuscriptDate
UNIVERSITY of TORONTO. Faculty of Arts and Science
UNIVERSITY of TORONTO Faculty of Arts and Science AUGUST 2005 EXAMINATION AT245HS uration - 3 hours Examination Aids: Non-programmable or SOA-approved calculator. Instruction:. There are 27 equally weighted
Centralized vs Onsite Monitoring:
Centralized vs Onsite Monitoring: A Sponsor s Balancing Act Applying a Risk-based Approach Introduction Since the August 2011 release of the draft guidance document by FDA on a risk-based approach to monitoring
Accident Prevention Techniques
Topic 9 Accident Prevention Techniques LEARNING OUTCOMES By the end of this topic, you should be able to: 1. Describe Job Hazard Analysis (JHA) as an accident prevention technique; 2. Describe Job Safety
Title: The BCL2-938 C>A promoter polymorphism is associated with risk group classification in children with acute lymphoblastic leukemia
Author's response to reviews Title: The BCL2-938 C>A promoter polymorphism is associated with risk group classification in children with acute lymphoblastic leukemia Authors: Annette Kuenkele ([email protected])
PRE/POST TESTS and PRE/POST TEST INSTRUCTOR KEYS
PRE/POST TESTS and PRE/POST TEST INSTRUCTOR KEYS Enclosed are two versions of optional PRIME For Life Pre/Post Tests and Test Keys for your participants. You may use either test with your groups. For accurate
ESI ANNUAL SALARY SURVEY
ESI ANNUAL SALARY SURVEY In order to uncover how public and private sector organizations are going about building and developing their project communities, ESI International conducted the ESI 2013 Project
Longitudinal Data Analysis. Wiley Series in Probability and Statistics
Brochure More information from http://www.researchandmarkets.com/reports/2172736/ Longitudinal Data Analysis. Wiley Series in Probability and Statistics Description: Longitudinal data analysis for biomedical
VCE Business Management 2013 2015
VCE Business Management 2013 2015 Written examination November Examination specifications The following information updates the specifications published in 2010. It reflects a change to the format introduced
Creating Customer Value, Satisfaction, and Loyalty 9/5/2008. Building Customer Value and Satisfaction
Chapter 4 Creating Customer Value, Satisfaction, and Loyalty 4-1 Chapter Questions How can companies deliver customer value, satisfaction, and loyalty? What is the lifetime value of a customer, and why
Colocation Services. Retail Colocation as it s meant to be
Colocation Services Retail Colocation as it s meant to be We are an agile business and look for similar organisations we can scale with. Infinity was the perfect choice. Jamie Donnelly Managing Director,
Homework 3 Solution, due July 16
Homework 3 Solution, due July 16 Problems from old actuarial exams are marked by a star. Problem 1*. Upon arrival at a hospital emergency room, patients are categorized according to their condition as
Essential QA Metrics for Determining Solution Quality
1.0 Introduction In today s fast-paced lifestyle with programmers churning out code in order to make impending deadlines, it is imperative that management receives the appropriate information to make project
THE PREDICTIVE MODELLING PROCESS
THE PREDICTIVE MODELLING PROCESS Models are used extensively in business and have an important role to play in sound decision making. This paper is intended for people who need to understand the process
Math 370/408, Spring 2008 Prof. A.J. Hildebrand. Actuarial Exam Practice Problem Set 1
Math 370/408, Spring 2008 Prof. A.J. Hildebrand Actuarial Exam Practice Problem Set 1 About this problem set: These are problems from Course 1/P actuarial exams that I have collected over the years, grouped
Master of Science in Statistics
Master of Science in Statistics Options: Biometrics Social, Behavioural and Educational Statistics Business Statistics Industrial Statistics General Statistical Methodology All Round Statistics Rubik s
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Sample Script of an Initial Brief Alcohol Counseling Session
Information Sheet for Behavioral Health Providers in Primary Care Sample Script of an Initial Brief Alcohol Counseling Session Introduce the Subject with a Transitional Statement From your answers it appears
Master of Science in Statistics
Master of Science in Statistics Majors: Biometrics Social, Behavioural and Educational Statistics Business Statistics Industrial Statistics General Statistical Methodology All Round Statistics INTERFACULTY
Getting Started Different Ways of Deleading Other Options and Resources
Contents Getting Started Protecting Children from Lead Poisoning page 2 Massachusetts Lead Law page 3 What is Deleading? page 4 Getting Your Home Inspected for Lead page 5 Different Ways of Deleading Low-Risk
Auditorium Acoustics and Architectural Design
Auditorium Acoustics and Architectural Design Second Edition Michael Barron. J ^A Spon Press an imprint of Taylor & Francis LONDON AND NEWYORK Contents Preface Preface to the first edition Foreword ix
Does my patient need more therapy after prostate cancer surgery?
Does my patient need more therapy after prostate cancer surgery? Contact the GenomeDx Patient Care Team at: 1.888.792.1601 (toll-free) or e-mail: [email protected] Prostate Cancer Classifier
Statistics in Applications III. Distribution Theory and Inference
2.2 Master of Science Degrees The Department of Statistics at FSU offers three different options for an MS degree. 1. The applied statistics degree is for a student preparing for a career as an applied
Small employers. Issue Brief. Health Insurance Purchasing Cooperatives. Elliot K.Wicks Economic and Social Research Institute
TASK FORCE ON THE FUTURE OF HEALTH INSURANCE Issue Brief NOVEMBER 2002 Health Insurance Purchasing Cooperatives Elliot K.Wicks Economic and Social Research Institute The Commonwealth Fund is a private
Multinational Comparisons of Health Systems Data, 2014
Multinational Comparisons of Health Systems Data, 214 Chloe Anderson The Commonwealth Fund November 214 Health Care Spending 2 Dollars ($US) Average Health Care Spending per Capita, 198 212 Adjusted for
Prostate cancer. Christopher Eden. The Royal Surrey County Hospital, Guildford & The Hampshire Clinic, Old Basing.
Prostate cancer Christopher Eden The Royal Surrey County Hospital, Guildford & The Hampshire Clinic, Old Basing. Screening Screening men for PCa (prostate cancer) using PSA (Prostate Specific Antigen blood
MBA PROGRAMME: 2015. Appendix 1 FINANCE AND RESPONSIBLE INVESTMENT SUBJECT CODE: CMBC 191
MBA PROGRAMME: 2015 Appendix 1 FINANCE AND RESPONSIBLE INVESTMENT STUDY GUIDE AND COURSE OUTLINE SUBJECT CODE: CMBC 191 1. Lecturing Dates February 7 February 27 March 27 April 17 May 17 May 22 2. Module
How To Understand Predictive Analysis And Data Mining
DATA MINING AND PREDICTIVE ANALYSIS PDF ==> Download: DATA MINING AND PREDICTIVE ANALYSIS PDF DATA MINING AND PREDICTIVE ANALYSIS PDF - Are you searching for Data Mining And Predictive Analysis Books?
Curriculum Vitae: Raul J. Cano, Ph.D.
CurriculumVitae:RaulJ.Cano,Ph.D. I.PERSONALINFORMATION NAME: RaulJ.Cano OFFICEADDRESS: BiologicalSciencesDepartment,53 210E CaliforniaPolytechnicStateUniversity SanLuisObispo,CA93407 OFFICETELEPHONE: (805)756
Social Networks and their Economics. Influencing Consumer Choice. Daniel Birke
Social Networks and their Economics Influencing Consumer Choice Daniel Birke Visiting Researcher, Aston Business School, Birmingham, and works in a leading international management consultancy in Germany.
Trends in Publicly Reported Nursing Facility Quality Measures
Trends in Publicly Reported Nursing Facility Quality Measures American Health Care Association Reimbursement and Research Department January 2011 Trends in Publicly Reported Nursing Facility Quality Measures
Radiation Therapy for Prostate Cancer: Treatment options and future directions
Radiation Therapy for Prostate Cancer: Treatment options and future directions David Weksberg, M.D., Ph.D. PinnacleHealth Cancer Institute September 12, 2015 Radiation Therapy for Prostate Cancer: Treatment
Quality Scorecard overall heart attack care overall heart failure overall pneumonia care overall surgical infection rate patient safety survival
Quality Scorecard s are required to report quality statistics to the s for Medicare and Medicaid Services (CMS) and the Department of Health (DOH). This information is made available at www.hospitalcompare.hhs.gov
Atherosclerosis of the aorta. Artur Evangelista
Atherosclerosis of the aorta Artur Evangelista Atherosclerosis of the aorta Diagnosis Classification Prevalence Risk factors Marker of generalized atherosclerosis Risk of embolism Therapy Diagnosis Atherosclerosis
Julio is [it] the best option?
BEG_CTRL_NUM : DONZ000043764 END_CTRL_NUM : DONZ000043764 DATESENT = July 11, 2007 TIMESENT = 3:20:43 pm RECEIVEDDATE = July 11, 2007 TIMERECEIVED = 3:20:43 pm FILENAME : Re: seguro para el wao.msg SUBJECT
Core Music Curriculum General Education
Department of Music BA Degree, Major in Music: 120 hours BM Degree, Major in Music Education: 126 hours BM Degree, Major in Performance: 120 hours College of Arts and Architecture UNC Charlotte www.music.uncc.edu
Creating Strategic Alliances for Post-Acute Coordination of Care
Creating Strategic Alliances for Post-Acute Coordination of Care Kathleen Yosko, PhD President/CEO Wheaton Franciscan Health Care Sole Illinois property Free-standing facility 101 IRF beds 27 SNF beds
An Introduction to Advanced Analytics and Data Mining
An Introduction to Advanced Analytics and Data Mining Dr Barry Leventhal Henry Stewart Briefing on Marketing Analytics 19 th November 2010 Agenda What are Advanced Analytics and Data Mining? The toolkit
Test your knowledge on risk. Fill in the box for the correct answer for each question or statement.
Test your knowledge on risk. Fill in the box for the correct answer for each question or statement. 1 2 Which 3 Which 4 The 5 In Which statement(s) describe the relationship between risk and insurance?
Multinomial Logistic Regression
Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a
TABLE OF CONTENTS BACKGROUND AND INTRODUCTION... 5 PURPOSE... 5 SCOPE... 6 RISK ASSESSMENT PROCESS... 6
TABLE OF CONTENTS BACKGROUND AND INTRODUCTION... 5 PURPOSE... 5 SCOPE... 6 RISK ASSESSMENT PROCESS... 6 RISK ASSESSMENT AND EVALUATION METHODOLOGY... 6 RESULTS... 8 RISK ASSESSMENT GAPS... 9 RISK ASSESSMENT
Plugging Premium Leakage
Plugging Premium Leakage Using Analytics to Prevent Underwriting Fraud WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Types of Underwriting Fraud... 1 Application Fraud/Rate Manipulation....
Decision & Risk Analysis Lecture 6. Risk and Utility
Risk and Utility Risk - Introduction Payoff Game 1 $14.50 0.5 0.5 $30 - $1 EMV 30*0.5+(-1)*0.5= 14.5 Game 2 Which game will you play? Which game is risky? $50.00 Figure 13.1 0.5 0.5 $2,000 - $1,900 EMV
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Milwaukee County Early Intervention Program
Milwaukee County Early Intervention Program National Symposium on Pretrial Diversion Strengthening the Evidence-Based Framework Washington D.C. May 30, 2012 District Attorney John T. Chisholm First Assistant
Building flexible, easy to change and rock-solid applications with BRFplus decision services. Carsten Ziegler, James Taylor
[ Building flexible, easy to change and rock-solid applications with BRFplus decision services Carsten Ziegler, James Taylor [ Learning Points Learn how the empowerment of business experts is built into
Waterfall vs. Agile Project Management
Lisa Sieverts, PMP, PMI-ACP Phil Ailes, PMI-ACP Agenda What is a Project Overview Traditional Project Management Agile Project Management The Differences Product Life Cycle The Teams Requirements WBS/Product
Project Management in a Multi-Environment Ken Halloway, PMP, ITIL 21 October 2015
Project Management in a Multi-Environment Ken Halloway, PMP, ITIL 21 October 2015 www.pmihr.org 1 What Am I Talking About? www.pmihr.org 2 Project www.pmihr.org 3 Lifecycle Initiating Planning Executing
Information asymmetries
Adverse selection 1 Repeat: Information asymmetries Problems before a contract is written: Adverse selection i.e. trading partner cannot observe quality of the other partner Use signaling g or screening
Life expectancy of children with cerebral palsy
Life expectancy of children with cerebral palsy J L Hutton, K Hemming and UKCP collaboration What is UKCP? Information about the physical effects of cerebral palsy on the everyday lives of children and
Administrative Measures of Settlement Reserve Funds by China Securities Depository and Clearing Corporation Limited
Administrative Measures of Settlement Reserve Funds by China Securities Depository and Clearing Corporation Limited Article 1: In order to prevent and remove the securities transactions clearing and settlement
Rockford s map update project is a joint effort with FEMA in cooperation with local associations and other state partners.
FREQUENTLY ASKED QUESTIONS 1. Why is Rockford getting new flood hazard maps? Flood hazard maps, also known as Flood Insurance Rate Maps (FIRMs), are important tools in the effort to protect lives and properties
The Entrepreneur s Guide to Financial Maturity Factoring - Financing for Companies Seeking Fast Cash
The Entrepreneur s Guide to Financial Maturity Factoring - Financing for Companies Seeking Fast Cash A healthy cash flow is an essential part of any successful business. Some entrepreneurs claim that a
Copyright 2009 Pearson Education Canada
The consequence of failing to adjust the discount rate for the risk implicit in projects is that the firm will accept high-risk projects, which usually have higher IRR due to their high-risk nature, and
International Services
International Services Consistently ranked as one of the best hospitals in the United States by U.S.News & World Report, patients from around the world travel to UCSF Medical Center and UCSF Benioff Children
Sun Li Centre for Academic Computing [email protected]
Sun Li Centre for Academic Computing [email protected] Elementary Data Analysis Group Comparison & One-way ANOVA Non-parametric Tests Correlations General Linear Regression Logistic Models Binary Logistic
Sample Size Designs to Assess Controls
Sample Size Designs to Assess Controls B. Ricky Rambharat, PhD, PStat Lead Statistician Office of the Comptroller of the Currency U.S. Department of the Treasury Washington, DC FCSM Research Conference
Doctorates in Occupational Safety and Health: A Critical Shortage
Doctorates in Occupational Safety and Health: A Critical Shortage By Anthony Veltri, Ed.D., MS, CSHM and Jim Ramsay, Ph.D., MA, CSP Contact Information: Anthony Veltri, Ed.D., MS, CSHM Associate Professor
UNIT-LINKED LIFE INSURANCE CONTRACTS WITH INVESTMENT GUARANTEES A PROPOSAL FOR ROMANIAN LIFE INSURANCE MARKET
UNIT-LINKED LIFE INSURANCE CONTRACTS WITH INVESTMENT GUARANTEES A PROPOSAL FOR ROMANIAN LIFE INSURANCE MARKET Cristina CIUMAŞ Department of Finance, Faculty of Economics and Business Administration, Babeş-Bolyai
Time s Up: DCAA s Renewed Focus on Incurred Cost Submissions
Time s Up: DCAA s Renewed Focus on Incurred Cost Submissions Nicole Mitchell, CPA Donna Dominguez Aronson LLC May 1, 2013 2013 All Rights Reserved 805 King Farm Boulevard Suite 300 Rockville, Maryland
WORKING CAPITAL MANAGEMENT OF BAJAJ AUTO LTD. WITH SPECIAL REFERENCE TO AUTOMOBILE INDUSTRY.
International Journal of Entrepreneurship and Management Research Vol. 1 No. 1 (January-June 2011) pp. 63-71 WORKING CAPITAL MANAGEMENT OF BAJAJ AUTO LTD. WITH SPECIAL REFERENCE TO AUTOMOBILE INDUSTRY.
Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn
Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn Gordon K. Smyth & Belinda Phipson Walter and Eliza Hall Institute of Medical Research Melbourne,
White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
The Use of M&S VV&A as a Risk Mitigation Strategy in Defense Acquisition
The Use of M&S VV&A as a Risk Mitigation Strategy in Defense Acquisition Michelle Kilikauskas Joint Accreditation Support Activity NAVAIR Weapons Division China Lake, CA 93555 [email protected]
FP7-ICT-2013-11-4.2. Scalable Data Analytics. Deadline: 16 April 2013 at 17:00:00 (Brussels local time)
Scalable Data Analytics Deadline: 16 April 2013 at 17:00:00 (Brussels local time) Agenda Time 14H30 Programme Overview of Objective 4.2 Scalable Data Analytics By Carola Carstens, European Commission,
Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept
Statistics 215b 11/20/03 D.R. Brillinger Data mining A field in search of a definition a vague concept D. Hand, H. Mannila and P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge. Some definitions/descriptions
