eachother'smethodsandusingajudiciouslychosencombinationofthem. Abstract

Transcription

1 IBMT.J.WatsonResearchCenter,YorktownHeights,N.Y.,U.S.A. Astatisticalperspectiveondatamining JonathanHosking,EdwinPednaultandMadhuSudan therearesomephilosophicalandmethodologicaldierences.weexaminethesedierences,and wedescribethreeapproachestomachinelearningthathavedevelopedlargelyindependently: Dataminingcanberegardedasacollectionofmethodsfordrawinginferencesfromdata.The aimsofdatamining,andsomeofitsmethods,overlapwiththoseofclassicalstatistics.however, classicalstatistics,vapnik'sstatisticallearningtheory,andcomputationallearningtheory.comparingtheseapproaches,weconcludethatstatisticiansanddataminerscanprotbystudying eachother'smethodsandusingajudiciouslychosencombinationofthem. Abstract asobtainingecientsummariesoflargeamountsofdata,identifyinginterestingstructuresand 1Introduction:astatisticianlooksatdatamining Therecentupsurgeofinterestintheeldvariouslyknownasdatamining,knowledgediscovery ormachinelearning1hastakenmanystatisticiansbysurprise.dataminingattackssuchproblems Keywords:classication,frequentistinference,PAClearning,statisticallearningtheory. temptingforastatisticiantoregarddataminingasnomorethanabranchofstatistics. lectionofmethodsforsummarizingandidentifyingpatternsindata.manystatisticalmodelsexist forexplainingrelationshipsinadatasetorformakingpredictions:clusteranalysis,discriminant problems.exploratorydataanalysis,aeldparticularlyassociatedwithj.w.tukey[18],isacol- analysisandnonparametricregressioncanbeusedinmanydataminingproblems.itistherefore torsoffutureobservations.statisticianshavewellestablishedtechniquesforattackingallofthese relationshipswithinadataset,andusingasetofpreviouslyobserveddatatoconstructpredic- Datasetscanbeverymuchlargerthanisusualinstatistics,runningtohundredsofgigabytesor timetotasinglemodel.therearedierencesofemphasisintheapproachtomodeling:comparedwithstatistics,dataminingpayslessattentiontothelarge-sampleasymptoticproperties ofitsinferencesandmoretothegeneralphilosophyof\learning",includingconsiderationofthe complexityofmodelsandofthecomputationsthattheyrequire.somemodelingtechniques,such Nonetheless,theproblemsandmethodsofdatamininghavesomedistinctfeaturesoftheirown. terabytes.dataanalysesareonacorrespondinglylargerscale,oftenrequiringdaysofcomputer independentlyofinputfromstatisticians. asneuralnetworks,haveanextensivemethodologyandterminologythathasdevelopedlargely asrule-basedmethods,arediculttotintotheclassicalstatisticalframework,andothers,such \learning"isaloadedterm. modelthatisunjustiablyelaborateforagivendataset(e.g.[11]).\machinelearning"isprobablybetter,though 1Unfortunately,\datamining"isapejorativetermtostatisticians,whouseittodescribethettingofastatistical 1

2 ties:classicalstatistics,thestatisticallearningtheoryofv.vapnik,andcomputationallearning theory.section6containssomecomparisonsandconclusions. anddatamining.insection2weobservesomeofthedierencesbetweenthestatisticalanddataminingapproachestodataanalysisandmodeling.insections3{5wedescribeinmoredetailsome Thispaperisabriefintroductiontosomeofthesimilaritiesanddierencesbetweenstatistics approachestomachinelearningthathaveariseninthreemore-or-lessdisjointacademiccommuni- Bothstatisticsanddataminingareconcernedwithdrawinginferencesfromdata.Theaimof theinferencemaybeunderstandingthepatternsofcorrelationandcausallinksamongthedata statisticshasdevelopedanapproach,describedfurtherinsection3below,thatinvolvesspecifying 2.1Featuresofdatamining values(\explanation"),ormakingpredictionsoffuturedatavalues(\generalization").classical 2Statisticsanddatamining statements.data-miningmethodshaveinmanycasesbeendevelopedforproblemsthatdonott amodelfortheprobabilitydistributionofthedataandmakinginferencesintheformofprobability whenappliedtofamiliarstatisticalproblemssuchasclassicationandregression,theyretainsome distinctfeatures.wenowmentionsomefeaturesofthedata-miningapproachesandtheirtypical easilyintotheframeworkofclassicalstatisticsandhaveevolvedinisolationfromstatistics.even implementations. Complexmodels.Someproblemsinvolvecomplexinteractionsbetweenfeaturevariables,with datasets(104to107examples).thisisinsomecasesaconsequenceoftheuseofcomplexmodels, Largeproblems.Bythestandardsofclassicalstatistics,dataminingoftendealswithverylarge shouldhavebetterprospectsofsuccessincomplexproblems. asneuralnetworksandrule-basedclassiershavethecapacitytomodelcomplexrelationshipsand a1616arrayofpixels,itisdiculttoformulateacomprehensiblestatisticalmodelthatcan identifythecharacterthatcorrespondstoagivenpatternofdots.data-miningtechniquessuch nosimplerelationshipsbeingapparentinthedata.characterrecognitionisagoodexample;given mining. Manydiscretevariables.Datasetsthatcontainamixtureofcontinuousanddiscrete-valued ofcomputationalcomplexityandscalabilityofalgorithmsareoftenofgreatimportanceindata continuousvariables.manydataminingmethodsaremoretolerantofdiscrete-valuedvariables. variablesarecommoninpractice.mostmultivariateanalysismethodsinstatisticsaredesignedfor Indeed,somerule-basedapproachesuseonlydiscretevariablesandrequirecontinuousvariablesto forwhichlargeamountsofdataareneededtoderivesecureinferences.inconsequence,issues expressedintermsofpredictionerror:forexample,inclassicationproblemsthelossfunction mightbethemisclassicationrateonasetofexamplesnotusedinthemodel-ttingprocedure. bediscretized. Wideuseofcross-validation.Data-miningmethodsoftenseektominimizealossfunction Predictionerrorisoftenestimatedbycross-validation,atechniqueknowntostatisticsbutused muchmorewidelyindatamining. canbeusedinanestedfashion the\wrappermethod"[7] tooptimizeseveralaspectsofthe model.theseincludevariousparametersthatmightotherwisebechosenarbitrarily(e.g.,the Minimizationofthepredictionerrorestimatedbycross-validationisapowerfultechniquethat 2

3 amountofpruningofadecisiontree,orthenumberofneighborstouseinanearest-neighbor examples.statisticalmethodsareparticularlylikelytobepreferablewhenfairlysimplemodelsare thegreatercomplexityofdataminingmethodsisnotalwaysjustiable:ripley[16]citesseveral approachesispossiblebutseemsrarelytobeperformed.somecomparisonshavefoundthat eliminatedfromthemodel. Fewcomparisonswithsimplestatisticalmodels.Whendataminingmethodsareused onproblemstowhichclassicalstatisticalmethodsarealsoapplicable,directcomparisonofthe classier)andthechoiceofwhichfeaturevariablesarerelevantforclassicationandwhichcanbe adequateandtheimportantvariablescanbeidentiedbeforemodeling.thisisacommonsituation logisticregressionandconcludedthattheuseofneuralnetworks\doesnotnecessarilyimplyany inbiomedicalresearch,forexample.inthiscontextvachetal.[19]comparedneuralnetworksand progress:theyfailintranslatingtheirincreasedexibilityintoanimprovedestimationofthe Acommonprobleminstatisticsanddataminingistouseobservationsonasetof\featurevariables" topredictthevalueofa\classvariable".thisproblemcorrespondstostatisticalmodelsfor classicationwhentheclassvariabletakesadiscretesetofvaluesandforregressionwhenthe regressionfunctionduetoinsucientsamplesizes,theydonotgivedirectinsighttotheinuence 2.2Classication:anillustrativeproblem ofsinglecovariates,andtheyarelackinguniquenessandreproducibility". theclassvariablebyyandthefeaturevariablesbythevectorx=[x1:::xf].itissometimes valuesoftheclassvariablecoveracontinuousrange.toillustratetherangeofapproachesavailable thisonecanlistvariousdata-miningmethodsindecreasingorderoftheirresemblancetoclassical statisticalmodeling.moredetailsofmanyofthesemethodscanbefoundin[13].wedenote usedforclassication.theclassicalstatisticalapproachisdiscriminantanalysis;startingfrom convenienttothinkofthefeaturevariablesasordinatesofa\featurespace"withtheaimofthe instatisticsanddataminingweconsidertheclassicationproblem.manydierentmethodsare ticaltechniquebasedonstatisticalmodelscontaining,usually,relativelyfewparameters.the modelingprocedureseekslinearorquadraticcombinationsofthefeaturevariablesthatidentify analysisbeingtopartitionthefeaturespaceintoregionscorrespondingtothedierentclasses (valuesofy). arecontinuous-valuedand,withineachclass,approximatelynormallydistributed. Linear/quadratic/logisticdiscriminantanalysis.Discriminantanalysisisaclassicalstatis- nonlineartransformationsoftheselinearcombinations,withtheprobabilityofafeaturevectorx belongingtoclasskbeingmodeledasmxm=1km theboundariesbetweenclasses.themostdetailedtheoryappliestocasesinwhichthefeatures izationoflogisticdiscriminationthatalsoinvolveslinearcombinationsoffeaturesbutalsoincludes Projectionpursuit.Forclassicationproblems,projectionpursuitcanbethoughtofasageneral- projectionpursuitasa\neostatistical"ratherthanaclassicalstatisticaltechnique. tation.thenonlinearitiesandoftenlargenumbersofparametersinthemodelleadsonetoregard mareprespeciedscatterplotsmoothingfunctions,choseninpartfortheirspeedofcompu- mfxj=1mjxj: 3 (1)

4 model.theprobabilityofafeaturevectorxbelongingtoclasskismodeledas Radialbasisfunctions.Radialbasisfunctionsformanotherkindofnonlinearneostatistical feedforwardnetwork,canbethoughtofasamodelsimilarto(1).however,the Neuralnetworks.Acommonformofneuralnetworkfortheclassicationproblem,themultilayer Herejjx cmjjisthedistancefrompointxinfeaturespacetothemthcentercm,misascale factor,andisabasisfunction,oftenchosentobethegaussianfunction(r)=exp( r2). MXm=1m(jjx cmjj=m): thatisunfamiliartostatisticians. Graphicalmodels.Graphicalmodels,alsoknownasBayesiannetworks,belieffunctions,orcausal neostatisticalmodels,butauniquemethodologyandterminologyforneuralnetworkshasdeveloped aredierent generallythelogisticfunction onelayeroflogistictransformationsmaybeapplied.neuralnetworksarerecognizablycloseto diagrams,involvethespecicationofanetworkoflinksbetweenfeatureandclassvariables.the m(t)=1=f1+exp( t)gisused andmorethan mtransformations linksspecifyrelationsofstatisticaldependencebetweenparticularfeatures;equallyimportantly, absenceofadirectlinkbetweentwofeaturesisanassertionoftheirconditionalindependence graphicalmodelsinvolvelargenumbersofparametersanddonottwellintotheframeworkof giventheotherfeaturesappearinginthenetwork.linksinthenetworkcanbeinterpretedas causalrelationsbetweenfeatures thoughthisisnotalwaysstraightforward,asexempliedby classicalstatisticalinference. Nearest-neighbormethods.Atitssimplest,thek-nearestneighborprocedureassignsaclassto thediscussionin[15] whichcanyieldparticularlyinformativeinferences.forrealisticproblems, pointxinfeaturespaceaccordingtothemajorityvoteoftheknearestdatapointstox.thisisa smoothingprocedure,andwillbeeectivewhenclassprobabilitiesvarysmoothlyoverthefeature basedonthevaluetakenbyasinglefeature,untilthepartitionsaresonethateachcorrespondsto Decisiontrees.Adecisiontreeisasuccessionofpartitionsoffeaturespace,eachpartitionusually asinglevalueoftheclassvariable.thisformulationbearslittleresemblancetoclassicalparametric space.questionsariseastothechoiceofkandofanappropriatedistancemeasureinfeaturespace. Theseissuesarenoteasilyexpressedintermsofclassicalstatisticalmodels.Modelspecicationis thereforedeterminedbymaximizingclassicationaccuracyonasetoftrainingdataratherthan measuredbyminimumdescriptionlength. statisticalmodels.choiceofthebesttreerepresentationisobtainedbycomparingdierenttrees intermsoftheirpredictiveaccuracy,estimatedbycross-validation,andtheircomplexity,often byformallyspecifyingandttingastatisticalmodel. Rules.Rule-basedmethodsseektoassignclasslabelstosubregionsoffeaturespaceaccordingto considerationofaruleset'spredictiveaccuracyandcomplexity. ofclassicalstatisticalmodels,andtheparametervaluesareoptimized,asfordecisiontrees,by involveparameterswhoseoptimalvaluesareunknown.themethodscannotbeexpressedinterms Individualrulescanbecomplexandhardtointerpretsubjectively.Rule-generationmethodsoften logicalcriteriasuchasifx1=3andx215andx2<30theny=1: 4

5 theirdierences.anygivendatasetmaycontainirrelevantorpoorlymeasuredfeatureswhichonly addnoisetotheanalysisandshouldforeciency'ssakebedeleted;somedependencesbetween classandfeaturesmaybemostsuccinctlyexpressedintermsofafunctionofseveralfeatures featuresx1;:::;xf.itcanbearguedthatthissimilaritybetweentheapproachesoutweighsallof ratherthanbyasinglefeature.nomethodcanbeexpectedtoperformwellifdoesnotusethe classicationproblem.however,eachapproachrequiresatsomestagetheselectionofappropriate Theforegoinglistillustratesawiderangeofstatisticalanddata-miningapproachestothe mostinformativefeatures:\garbagein,garbageout". componentsregressionexplicitlyformlinearcombinationsoffeaturesthatarethenusedasnew ontheimpurityoftheconditionalprobabilitydistributionoftheclassvariablegiventhefeatures, usedindecision-treeandrule-basedclassiers[10].asnotedabove,the\wrapper"methodisa powerfulandwidelyapplicabletechniqueforfeatureselection. above.theserangefromcriteriabasedonsignicancetestsforstatisticalmodelstomeasuresbased Constructionofnewfeaturescanbeexplicitorimplicit.Sometechniquessuchasprincipal- Explicitfeatureselectioncriteriahavebeendevelopedforseveralofthemethodsdescribed featurevariablesinthemodel.conversely,thelinearcombinationspjmjxjoffeaturesthat appearintherepresentation(1)forprojection-pursuitandneural-networkclassiersareimplicit andscienticinference.adetailedaccountofthetheoryisgivenbycoxandhinkley[2].the constructedfeatures.constructionofnonlinearcombinationsoffeaturesisgenerallyamatterfor Inthissectionwegiveabriefsummaryoftheclassical\frequentist"approachtostatisticalmodeling 3Classicalstatisticalmodeling subjectivejudgement. techniquesusedinappliedstatisticalanalysesaredescribedinmorespecializedtextssuchas[4]for andclassication,thedatavectorzisdecomposedintoz=[x;y]andyismodeledasafunction classicationproblemsand[27]forregression.weassumethatinferencefocusesonadatavectorz withtheavailabledatazi;i=1;:::;`,being`instancesofz.inmanyproblems,suchasregression with\whatmighthavehappened,butdidn't"(otherpotentiallyobservabledatavectors). vectorz.thisenables\whathappened"(theobserveddatavector)tobequantitativelycompared 3.1Modelspecication Astatisticalmodelisthespecicationofafrequencydistributionp(z)fortheelementsofthedata ofthexvalues. interest;thefrequencydistributionofxmayormaynotberelevant.inmoststatisticalregression therelationshipbetweenyandxisobservedwitherror.thealternativespecicationinwhichthe whereeisanerrortermhavingmeanzeroandsomeprobabilitydistribution;i.e.,itisassumedthat analysesthemodelhastheform functionalrelationshipy=f(x)isexactanduncertaintyarisesonlywhenpredictingyathitherto Inregressionandclassicationproblemstheconditionaldistributionofygivenx,p(yjx),isof unobservedvaluesofxismuchlesscommon:oneexampleistheinterpolationofrandomspatial processesbykriging[8]. Inclassicalstatistics,modelspecicationhasalargesubjectivecomponent.Candidatesforthe y=f(x)+e (2) distributionofz,ortheformoftherelationshipbetweenyandx,maybeobtainedfrominspection 5

6 bythemaximum-likelihoodprocedure:thejointprobabilitydensityfunctionofthedata,p(z;), 3.2Estimation Modelspecicationgenerallyinvolvesanunknownparametervector.Thisistypicallyestimated ofthedata,fromfamiliaritywithrelationsestablishedbypreviousanalysisofsimilardatasets,or function logp(z;).whenthedataareassumedtobeasetofindependentandidentically ismaximizedover.maximum-likelihoodestimationcanberegardedasminimizationoftheloss fromascientictheorythatentailsparticularrelationsbetweenelementsofthedatavector. Whenthedatavectorisdecomposedasz=[x;y],theobserveddataaresimilarlydecomposedas zi=[xi;yi],andthelossfunction(negativelog-likelihood)is distributedvectorszi,i=1;:::;`,thislossfunctionis `Xi=1 logp(yijxi;): `Xi=1 logp(zi;): varianceindependentofi,thislossfunctionisequivalenttothesumofsquares andecientasthesamplesize`increasestoinnity.exceptforcertainmodelswhoseanalysisis IftheconditionaldistributionofyigivenxiisNormalwithmeanafunctionofxi,f(xi;),and particularlysimple,classicalstatisticshaslittletosayaboutnite-samplepropertiesofestimators andpredictors. Thejusticationformaximum-likelihoodestimationisasymptotic:theestimatorsareconsistent `Xi=1fyi f(xi;)g2: theparameterisregardedasxedbutunknown,anddoesnothaveaprobabilitydistribution. Estimatesofaccuracyaretypicallyexpressedintermsofcondenceregions.Infrequentistinference Insteadoneconsidershypotheticalrepetitionsoftheprocessofgenerationofdatafromthemodel withaxedvalue0oftheparametervector,followedbycomputationof^,themaximumlikelihoodestimatorof.overtheserepetitionsaprobabilitydistributionfor^ willbebuilt Assessmentoftheaccuracyofestimatedparametersisanimportantpartoffrequentistinference. up.likelihoodtheoryprovidesanasymptoticlarge-sampleapproximationtothisdistribution. accuracywithwhichtheparametercanbeestimated. madefromthemodel.thesetooareasymptoticlarge-sampleapproximations.condencestatementsforparametersandpredictionsarevalidonlyontheassumptionthatthemodeliscorrect, thenacondenceregionforwithcondencelevel.thesizeoftheregionisameasureofthe FromitonecandeterminearegionC(^),dependingon^,ofthespaceofpossiblevaluesof,that containsthetruevalue0withprobability(nomatterwhatthistruevaluemaybe).c(^)is therelativefrequenciesofallofthepossiblevaluesofz.ifthemodelisfalse,predictionsmaybe i.e.thatforsomevalueofthespeciedfrequencydistributionp(z;)forzaccuratelyrepresents inaccurateandestimatedparametersmaynotbemeaningful. Condenceregionscanalsobeobtainedforsubsetsofthemodelparametersandforpredictions 6

7 beinadequatethroughhavingthewrongstructure:forexample,aregressionmodelmayrelatey linearlytoxwhenthecorrectphysicalrelationislinearbetweenlogyandlogx. withadditionalstructurebeingneededtodescribethepatternsinthedata.amodelmayalso isunjustiablyelaborate,withthemodelstructureinpartrepresentingmerelyrandomnoiseinthe data.underttingistheconversesituation,inwhichthemodelisanoversimplicationofreality Inadequacyofastatisticalmodelmayarisefromthreesources.Overttingoccurswhenthemodel 3.3Diagnosticchecking parameterisdroppedwillusuallybedeemedadequate. nosticgoodness-of-ttests.astatistictiscomputedwhosedistributioncanbefound,either correct.ifthecomputedvalueoftisintheextremetailofitsdistributionthereisanindicationof exactlyorasalarge-sampleasymptoticapproximation,undertheassumptionthatthemodelis Ifthecondenceregionforaparameterincludesthevaluezero,thenasimplermodelinwhichthe Inthefrequentistframework,underttingbyastatisticalmodelistypicallyassessedbydiag- Comparisonofparameterswiththeirestimatedaccuracyprovidesacheckagainstovertting. andawayofmodifyingthemodeltocorrecttheinadequacy. valueoftoften(butnotalways)suggestsaparticulardirectioninwhichthemodelisinadequate, modelinadequacy:eitherthemodeliswrongorsomethingveryunusualhasoccurred.anextreme structure,thisisanindicationofmodelinadequacyandmaysuggestsomewayinwhichthemodel independentlydistributed;ifaplotofresidualsagainstthettedvaluesshowsanynoticeable shouldbemodied. ofmodeladequacy,foridenticationeitherofunderttingorofincorrectmodelstructure.for example,theresidualsfromaregressionmodelthatiscorrectlyspeciedwillbeapproximately notusedinformalgoodness-of-ttests,theycanbeusedasthebasisofsubjectivejudgements Manydiagnosticplotsandstatisticshavebeendevisedforparticularstatisticalmodels.Though themodelanditsparticularestimatedparametervalues.inanalysesinwhichthereistheoption inuentialvalueshavebeenmeasuredwithsucientaccuracytojustifyconclusionsdrawnfrom ontheestimatedvaluesofthemodelparameters.suchdatapointsmeritcloseinspectiontocheck whethertheoutliersmayhavearisenfromfaultydatacollectionortranscription,andwhetherthe orinuentialvalues,whicharesuchthatasmallchangeinthedatavaluewillhavealargeeect observationsmaybeoutliers,valuesthatarediscordantwiththepatternoftheotherdatavalues, Diagnosticplotsarealsousedtoidentifydatavaluesthatareunusualinsomerespect.Unusual modelinadequacyrevealedbydiagnosticcheckssuggestsamodiedmodelspecicationdesigned ofcollectingadditionaldataatcontrolledpoints,forexamplewhenmodelingtherelationy=f(x) wherexcanbexedandthecorrespondingvalueofyobserved,themostinformativexvaluesat tocorrecttheinadequacy;themodiedmodelisthenitselfestimatedandchecked,andthecycle whichtocollectmoredatawillbeintheneighborhoodofoutlyingandinuentialdatapoints. Thesequenceofspecication{estimation{checkinglendsitselftoaniterativeprocedureinwhich 3.4Modelbuildingasaniterativeprocedure bespeciedapriori.thisisthecase,forexample,whenthecandidatesformasequenceofnested isrepeateduntilasatisfactorymodelisobtained.thisprocedureoftenhasalargesubjective isalsoincludedin(j+1).carefulcontrolovertheprocedureisnecessaryinordertoensurethat formalprocedurestoidentifythebestmodelcanbedevisediftheclassofcandidatemodelscan modelsm1;:::;mm,whoseparametervectors(1);:::;(m)aresuchthateveryelementof(j) component,arisingfromthemodelspecicationsandthechoiceofdiagnosticchecks.however, 7

8 inferencesarevalid,forexamplethatcondenceregionsfortheparametersinthenalmodelhave thecorrectcoverageprobability. forexamplewhetheraregressionmodely=(1) logy=(2) onthequalityoftofthemodels,theireaseofinterpretationandtheirconcordancewithknown physicalmechanismsrelatingthevariablesinthemodel. basedontheassumptionthatthenalmodeliscorrect.thisisproblematicalintworespects.in Classicalfrequentiststatisticshaslittletosayaboutthechoicebetweennonnestedmodels, manysituationsonemaybelievethatthetruedistributionofzhasaverycomplexstructureto Onceasatisfactorymodelhasbeenobtained,furtherinferencesandpredictionsaretypically 1x1+(2) 2x3.Suchdecisionsaregenerallyleftasamatterofsubjectivejudgementbased 1x1+(1) 2x2issuperiortoanalternativemodel inferences. estimatedandtestedonthesamesetofdata,andfailuretoallowforthiscanleadtoinaccurate parameterestimatorsinthenalmodelmaybeaectedbythefactthatseveralmodelshavebeen whichanystatisticalmodelisatbestanapproximation.furthermore,thestatisticalpropertiesof variabilitycancausexvariablesthatareactuallyunrelatedtoytoappeartobestatistically leadstounderestimationofthevariabilityoftheerrortermintheregressionmodel,whichcanlead procedureforidentifyingthebeststatisticalmodel,inthiscasedecidingwhichelementsofthe signicant,theestimatedregressioncoecientsofthevariablesselectedforthenalmodeltend tobeoverestimatesoftheabsolutemagnitudeofthetrueparametervalues.this\selectionbias" xcomponentofthedatavectorshouldappearintheregressionmodel(2).becauserandom Asanexampleofthislastproblem,weconsiderstepwiseregression.Thisisawidelyused topoorresultswhenthenalmodelisusedforprediction.inpracticeitisoftenbettertouseall oftheavailablevariablesratherthanastepwiseprocedureforprediction[14]. Simulation-basedmethodssuchasthebootstrap[3]enablebetterassessmentofaccuracyinnite 3.5Recentdevelopments useofnonlinearmodelsenablesawiderrangeofx{yrelationshipstobeaccuratelymodeled. classicalfrequentistapproach.akaike'sinformationcriterion[17],andrelatedmeasuresofschwarz samples. estimators[6]hasmadeinferencelesssusceptibletooutliersandinuentialdatavalues.greater Developmentsinstatisticaltheorysincethe1970shaveaddressedsomeofthedicultieswiththe 4Vapnik'sstatisticallearningtheory andrissanen,providelikelihood-basedcomparisonsofnonnestedmodels.developmentofrobust itsformisnotexactlycorrect,isoftenthepurposeoftheanalysis.thissituationisalsofacedin classicalstatisticalmodelingandhasledtothecreationofthediagnosticchecksdiscussedearlier. theformofthecorrectmodelisusuallyunknown.infact,discoveringanadequatemodel,evenif Onereasonthatclassicalstatisticalmodelinghasalargesubjectivecomponentisthatmostofthe mathematicaltechniquesusedintheclassicalapproachassumethattheformofthecorrectmodel decidedsubjectivelybasedonthejudgmentandexperienceofthedataanalyst. guidancewhencomparingdierenttypesofmodels.thequestionofmodeladequacymuststillbe However,evenwiththesediagnostics,theclassicalapproachdoesnotprovidermmathematical isknownandthattheproblemistoestimateitsparameters.indatamining,ontheotherhand, amathematicalbasisforcomparingmodelsofdierentformsandforestimatingtheirrelative ThislattersourceofsubjectivityhasmotivatedVapnikandChervonenkis[24,25,26]todevelop 8

9 statisticallearningtheorycloselymatchesthesituationactuallyfacedindatamining. nitesamplesenablesoverttingtobequantitativelyassessed.thus,theunderlyingpremiseof asymptoticstatisticsasisusuallythecaseintheclassicalapproach.thisshiftofemphasisto becorrect.inaddition,comparisonsbetweenmodelsarebasedonnite-samplestatistics,not ofthecorrectmodelistrulyunknownandthatthegoalistoidentifythebestpossiblemodel fromagivensetofmodels.themodelsneednotbeofthesameformandnoneofthemneed adequacies.thisbodyofwork,nowknownasstatisticallearningtheory,presumesthattheform 4.1Modelspecication Asinclassicalstatisticalmodeling,modelsforthedatamustbespeciedbytheanalyst.However, orderingisusedtoaddresstheissueofovertting.inpractice,modelswithfewerparametersor thedata.inaddition,apreferenceorderingoverthemodelsmustalsobespecied.thispreference insteadofspecifyingasingle(parametric)modelwhoseformisthenassumedtobecorrect,aseries ofcompetingmodelsmustbespeciedoneofwhichwillbeselectedbasedonanexaminationof modeling;however,whatisbeingestimatedisquitedierent.intheclassicalapproach,theformof explainsthedata. 4.2Estimation Estimationplaysacentralroleinstatisticallearningtheoryjustasitdoesinclassicalstatistical Whenapplyingstatisticallearningtheory,onesearchesforthemostpreferablemodelthatbest degreesoffreedomarepreferabletothosewithmore,sincetheyarelesslikelytoovertthedata. estimatingtherelativeperformanceofcompetingmodelssothatthebestmodelcanbeselected. themodelisassumedtobeknownand,hence,emphasisisplacedonestimatingitsparameters.in statisticallearningtheory,thecorrectmodelisassumedtobeunknownandemphasisisplacedon extendedsothatdenesboththespecicparametersofthemodelandtheparametricfamily specicmodel.inthecaseofaparametricfamilyofmodels,thenotationintroducedearlieris arealsoconsideredfordierentkindsofmodelingproblems. statisticallearningtheorywhencomparingprobabilitydistributions.however,otherlossfunctions Thenegativelog-likelihoodfunctionsemployedinclassicalstatisticalmodelingarealsousedin Ingeneral,statisticallearningtheoryconsidersthelossQ(z;)betweenadatavectorzanda Therelativeperformanceofcompetingmodelsismeasuredthroughtheuseoflossfunctions. towhichthemodelbelongs.inthisway,modelsfromdierentfamiliescanbecompared.when modelingthejointprobabilitydensityofthedata,theappropriatelossfunctionisthesamejoint Similarly,whenthedatavectorzcanbedecomposedintotwocomponents,z=[x;y]andweare negativelog-likelihoodusedinclassicalstatisticalmodeling: interestedinmodelingtheconditionalprobabilitydistributionofyasafunctionofx,thenthe conditionalnegativeloglikelihoodistheappropriatelossfunction: Ontheotherhand,ifwearenotinterestedintheactualdistributionofybutonlyinconstructing Q(z;)= logp(yjx;): Q(z;)= logp(z;): 0/1lossfunctionusedinpatternrecognitionisappropriate: apredictorf(x;)forythatminimizestheprobabilityofmakinganincorrectprediction,thenthe Q(z;)=0;iff(x;)=y, 1;iff(x;)6=y. 9

10 wealreadyknewallofthestatisticalpropertiesofthedata.ifthedatavectorzisgeneratedbya Ingeneral,Q(z;)canbechosendependingonthenatureofthemodelingproblemonefaces.Its minimizestheexpectedlossr()withrespecttof(z),where randomprocessaccordingtotheprobabilitymeasuref(z),thenthebestmodelistheonethat lossesimplybettermodelsofthedata. purposeistomeasuretheperformanceofamodelsothatthebestmodelcanbeselected.theonly requirementfromthepointofviewofstatisticallearningtheoryisthat,byconvention,smaller Oncealossfunctionhasbeenselected,identifyingthebestmodelwouldberelativelyeasyif utilitymeasureoftheoutcomegiventhedecision.utilitymeasuresprovideanumericalencoding ofuncertaintyoneiswillingtoacceptinchoosingariskydecisionthathasalowprobabilityof ofwhichoutcomesarepreferredoverothers,aswellasaquantitativemeasurementofthedegree nologyofdecisiontheory,isadecisionvector,zisanoutcome,andq(z;)isthe(negative) ThemodelthatminimizesR()isoptimalfromadecision-theoreticpointofview.Inthetermi- R()=ZQ(z;)dF(z): onemustchoosethemostsuitablemodelonecanidentifybasedonasetofobserveddatavectors probabilitymeasuref(z)thatdenesthestatisticalpropertiesofthedataisunknown.instead, measure thatis,thebestmodelgiventhelossfunction. utilityr()producesanoptimaldecisionconsistentwiththeriskpreferencesdenedbytheutility obtainingahighlydesirableoutcomeversusamoreconservativedecisionwithahighprobability ofamoderateoutcome.choosingthedecisionvectorthathasthebestexpected(negative) distributed,theaveragelossremp(;`)fortheobserveddatacanbeusedasanempiricalestimator zi,i=1;:::;`.assumingthattheobservedvectorsarestatisticallyindependentandidentically oftheexpectedloss,where Unfortunately,inpractice,theexpectedlossR()cannotbecalculateddirectlybecausethe modelsand/ortheirparametersareselectedbyoptimizingnumericalcriteriaofthisgeneralform. StatisticallearningtheorypresumesthatmodelsarechosenbyminimizingRemp(;`).Notethat thispresumptionisconsistentwithstandardmodel-ttingproceduresusedinstatisticsinwhich doesminimizingtheaverageempiricallossremp(;`)yieldmodelsthatalsominimizetheexpected Thefundamentalquestionofstatisticallearningtheoryisthefollowing:underwhatconditions Remp(;`)=1``Xi=1Q(zi;): fortheexpectedlosses,notfortheparameters.theexpectedlossr()foramodelisregarded expressedintermsofcondenceregions;however,inthiscase,condenceregionsareconstructed isarandomquantitythatwecansample,sinceitsvaluedependsonthevaluesoftheobserved asxedbutunknown,sincetheprobabilitymeasuref(z)thatdenesthestatisticalpropertiesof byconsideringtheaccuracyoftheempiricallossestimate.asinclassicalstatistics,accuracyis thedatavectorsisxedbutunknown.ontheotherhand,theaverageempiricallossremp(;`) lossr(),sincethelatteriswhatweactuallywanttoaccomplish?thisquestionisanswered datavectorszi,i=1;:::;`,usedinitscalculation.statisticallearningtheorythereforeconsiders condenceregionsforr()givenremp(;`). distinguishesstatisticallearningtheoryfromclassicalstatistics.oneofthefundamentaltheorems modelsareselectedbyminimizingaverageempiricalloss.thislattercaveatisthekeyissuethat dierencebetweentheexpectedandaverageempiricallosseswhiletakingintoaccountthefactthat Toconstructthesecondenceregions,weneedtoconsidertheprobabilitydistributionofthe 10

11 andaverageempiricallosses;thatis,onemustconsiderthedistributionof ofstatisticallearningtheoryshowsthat,inordertoaccountforthefactthatmodelsareselectedby minimizingaverageempiricalloss,onemustconsiderthemaximumdierencebetweentheexpected somanydegreesoffreedomthatonecanndamodelthattsthenoiseinthedatabutdoesnot adequatelyreecttheunderlyingrelationships.asaresult,oneobtainsamodelthatlooksgood whereisthesetofmodelsoneisselectingfrom. ofovertting.intuitivelyspeaking,overttingoccurswhenthesetofmodelstochoosefromhas Thereasonthatthemaximumdierencemustbeconsideredhastodowiththephenomenon 2R() Remp(;`); sup minimizesremp(;`).becauseofthissearch,themaximumdierencebetweentheexpectedand empiricallosswillunderestimatetheexpectedlossforaxedmodel,boththeprobabilityand thedegreeofunderestimationareincreasedbythefactthatweexplicitlysearchforthemodelthat averageempiricallossesisthequantitythatgovernsthecondenceregion. maticallycorrespondstoasituationinwhichtheaverageempiricallossremp(;`)substantially underestimatestheexpectedlossr().althoughthereisalwayssomeprobabilitythattheaverage relativetothetrainingdatabutthatperformspoorlywhenappliedtonewdata.thismathe- theyhavedevelopedtoconstructsmall-samplecondenceregionsfortheexpectedlossgiventhe averageempiricalloss.theresultingcondenceregionsdierfromthoseobtainedinclassical statisticsinthreerespects.first,theydonotassumethatthechosenmodeliscorrect.second, modelsoneisselectingfromindependentoftheformsofthosemodels.thismethodisbasedona theyarebasedonsmall-samplestatisticsandarenotasymptoticapproximationsasistypicallythe case.third,auniformmethodisusedtotakeintoaccountthedegreesoffreedominthesetof ThelandmarkcontributionofVapnikandChervonenkisisaseriesofprobabilityboundsthat example,thevcdimensionofalinearregressionordiscriminantmodelisequaltothenumberof termsinthemodel(i.e.,thenumberofdegreesoffreedomintheclassicalsense),sincenlinear anddoesnotformallyrequireanexactt;nevertheless,theintuitiveinsightsgainedbythinking termscanbeusedtoexactlytnpoints.theactualdenitionofvcdimensionismoregeneral ofdatavectorsforwhichoneisprettymuchguaranteedtondamodelthattsexactly.for measurementknownasthevapnik-chervonenkis(vc)dimension. abouttheconsequencesofexacttsareoftenvalidwithregardtovcdimension.forexample,one TheVCdimensionofasetofmodelscanconceptuallybethoughtofasthemaximumnumber exceedthevcdimensionofthesetofmodelstochoosefrom;otherwise,onecouldobtainanexact consequenceisthatinordertoavoidoverttingthenumberofdatasamplesshouldsubstantially ttoarbitrarydata. equallyapplicabletolinear,nonlinearandnonparametricmodels,andtocombinationsofdissimilar modelfamilies.thisincludesneuralnetworks,classicationandregressiontrees,classicationand regressionrules,radialbasisfunctions,bayesiannetworks,andvirtuallyanyothermodelfamily ofmodelswithonlyoneparameterthathaveinnitevcdimensionand,hence,areabletoexactly imaginable.inaddition,vcdimensionisamuchbetterindicatoroftheabilityofmodelstot arbitrarydatathanissuggestedbythenumberofparametersinthemodels.thereareexamples BecauseVCdimensionisdenedintermsofmodelttingandnumbersofdatapoints,itis tanysetofdata[22,23].therearealsomodelswithbillionsofparametersthathavesmallvc dimensions,whichenablesonetoobtainreliablemodelsevenwhenthenumberofdatasamplesis muchlessthanthenumberofparameters.vcdimensioncoincideswiththenumberofparameters 11

12 example,ifthelossfunctionq(z;)isthe0/1lossusedinpatternrecognition,thenwithprobability onlyforcertainmodelfamilies,suchaslinearregression/discriminantmodels.vcdimension thereforeoersamuchmoregeneralnotionofdegreesoffreedomthanisfoundinclassicalstatistics. regionislargelydeterminedbytheratioofthevcdimensiontothenumberofdatavectors.for atleast1, IntheprobabilityboundsobtainedbyVapnikandChervonenkis,thesizeofthecondence Remp(;`) pe VCdimensionhtothenumberofdatavectors`isthedominantterminthedenitionofEand, andwherehisthevcdimensionofthesetofmodelstochoosefrom.notethattheratioofthe E=4h`ln2` h+1 4`ln4 E1A; hence,inthesizeofthecondenceregionforr().otherfamiliesoflossfunctionshaveanalogous Theboundsarethereforeapplicableforanextremelywiderangeofmodelingproblemsandforany condenceregionsinvolvingthequantitye. familyofmodelsimaginable. propertiesofthedatavectors,theyarevalidforsmallsamplesizes,andtheyaredependentonly thattheymakenoassumptionsabouttheprobabilitydistributionf(z)thatdenesthestatistical onthevcdimensionofthesetofmodelsandonthepropertiesofthelossfunctionemployed. discussedindetailinbooksbyvapnik[21,22,23].theremarkablepropertiesoftheseboundsare TheconceptofVCdimensionandcondenceboundsforvariousfamiliesoflossfunctionsare 4.3Modelselection Asdiscussedatthebeginningofthissection,thedataanalystisexpectedtoprovidenotjusta singleparametricmodel,butanentireseriesofcompetingmodelsorderedaccordingtopreference, oneofwhichwillbeselectedbasedonanexaminationofthedata.theresultsofstatisticallearning theoryarethenusedtoselectthemostpreferablemodelthatbestexplainsthedata. amongthosemodelsthatoccurbeforethecut-o.asthecut-opointisadvancedthroughthe averageempiricallosssteadilydecreases.thesecondeectisthatthesizeofthecondenceregion ordering,theotheristoselectthemodelwiththesmallestaverageempiricallossremp(;`)from moremodelstochoosefromonecanusuallyobtainabetterttothedata;hence,theminimum preferenceordering,boththesetofmodelsthatappearbeforethecut-oandthevcdimensionof thissetsteadilyincrease.thisincreaseinvcdimensionhastwoeects.thersteectisthatwith Theselectionprocesshastwocomponents:oneistodetermineacut-opointinthepreference chooseacut-opointinthepreferenceordering,vapnikandchervonenkisadvocateminimizing fortheexpectedlossr()steadilyincreasesbecausethesizeisgovernedbythevcdimension.to forthedata. ofthecondenceparameter.themodelthatminimizestheaverageempiricallossremp(;`) forthosemodelsthatoccurbeforethechosencut-oisthenselectedasthemostsuitablemodel estimateofr().forexample,ifthe0/1lossfunctionwerebeingused,onewouldchoosethe theupperboundonthecondenceregionfortheexpectedloss;thatis,minimizetheworst-case cut-osoastominimizethelefthandsideoftheinequalitypresentedaboveforadesiredsetting TheoverallapproachisillustratedbythegraphinFigure1.Theprocessbalancestheability 12

13 Loss UpperBoundon ExpectedLoss Cut-O BestMinimumAverage EmpiricalLoss apoormodel.thepreferenceorderingprovidesthenecessarystructureinwhichtocompare tondincreasinglybettertstothedataagainstthedangerofoverttingandtherebyselecting Figure1:Expectedlossandaverageempiricallossasafunctionofthepreferencecut-o. PreferenceCut-O ofavailabledataincreases. 4.4Useofvalidationdata OnedrawbacktotheVapnik-ChervonenkisapproachisthatitcanbediculttodeterminetheVC processitselfattemptstomaximizetherateofconvergencetoanoptimummodelasthequantity (i.e.,vcdimension).theresultisamodelthatminimizestheworst-caselossonfuturedata.the competingmodelswhileatthesametimetakingintoaccounttheireectivedegreesoffreedom dimensionofasetofmodels,especiallyforthemoreexotictypesofmodels.evenforsimplelinear regression/discriminantmodels,thesituationisnotentirelystraightforward.therelationship statedabovethatthevcdimensionisequaltothenumberoftermsinsuchamodelisactually anupperboundonthevcdimension.ifthemodelsarewritteninacertaincanonicalform, dimensionsareordersofmagnitudesmallerthanthenumberofterms,evenifthemodelscontain thenthevcdimensionisalsoboundedbythequantityr2a2+1,whereristheradiusofthe onthevcdimensionmakesitpossibletoobtainlinearregression/discriminantmodelswhosevc billionsofterms.thisfactisextremelyfortunatebecauseitoersameansofavoidingthe\curseof smallestspherethatenclosestheavailabledatavectorsanda2isthesumofthesquaresofthe coecientsofthemodelinitscanonicalform.asvapnikhasshown[22],thisadditionalbound dimensionality,"enablingreliablemodelstobeobtainedeveninhigh-dimensionalspacesbybasing thepreferenceorderingofthemodelsonthesumofthesquaresofthemodelcoecients. canbeestimatedusingresamplingtechniques[3].inthesimplestoftheseapproaches,theavailable setofdataisrandomlydividedintotrainingandvalidationsets.thetrainingsetisusedrstto selectthebest-ttingmodelforeachcut-opointinthepreferenceordering.thevalidationset isthenusedtoestimatetheexpectedlossesoftheselectedmodelsbycalculatingtheiraverage IncaseswheretheVCdimensionofasetofmodelsisdiculttodetermine,theexpectedloss 13

14 expectedlossonthevalidationdataischosenasthemostsuitablemodel. uousparametersimpliesaninnitesetofmodels),itisveryeasytoobtaincondenceboundsfor empiricallossesonthevalidationdata.finally,themodelwiththesmallestupperboundforthe before,exceptthatenowhasthevalue aboutvcdimension[22].inparticular,thesameequationsforthecondenceboundsareusedas theexpectedlossesofthesemodelsindependentoftheirexactformsandwithouthavingtoworry Becauseonlyanitenumberofmodelsareevaluatedonthevalidationset(modelswithcontin- thesemodelsgiventheiraverageempiricallossesonthevalidationdata.sincethesameunderlying size`vofthevalidationset,onecanobtaintightcondenceregionsfortheexpectedlossesof principlesareatwork,thisapproachexhibitsthesamekindofrelationshipbetweentheexpected validationset.moreover,becausethenumbernofsuchmodelsistypicallysmallrelativetothe wherenisthenumberofmodelsevaluatedagainstthevalidationsetand`visthesizeofthe E=2`vlnN 2`vln; andaverageempiricallossesasthatshowninfigure1. expectedlossestimates,ithasthedisadvantagethatdividingtheavailabledataintosubsets decreasestheoverallaccuracyoftheresultingestimates.thisdecreaseinaccuracyisusually modelstoallofthedataandcalculatingthevcdimensionforallrelevantsetsofmodelsbecomes moreattractive. notmuchofaconcernwhendataisplentiful.however,whenthesamplesizeissmall,tting Althoughthisvalidation-setapproachhasanadvantageinthatitisrelativelyeasytoobtain Thestatisticaltheoryofminimizationoflossfunctionsprovidesageneralanalysisoftheconditions underwhichaclassofmodelsislearnable.thetheoryreducesthetaskoflearningtothatofsolving 5ComputationallearningtheoryandPAClearning empiricallossonthesamplesz1;:::;z`.beforeevendeningeciencyformally(weshalldo elaborateonpresently,thisturnsouttoberelatedtothefamousquestionfromcomputational widespreadbeliefisthatsuchalgorithmswillnotexistformanyclassofmodels.asweshall sosoon),wepointoutthatsuchecientalgorithmsarenotknowntoexist.furthermore,the Theperfectcomplementtothistheorywouldbeanecientalgorithmforeveryclassofmodels betocharacterizethemodelclassesforwhichecientalgorithmsdoexist.unfortunately,such characterizationsarealsoruledoutduetotheinherentundecidabilityofsuchquestions.inview ofthesebarriers,itbecomesclearthatthequestionofwhetheragivenmodelclassallowsforan ecientalgorithmtosolvetheminimizationproblemhastobetackledonanindividualbasis. Giventhattheanswertothisquestionismostprobablynegative,thenextbesthopewould focusonresultsthattendtounifythearea.thusmostofthissurveyisfocusedonformulating Thereareplentyofresultsthatshowhowtosolvesuchminimizationproblemsforvariousclasses ofmodels.theseshowthediversitywithintheareaofcomputationallearning.weshallhowever analysisoftheseproblems.wecoversomeofthesalientresultsinthisareainthisbriefsurvey. Thecomputationaltheoryoflearning,initiatedbyValiant'sworkin1984,isdevotedtothe 14

15 ofthemodel. 5.1Computationalmodeloflearning afunctionoftheinputandoutputsizeofthefunctiontobecomputed.thewell-entrenchedand Thecomplexityofacomputationaltaskisthenumberofelementarysteps(addition,subtraction, multiplication,division,comparison,etc.)ittakestoperformthecomputation.thisisstudiedas therightdenitionforthecomputationalsettingandexaminingseveralparametersandattributes well-studiednotionofeciencyisthatofpolynomialtime:analgorithmisconsideredecientif thenumberofelementaryoperationsitperformsisboundedbysomexedpolynomialintheinput andoutputsizes.theclassofproblemswhichcanbesolvedbysuchecientalgorithmsisdenoted byp(forpolynomialtime).thisshallbeournotionofeciencyaswell. representationofthehypothesis.inordertocircumventsuchdiculties,oneforcestherunning passasecient,bypickinganunnecessarilylargenumberofsamplesoranunnecessarilyverbose whichmaybeleftunclearbytheproblem.thechoicecouldeasilyallowaninecientalgorithmto Similarly,theoutputofthelearningalgorithmisagainarepresentationofthemodel,thechoiceof z1;:::;z`2rn,but`itselfmaybethoughtofasaparametertobechosenbythelearningalgorithm. theinputandoutputsizescarefully.theinputtothelearningtaskisacollectionofvectors Inordertostudythecomputationalcomplexityofthelearningproblem,wehavetodene with` atleast,notdirectly.butthesmallest`requiredtoguaranteegoodconvergencegrows consistentwiththedatawillbeatleastd.thusindirectlythisdoesallowtherunningtimetobe apolynomialin`. timeofthealgorithmtobepolynomialinn(theinputsizeofasinglesample)andthesizeof learningalgorithmproducesahypothesiswhosepredictionabilityisveryclose(givenbyanaccuracy algorithmisthatwithhighprobability(boundedawayfrom1byacondenceparameter),the functionqandasourceofrandomvectorsz2rnthatfollowsomeunknowndistributionf(z),a allowedtobeapolynomialin1=and1=aswell. beingdecidedbythealgorithm,andoutputsamodel(hypothesis)h(z1;:::;z`),possiblyfroma (generalized)paclearningalgorithmisonethattakestwoparameters(theaccuracyparameter) and(thecondenceparameter),reads`randomexamplesz1;:::;z`asinput,thechoiceof` Theabovediscussioncannowbeformalizedinthefollowingdenition,whichispopularly tobeecientifitsrunningtimeisboundedbyapolynomialinn,1=,1=andtherepresentation wherer()isthesameexpectedlossconsideredinstatisticallearningtheory.thealgorithmissaid Pr F[z1;:::;z`]2Rn`:R(h(z1;:::;z`))inf

16 learningproblem,inthissurveyweshallfocusonthebooleanpattern-recognitionproblemstypically examinedincomputationallearningtheory.herethedatavectorzispartitionedintoavector Hencetheaccuracyparameterrepresentsthemaximumpredictionerrordesiredforthemodel. x2f0;1gn 1andabity2f0;1gthatistobepredicted.Themodelisgivenbyafunction f:f0;1gn 1!f0;1gandthelossfunctionQ(z;)ofavectorz=[x;y]is0iff(x)=yand1 WhilethenotionofgeneralizedPAClearning(cf.[5])isitselfgeneralenoughtostudyany HenceforthwefocusonproblemsforwhichQ(z;)iscomputableeciently(i.e.,f(x)iscom- 5.2Intractablelearningproblems well-studiedcomputationalclassnp.npconsistsofproblemsthatcanbesolvedecientlybyan terministicmachinecannondeterministicallyguessthethatminimizestheloss,thussolvingthe problemeasily.ofcourse,theideaofanalgorithmthatmakesnondeterministicchoicesismerely amathematicalabstraction andnotecientlyrealizable.theimportanceofthecomputational algorithmthatisallowedtomakenondeterministicchoices.inthecaseoflearning,thenonde- classnpcomesfromthefactthatitcapturesmanywidelystudiedproblemssuchasthetravelingsalespersonproblem,orthegraphcoloringproblem.evenmoreimportantisthenotion restricted(tosomethingxed).atypicalexampleisthatoflearningapattern-recognitionproblem: tosolvethem? ofnp-hardness aproblemisnp-hardiftheexistenceofanecient(polynomial-time)algorithm tosolveitwouldimplyapolynomial-timealgorithmtosolveeveryprobleminnp.thefamous question\isnp=p?"asksexactlythisquestion:donp-hardproblemshaveecientalgorithms \3-termDNF".Itcanbeshownthatlearning3-termDNFformulaewith3-termDNFisNP-hard. Interestinglyhoweveritispossibletoecientlylearnabroaderclass\3CNF"whichcontains3- termdnf.thusthisnp-hardnessresultisnotpointingtoanyinherentcomputationalbottlenecks learningproblemtractable. tothetaskoflearning itmerelyadvocatesajudiciouschoiceofthehypothesisclasstomakethe ItiseasytoshowthatseveralPAClearningproblemsareNP-hardifthehypothesisclassis areeasytocompute,buthardtoinvert,evenonrandomlychoseninstances.suchinstancesare commonincryptography,andinparticulararetheheartofwell-knowncryptosystemssuchasrsa. Ifthisassumptionistrue,itimpliesthatNP6=P.Underthisassumptionitispossibletoshowthat patternrecognitionproblems,wherethepatternisgeneratedbyadeterministicfiniteautomaton somethingstrongerthannp6=p.acommonassumptionhereisthatthereexistfunctionswhich ofchoicefortheoutput.inordertoshowthehardnessofsuchproblemsoneneedstoassume Itishardertoshowthataclassofproblemsishardtolearnindependentoftherepresentation (orhiddenmarkovmodel)arehardtolearn,undersomedistributionsonthespaceofthedata vectors.recentresultsalsoshowthatpatternsgeneratedbyconstantdepthbooleancircuitsare Furthermore,thecomplexityofthelearningprocessisdenitelydependentontheunderlying hardtolearnundertheuniformdistribution. i.e.,moretractable,whennorestrictionsareplacedonthemodelusedtodescribethegivendata. distributionaccordingtowhichwewishtolearn.16 Insummary,thenegativeresultsshednewlightontwoaspectsoflearning.Learningiseasier,

17 theroleoftheparametersandinthedenitionoflearning.aswewillseethesearenotvery inlearningandpresentanalternatemodelwhichshowsmorerobustnesstowardssuchnoise. Thestrengthofweaklearning.Ofthetwofuzzparameters,and,usedinthedenition criticaltothelearningprocess.thesecondissuewewillconsideristheroleof\classicationnoise" Wenowmovetosomelessonslearntfrompositiveresultsinlearning.Therstofthesefocuseson 5.3PAClearningalgorithms ofpaclearning,itseemsclearthat(theaccuracy)ismoresignicantthan(thecondence), especiallyforpatternrecognitionproblems.forsuchproblems,givenanalgorithmwhichcanlearn themajorityvoteis-inaccuratewithprobability1 exp( ck)forsomec>0. algorithmktimes,producinganewhypothesiseachtime.denotethesehypothesesbyh1;:::;hk. amodelwithprobability,say2=3(oranycondencestrictlygreaterthan1=2),itiseasytoboost Useforthenewpredictionthealgorithmwhosepredictiononanyvectorxisthemajorityvoteof thepredictionsofh1;:::;hk.itiseasytoshow,byanapplicationofthelawoflargenumbers,that thecondenceofgettingagoodhypothesisasfollows.pickaparameterkandrunthelearning accuracy.ofcourse,theproblemisthatwedon'tknowwhereourearlierpredictionswerewrong(if samelearningalgorithmontheregionwhereourearlierpredictionsareinaccuratetoboostour 1=3,independentofthedistributionfromwhichthedatavectorsarepicked,thenwecouldusethe weareluckyenoughtobeabletondlearningalgorithmswhichlearntopredictwithinaccuracy isunclearastohowonecouldusealearningalgorithmwhichcanlearntopredictamodelwith inaccuracy1=3togetanewalgorithmwhichcanpredictamodelwithinaccuracy1%.however,if Theaccuracyparameter,ontheotherhand,doesnotappeartoallowsuchsimpleboosting.It robustnessofpaclearning:weaklearning(withinaccuracybarelybelow1=2)isequivalentto stronglearning(withinaccuracyarbitrarilycloseto0).howeverwestressthatthisequivalence tosquareone,itturnsoutnottobethecase.in1986,schapireshowedhowtoturnthisintuition togetaboostingresultfortheaccuracyparameteraswell.thisresultdemonstratesasurprising weknewwewouldchangeourprediction!).thoughitappearsthatthisreasoninghasledusback isobservedwithnopredictionnoise.thisisnotanassumptionjustiedbyreality.itismade usuallytogetabasicunderstandingoftheproblem.howeverinordertomakeacomputational learningresultusefulinpractice,onemustallowfornoise.numerousexamplesareknownwherean distribution. Learningwithnoise.Mostresultsincomputationallearningstartbyassumingthatthedata withanoracleandgetstoask\statistical"questionsaboutthedatavectors.atypicalstatistical queryasksfortheprobabilitythataneventdenedoverthedataspaceoccursforavectorchosen insteadofactuallyseeingdatavectorszassampledfromthespace,thelearningalgorithmworks aretoleranttoerrorswhileothersarenot,amodeloflearningcalledstatisticalquerymodelhas beenproposedbykearnsin1992.thismodelrestrictsalearningalgorithminthefollowingway: amountofnoiseaswell.howeverthisisnotuniversallytrue.tounderstandwhysomealgorithms algorithmwhichlearnswithoutclassicationnoise,canbeconvertedintoonethatcantoleratesome samplesofthedata.furthermore,itiseasytoseehowtosimulatethisoracleevenwhenthedata withinanadditiveerrorof.itiseasytoseehowtosimulatethisoracle,givenaccesstorandom presentedwithatoleranceparameter.theoraclerespondswiththeprobabilityoftheeventto atrandomfromthedistributionunderwhichweareattemptingtolearn.further,thequeryis 17

18 Table1:Statisticians'anddataminers'issuesindataanalysis. Statisticians'issues Modelspecication Parameterestimation Dataminers'issues Accuracy statisticalqueryoracleisasucientconditionforlearningwithclassicationnoise.almostall vectorscomewithsomeclassicationnoise,butlessthan.thuslearningwithaccessonlytoa Modelcomparison Diagnosticchecks Asymptotics Generalizability Computationalcomplexity Modelcomplexity potentiallearningstrategywhenattemptingtolearninthepresenceofnoise. model.thusthismodelprovidesagoodstandpointfromwhichtoanalysetheeectivenessofa knownalgorithmsthatlearnwithclassicationnoisecanbeshowntolearninthestatisticalquery Speedofcomputation handwritingrecognitionprograms.aclassoflearningalgorithmsthatbehaveinthismannerhas askquestionsaboutthedataoneistryingtolearn.considerforinstanceahandwritingrecognition program,whichgeneratessomepatternsandaskstheteachertoindicatewhatletterthispattern Alternatemodelsforlearning.ThissurveyhasfocusedonthePACmodelsinceitisclose beenstudiedunderthelabeloflearningwithqueries.othermodelsforlearningthathavebeen seemstoresemble.itisconceivablethatsuchlearningprogramsmaybemoreecientthanpassive modelsotherthanthepacmodel.thisbodyofworkconsiderslearningwhenoneisallowedto tothespiritofdatamining.however,alargebodyofworkincomputationallearningfocuseson 5.4Furtherreading Wehavegivenaveryinformalsketchofthevariousnewquestionsposedbystudyingtheprocess oflearning,orttingmodelstoagivendata,fromthepointofviewofcomputation.duetospace studiedincludecapturescenariosofsupervisedlearningandlearninginanonlinesetting. above.theinterestedreaderisreferredtothethetextonthissubjectbykearnsandvazirani[9] limitations,wedonotgiveacompletelistofreferencestothesourcesoftheresultsmentioned learningandtheirapplicabilitytopracticalscenarios. 6Conclusions foradetailedcoverageofthetopicsabovewithcompletereferences.othersurveysonthistopic include,thosebyvaliant[20]andangluin[1].finallyanumberofdierentlecturenotesarenow Theforegoingsectionsillustratesomedierencesofapproachbetweenclassicalstatisticsanddataminingmethodsthatoriginatedincomputerscienceandengineering.Table1summarizeswhatwe regardastheprincipalissuesindataanalysisthatwouldbeconsideredbystatisticiansanddata Inaddition,theapproachesofstatisticallearningtheoryandcomputationallearningtheory includespointerstootherusefulhomepagesfortrackingrecentdevelopmentsincomputational availableonlineonthistopic.thissurvey,hasinparticularusedthoseofmansour[12],which provideproductiveextensionsofclassicalstatisticalinference.theinferenceproceduresofclassical miners. 18

19 datasamplesbutnotforthefactthatinmanycasesthechoiceofmodelisdependentonthedata. statisticsinvolverepeatedsamplingunderagivenstatisticalmodel;theyallowforvariationacross andthatthetargetconceptisdeterministic.evenwiththesesimplications,usefulpositiveresults fornear-optimalmodelingarediculttoobtain,andforsomemodelingproblemsonlynegative distributionsofthedata.however,themajorityoftheresultsassumethatthedataarenoise-free seektoidentifymodelingproceduresthathaveahighprobabilityofnear-optimalityoverallpossible thatcouldinpracticebeverylarge.thepac-learningresultsfromcomputationallearningtheory Statisticallearningtheorybasesitsinferencesonrepeatedsamplingfromanunknowndistribution resultshavebeenobtained. ofthedata,andallowsfortheeectofmodelchoice,atleastwithinaprespeciedclassofmodels Forexample,statisticianstendtoworkwithrelativelysimplemodelsforwhichissuesofcomputationalspeedhaverarelybeenaconcern.Someofthedierences,however,presentopportunities inferencearerelatedtothedierentkindsofproblemsonwhichtheseapproacheshavebeenused. forstatisticiansanddataminerstolearnfromeachother'sapproaches.statisticianswoulddowell modelhasbeenidentied,andinsteadgivemoreattentiontoestimatesofpredictiveaccuracy todownplaytheroleofasymptoticaccuracyestimatesbasedontheassumptionthatthecorrect Tosomeextent,thedierencesbetweenstatisticalanddata-miningapproachestomodelingand obtainedfromdataseparatefromthoseusedtotthemodel.dataminerscanbenetbylearning fromstatisticians'awarenessoftheproblemscausedbyoutliersandinuentialdatavalues,and modelsareadequateandtheimportantvariablescanbeidentiedbeforemodeling.inproblems bymakinggreateruseofdiagnosticstatisticsandplotstoidentifyirregularitiesinthedataand inadequaciesinthemodel. withlargedatasetsinwhichtherelationbetweenclassandfeaturevariablesiscomplexand exempliedbythoselistedinsection2.2,oersnosharpdistinctionbetweenstatisticalanddataminingmethods.nosinglemethodislikelytobeobviouslybestforagivenproblem,anduseofa Asnotedearlier,statisticalmethodsareparticularlylikelytobepreferablewhenfairlysimple poorlyunderstood,dataminingmethodsoerabetterchanceofsuccess.however,manypracticalproblemsfallbetweentheseextremes,andthevarietyofavailablemodelsfordataanalysisbasedclassiermightuseadditionalfeaturevariablesformedfromlinearcombinationsoffeatures computedimplicitlybylogisticdiscriminantoraneural-networkclassier.inferencesfromseveral combinationofapproachesoersthebestchanceofmakingsecureinferences.forexample,arule- them. minerscanprotbystudyingeachother'smethodsandusingajudiciouslychosencombinationof inputfeatures \stackedgeneralization"[28].theoverallconclusionisthatstatisticiansanddata distinctfamiliesofmodelscanbecombined,eitherbyweightingthemodels'predictionsorbyan Acknowledgements additionalstageofmodelinginwhichpredictionsfromdierentmodelsarethemselvesusedas WearehappytoacknowledgehelpfuldiscussionswithseveralparticipantsattheWorkshoponData YishayMansour,DanaRonandRonittRubinfeld(M.S.). MininganditsApplications,InstituteofMathematicsanditsApplications,Minneapolis,November 1996(J.H.),manyconversationswithVladimirVapnik(E.P.),andcommentsandpointersfrom

20 References [1]Angluin,D.(1992).Computationallearningtheory:surveyandselectedbibliography.InProceedings ofthetwentyfourthannualsymposiumontheoryofcomputing,351{369.acm. [2]Cox,D.R.andHinkley,D.V.(1986).Theoreticalstatistics.London:ChapmanandHall. [3]Efron,B.(1981).Thejackknife,thebootstrap,andotherresamplingplans,CBMSMonograph38. Philadelphia,Pa.:SIAM. [4]Hand,D.J.(1981).Discriminationandclassication.Chichester,U.K.:Wiley. [5]Haussler,D.(1990).DecisiontheoreticgeneralizationsofthePAClearningmodel.InAlgorithmic LearningTheory,eds.S.Arikawa,S.Goto,S.Ohsuga,andT.Yokomori,pp.21{41.NewYork:Springer- Verlag. [6]Huber,P.J.(1981).Robuststatistics.NewYork:Wiley. [7]John,G.,Kohavi,R.,andPeger,K.(1994).Irrelevantfeaturesandthesubsetselectionproblem. InMachineLearning:ProceedingsoftheEleventhInternationalConference,pp.121{129.SanMateo, Calif.:MorganKaufmann. [8]Journel,A.G.,andHuibregts,C.J.(1978).Mininggeostatistics.London:AcademicPress. [9]Kearns,M.J.,andVazirani,U.V.(1994).Anintroductiontocomputationallearningtheory.Cambridge, Mass.:MITPress. [10]Kononenko,I.,andHong,S.J.(1997).Attributeselectionformodeling.FutureGenerationComputer Systems,thisissue. [11]Lovell,M.C.(1983).Datamining.ReviewofEconomicsandStatistics,65,1{12. [12]Mansour,Y.Lecturenotesonlearningtheory.Availablefromhttp:// [13]Michie,D.,Spiegelhalter,D.J.,andTaylor,C.C.(eds.)(1994).Machinelearning,neuralandstatistical classication.hemelhempstead,u.k.:ellishorwood. [14]Miller,A.J.(1983).Contributiontothediscussionof\Regression,predictionandshrinkage"byJ.B. Copas.JournaloftheRoyalStatisticalSociety,SeriesB,45,346{347. [15]Pearl,J.(1995).Causaldiagramsforempiricalresearch.Biometrika,82,669{710. [16]Ripley,B.D.(1994).Commenton\Neuralnetworks:areviewfromastatisticalperspective"by B.ChengandD.M.Titterington.StatisticalScience,9,45{48. [17]Sakamoto,Y.,Ishiguro,M.,andKitagawa,G.(1986).Akaikeinformationcriterionstatistics.Dordrecht, Holland:Reidel. [18]Tukey,J.W.(1977).Exploratorydataanalysis.Reading,Mass.:Addison-Wesley. [19]Vach,W.,Rossner,R.,andSchumacher,M.(1996).Neuralnetworksandlogisticregression:partII. ComputationalStatisticsandDataAnalysis,21,683{701. [20]Valiant,L.(1991).Aviewofcomputationallearningtheory.InComputationandCognition:Proceedings ofthefirstnecresearchsymposium,32{51.philadelphia,pa.:siam. [21]Vapnik,V.N.(1982).Estimationofdependenciesbasedonempiricaldata.NewYork:Springer-Verlag. [22]Vapnik,V.N.(1995).Thenatureofstatisticallearningtheory.NewYork:Springer-Verlag. [23]Vapnik,V.N.(toappear,1997).Statisticallearningtheory.NewYork:Wiley. [24]Vapnik,V.N.,andChervonenkis,A.Ja.(1971).Ontheuniformconvergenceofrelativefrequencies ofeventstotheirprobabilities.theoryofprobabilityanditsapplications,16,264{280.originally publishedindokladyakademiinaukussr,181(1968). [25]Vapnik,V.N.,andChervonenkis,A.Ja.(1981).Necessaryandsucientconditionsfortheuniform convergenceofmeanstotheirexpectations.theoryofprobabilityanditsapplications,26,532{553. [26]Vapnik,V.N.,andChervonenkis,A.Ja.(1991).Thenecessaryandsucientconditionsforconsistency ofthemethodofempiricalriskminimization.patternrecognitionandimageanalysis,1,284{305. OriginallypublishedinYearbookoftheAcademyofSciencesoftheUSSRonRecognition,Classication, andforecasting,2(1989). [27]Weisberg,S.(1985).Appliedregressionanalysis,2ndedn.NewYork:Wiley. [28]Wolpert,D.(1992).Stackedgeneralization.NeuralNetworks,5,241{