HandSegmentationUsingLearning-BasedPredictionand VericationforHandSignRecognition DepartmentofComputerScience YuntaoCuiandJohnJ.Weng mentationschemeusingattentionimagesfrommultiple Thispaperpresentsaprediction-and-vericationseg- AbstractE-mail:fcui,wengg@cps.msu.edu EastLansing,MI48824,USA MichiganStateUniversity canhandlealargenumberofdierentdeformableobjectspresentedincomplexbackgrounds.theschemeingareferenceimageofthestaticbackground[8],or eachsequencerepresentsahandsign.theexperimen- a3%falserejectionrate. menthandsinthesequencesofintensityimages,where talresultshoweda95%correctsegmentationratewith 1Introduction vericationscheme.thesystemhasbeentestedtoseg-elsor2-dvelocity-eldmodels[2].thesecondtype extractingthemotionentitybasedon3-dmotionmod- (e.g.[9]).thesemodelstypicallyneedagoodinitial positiontoconverge.theyalsoneedarelativelyclean ofapproachestashapetodeformablemovingobjects imagegradient. backgroundsincetheexternalforcesaredenedbythe guidedbythepastknowledgethroughaprediction-and- xations.amajoradvantageofthisschemeisthatit isalsorelativelyecientsincethesegmentationis cultiesfacedbythevision-basedapproachissegmenta- tion(e.g.[1,3,4,7,11,13]).oneofthemajordi- amountofresearchonvision-basedhandsignrecognimanmachineinterface.recently,thereisasignicantionofthemovinghandfromsometimescomplexback- suchasuniformbackground. grounds.toavoidtheaboveproblem,someofthesys- temsrelyonmarkers.theothersuserestrictivesetups Theabilitytointerprethandsignsisessentialforhu- Inordertoovercomethedicultiesfacedbythesegmentationmethodsfordeformableobjectsmentionevironment,itisnotverydiculttoroughlydetermine thepositionofamovingobjectintheimageusingmotioninformation.however,itisnotsimpleifthetaskis toextractthecontouroftheobjectfromvariousbackgrounds.severalmotionsegmentationmethodshavefigure1:anillustrationoftwolevelxationsofaninput handimage. thattheobjectofinterestismovinginastationaryen- choiceofvisualcueforvisualattention.ifweassume ofanalyzingtemporalsequence,motionisanobvious toperformthetaskofhandsegmentation.inthecase Inthispaper,wepresentalearning-basedapproachbackgroundinterference. thereconstructionisnotabletofullygetridofthe positioninarectangularattentionimagetogetherwith thebackground.theattentionimagewentthrougha reconstructionbasedonlearningwhichcanreducethe backgroundinterferencetoacertaindegree.however, proach[5].inthatapproach,theobjectwasassumedto above,wehaveproposedaneigen-subspacelearningap- beenproposed.theseapproachesfallintotwocategories.approachesintherstcategoryaredesignedto ofapproachesachievesasegmentationbyeitherbuild- dealwithrigidmovingobjects(e.g.[2,8]).thistypesolvethesegmentationproblemcompletely.similar kindofmultiplexationshasahierarchalstructure.as showninfig.1,therstlevelofthexationconcentratesontheentirehand,whilethenextlevelofthe xationtakescareofdierentpartsofthehand.the tohumanvision,multiplexationsareneeded.this attentionwindowoftherstlevelxationusuallycontainsapartofthebackground.butaswecontinue Oneattentionwindowfromasinglexationcannot zoomingintheobjectfromdierentxations,theat- Input image Attention window of first level fixation Attention windows of second level fixations
tentionwindowsbecomefocusingondierentpartsof theobject.oneimportantfeatureoftheseattention windowsisthattheytypicallycontainmuchlessbackgroundthantheattentionwindowoftherstlevelxation.theseattentionimagesfrommultiplexations canbeusedasimportantvisualcuestosegmentthe objectofinterestfromtheinputimage.inthispaper, wepresentalearning-basedapproachwhicheciently utilizestheattentionimagesobtainedfromthemultiple xationsthroughaprediction-and-vericationscheme toperformthetaskofhandsegmentation. 2ValidSegmentation Inthissection,wedenetheverierftoevaluate thesegmentationusingfunctioninterpolationbasedon trainingsamples.givenaninputimage,wecanconstructanattentionimageofthehandasshowninfig. 2.Input image Attention image Extract and scale the hand Figure2:Theillustrationofconstructingattentionimages. 2.1TheMostExpressiveFeatures(MEF) LetanattentionimageFofmrowsandncolumns bean(mn)-dimensionalvector.forexample,theset ofimagepixelsff(i;j)j0i<m;0j<ng canbewrittenasavectorv=(v1;v2;;vd)where vmi+j=f(i;j)andd=mn.typicallydisvery large.thekarhunen-loeveprojection[12]isavery ecientwaytoreduceahigh-dimensionalspacetoa low-dimensionalsubspace.thevectorsproducedbythe Karhunen-Loeveprojectionaretypicallycalledtheprinciplecomponents.Wecallthesevectorsthemostexpressivefeatures(MEF)inthattheybestdescribethe samplepopulationinthesenseoflineartransform[4]. 2.2ApproximationasFunctionInterpolation AfterprojectinghandattentionimagestoalowdimensionalMEFspace,wearenowreadytoapproximatetheverierfusingfunctioninterpolation. Denition1GivenatrainingvectorXk;iofgesture kinthemefspace,agaussianbasisfunctionsiis si(x)=e?kx?xk;ik2,whereisapositivedampingfactor,andkkdenotestheeuclideandistance. Averysmalltendstoreducethecontributionof neighboringtrainingsamples. Denition2GivenasetofntrainingsamplesLk= fxk;1;xk;2;;xk;ngofgesturek,thecondencelevel oftheinputxbelongstoclasskisdenedas:gk(x)= Pni=1cisi(X),wherethesiisaGaussianbasisfunctionandthecoecientsci'saretobedeterminedbythe trainingsamples. Thecoecientsci'saredeterminedasfollows.Given ntrainingsamples,wehavenequations gk(xk;i)=nxi=1cisi(xk;i); (1) whicharelinearwithrespecttothecoecientsci's.if wesetgk(xk;i)equalto1,wecansolvetheaboveequationsforciusinggauss-jordaneliminationmethod. ThecondenceleveldenedinDenition2canbe usedtoverifyasegmentationresult. Denition3GivenasegmentationresultSandacon- dencelevell,theverierfoutputsvalidsegmentation forgesturekifgk(s)>l. Intuitively,asegmentationresultSisvalidifthereis atrainingsamplethatissucientlyclosetoit. 3PredicationforValidSegmentation Thissectioninvestigatestheproblemhowtonda validsegmentation.ourapproachistousetheattentionimagesfrommultiplexationsoftraininghandimages.givenahandattentionimage,axationimage isdeterminedbyitsxationposition(s;t)andascale r.fig.3showstheattentionimagesofthe19xations fromonetrainingsample. Figure3:Theattentionimagesfrom19xationsofa trainingsample.therstoneisthesameastheoriginal handattentionimage. 3.1Overview Givenatrainingset,weobtainasetofattention imagesfrommultiplexationsforeachimageintheset. Eachattentionimagefromaxationisassociatedwith thesegmentationmaskoforiginalhandattentionimage, thescalerandthepositionofthexation(s;t).these informationisnecessarytorecoverthesegmentationfor theentireobject. Duringthesegmentationstage,werstusethemotioninformationtoselectvisualattention.Then,we
trydierentxationsontheinputimage.anattentionimagefromaxationofaninputimageisused toquerythetrainingset.thesegmentationmaskassociatedwiththequeryresultisthepredication.the predictedsegmentationmaskisthenappliedtotheinputimage.finally,weverifythesegmentationresult toseeiftheextractedsubimagecorrespondstoahand gesturethathasbeenlearned.iftheanswerisyes,we ndthesolution.thissolutioncanfurthergothrough arenementprocess.fig.4givestheoutlineofthe scheme. 3.2OrganizationofAttentionImagesfrom Fixations Inordertoachieveafastretrieval,webuildahierarchicalstructuretoorganizethedata. Denition4Ahierarchicalquasi-VoronoidiagramP ofsisasetofpartitionsp=fp1;p2;;pmg,where everypi=fpi;1;;pi;nig,i=1;2;;misapartitionofs.pi+1=fpi+1;1;;pi+1;ni+1gisaner VoronoidiagrampartitionofPiinthesensethatcorrespondingtoeveryelementPi;k2Pi,Pi+1containsa VoronoipartitionfPi+1;s;;Pi+1;tgofPi;k. 0,1 1,2 1,3 1,4 1,5 1,6 1,7 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,10 2,14 2,12 Figure5:A2-Dillustrationofahierarchicalquasi-Voronoi diagram. ThegraphicdescriptioninFig.5givesansimpli- edbutintuitiveexplanationofthehierarchicalquasi- Voronoidiagram.Thestructureisatree.Theroot correspondstotheentirespaceofallthepossibleinputs.thechildrenoftherootpartitionthespaceinto largecells,asshownbythicklinesinfig.5.thechildrenofaparentsubdividetheparent'scellfutureinto smallercells,andsoon. 3.3PredictionasQueryingtheTraining Set GivenatrainingsetL,ahierarchicalquasi-Voronoi diagramp=fp1;p2;;pngcorrespondingtoland aquerysamplex,thepredictionproblemistonda trainingsamplex02l,suchthatkx?x0kkx?x00k foranyx002lwithx006=x0.thetypeofquerymentionedaboveisanearestneighborproblem,alsoknown aspost-oceproblem[10].therestilllacksofecient solutionsforthecasewithdimensionhigherthanthree. Inthissection,wewillpresentanecientalgorithm whenthetrainingsetisd-supportiveasdenedbelow. Denition5LetSbeasetwhichcontainsallpossible samples.atrainingsetl=fl1;l2;;lngisadsupportivetrainingsetifforanytestsamplex2s, thereexistisuchthatkx?lik<d,wherekkisthe Euclideandistance. Nexttwotheoremsshowhowtoprunethesearch patheswhenthetrainingsetisd-supportive. Theorem1Wehaveasetofd-supportivetrainingset L=fL1;L2;;Lng,ahierarchicalquasi-VoronoidiagramP=fP1;P2;;PngcorrespondingtoLanda querysamplex2s.lettheithpartitionbepi= fpi;1;pi;2;;pi;nigandc=fc1;c2;;cnigbethe correspondingcentersofregionsinpi.assumec1be thecentertoxsuchthatkc1?xkkci?xkforany i6=1.letc2beanyothercenterandp1beaboundary hyperplanebetweenregionsrepresentedbyc1andc2as illustratedinfig.6.thentheregionofc2doesnot containthenearesttrainingsampletoxifthedistance betweenxandthehyperplanep1isgreaterthand. d a b e f boundary hyperplane M P1 P2 m C 1 C 2 X Figure6:A2Dillustrationofnearestneighborquerytheorems. Inordertoavoidtocalculatethepointtohyperplane distanceinahighdimensionalspace,wecanusefollowingequivalenttheorem. Theorem2LetkC1?C2k=r,f=r2,e=r2?d, kc1?xk=aandkc2?xk=basshowninfig.6. TheregionofC2doesnotcontainthenearesttraining sampletoxifa2?e2<b2?f2. FortheproofTheorem1andTheorem2,thereader isreferredto[6]. 4Experiments Wehaveappliedoursegmentationschemetothetask ofhandsegmentationintheexperiments.thenumber ofgesturesweusedinourexperimentis40.thesegestureshaveappearedinthesignswhichhavebeenused
input sequence Confident?Figure4:Overviewofthesegmentationscheme. Motion based visual attention Extractor attention images recalled mask from multiple fixations Information needed by the Verifier approximate function (e.g., illustratedinfig.7.thesizeofattentionwindowused coefficients) intheexperimentis3232pixels. totestthehandsignrecognitionsystem[4].theyareverierfforthatgesture.givenasetoftrainingsam- plesl=fx1;x2;;xngforgesturek,weempirically information for gesture k gesture 1 functionasfollows: pleswereusedtoobtainedtheapproximationofthe determinedthedampingfactorintheinterpolation Predictor gesture k gesture n no yes index Thesecondtypeoftrainingwastogeneratetheat- =0:2Pn?1 i=1kxi?xi+1k n?1 : (2) Discard Refinement proximationforverierfwhichwouldbeusedlater 4.1Training iments.thersttypeoftrainingistogettheap- tocheckthevalidationofthesegmentation.foreach gesture,anumberbetween(27and36)oftrainingsam- Twotypesoftrainingwereconductedintheexper- Figure7:40handgesturesusedintheexperiment. Thetotalnumberoftrainingattentionimagesis1742. 4.2HandSegmentation presentedintheattentionwindowwouldbediscarded. tentionimageswithmorethan30%backgroundpixels ples.inthecurrentimplementation,theselectionofthe foreachtrainingsampleasshowninfig.3.theattentionimagesfrommultiplexationsoftrainingsammentationtaskfromatemporalsequenceofintensity xationsismechanical.totally19xationswereused images.eachsequencerepresentsacompletehandsign. Fig.8(a)showstwosamplesequences. weutilizemotioninformationtondamotionattention window.theattentionalgorithmcandetecttherough Thetrainedsystemwastestedtoperformtheseg- Inordertospeeduptheprocessofthesegmentation,
(a) (b) attentionareshownusingdarkrectangular;(c)theresultsofthesegmentationareshownaftermaskingothebackground. Figure8:Thesamplesoftheexperimentalresults.(a)Theinputtestingsequences;(b)Theresultsofmotion-basedvisual
positionofamovingobject,buttheaccuracyisnot guaranteedasshowninfig.8(b).wesolvethisproblembydoingsomelimitedsearchbasedonthemotion attentionwindow.inthecurrentimplementation,given amotionattentionwindowwithmrowsandncolumns, wetrythecandidateswithsizefrom(0:5m;0:5n)to (2m;2n)usingstepsize(0:5m;0:5n). Wetestedthesystemwith802images(161sequences)whichwerenotusedinthetraining.Aresult wasrejectedifthesystemcouldnotndavalidsegmentationwithacondencelevell.thesegmentation wasconsideredasacorrectoneifthecorrectgesture segmentationcwasretrievedandplacedintheright positionofthetestimage.forthecaseofl=0:2,we haveachieved95%correctsegmentationratewith3% falserejectionrate.fig.8(c)showssomesegmentationresults.wesummarizetheexperimentalresultsin Table1.ThetimewasobtainedonaSGI-INDIGO2 workstation. Table1:Summaryoftheexperimentaldata NumberofCorrect FalseTime testimagessegmentationrejectionperimage 805 95% 3%58.3sec. 5ConclusionsandFutureWork Asegmentationschemeusingattentionimagesfrom multiplexationsispresentedinthispaper.themajoradvantageofthisschemeisthatitcanhandlea largenumberofdierentdeformableobjectspresented invariouscomplexbackgrounds.theschemeisalso relativelyecientsincethesearchofthesegmentation isguidedbythepastknowledgethroughapredicationand-vericationscheme. Inthecurrentimplementation,thexationsaregeneratedmechanically.Thenumberofxationsandthe positionsofxationsarethesameregardlessofthetypes ofgestures.thisisnotveryecient.somegestures maybeverysimplesothatafewxationsareenough torecognizethem.nevertheless,inordertoachievethe optimalperformance,dierentgesturesmayrequiredifferentpositionsofxations.inthefuture,weplanto investigatethegenerationofthexationsalsobasedon learning.thepreviousxationsareusedtoguidethe nextaction.thenextactioncouldbe(a)termination oftheprocessofgeneratingxationifthegesturehas alreadybeenrecognized;or(b)ndingtheappropriate positionfornextxation. Acknowledgements TheauthorswouldliketothankYuZhong,Kal Rayes,DougNeal,andValerieBolsterformakingthemselvesavailablefortheexperiments.ThisworkwassupportedinpartbyNSFgrantNo.IRI9410741andONR grantno.n00014-95-1-0637. References [1]A.BobickandA.Wilson,\Astate-basedtechnique forthesummarizationandrecognitionofgesture", inproc.5thint'lconf.computervision,pp.382-388,boston,1995. [2]P.BouthemyandE.Francois,\Motionsegmentationandqualitativedynamicsceneanalysisfroman imagesequence",ininternationaljournalofcomputervision,vol.10,pp.157-182,1993. [3]R.Cipolla,Y.OkamotoandY.Kuno,\Robust structurefrommotionusingmotionparallax",in IEEEConf.ComputerVisionandPatternRecog., pp.374-382,1993. [4]Y.Cui,D.SwetsandJ.Weng,\Learning-based handsignrecognitionusingshoslif-m",inproc. 5thInt'lConf.ComputerVision,pp.631-636, Boston,1995. [5]Y.CuiandJ.Weng,"2Dobjectsegmentationfrom foveaimagesbasedoneigen-subspacelearning", Proc.IEEEInt'lSymposiumonComputerVision, CoralGables,FL,Nov.20-22,1995. [6]Y.CuiandJ.Weng,\Alearning-basedpredictionand-vericationsegmentationschemeforhandsign imagesequences",technicalreportcps-95-43, ComputerScienceDepartment,MichiganState University,Dec.,1995. [7]T.DarrellandA.Pentland,\Space-timegestures",inIEEEConf.ComputerVisionandPatternRecog.,pp.335-340,1993. [8]G.W.Donohoe,D.R.HushandN.Ahmed, \Changedetectionfortargetdetectionandclassicationinvideosequences",inProc.Int'lConf. Acoust.,Speech,SignalProcessing,pp.1084-1087, 1988. [9]M.Kass,A.WitkinandD.Terzopoulos,\Snakes: activecontourmodels",inproc.1sticcv,pp.259-268,1987. [10]D.Knuth,TheArtofComputerProgrammingIII: SortingandSearching,Addison-Wesley,Reading, Mass.,1973. [11]J.J.KuchandT.S.Huang,\Visionbasedhand modelingandtracking",inproc.international ConferenceonComputerVision,June,1995. [12]M.M.Loeve,ProbabilityTheory,Princeton,NJ: VanNostrand,1955. [13]T.E.StarnerandA.Pentland,\Visualrecognition ofamericansignlanguageusinghiddenmarkov models",inproc.internationalworkshoponautomaticface-andgesture-recognition",june1995.