TheEMAlgorithmforMixturesofFactorAnalyzers DepartmentofComputerScience ZoubinGhahramani GeoreyE.Hinton May21,1996(revisedFeb27,1997) TechnicalReportCRG-TR-96-1 Email:zoubin@cs.toronto.edu Toronto,CanadaM5S1A4 6King'sCollegeRoad UniversityofToronto dimensionaldatausingasmallnumberoflatentvariables,canbeextendedbyallowing dierentlocalfactormodelsindierentregionsoftheinputspace.thisresultsina modelwhichconcurrentlyperformsclusteringanddimensionalityreduction,andcan bethoughtofasareduceddimensionmixtureofgaussians.wepresentanexact Expectation{Maximizationalgorithmforttingtheparametersofthismixtureoffactor Factoranalysis,astatisticalmethodformodelingthecovariancestructureofhigh Abstract 1Introduction Clusteringanddimensionalityreductionhavelongbeenconsideredtwoofthefundamental problemsinunsupervisedlearning(duda&hart,1973;chapter6).inclustering,thegoal istogroupdatapointsbysimilaritybetweentheirfeatures.conversely,indimensionality analyzers. clusteringand,withineachcluster,localdimensionalityreduction. formsofdimensionalityreduction factoranalysis withabasicmethodforclustering the Gaussianmixturemodel.Whatresultsisastatisticalmethodwhichconcurrentlyperforms paperwepresentanemlearningalgorithmforamethodwhichcombinesoneofthebasic reduction,thegoalistogroup(orcompress)featuresthatarehighlycorrelated.inthis moreseparateddependingonthelocalmetric. reductionmayguidetheprocessofclusterformation i.e.dierentclustersmayappear anddimensionalityreductionareperformedseparately.first,dierentfeaturesmaybe correlatedwithindierentclustersandthusthemetricfordimensionalityreductionmay needtovarybetweendierentclusters.conversely,themetricinducedindimensionality Recently,therehasbeenagreatdealofresearchonthetopicoflocaldimensionality Localdimensionalityreductionpresentsseveralbenetsoveraschemeinwhichclustering usedbytheseauthorsfordimensionalityreductionisprincipalcomponentsanalysis(pca). SungandPoggio,1994;SchwenkandMilgram,1995;Hintonetal.,1995).Thealgorithm characterandfacerecognition(breglerandomohundro,1994;kambhatlaandleen,1994; reduction,resultinginseveralvariantsonthebasicconceptwithsuccessfulapplicationsto 1
Figure1:Thefactoranalysisgenerativemodel(invectorform). - x? z modelforthedata,asthecostofcodingadatapointisequalanywherealongtheprincipal PCA,unlikemaximumlikelihoodfactoranalysis(FA),doesnotdeneaproperdensity PCAisnotrobusttoindependentnoiseinthefeaturesofthedata(seeHintonetal.,1996, componentsubspace(i.e.thedensityisun-normalizedalongthesedirections).furthermore, foracomparisonofpcaandfamodels).hinton,dayan,andrevow(1996),alsoexploring analyzerstoamixtureoffactoranalyzers.theirlearningalgorithmconsistedofanouter anapplicationtodigitrecognition,werethersttoextendmixturesofprincipalcomponents loop.thissimpliestheimplementation,reducesthenumberofheuristicparameters(i.e. loopofapproximateemtotthemixturecomponents,combinedwithaninnerloopof learningratesorstepsofconjugategradientdescent),andcanpotentiallyresultinspeed-ups. gradientdescenttoteachindividualfactormodel.inthisnotewepresentanexactem algorithmformixturesoffactoranalyzerswhichobviatestheneedforanouterandinner Inmaximumlikelihoodfactoranalysis(FA),ap-dimensionalreal-valueddatavectorxis 2FactorAnalysis analyzersinsection3.weclosewithadiscussioninsection4. gorithm.thisisfollowedbythederivationofthelearningalgorithmformixtureoffactor InthenextsectionwepresentbackgroundmaterialonfactoranalysisandtheEMaldimensionalrandomvariableuisdistributedN(0; smallerthanp(everitt,1984).thegenerativemodelisgivenby: modeledusingak-dimensionalvectorofreal-valuedfactors,z,wherekisgenerallymuch independentgiventhefactors.accordingtothismodel,xisthereforedistributedwithzero whereisknownasthefactorloadingmatrix(seefigure1).thefactorszareassumed diagonalityof toben(0;i)distributed(zero-meanindependentnormals,withunitvariance).thep- isoneofthekeyassumptionsoffactoranalysis:theobservedvariablesare x=z+u; ),where isadiagonalmatrix.the (1) meanandcovariance0+ bestmodelthecovariancestructureofx.thefactorvariableszmodelcorrelationsbetween theelementsofx,whiletheuvariablesaccountforindependentnoiseineachelementofx. mativeprojectionsofthedata.givenand ThekfactorsplaythesameroleastheprincipalcomponentsinPCA:Theyareinfor- ;andthegoaloffactoranalysisistondtheand 2,theexpectedvalueofthefactorscanbe that
computedthroughthelinearprojection:e(zjx)=x; Notethatsince where0( +0) 1,afactthatresultsfromthejointnormalityofdataandfactors: isdiagonal,theppmatrix( P "xz#!=n"0#;"0+ +0),canbeecientlyinvertedusing 0I#!: (3) (2) whereiisthekkidentitymatrix.furthermore,itispossible(andinfactnecessaryfor EM)tocomputethesecondmomentofthefactors, thematrixinversionlemma: ( +0) 1= E(zz0jx)=Var(zjx)+E(zjx)E(zjx)0 =I +xx00; 1 1(I+0 1) 10 1; factoranalysis(seeappendixaandrubin&thayer,1982): PCA. whichprovidesameasureofuncertaintyinthefactors,aquantitythathasnoanaloguein Theexpectations(2)and(4)formthebasisoftheEMalgorithmformaximumlikelihood (4) E-step:ComputeE(zjxi)andE(zz0jxi)foreachdatapointxi,givenand M-step: new= new=1ndiag(nxi=1xix0i newe[zjxi]x0i); nxi=1xie(zjxi)0! nxl=1e(zz0jxl)! 1.(5) 3MixtureofFactorAnalyzers wherethediagoperatorsetsalltheo-diagonalelementsofamatrixtozero. (6) Assumewehaveamixtureofmfactoranalyzersindexedby!j,j=1;:::;m.Thegenerative modelnowobeysthefollowingmixturedistribution(seefigure2): Asinregularfactoranalysis,thefactorsareallassumedtobeN(0;I)distributed,therefore, P(x)=mXj=1ZP(xjz;!j)P(zj!j)P(!j)dz: P(zj!j)=P(z)=N(0;I): 3 (8) (7)
! SSSw - x/ Figure2:Themixtureoffactoranalysisgenerativemodel. j;j z allowingeachtomodelthedatacovariancestructureinadierentpartofinputspace, Whereasinfactoranalysisthedatameanwasirrelevantandwassubtractedbeforettingthe model,herewehavethefreedomtogiveeachfactoranalyzeradierentmean,j,thereby followingstatementscanbeeasilyveried, zandthemixtureindicatorvariable!,wherewj=1whenthedatapointwasgenerated adaptablemixingproportions,j=p(!j).thelatentvariablesinthismodelarethefactors by!j.forthee-stepoftheemalgorithm,oneneedstocomputeexpectationsofall theinteractionsofthehiddenvariablesthatappearintheloglikelihood.fortunately,the Theparametersofthismodelaref(j;j)mj=1;; P(xjz;!j)=N(j+jz; g;1thevectorparametrizesthe ): (9) andusingequations(2)and(10)weobtain Dening hij=e[wjjxi]/p(xi;!j)=jn(xi j;j0j+ E[wjzz0jxi]=E[wjjxi]E[zz0j!j;xi]: E[wjzjxi]=E[wjjxi]E[zj!j;xi] E[wjzjxi]=hijj(xi j); ) (10) (12) (11) TheEMalgorithmformixturesoffactoranalyzersthereforebecomes: wherej0j( E-step:Computehij,E[zjxi;!j]andE[zz0jxi;!j]foralldatapointsiandmixture +j0j) 1.Similarly,usingequations(4)and(11)weobtain E[wjzz0jxi]=hijI jj+j(xi j)(xi j)00j: (14) (13) sians.eachfactoranalyzertsagaussiantoaportionofthedata,weightedbytheposterior componentsj. parametersdedicatedtomodelingcovariancestructure. probabilities,hij.sincethecovariancematrixforeachgaussianisspeciedthroughthe lowerdimensionalfactorloadingmatrices,themodelhasmkp+p,ratherthanmp(p+1)=2, 1Notethateachmodelcanalsobeallowedtohaveaseparate Themixtureoffactoranalyzersis,inessence,areduceddimensionalitymixtureofGaus- M-step:Solveasetoflinearequationsforj,j,jand (seeappendixb). interpretationassensornoise. 4 matrix.this,however,changesits
4Discussion WehavedescribedanEMalgorithmforttingamixtureoffactoranalyzers.Matlabsource beingdeveloped. codeforthealgorithmcanbeobtainedfromftp://ftp.cs.toronto.edu/pub/zoubin/ zandthediscretevariables!dependontheirvalueataprevioustimestep,iscurrently mfa.tar.gz.anextensionofthisarchitecturetotimeseriesdata,inwhichboththefactors someperformanceloss.alternatively,afull-edgedbayesiananalysis,inwhichthesemodel methodsbasedonpruningorgrowingthemixturemaybemoreecientatthecostof dataandtheloglikelihoodonavalidationsetisusedtoselectthenalvalues.greedy bywhichthesecanbeselectediscross-validation:severalvaluesofmandkarettothe factoranalyzerstouse(m),andthenumberoffactorineachanalyzer(k).onemethod mixtureoffactoranalyzersthemodelerhastwofreeparameterstodecide:thenumberof Oneoftheimportantissuesnotaddressedinthisnoteismodelselection.Inttinga WethankC.Bishopforcommentsonthemanuscript.Theresearchwasfundedbygrants Acknowledgements fromthecanadiannaturalscienceandengineeringresearchcouncilandtheontario InformationTechnologyResearchCenter.GEHistheNesbitt-BurnsfellowoftheCanadian parametersareintegratedover,mayalsobepossible. Theexpectedloglikelihoodforfactoranalysisis InstituteforAdvancedResearch. AEMforFactorAnalysis Q=E"logYi(2)p=2j j XiE12x0i j 1=2expf 12[xi z]0 1z+12z00 1[xi z]g# wherecisaconstant,independentoftheparameters,andtristhetraceoperator. Tore-estimatethefactorloadingmatrixweset =c n2logj j Xi12x0i 1xi x0i 1E[zjxi]+12trh0 1z @Q @= Xi 1xiE[zjxi]0+Xl 1newE[zz0jxl]=0 1E[zz0jxi]i; obtaining new XlE[zz0jxl]0!=XixiE[zjxi]0 5
fromwhichwegetequation(5). Substitutingequation(5),n2 Were-estimatethematrix @@Q 1=n2 new Xi12xix0i newe[zjxi]x0i+12newe[zz0jxi]new0=0: new=xi12xix0i 12newE[zjxi]x0i throughitsinverse,setting andusingthediagonalconstraint, BEMforMixtureofFactorAnalyzers Theexpectedloglikelihoodformixtureoffactoranalysisis new=1ndiag(xixix0i newe[zjxi]x0i): augmentedcolumnvectoroffactors~z="z1# Q=E24logYiYj(2)p=2j Tojointlyestimatethemeanjandthefactorloadingsjitisusefultodenean j 1=2expf 12[xi j jz]0 1[xi j jz]gwj35 andanaugmentedfactorloadingmatrix~j=[jj].theexpectedloglikelihoodisthen wherecisaconstant.toestimate~jweset Q=E24logYiYj(2)p=2j =c n2logj j Xi;j12hijx0i j 1=2expf 12[xi ~j~z]0 1xi hijx0i 1~jE[~zjxi;!j]+12hijtrh~0j 1[xi ~j~z]gwj35 @~j= Xihij @Q 1xiE[~zjxi;!j]0+hij 1~new je[~z~z0jxi;!j]=0: 1~jE[~z~z0jxi;!j]i Thisresultsinalinearequationforre-estimatingthemeansandfactorloadings, hnew jnew ji=~new j= XihijxiE[~zjxi;!j]0! 6 XlhljE[~z~z0jxl;!j]! 1 (15)
where and E[~z~z0jxl;!j]="E[zz0jxl;!j]E[zjxl;!j] E[~zjxi;!j]="E[zjxi;!j] E[zjxl;!j]01# @@Q Were-estimatethematrix 1=n2 new Xij12hijxix0i hij~new throughitsinverse,setting je[~zjxi;!j]x0i+12hij~new 1#: Substitutingequation(15)for~jandusingthediagonalconstrainton new=1ndiag8<:xijhijxi ~new je[~zjxi;!j]x0i9=;: je[~z~z0jxi;!j]~new0 weobtain, j=0: Sincehij=P(!jjxi),usingtheempiricaldistributionofthedataasanestimateofP(x)we Finally,tore-estimatethemixingproportionsweusethedenition, j=p(!j)=zp(!jjx)p(x)dx: (16) get References Bregler,C.andOmohundro,S.M.(1994).Surfacelearningwithapplicationstolip-reading. InCowan,J.D.,Tesauro,G.,andAlspector,J.,editors,AdvancesinNeuralInformation new j=1nnxi=1hij: Duda,R.O.andHart,P.E.(1973).PatternClassicationandSceneAnalysis.Wiley,New Everitt,B.S.(1984).AnIntroductiontoLatentVariableModels.ChapmanandHall, ProcessingSystems6,pages43{50.MorganKaufmanPublishers,SanFrancisco,CA. Hinton,G.,Revow,M.,andDayan,P.(1995).Recognizinghandwrittendigitsusingmixtures York. Hinton,G.E.,Dayan,P.,andRevow,M.(1996).ModelingthemanifoldsofImagesof oflinearmodels.intesauro,g.,touretzky,d.,andleen,t.,editors,advancesin London. handwrittendigits.submittedforpublication. MA. NeuralInformationProcessingSystems7,pages1015{1022.MITPress,Cambridge, 7
Kambhatla,N.andLeen,T.K.(1994).Fastnon-lineardimensionreduction.InCowan, Rubin,D.andThayer,D.(1982).EMalgorithmsforMLfactoranalysis.Psychometrika, J.D.,Tesauro,G.,andAlspector,J.,editors,AdvancesinNeuralInformationProcessing Systems6,pages152{159.MorganKaufmanPublishers,SanFrancisco,CA. Sung,K.-K.andPoggio,T.(1994).Example-basedlearningforview-basedhumanface Schwenk,H.andMilgram,M.(1995).Transformationinvariantautoassociationwithapplicationtohandwrittencharacterrecognition.InTesauro,G.,Touretzky,D.,andLeen, T.,editors,AdvancesinNeuralInformationProcessingSystems7,pages991{998.MIT Press,Cambridge,MA. 47(1):69{76. detection.mitaimemo1521,cbclpaper112. 8