Mult-Vew Regresson va Canoncal Correlaton Analyss Sham M. Kakade 1 and Dean P. Foster 2 1 Toyota Technologcal Insttute at Chcago Chcago, IL 60637 2 Unversty of Pennsylvana Phladelpha, PA 19104 Abstract. In the mult-vew regresson problem, we have a regresson problem where the nput varable (whch s a real vector) can be parttoned nto two dfferent vews, where t s assumed that ether vew of the nput s suffcent to make accurate predctons ths s essentally (a sgnfcantly weaker verson of) the co-tranng assumpton for the regresson problem. We provde a sem-supervsed algorthm whch frst uses unlabeled data to learn a norm (or, equvalently, a kernel) and then uses labeled data n a rdge regresson algorthm (wth ths nduced norm) to provde the predctor. The unlabeled data s used va canoncal correlaton analyss (CCA, whch s a closely related to PCA for two random varables) to derve an approprate norm over functons. We are able to characterze the ntrnsc dmensonalty of the subsequent rdge regresson problem (whch uses ths norm) by the correlaton coeffcents provded by CCA n a rather smple expresson. Interestngly, the norm used by the rdge regresson algorthm s derved from CCA, unlke n standard kernel methods where a specal apror norm s assumed (.e. a Banach space s assumed). We dscuss how ths result shows that unlabeled data can decrease the sample complexty. 1 Introducton Extractng nformaton relevant to a task n an unsupervsed (or sem-supervsed) manner s one of the fundamental challenges n machne learnng the underlyng queston s how unlabeled data can be used to mprove performance. In the mult-vew approach to sem-supervsed learnng [Yarowsky, 1995, Blum and Mtchell, 1998], one assumes that the nput varable x can be splt nto two dfferent vews (x (1), x (2) ), such that good predctors based on each vew tend to agree. Roughly speakng, the common underlyng mult-vew assumpton s that the best predctor from ether vew has a low error thus the best predctors tend to agree wth each other. There are many applcatons where ths underlyng assumpton s applcable. For example, object recognton wth pctures form dfferent camera angles we expect a predctor based on ether angle to have good performance. One can even
2 consder mult-modal vews, e.g. dentty recognton where the task mght be to dentfy a person wth one vew beng a vdeo stream and the other an audo stream each of these vews would be suffcent to determne the dentty. In NLP, an example would be a pared document corpus, consstng of a document and ts translaton nto another language. The motvatng example n Blum and Mtchell [1998] s a web-page classfcaton task, where one vew was the text n the page and the other was the hyperlnk structure. A characterstc of many of the mult-vew learnng algorthms [Yarowsky, 1995, Blum and Mtchell, 1998, Farquhar et al., 2005, Sndhwan et al., 2005, Brefeld et al., 2006] s to force agreement between the predctors, based on ether vew. The dea s to force a predctor, h (1) ( ), based on vew one to agree wth a predctor, h (2) ( ), based on vew two,.e. by constranng h (1) (x (1) ) to usually equal h (2) (x (2) ). The ntuton s that the complexty of the learnng problem should be reduced by elmnatng hypothess from each vew that do not agree wth each other (whch can be done usng unlabeled data). Ths paper studes the mult-vew, lnear regresson case: the nputs x (1) and x (2) are real vectors; the outputs y are real valued; the samples ((x (1), x (2) ), y) are jontly dstrbuted; and the predcton of y s lnear n the nput x. Our frst contrbuton s to explctly formalze a mult-vew assumpton for regresson. The mult-vew assumpton we use s a regret based one, where we assume that the best lnear predctor from each vew s roughly as good as the best lnear predctor based on both vews. Denote the (expected) squared loss of a predcton functon g(x) to be loss(g). More precsely, the mult-vew assumpton s that loss(f (1) ) loss(f) ɛ loss(f (2) ) loss(f) ɛ where f (ν) s the best lnear predctor based on vew ν {1, 2} and f s the best lnear predctor based on both vews (so f (ν) s a lnear functon of x (ν) and f s a lnear functon of x = (x (1), x (2) )). Ths assumpton mples that (only on average) the predctors must agree (shown n Lemma 1). Clearly, f the both optmal predctors f (1) and f (2) have small error, then ths assumpton s satsfed, though ths precondton s not necessary. Ths (average) agreement s explctly used n the co-regularzed least squares algorthms of Sndhwan et al. [2005], Brefeld et al. [2006], whch drectly constran such an agreement n a least squares optmzaton problem. Ths assumpton s rather weak n comparson to prevous assumptons [Blum and Mtchell, 1998, Dasgupta et al., 2001, Abney, 2004]. Our assumpton can be vewed as weakenng the orgnal co-tranng assumpton (for the classfcaton case). Frst, our assumpton s stated n terms of expected errors only and mples only expected approxmate agreement (see Lemma 1). Second, our assumpton s only n terms of regret we do not requre that the loss of any predctor be small. Lastly, we make no further dstrbutonal assumptons (asde from a bounded second moment on the output varable), such as the commonly used, overly-strngent assumpton that the dstrbuton of the vews be condtonally
3 ndependent gven the label [Blum and Mtchell, 1998, Dasgupta et al., 2001, Abney, 2004]. In Balcan and Blum [2006], they provde a compatblty noton whch also relaxes ths latter assumpton, though t s unclear f ths compatblty noton (defned for the classfcaton settng) easly extends to the assumpton above. Our man result provdes an algorthm and an analyss under the above multvew regresson assumpton. The algorthm used can be thought of as a rdge regresson algorthm wth regularzaton based on a norm that s determned by canoncal correlaton analyss (CCA). Intutvely, CCA [Hotellng, 1935] s an unsupervsed method for analyzng jontly dstrbuted random vectors. In our settng, CCA can be performed wth the unlabeled data. We characterze the expected regret of our mult-vew algorthm, n comparson to the best lnear predctor, as a sum of a bas and a varance term: the bas s 4ɛ so t s small f the mult-vew assumpton s good; and the varance s d n, where n s the sample sze and d s the ntrnsc dmensonalty whch we show to be the sum of the squares of the correlaton coeffcents provded by CCA. The noton of ntrnsc dmensonalty we use s the related to that of Zhang [2005], whch provdes a noton of ntrnsc dmensonalty for kernel methods. An nterestng aspect to our settng s that no apror assumptons are made about any specal norm over the space of lnear predctons, unlke n kernel methods whch apror mpose a Banach space over predctors. In fact, our mult-vew assumpton s co-ordnate free the assumpton s stated n terms of the best lnear predctor for the gven lnear subspaces, whch has no reference to any coordnate system. Furthermore, no apror assumptons about the dmensonalty of our spaces are made thus beng applcable to nfnte dmensonal methods, ncludng kernel methods. In fact, kernel CCA methods have been developed n Hardoon et al. [2004]. The remander of the paper s organzed as follows. Secton 2 formalzes our mult-vew assumpton and revews CCA. Secton 3 presents the man results, where the bas-varance tradeoff and the ntrnsc dmensonalty are characterzed. The Dscusson expands on a number of ponts. The foremost ssue addressed s how the mult-vew assumpton, wth unlabeled data, could potentally allow a sgnfcant reducton n the sample sze. Essentally, n the hgh (or nfnte) dmensonal case, the mult-vew assumpton mposes a norm whch could concde wth a much lower ntrnsc dmensonalty. In the Dscusson, we also examne two related mult-vew learnng algorthms: the SVM-2K algorthm of Farquhar et al. [2005] and the co-regularzed least squares regresson algorthm of Sndhwan et al. [2005]. 2 Prelmnares Ths frst part of ths secton presents the mult-vew regresson settng and formalzes the mult-vew assumpton. As s standard, we work wth a dstrbuton D(x, y) over nput-output pars. To abstract away the dffcultes of analyzng the use of a random unlabeled set sampled from D(x), we nstead assume that
4 the second order statstcs of x are known. The transductve settng and the fxed desgn settng (whch we dscuss later n Secton 3) are cases where ths assumpton s satsfed. The second part of ths secton revews CCA. 2.1 Regresson wth Multple Vews Assume that the nput space X s a subset of a real lnear space, whch s of ether fnte dmenson (.e. X R d ) or countably nfnte dmenson. Also assume that each x X s n l 2 (.e. x s a squared summable sequence). In the mult-vew framework, assume each x has the form x = (x (1), x (2) ), where x (1) and x (2) are nterpreted as the two vews of x. Hence, x (1) s an element of a real lnear space X (1) and x (2) s n a real lnear space X (2) (and both x (1) and x (2) are n l 2 ). Conceptually, we should thnk of these spaces as beng hgh dmensonal (or countably nfnte dmensonal). We also have outputs y that are n R, along wth a jont dstrbuton D(x, y) over X R. We assume that the second moment of the output s bounded by 1,.e. E[y 2 x] 1 t s not requred that y tself be bounded. No boundedness assumptons on x X are made, snce these assumptons would have no mpact on our analyss as t s only the subspace defned by X that s relevant. We also assume that our algorthm has knowledge of the second order statstcs of D(x),.e. we assume that the covarance matrx of x s known. In both the transductve settng and the fxed desgn settng, such an assumpton holds. Ths s dscussed n more detal n Secton 3. The loss functon consdered for g : X R s the average squared error. More formally, loss(g) = E [ (g(x) y) 2] where the expectaton s wth respect to (x, y) sampled from D. We are also nterested n the losses for predctors, g (1) : X (1) R and g (2) : X (2) R, based on the dfferent vews, whch are just loss(g (ν) ) for ν {1, 2}. The followng assumpton s made throughout the paper. Assumpton 1 (Mult-Vew Assumpton) Defne L(Z) to be the space of lnear mappngs from a lnear space Z to the reals and defne: f (1) = argmn g L(X (1) )loss(g) f (2) = argmn g L(X (2) )loss(g) f = argmn g L(X) loss(g) whch exst snce X s a subset of l 2. The mult-vew assumpton s that for ν {1, 2}. loss(f (ν) ) loss(f) ɛ Note that ths assumpton makes no reference to any coordnate system or norm over the lnear functons. Also, t s not necessarly assumed that the losses,
5 themselves are small. However, f loss(f (ν) ) s small for ν {1, 2}, say less than ɛ, then t s clear that the above assumpton s satsfed. The followng Lemma shows that the above assumpton mples that f (1) and f (2) tend to agree on average. Lemma 1. Assumpton 1 mples that: ( 2 E f (1) (x (1) ) f (2) (x )) (2) 4ɛ where the expectaton s wth respect to x sampled from D. The proof s provded n the Appendx. As mentoned n the Introducton, ths agreement s explctly used n the co-regularzed least squares algorthms of Sndhwan et al. [2005], Brefeld et al. [2006]. 2.2 CCA and the Canoncal Bass A useful bass s that provded by CCA, whch we defne as the canoncal bass. Defnton 1. Let B (1) be a bass of X (1) and B (2) be a bass of X (2). Let x (ν) 1, x(ν) 2,... be the coordnates of x(ν) n B (ν). The par of bases B (1) and B (2) are the canoncal bases f the followng holds (where the expectaton s wth respect to D): 1. Orthogonalty Condtons: 2. Correlaton Condtons: E E[x (ν) x (ν) j ] = { 1 f = j 0 else [ ] { x (1) x (2) λ f = j j = 0 else where, wthout loss of generalty, t s assumed that 1 λ 0 and that 1 λ 1 λ 2... The -th canoncal correlaton coeffcent s defned as λ. Roughly speakng, the jont covarance matrx of x = (x (1), x (2) ) n the canoncal bass has a partcular structured form: the ndvdual covarance matrces of x (1) and x (2) are just dentty matrces and the cross covarance matrx between x (1) and x (2) s dagonal. CCA can also be specfed as an egenvalue problem 3 (see Hardoon et al. [2004] for revew). 3 CCA fnds such a bass s as follows. The correlaton coeffcent between two real values (jontly dstrbuted) s defned as corr(z, z E[zz ) = ] Let Π ax be the E[z 2 ]E[z 2 ] projecton operator, whch projects x onto drecton a. The frst canoncal bass vectors b (1) 1 B (1) and b (2) 1 B (2) are the unt length drectons a and b whch maxmze corr(π ax (1), Π b x (2) ) and the correspondng canoncal correlaton coeffcent λ 1 s ths maxmal correlaton. Inductvely, the next par of drectons can be found whch maxmze the correlaton subject to the par beng orthogonal to the prevously found pars.
6 3 Learnng Now let us assume we have observed a tranng sample T = {(x (ν) m, y m )} n m=1 of sze n from a vew ν, where the samples drawn ndependently from D. We also assume that our algorthm has access to the covarance matrx of x, so that the algorthm can construct the canoncal bass. Our goal s to construct an estmator f (ν) of f (ν) recall f (ν) s the best lnear predctor usng only vew ν such that the regret s small. loss( f (ν) ) loss(f (ν) ) Remark 1. (The Transductve and Fxed Desgn Settng) There are two natural settngs where ths assumpton of knowledge about the second order statstcs of x holds the random transductve case and the fxed desgn case. In both cases, X s a known fnte set. In the random transductve case, the dstrbuton D s assumed to be unform over X, so each x m s sampled unformly from X and each y m s sampled from D(y x m ). In the fxed desgn case, assume that each x X appears exactly once n T and agan y m s sampled from D(y x m ). The fxed desgn case s commonly studed n statstcs and s also referred to as sgnal reconstructon. 4 The covarance matrx of x s clearly known n both cases. 3.1 A Shrnkage Estmator (va Rdge Regresson) Let the representaton of our estmator f (ν) n the canoncal bass B (ν) be f (ν) (x (ν) ) = β (ν) x (ν) (1) where x (ν) of β (ν) as: s the -th coordnate n B (ν). Defne the canoncal shrnkage estmator β (ν) = λ Ê[x y] λ n m x (ν) m, y m (2) Intutvely, the shrnkage by λ down-weghts drectons that are less correlated wth the other vew. In the extreme case, ths estmator gnores the uncorrelated coordnates, those where λ = 0. The followng remark shows how ths estmator has a natural nterpretaton n the fxed desgn settng t s the result of rdge regresson wth a specfc norm (nduced by CCA) over functons n L(X (ν) ). 4 In the fxed desgn case, one can vew each y m = f(x m) + η, where η s 0 mean nose so f(x m) s the condtonal mean. After observng a sample {(x (ν) m, y m)} X m=1 for all x X (so n = X ), the goal s to reconstruct f( ) accurately.
7 Remark 2. (Canoncal Rdge Regresson). We now specfy a rdge regresson algorthm for whch the shrnkage estmator s the soluton. Defne the canoncal norm for a lnear functon n L(X (ν) (ν) ) as follows: usng the representaton of f n B (ν) as defned n Equaton 1, the canoncal norm of f (ν) s defned as: f (ν) 1 λ ( β(ν) ) 2 CCA = (3) λ where we overload notaton and wrte f (ν) CCA = β (ν) CCA. Hence, functons whch have large weghts n the less correlated drectons (those wth small λ ) have larger norms. Equpped wth ths norm, the functons n L(X (ν) ) defne a Banach space. In the fxed desgn settng, the rdge regresson algorthm wth ths norm chooses the β (ν) whch mnmzes: 1 X X m=1 ( y m β (ν) x (ν) m ) 2 + β(ν) 2 CCA Recall, that n the fxed desgn settng, we have a tranng example for each x X, so the sum s over all x X. It s straghtforward to show (by usng orthogonalty) that the estmator whch mnmzes ths loss s the canoncal shrnkage estmator defned above. In the more general transductve case, t s not qute ths estmator, snce the sampled ponts {x (ν) m } m may not be orthogonal n the tranng sample (they are only orthogonal when summed over all x X). However, n ths case, we expect that the estmator provded by rdge regresson s approxmately equal to the shrnkage estmator. We now state the frst man theorem. Theorem 1. Assume that E[y 2 x] 1 and that Assumpton 1 holds. Let f (ν) be the estmator constructed wth the canoncal shrnkage estmator (Equaton 2) on tranng set T. For ν 1, 2, then E T [loss( f (ν) )] loss(f (ν) ) 4ɛ + λ2 n where expectaton s wth respect to the tranng set T sampled accordng to D n. We comment on obtanng hgh probablty bounds n the Dscusson. The proof (presented n Secton 3.3) shows that the 4ɛ results from the bas n the P λ2 n algorthm and as the ntrnsc dmensonalty. Note that Assumpton 1 mples that: results from the varance. It s natural to nterpret λ2 E T [loss( f (ν) )] loss(f) 5ɛ + λ2 n where the comparson s to the best lnear predctor f over both vews.
8 Remark 3. (Intrnsc Dmensonalty) Let β (ν) be a lnear estmator n the vector of sampled outputs, Y = (y 1, y 2,... y m ). Note that the prevous thresholded estmator s such a lnear estmator (n the fxed desgn case). We can wrte β (ν) = P Y where P s a lnear operator. Zhang [2005] defnes tr(p T P ) as the ntrnsc dmensonalty, where tr( ) s the trace operator. Ths was motvated by the fact that n the fxed desgn settng the error drops as tr(p T P ) n, whch s bounded by d n n a fnte dmensonal space. Zhang [2005] then goes on to analyze the ntrnsc dmensonalty of kernel methods n the random desgn settng (obtanng hgh probablty bounds). In our settng, the sum λ2 s precsely ths trace, as P s a dagonal matrx wth entres λ. 3.2 A (Possbly) Lower Dmensonal Estmator Consder the thresholded estmator: { β (ν) Ê[x y] f λ = 1 ɛ 0 else (4) where agan Ê[x y] s the emprcal expectaton 1 n m x(ν) m, y m. Ths estmator uses an unbased estmator of β (ν) for those wth large λ and thresholds to 0 for those wth small λ. Hence, the estmator lves n a fnte dmensonal space (determned by the number of λ whch are greater than 1 ɛ). Theorem 2. Assume that E[y 2 x] 1 and that Assumpton 1 holds. Let d be the number of λ for whch λ 1 ɛ. Let f (ν) be the estmator constructed wth the threshold estmator (Equaton 4) on tranng set T. For ν 1, 2, then E T [loss( f (ν) )] loss(f (ν) ) 4 ɛ + d n where expectaton s wth respect to the tranng set T sampled accordng to D n. Essentally, the above ncreases the bas to 4 ɛ and (potentally) decreases (ν) the varance. Such a bound may be useful f we desre to explctly keep β n a lower dmensonal space n contrast, the explct dmensonalty of the shrnkage estmator could be as large as X. 3.3 The Bas-Varance Tradeoff Ths secton provdes lemmas for the proofs of the prevous theorems. We characterze the bas-varance tradeoff n ths error analyss. Frst, a key techncal lemma s useful, for whch the proof s provded n the Appendx.
9 Lemma 2. Let the representaton of the best lnear predctor f (ν) (defned n Assumpton 1) n the canoncal bass B (ν) be f (ν) (x (ν) ) = β (ν) x (ν) (5) Assumpton 1 mples that ( (1 λ ) β (ν) ) 2 4ɛ for ν {1, 2}. Ths lemma shows how the weghts (of an optmal lnear predctor) cannot be too large n coordnates wth small canoncal correlaton coeffcents. Ths s because for those coordnates wth small λ, the correspondng β must be small enough so that the bound s not volated. Ths lemma provdes the techncal motvaton for our algorthms. Now let us revew some useful propertes of the square loss. Usng the representatons of f (ν) and f defned n Equatons 1 and 5, a basc fact for the square loss wth lnear predctors s that loss( f (ν) ) loss(f (ν) ) = β (ν) β (ν) 2 2 where x 2 = x2. The expected regret can be decomposed as follows: ] [ ] E T [ β (ν) β (ν) 2 2 = E T [ β (ν) ] β (ν) 2 2 + E T β (ν) E T [ β (ν) ] 2 2 (6) = E T [ β (ν) ] β (ν) 2 2 + Var( β (ν) ) (7) where the frst term s the bas and the second s the varance. The proof of Theorems 1 and 2 follow drectly from the next two lemmas. Lemma 3. (Bas-Varance for the Shrnkage Estmator) Under the precondtons of Theorem 1, the bas s bounded as: and the varance s bounded as: E T [ β (ν) ] β (ν) 2 2 4ɛ Proof. It s straghtforward to see that: Var( β (ν) ) λ2 n β (ν) = E[x y] whch mples that E T [ β (ν) ] = λ β (ν)
10 Hence, for the bas term, we have: E T [ β (ν) ] β (ν) 2 2 = (1 λ ) 2 (β (ν) ) 2 (1 λ )(β (ν) ) 2 4ɛ We have for the varance Var( β (ν) ) = λ2 n Var(x(ν) y) λ2 n E[(x(ν) y) 2 ] = λ2 n E[(x(ν) ) 2 E[y 2 x]] λ2 n E[(x(ν) ) 2 ] = λ2 n The proof s completed by summng over. Lemma 4. (Bas-Varance for the Thresholded Estmator) Under the precondtons of Theorem 2, the bas s bounded as: E T [ β (ν) ] β (ν) 2 2 4 ɛ and the varance s bounded as: Var( β (ν) ) d n Proof. For those such that λ 1 ɛ, E T [ β (ν) ] = β (ν) Let j be the ndex at whch the thresholdng begns to occur,.e. t s the smallest nteger such that λ j < 1 ɛ. Usng that for j, we have 1 < (1 λ j )/ ɛ
11 (1 λ )/ ɛ, so the bas can be bounded as follows: E T [ β (ν) ] β (ν) 2 2 = = j ( E T [ β (ν) (β (ν) ) 2 ) 2 ] β (ν) j 1 λ (β (ν) ɛ ) 2 1 (1 λ )(β (ν) ɛ 4 ɛ ) 2 where the last step uses Lemma 2. Analogous to the prevous proof, for each < j, we have: Var( β (ν) ) 1 and there are d such. 4 Dscusson Why does unlabeled data help? Theorem 1 shows that the regret drops at a unform rate (down to ɛ). Ths rate s the ntrnsc dmensonalty, λ2, dvded by the sample sze n. Note that ths ntrnsc dmensonalty s only a property of the nput dstrbuton. Wthout the mult-vew assumpton (or workng n the sngle vew case), the rate at whch our error drops s governed by the extrnsc dmensonalty of x, whch could be large (or countably nfnte), makng ths rate very slow wthout further assumptons. It s straghtforward to see that the ntrnsc dmensonalty s no greater than the extrnsc dmensonalty (snce λ s bounded by 1), though t could be much less. The knowledge of the covarance matrx of x allows us to compute the CCA bass and construct the shrnkage estmator whch has the mproved converge rate based on the ntrnsc dmensonalty. Such second order statstcal knowledge can be provded by the unlabeled data, such as n the transductve and fxed desgn settngs. Let us compare to a rdge regresson algorthm (n the sngle vew case), where one apror chooses a norm for regularzaton (such as an RKHS norm mposed by a kernel). As dscussed n Zhang [2005], ths regularzaton governs the bas-varance tradeoff. The regularzaton can sgnfcantly decrease the varance the varance drops as d n where d s a noton of ntrnsc dmensonalty defned n Zhang [2005]. However, the regularzaton also bases the algorthm to predctors wth small norm there s no apror reason that there exsts a good predctor wth a bounded norm (under the pre-specfed norm). In order to obtan a reasonable convergence rate, t must also be the case that the best predctor (or a good one) has a small norm under our pre-specfed norm.
12 In contrast, n the mult-vew case, the mult-vew assumpton mples that the bas s bounded recall that Lemma 3 showed that the bas was bounded by 4ɛ. Essentally, our proof shows that the bas nduced by usng the specal norm nduced by CCA (n Equaton 3) s small. Now t may be the case that we have apror knowledge of what a good norm s. However, learnng the norm (or learnng the kernel) s an mportant open queston. The mult-vew settng provdes one soluton to ths problem. Can the bas be decreased to 0 asymptotcally? Theorem 1 shows that the error drops down to 4ɛ for large n. It turns out that we can not drve ths bas to 0 asymptotcally wthout further assumptons, as the nput space could be nfnte dmensonal. On obtanng hgh probablty bounds. Clearly, stronger assumptons are needed than just a bounded second moment to obtan hgh probablty bounds wth concentraton propertes. For the fxed desgn settng, f y s bounded, then t s straghtforward to obtan hgh probablty bounds through standard Chernoff arguments. For the random transductve case, ths assumpton s not suffcent ths s due to the addtonal randomness from x. Note that we cannot artfcally mpose a bound on x as the algorthm only depends on the subspace spanned by X, so upper bounds have no meanng note the algorthm scales X such that t has an dentty covarance matrx (e.g. E[x 2 ] = 1). However, f we have a hgher moment bound, say on the rato of E[x 4 ]/E[x2 ], then one could use the Bennett bound can be used to obtan data dependent hgh probablty bounds, though provdng these s beyond the scope of ths paper. Related work. The most closely related mult-vew learnng algorthms are the SVM-2K algorthm of Farquhar et al. [2005] and the co-regularzed least squares regresson algorthm of Sndhwan et al. [2005]. Roughly speakng, both of these algorthms try to fnd two hypothess h (1) ( ), based on vew one, and h (2) ( ), based on vew two whch both have low tranng error and whch tend to agree wth each other on unlabeled error, where the latter condton s enforced by constranng h (1) (x (1) ) to usually equal h (2) (x (2) ) on an unlabeled data set. The SVM-2K algorthm consders a classfcaton settng and the algorthm attempts to force agreement between the two hypothess wth slack varable style constrants, common to SVM algorthms. Whle ths algorthm s motvated by kernel CCA and SVMs, the algorthm does not drectly use kernel CCA, n contrast to our algorthm, where CCA naturally provdes a coordnate system. The theoretcal analyss n [Farquhar et al., 2005] argues that the Rademacher complexty of the hypothess space s reduced due to the agreement constrant between the two vews. The mult-vew approach to regresson has been prevously consdered n Sndhwan et al. [2005]. Here, they specfy a co-regularzed least squares regresson algorthm, whch s a rdge regresson algorthm wth an addtonal penalty
13 term whch forces the two predctons, from both vews, to agree. A theoretcal analyss of ths algorthm s provded n Rosenberg and Bartlett [2007], whch shows that the Rademacher complexty of the hypothess class s reduced by forcng agreement. Both of these prevous analyss do not explctly state a mult-vew assumpton, so t hard to drectly compare the results. In our settng, the mult-vew regret s explctly characterzed by ɛ. In a rather straghtforward manner (wthout appealng to Rademacher complextes), we have shown that the rate at whch the regret drops to 4ɛ s determned by the ntrnsc dmensonalty. Furthermore, both of these prevous algorthms use an apror specfed norm over ther class of functons (nduced by an apror specfed kernel), and the Rademacher complextes (whch are used to bound the convergence rates) depend on ths norm. In contrast, our framework assumes no norm the norm over functons s mposed by the correlaton structure between the two vews. We should also note that ther are close connectons to those unsupervsed learnng algorthms whch attempt to maxmze relevant nformaton. The Imax framework of Becker and Hnton [1992], Becker [1996] attempts to maxmze nformaton between two vews x (1) and x (2), for whch CCA s a specal case (n a contnuous verson). Subsequently, the nformaton bottleneck provded a framework for capturng the mutual nformaton between two sgnals [Tshby et al., 1999]. Here, the goal s to compress a sgnal x (1) such that t captures relevant nformaton about another sgnal x (2). The framework here s unsupervsed as there s no specfc supervsed task at hand. For the case n whch the jont dstrbuton of x (1) and x (2) s Gaussan, Chechk et al. [2003] completely characterzes the compresson tradeoffs for capturng the mutual nformaton between these two sgnals CCA provdes the coordnate system for ths compresson. In our settng, we do not explctly care about the mutual nformaton between x (1) and x (2) performance s judged only by performance at the task at hand, namely our loss when predctng some other varable y. However, as we show, t turns out that these unsupervsed mutual nformaton maxmzng algorthms provde approprate ntuton for mult-vew regresson, as they result n CCA as a bass. Acknowledgements We thank the anonymous revewers for ther helpful comments. References Steven Abney. Understandng the yarowsky algorthm. Comput. Lngust., 30(3):365 395, 2004. ISSN 0891-2017. Mara-Florna Balcan and Avrm Blum. A pac-style model for learnng from labeled and unlabeled data. In Sem-Supervsed Learnng, pages 111 126. MIT Press, 2006. S. Becker. Mutual nformaton maxmzaton: Models of cortcal self-organzaton. Network: Computaton n Neural Systems, 1996.
14 Suzanna Becker and Geoffrey E. Hnton. Self-organzng neural network that dscovers surfaces n random-dot stereograms. Nature, 355(6356):161 163, January 1992. do: 10.1038/355161a0. Avrm Blum and Tom Mtchell. Combnng labeled and unlabeled data wth cotranng. In COLT 98: Proceedngs of the eleventh annual conference on Computatonal learnng theory, pages 92 100, New York, NY, USA, 1998. ACM Press. ISBN 1-58113-057-0. Ulf Brefeld, Thomas Gartner, Tobas Scheffer, and Stefan Wrobel. Effcent coregularsed least squares regresson. In ICML 06: Proceedngs of the 23rd nternatonal conference on Machne learnng, pages 137 144, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-383-2. G. Chechk, A. Globerson, N. Tshby, and Y. Wess. Informaton bottleneck for gaussan varables, 2003. URL cteseer.st.psu.edu/artcle/chechk03nformaton.html. Sanjoy Dasgupta, Mchael L. Lttman, and Davd Mcallester. Pac generalzaton bounds for co-tranng, 2001. Jason D. R. Farquhar, Davd R. Hardoon, Hongyng Meng, John Shawe-Taylor, and Sándor Szedmák. Two vew learnng: Svm-2k, theory and practce. In NIPS, 2005. Davd R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. Canoncal correlaton analyss: An overvew wth applcaton to learnng methods. Neural Comput., 16(12):2639 2664, 2004. ISSN 0899-7667. H. Hotellng. The most predctable crteron. Journal of Educatonal Psychology, 1935. D. Rosenberg and P Bartlett. The rademacher complexty of co-regularzed kernel classes. submtted. Proceedngs of the Eleventh Internatonal Conference on Artfcal Intellgence and Statstcs, 2007. V. Sndhwan, P. Nyog, and M. Belkn. A co-regularzed approach to sem-supervsed learnng wth multple vews. Proceedngs of the ICML Workshop on Learnng wth Multple Vews, 2005. N. Tshby, F. Perera, and W. Balek. The nformaton bottleneck method. In Proceedngs of the 37-th Annual Allerton Conference on Communcaton, Control and Computng, pages 368 377, 1999. URL cteseer.st.psu.edu/tshby99nformaton.html. Davd Yarowsky. Unsupervsed word sense dsambguaton rvalng supervsed methods. In Proceedngs of the 33rd annual meetng on Assocaton for Computatonal Lngustcs, pages 189 196, Morrstown, NJ, USA, 1995. Assocaton for Computatonal Lngustcs. Tong Zhang. Learnng bounds for kernel regresson usng effectve data dmensonalty. Neural Comput., 17(9):2077 2098, 2005. ISSN 0899-7667. 5 Appendx We now provde the proof of Lemma 1 Proof. (of Lemma 1). Let β (ν) be the weghts for f (ν) and let β be the weghts of f n some bass. Let β (ν) x (ν) and β x be the representaton of f (ν) and f n ths bass. By Assumpton 1 ɛ E(β (ν) x (ν) y) 2 E(β x y) 2 = E(β (ν) x (ν) β x + β x y) 2 E(β x y) 2 = E(β (ν) x (ν) β x) 2 2E[(β (ν) x (ν) β x)(β x y)]
15 Now the normal equatons for β (the frst dervatve condtons for the optmal lnear predctor β) states that for each : E[x (β x y)] = 0 where x s the component of x. Ths mples that both E[β x(β x y)] = 0 E[β (ν) x (ν) (β x y)] = 0 where the last equaton follows snce x (ν) has components n x. Hence, E[(β (ν) x (ν) β x)(β x y)] = 0 and we have shown that: The trangle nequalty states that: E(β (1) x (1) β x) 2 ɛ E(β (2) x (2) β x) 2 ɛ E(β (1) x (1) β (2) x) 2 ( ) 2 E(β (1) x (1) β x) 2 + E(β (2) x (2) β x) 2 (2 ɛ) 2 whch completes the proof. Below s the proof of Lemma 2. Proof. (of Lemma 2) From Lemma 1, we have: 4ɛ E [(β (1) x (1) β (2) x (2) ) 2] = ( ) (β (1) ) 2 + (β (2) ) 2 2λ β (1) β (2) = ( ) (1 λ )(β (1) ) 2 + (1 λ )(β (2) ) 2 + λ ((β (1) ) 2 + (β (2) ) 2 2β (1) β (2) ) = ( (1 λ )(β (1) ) 2 + (1 λ )(β (2) ) 2 + λ (β (1) β (2) ) 2) ( (1 λ )(β (1) ) 2 + (1 λ )(β (2) ) 2) (1 λ )(β (ν) ) 2 where the last step holds for ether ν = 1 or ν = 2.