Transfer Learning! and Prior Estimation! for VC Classes!

Transcription

1 15-859(B) Machine Learning Theory! Transfer Learning! and Prior Estimation! for VC Classes! Liu Yang! 04/25/2012! Liu Yang Notation! Instance space X = R n! Concept space C of classifiers h: X -> {0,1! - Assume C has VC dimension vc <! Data Distribution D on X! Unknown target function h*: the true labeling function (Realizable case: h* in C)! Assume (h, g)=p x~d [h(x) g(x)] for any classifiers h, g, is a metric on C! Err (h) = P x~d [h(x) h*(x)]! Liu Yang

2 Transfer Learning Principle: solving a new learning problem is easier given that we ve solved several already!! How does it help?! - New task directly ``related to previous task! [eg, Ben-David & Schuller 03; Evgeniou, Micchelli, & Pontil 2005]! - Previous tasks give us useful sub-concepts [eg, Thrun 96]! - Can gather statistical info on the variety of concepts! [eg, Baxter 97; Ando & Zhang 04]! Example: Speech Recognition! - After training a few times, figured out the dialects! - Next time, just identify the dialect! - Much easier than training a recognizer from scratch! Liu Yang Model of Transfer Learning! Motivation: Learners often Not Too Altruistic! Layer 1: draw task iid from unknown prior! h 1 * prior Task 1 Task T Better Estimate of Prior!!! h T * Task 2 x 1 1,y1 1 x 1 k,y1 k h 2 * x T 1,yT 1 x T k,yt k Layer 2: per task, draw data iid from target! x 2 1,y2 1 x 2 k,y2 k Liu Yang

3 Identifiability of priors from joint distribs! Let prior π be any distribution on C! - example: (w, b) ~ multivariate normal! Target h* π ~ π! Data X = (X 1, X 2, ) iid D indep h* π! Z(π) = ((X 1, h* π (X 1 ), (X 2, h* π (X 2 ), )! Let [m] = {1,, m! Denote X I = {X i i in I (I : subset of natural numbers)! Z I (π) = {(X i, h* π (X i )) I in I! Theorem: Z [VC] (π 1 ) = d Z [VC] (π 2 ) iff π 1 = π 2! Liu Yang Identifiability of priors by! VC-dim joint distri! Threshold:! for two points x 1, x 2, if x 1 < x 2, then! Pr(+,+)=Pr(+), Pr(-,-)=Pr(-), Pr(+,-)=0,! So Pr(-,+)=Pr(+)-Pr(++) = Pr(+)-Pr(+)! - for any k > 1 points, can directly to reduce number of labels in the! joint prob from k to 1!! P( (-+) )! = P( (-+) )! = P( (+) ) - P( (++) ) = P( (+) ) - P( (+) )! + P( (+-) ) (unrealized labeling!!)! = P( (+) ) - P( (+) )! Liu Yang

4 Theorem: Z [VC] (π 1 ) = d Z [VC] (π 2 ) iff π 1 = π 2 Proof Sketch! Let m (h,g) = 1/m i=1 m II(h(X m ) g(x m ))! Then vc < implies wp1 forall h, g in C with h g! lim m -> m (h,g) = (h,g) > 0! is a metric on C by assumption,! so wp1 each h in C labels -seq (X 1, X 2 ) distinctly (h(x 1 ), h(x 2 ), )! => wp1 conditional distribution of the label seq Z(π) X identifies π! => distrib of Z(π) identifies π! ie Z (π 1 ) = d Z (π 2 ) implies π 1 = π 2! Liu Yang Identifiability of Priors from Joint Distributions! lower dim cond distrib y closer to ỹ Liu Yang

5 Identifiability of Priors from Joint Distributions! Liu Yang Identifiability of Priors from Joint Distributions! Liu Yang

6 Transfer Learning Setting! Collection of distribs on C (known)! Target distrib π* in (unknown)! Indep target fns h 1 *,, h T * ~ π* (unknown)! Indep iid D data sets X (t) = (X 1 (t), X 2 (t), ), t in [T]! Define Z (t) = ((X 1 (t), h t *(X 1 (t) )), (X 2 (t), h t *(X 2 (t) )), )! Learning alg gets Z (1), then produces ĥ 1, then gets Z (2), then produces ĥ 2, etc in sequence! Interested in: values of (ĥ t, h*(t)), and the number of h* t (X j (t) ) value alg needs to access &! Liu Yang Estimating the prior! Principle: learning would be easier if know π*! Fact: π* is identifiable by distrib of Z (t)! [VC] Strategy: Take samples Z (i) [VC] from past tasks 1,, t-1, use them to estimate distrib of Z (i) [VC], convert that into an estimate π of π*,! t-1 Use π in a prior-dependent learning alg for t-1 new task h t *! Assume is totally bounded in total variation! Can estimate π* at a bounded rate:! π* - π t < t converges to 0 (holds whp)! Liu Yang

7 Main Theorem! Pf Idea: relate convergence of estimator for d-dim joint to convergence! of estimator for the prior! r k Prior ( -dim joint)! k-dim joints! k-dim conditionals! d-dim joints! d-dim conditionals! Standard result: exist a converging estimator! for distrib on d exs, for totally bounded families! Liu Yang Transfer Learning! Given a prior-dependent learning A(, π), with! E[# labels accessed] = (, π) and producing ĥ! with E[ (ĥ, h*)]! For t = 1,, T! If t-1 > /4,! run prior-indep learning on Z (t) [VC/ ] to get ĥ t! Else let π t = argmin π in B(π t-1, t-1 ) ( /2, π) and run A( /2, π t ) on Z (t) to get ĥ t! Theorem: Forall t, E[ (ĥ t, h t *)], and! limsup T -> E[#labels accessed]/t ( /2, π*) + vc Liu Yang

8 Lemma:! Relate Prior to k-dim joint! Proof:! - The left inequality follows from, for any θ,θ in Θ and t (natural num), P Ztk (θ) - P Z tk (θ ) P Z t ( ) - P Z t ( ) = π θ - π θ! - To show the right inequality: Fix θ,θ in Θ, let γ>0,let B subseteq (Χ {-1, +1 ) be a measurable set st! - Carathéodory s extention theorem implies there exist disjoint sets {A i i in N! where A i is an event for finite number of data pts, st! P Zt (θ) (B) - P Z t (θ ) (B) < Σ i in N P Z t (θ) (A i ) - Σ i in N P Z t (θ ) (A i )+γ - Since these sums are bounded, there must exist n in N st! Σ i in N P Zt (θ) (A i )<γ+σ i=1 n P Zt (θ) (A i ) Liu Yang Relate Prior to k-dim joint! So that! As there exists k (natural num) & measurable A subset of! (X {-1, +1) k st! Thus! In sum,! - Taking the limit as γ-> 0 implies! - Particularly, it implies there exists a sequence st! QED! Liu Yang

9 Relate k-dim Joint to k-dim Cond! Want to bound between tvd of k-dim joints! Easier to bound diff between tvd of k-dim cond Distris! Use Jensen s ineqn to relate tvd of k-dim joint distri to k-dim cond distri :! P Ztk (θ) - P Z tk ( ) E [ P Y tk ( ) X tk - P Ytk ( ) X tk ]! Liu Yang Relate k-dim Condto d-dim Cond! - By def of total variation dist! - By Sauer's Lemma this is! - Notations: (By P(A and B) = P(A) P(A and not B) Two terms, one reduce dim by 1, the other brought y vector closer to the unrealizable labeling by one bit)! - Apply this to theta and theta, interested in the tvd between the cond Prob! Liu Yang

10 Tree Argument: Combinatorics! - Consider these two terms inductively define a binary tree! - Branch based on modification to the y vector! Branches left once, Branches right once, gets a gets a diff of probs difference of probs! for set I of one less for a one closer to an element! unrealized than parent! At any level, left to right nodes have Stop branching upon decreasing I values reaching a set I and a st either is an unrealized labeling, or I = d! - Any path can branch left k - d times (total) before reaching a set I w/ only d elements; can branch right d + 1 times in a row before reaching a st both probs zero, so the diff is zero! Liu Yang Tree Argument: Conclusions Bound original (root node) diff of probs by sum of the diff of probs for leaf nodes with I = d! Depth of any leaf node with I = d is at most (k - d)d! Maximum width of the tree is at most k - d So total #leaf nodes with I = d is at most d (k d) 2! -! For any {-1, +1 k,! Liu Yang

11 Relate k-dim Joint to d-dim Joint! - Note! - By exchangeability, the last line equals! - Want d-dim joint instead of d-dim cond Claim:! Liu Yang Proof of the Claim! Proof:! Suppose! Liu Yang

12 Reflect the Path of Proof! Earlier, π θ π θ P Ztk ( ) -P Z tk ( ) + r k! Just showed! Carathéodory s extention! Thm in general form! P Zt ( ) -P Z t ( ) P Z tk ( ) -P Z tk ( ) + r k! (tree) g k ( P Z td ( ) -P Z td ( ) ) + r k! where P Ztk ( ) -P Z tk ( ) 4 (2ek)2d+2 P Ztd ( ) - P Z td ( )! So in total! For any k in N, π θ π θ 4 (2ek) 2d+2 P Ztd (θ) - P Z td ( ) + r k! In particular, r k ->0 as k-> Let g(ε) = min k (4 (2ek) 2d+2 ε +r k )! (Why? Let ε k = (r k /(4 (2ek)2d+2)) 2 ε k =o(1) g(εk ) 4 (2ek)2d+2 ε k + r k =2r k g is monotonic in ε=> lim ε-> g(ε) = lim k-> g(ε k ) = lim k-> 2r k = 0 ) Liu Yang Distri Estimation Rate! The last component: rate of conv of our estimate of! - N( ) is the -covering number of! - Taking as the minimum distance skeleton estimate of Yatracos (1985)! achieves expected tvd from, for some! Solving for eps in terms of T implies E[tvd of d dim] -> 0 as T->! Conclusion for prior estimation:! - Pick the sequence of R t st R - Let w t be E[tvd of d dim] For any t -> 0, but with E[w t, apply Markov t ]/R ineq t -> 0 => P(w t > R t ) < E[w t ]/R t - Since E[tvd of d dim] -> 0, Markov s ineq => there is a bound on tvd -> 0 which! holds with prob that -> 1, as T->! - If tvd of d-dim joints -> 0, plugging into g() (just proved), tvd of priors -> 0 Together we just proved the theorem! Liu Yang

13 Rate of Conv under Hölder-Smooth! Definition:! Theorem! π θ π θ min k 4 (2ek) 2d+2 P Ztd (θ) - P Z td ( ) 1/2 + r k! Under Hölder-smooth! r k = O(L(d/k log(k/d)) )! Liu Yang Rate of Conv under Hölder-Smooth! Definition:! Theorem! Proof:! - By PAC bound, for any γ>0, wp>1-γ, a sample of partition C! into regions of width < γ! - For any! - For any! - By smoothness, w p >1-γ, we have everywhere! Liu Yang

14 Rate of Conv under Hölder-Smooth! - Thus for any wp>1-γ,! - Since the regions that define are the same,! r k! - Thus, wp 1-γ,! - Proceed as before, we get! - Plug in get! (*) Liu Yang Rate of Conv under Hölder-Smooth! Rate of conv of estimate of! - -cover size bounded by grid-argument under höldersmooth, plug that into the SC of Yachocos (1985), get T = O(ε -2 (L/ε) d/α log(1/ε)) for ε Solving for ε, we get! ε = O(L (log(tl)/t) α/(d+2α) )! - Plug this into (*), get the follow (hold for any γ)! - With, QED! Liu Yang

15 Is this Better than without Transfer?! The question becomes:! - How much does knowledge of target distrib π*help?! There are some (constant factor) gains for passive learning [eg HKS1992]! It really helps in Active learning:! - Earlier, we showed can get o(1/ ) for all π! For many C (eg linear separators), no prior-indep alg has this guarantee! Plugging in that method, transfer method accesses o(1/ ) labels on avg! Liu Yang An Example of! Prior-Dependent Learning Self-verifying Bayesian Active Learning! (a special type of stopping criterion)! - Given, adaptively decides # of query, then halts! - has the property that E[err] < when halts! Question: Can you do with E[#query] = o(1/ )? (passive learning need 1/ labels)! Liu Yang

16 Example: Intervals! Verification Lower Bound! In non-bayesian setting, supposing h* is empty interval! Given any classifier h,! just to verify err(h) < ε,! Need to verify h* is not an interval of width 2ε! Need an example in Ω(1/ε) regions to verify this fact! 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 2ε 0 1 h* Suppose h* is empty interval, D is uniform on [0,1]! Liu Yang Interval Example with prior Algorithm: Query random pts till find first +, do binary search to find end-pts Halt when reach a prespecified prior-based query budget Output posterior s Bayes classifier! Let budget N be high enough so E[err] <! - N = o(1/ ) sufficient for E[err w*>0] < : if w* > 0, even prior-independent analysis needs only E[#queries w*] = O(1/w* + log(1/ )) = o(1/ )! - N = o(1/ ) sufficient for E[err w*=0] < : if P(w*=0)>0, then after some L = O(log(1/ )) queries, wp> 1-, most prob mass on empty interval, so posterior s Bayes classifier has 0 error rate! Liu Yang

17 Can do o(1/eps) for any VC-class! Theorem: With the prior, can get o(1/ ) QC! There are methods that find a good classifier in o(1/eps) queries (though they aren t self-verifying) [BHW08]! Need set a stopping criterion for those alg! The stop criterion we use : budget! Set the budget to be just large enough so E[err] <! Liu Yang