Multiplicative Update Algorithms, Boosting and Ensemble Methods

CS369M: Algorihms for Modern Massive Daa Se Analysis Lecure 16-11/11/2009 Muliplicaive Updae Algorihms, Boosing and Ensemble Mehods Lecurer: Michael Mahoney Scribes: Mark Wagner and Yuning Sun *Undied Noes 1 Graph Pariioning Expansion of random cus can relax o real numbers d λ 2 = min xr V φ = min S V E ( S, S 1 n S ( 1 S Aij (x i x j 2 ij (x i x j 2 specral ( 2 or can relax o a vecor Claim: d λ 2 = min x jr n ij A ij x i x j 2 2 ij x i x j 2 ( 3 2 Proof: ( 3 is relaxaion of ( 2, he direc soluion of ( 3 is Claim ( 3 is equal o SDP min ij A ij x i x j 2 2 s ij x i x j 2 2 = n which is equal o min L G [r]x s L kn [race]x = n x 0 ( 4 Problem harder (SDP versus eigenvalue problem useful - look a duals include exra informaion Fac: 1

Dual of ( 4 is max y s L G y 1 n L n A feasible soluion is a number y and a marix Y such ha L G = y n L n + Y Recall: S S = 1 T S L n 1 S E ( S, S = 1 T S L G L S bu So cos of cu y Wha s going on here? ( y 1 T S L G 1 S = 1 S n L n + Y 1 S y 1 S n L n1 S Recall embedding a scaled version of he complee graph in G we know he expansion and cu values for K n and so relae i o G. Noe: K n is an expander. Flow - if graph H of known expansion can be embedded in G as a flow hen h H h G. Then he opimal soluion for a fixed H can be compued as he soluion o a concurren mulicommodiy flow problem. O(log n approximaion, which is igh Specral - relax o an eigenvalue problem and use Cheeger. ARV-ype mehods Can I consruc ieraively a graph H (and es is expansion and sop when i s a good expander and ge a bound on h G. yes - wrie as an SDP. Can compue faser by using primal-dual ideas ARV - original O ( log n AHK - primal-dual mehod in heoreical compuer science. boh using mulicommodiy flows KRV - single-commodiy flows using cu maching game OSVV - exended KRV LMO - empirical evaluaion. Describes as specral modified 2

2 Online Learning predicion/inference problem - given daa predic somehing. Ways o formalize his, differen assumpions on wha he daa are (real numbers, graphs, srings where hey come/generaed from (according o an underlying disribuion; access o side informaion Tradiional Saisic daa generaed according o an underlying disribion learn paramers describing disribion evaluae qualiy by Risk - expeced value of some loss funcion over he disribion in he daa ERM SRM Wha if he daa are no generaed by some underlying process? wih no assumpions, hard o predic Idea: ge daa elemens sequenially {y i, x i } R predic he nex elemen. Evaluaed by he loss funcion e.g. number of incorrec predicions Access o side informaion, namely predicion of a se of expers. Expers make predicions according o some rule deerminisic, random, adversarial, ec A each ime sep, he expers also have a loss Goal: wan loss no oo much worse han he bes exper. Also: in predicion a ime you have access o your predicion and losses in he pas predicions and losses of he expers in he pas Wha are he expers? oracle saisical model cerain seps in an algorihm basis funcions 3 Muliplicaive weighs updae rule mainain probabiliy disribuion over expers 3

a each sep, increase or decrease he weigh muliplicaively ie by muliplying by (1 + = parameer judges how much confidence o place in exper s predicion/regularizaion parameer Discree Expers: se of expers E ha makes predicions f Ei, R n se of vecors {xr n : n i=1 x i = 1} = weighs on expers l (i = loss of exper i a sage l (ˆp, y = loss of algorihm = n i=1 x il (i Algorihm 1. W 0 = 1 2. when y and he expers predicion algorihm uses his updae rule W +1,i = W,i (1 l(i = P T (1 l(i = e n P T l(i, where n = log (1 Thm: For any exper E j j [n] Proof use poenial funcion argumen. T l ( ˆp log n + y T l (i Firs, relae poeial funcion W = n i=1 W,i P T W +1 W +1,i = (1 l(i = e n P n l(i Nex relae poenial funcion o performance of algorihm n n W +1 = w +1,i = w,i (1 l(i i=1 i=1 Noe (1 x 1 x for 0 1 4

So w +1 w,i (1 l (i i ( = w 1 w,i l (i w = w (1 l (ˆp w exp ( l ( ˆp ( T w exp l (ˆp i e η P l(i W +1 ne P l(ˆp η l (i log n l ( ˆp l ( ˆp log n + n l (n log n + (1 + l (i Define he regre if = log n T Q: is log n large or small? R T = T l (p min l (f E, expers log n + l (i 2 T log n If exra informaion is given ha one exper will be perfec find he bes exper in logn misakes -muliplicaive weighs updae rule says you re no much worse han his scenario, in more general cases applicaions o algorihms AHK generalize he losses o marix losses o solve SDPs O ( n 2 ime KRV - cu-maching game o solve sparses cus. 2 players: a cu player, and a maching player 1. G 0 = 0 2. in each round, cu player chooses a bisecion ( S, S and he maching player chooses a perfec maching M across ( S, S. hen G +1 G + M. 3. game sops when G is an expander eg l G 1 10 4. value of game is number of seps i ook. goal: cu player - sop soon (find expander fas, maching player - delay sop. 5

Dual algorihm. 1. Le G = γg 2. approximae he 2nd eigenvecor of L G. Degree of approximaion governed by regularizaion parameer. 3. use he bisecion ( S n/2, S n/2 from he sweep cu. Call flow-based improvemen algorihm o ge a cu ( T, T and a maching M unil sopping rule is saisfied. Le G = G + M. Reurn bes cu (T T Why would you hope/expec ha hese muliplicaive weigh updae algorihms would perform well in pracice? faser han naive compuaion ofen give beer answers han he exac algorihm Boosing - example of an ensemble mehod Given X, learn C : X {0, 1} a classificaion rule from some concep class C Risk = E [error]. Define a γ weak learning algorihm is one ha has error 1 2 γ. An srong learning algorihm wih error. Can one combine a se of weak learners ino a srong learner. Idea - weak learners are a lile beer han chance, so combining hem doesn make hings worse. If hey are differen hen we can hope for improvemen by averaging predicions. Boosing - AdaBoos - do boosing by sampling. Take a sample of daa and use algorihm o boos on ha sample do his in an ieraive manner by updaing weighs on daa poins o find new classificaion rule for daa poins ha are misclassified Evens - hypoheses for he classificaion rule oupu a each sep A each sep - ge a classificaion rule h (weak learner and final classificaion algorihm use h as predicion 6