A Constant-Factor Approximation Algorithm for the Link Building Problem

A Costat-Factor Approximatio Algorithm for the Lik Buildig Problem Marti Olse 1, Aastasios Viglas 2, ad Ilia Zvedeiouk 2 1 Ceter for Iovatio ad Busiess Developmet, Istitute of Busiess ad Techology, Aarhus Uiversity, Birk Ceterpark 15, DK-7400 Herig, Demark martio@hih.au.dk 2 School of Iformatio Techologies, Uiversity of Sydey, 1 Clevelad St, NSW 2006, Australia taso.viglas@sydey.edu.au, izve6419@ui.sydey.edu.au Abstract. I this work we cosider the problem of maximizig the PageRak of a give target ode i a graph by addig k ew liks. We cosider the case that the ew liks must poit to the give target ode (backliks). Previous work [7] shows that this problem has o fully polyomial time approximatio schemes uless P = NP. We preset a polyomial time algorithm yieldig a PageRak value withi a costat factor from the optimal. We also cosider the aive algorithm where we choose backliks from odes with high PageRak values compared to the outdegree ad show that the aive algorithm performs much worse o certai graphs compared to the costat factor approximatio scheme. 1 Itroductio Search egie optimizatio (SEO) is a fast growig idustry that deals with optimizig the rakig of web pages i search egie results. SEO is a complex task, especially sice the specific details of search ad rakig algorithms are ofte ot publicly released, ad also ca chage frequetly. Oe of the key elemets of optimizig for search egie visibility is the exteral lik popularity [9], which is based o the structure of the web graph. The problem of obtaiig optimal ew backliks i order to achieve good search egie rakigs is kow as Lik Buildig ad leadig experts from the SEO idustry cosider Lik Buildig to be a importat aspect of SEO [9]. The PageRak algorithm is oe of the popular methods of defiig a rakig accordig to the lik structure of the graph. The defiitio of PageRak [3] uses radom walks based o the radom surfer model. The radom surfer walk is defied as follows: the walk ca start from ay ode i the graph ad at each step the surfer chooses a ew ode to visit. The surfer usually chooses (uiformly at radom) a outgoig lik from the curret ode, ad follows it. But with a small probability at each step the surfer might choose to igore the curret ode s outgoig liks, ad just zap to ay ode i the graph (chose uiformly at radom). The radom surfer walk is a radom walk o the graph with radom

restarts every few steps. This radom walk has a uique statioary probability distributio that assigs the probability value π i to ode i. This value is the PageRak of ode i, ad ca be iterpreted as the probability for the radom surfer of beig at ode i at ay give poit durig the walk. We refer to the radom restart as zappig. The parameter that cotrols the zappig frequecy is the probability of cotiuig the radom walk at each step, α > 0. The high level idea is that the PageRak algorithm will assig high PageRak values to odes that would appear more ofte i a radom surfer type of walk. I other words the odes with high PageRak are hot-spots that will see more radom surfer traffic, resultig directly from the lik structure of the graph. If we add a small umber of ew liks to the graph, the PageRak values of certai odes ca be affected very sigificatly. The Lik Buildig problem arises as a atural questio: give a specific target ode i the graph, what is the best set of k liks that will achieve the maximum icrease for the PageRak of the target ode? We cosider the problem of choosig the optimal set of backliks for maximizig π x, the PageRak value of some target ode x. A backlik (with respect to a target ode x) is a lik from ay ode towards x. Give a graph G(V, E) ad a iteger k, we wat to idetify the k 1 liks to add to ode x i G i order to maximize the resultig PageRak of x, π x. Ituitively, the ew liks added should redirect the radom surfer walk towards the target ode, as much as possible. For example addig a ew lik from a ode of very high PageRak would usually be a good choice. 1.1 Related Work ad Cotributio The PageRak algorithm [3] is based o properties of Markov chais. There are may results related to the computatio of PageRak values [5, 2] ad recalculatig PageRak values after addig a set of ew liks i a graph [1]. The Lik Buildig problem that we cosider i this work is kow to be NP-hard [7] where it is eve showed that there is o fully polyomial time approximatio scheme (FPTAS) for Lik Buildig uless NP = P ad the problem is also show to be W[1]-hard with parameter k. A related problem cosiders the case where a target ode aims at maximizig its PageRak by addig ew outliks. Note that i this case, ew outliks ca actually decrease the PageRak of the target ode. This is differet to the case of the Lik Buildig problem with backliks where the PageRak of the target ca oly icrease [1]. For the problem of maximizig PageRak with outliks we refer to [1, 4] cotaiig, amog other thigs, guidelies for optimal likig structure. I Sect. 2 we give backgroud to the PageRak algorithm. I Sect. 3 we formally itroduce the Lik Buildig problem. I Sect. 3.1 we cosider the aive ad ituitively clear algorithm for Lik Buildig where we choose backliks from the odes with the highest PageRak values compared to their outdegree (plus oe). We show how to costruct graphs where we obtai a surprisigly high approximatio ratio. The approximatio ratio is the value of the optimal solutio divided by the value of the solutio obtaied by the algorithm. I Sect. 3.2, we preset a polyomial time algorithm yieldig a PageRak value withi

a costat factor from the optimal ad therefore show that the Lik Buildig problem is i the class APX. 2 Backgroud: The PageRak Algorithm The PageRak algorithm was proposed by Bri, Page [3] ad Bri, Page, Motwai ad Wiograd [8] as a webpage rakig method that captures the importace of webpages. Loosely speakig, a lik poitig to a webpage is cosidered a vote of importace for that webpage. A lik from a importat webpage is better for the receiver tha a lik from a uimportat webpage. We cosider directed graphs G = (V, E) that are uweighted ad therefore we cout multiple liks from a ode u to a ode v as a sigle lik. The graph may represet a set of webpages V with hyperliks betwee them, E, or ay other liked structure. We defie the followig radom surfer walk o G: at every step the radom surfer will choose a ew ode to visit. If the radom surfer is curretly visitig ode u the the ext ode is chose as follows: (1) with probability α the surfer chooses a outlik from u, (u, v), uiformly at radom ad visits v. If the curret ode u happes to be a sik (ad therefore has o outliks) the the surfer picks ay ode v V uiformly at radom, (2) with probability 1 α the surfer visits ay ode v V chose uiformly at radom this is referred to as zappig. A typical value for the probability α is 0.85. The radom surfer walk is therefore a radom walk that usually follows a radom outlik, but every few steps it essetially restarts the radom walk from a radom ode i the graph. Sice the ew ode depeds oly o the curret positio i the graph, the sequece of visited pages is a Markov chai with state space V ad trasitio probabilities that ca be defied as follows. Let P = {p ij } deote a matrix 1 derived from the adjacecy matrix of the graph G, such that p ij = if outdeg(i) (i, j) E ad 0 otherwise (outdeg(i) deotes the outdegree of i, the umber of out-goig edges from ode i V ). If outdeg(i) = 0 the p ij = 1. The trasitio probability matrix of the Markov chai that describes the radom surfer walk ca therefore be writte as Q = 1 α 1l, + αp, where 1l, is a matrix with every etry equal to 1. This Markov chai is aperiodic ad irreducible ad therefore has a uique statioary probability distributio π - the eigevector associated with the domiat eigevalue of Q. For ay positive iitial probability distributio x 0 over V, the iteratio x T 0 Q l will coverge to the statioary probability distributio π T for large eough l. This is referred to as the power method [5]. The distributio π = (π 1,..., π ) T is defied as the PageRak vector of G. The PageRak value of a ode u V is the expected fractio of visits to u after i steps for large i regardless of the startig poit. A ode that is reachable from may other odes i the graph via short directed paths will have a larger PageRak, for example.

3 The Lik Buildig Problem The k backlik (or Lik Buildig) problem is defied as follows: Defiitio 1. The LINK BUILDING problem: Istace: A triple (G, x, k) where G(V, E) is a directed graph, x V ad k ZZ +. Solutio: A set S V \ {x} with S = k maximizig π x i G(V, E (S {x})). For fixed k = 1 this problem ca be solved i polyomial time by simply calculatig the ew potetial PageRaks of the target ode after addig a lik from each ode. This requires O() PageRak calculatios. The argumet is similar for ay fixed k. As metioed i Sect. 1.1, if k is part of the iput the the problem becomes NP-hard. 3.1 Naive Selectio of Backliks Whe choosig ew icomig liks i a graph, based o the defiitio of the PageRak algorithm, higher PageRak odes appear to be more desirable. If we aively assume that the PageRak values will ot chage after isertig ew liks to the target ode the the optimal ew sources for liks to the target would be the odes with the highest PageRak values compared to outdegree plus oe. This leads us to the followig aive but ituitively clear algorithm: Naive(G, x, k) Compute all PageRaks π i, for all (i V : (i, x) E) Retur the k webpages with highest values of, where di is the outdegree of page i π i d i +1 Fig. 1. The aive algorithm The algorithm simply computes all iitial PageRaks ad chooses the k odes π with the highest value of i d i+1. It is well uderstood [7] that the aive algorithm is ot always optimal. We will ow show how to costruct graphs with a surprisigly high approximatio ratio roughly 13.8 for α = 0.85 for the aive algorithm. Lower Boud for the Approximatio Ratio of the Naive Algorithm We defie a family of iput graphs ( cycle versus sik graphs) that have the followig structure: There is a cycle with k odes, where each ode has a umber of icomig liks from t c other odes (referred to as tail odes). Tail odes are used to boost the PageRaks of certai pages i the iput graph ad have a outdegree of 1. There are also k sik odes (o outliks) each oe with a tail of

t s odes poitig to them. The target ode is x ad it has outliks towards all of the siks. Figure 2 illustrates this family of graphs. Assume also that there is a isolated large clique with size t i. Fig. 2. A cycle versus sik graph for the aive algorithm. Due to symmetry all pages i the cycle will have the same PageRak π c ad the k sik pages will have the PageRak π s. All tail odes have o icomig liks ad will also have the same PageRak deoted by π t. The PageRak of the target ode is π x ad the PageRak of each ode i the isolated clique is π i. The iitial PageRaks for this kid of symmetric graph ca be computed by writig a liear system of equatios based o the idetity π T = π T Q. The total umber of odes is = k (t s + t c + 2) + t i + 1. π t = 1 α π x = π t = 1 α + α k π s + α k π s ( ) π s = π t + α k + t sπ t π c = π t + α (π c + t c π t ) π i = π t + απ i. We eed to add k ew liks towards the target ode. We will pick the sizes of the tails t c, t s ad therefore the PageRaks i the iitial etwork so that the PageRak (divided by outdegree plus oe) of the cycle odes is slightly higher tha the PageRak over degree for the sik odes. Therefore the aive algorithm 1 will choose to add k liks from the k cycle odes. Oce oe lik has bee added, the rest of the cycle odes are ot desirable aymore, a fact that the aive algorithm fails to observe. The optimal solutio is to add k liks from the sik odes. I order to make sure cycle odes are chose by the aive algorithm, we eed to esure that π c outdeg(c)+1 > π s outdeg(s)+1 πc 2 > π s π c /π s = 2 + δ for some

δ > 0. We the parameterize our tails: t c = u (1) t i = u 2 (2) u t s = 2(1 λ α). (3) where u determies the size of the graph ad λ is the solutio of π c /π s = 2+δ, givig (( α 2 α ) δ + 2α 2) ku + 2 ((α 1) δ + 2α 1) k + 2 ( α 2 α ) δ + 4(α 2 α) λ = 2 α 2 ku + ((2 α 2 2 α) δ + 4 α 2 2 α) k + (2 α 3 2 α 2 ) δ + 4 α 3 4 α 2 We ca solve for λ for ay desired value of δ. Note also that we choose the tails of the clique odes to be u 2 i order to make them asymptotically domiate all the other tails. The aive algorithm therefore will add k liks from the cycle odes which will result i the followig liear system for the PageRaks: π g t = 1 α + α k πg s g = π g t + αk πg c ( 2 ) πs g = π g π g t + α x k + t sπ g t πc g = π g t + α (πc g /2 + t c π g t ) π g i = πg t + απ g i. The optimal is to choose k liks from the sik odes with a resultig Page- Rak vector described by the followig system: πt o = 1 α o = πt o + αkπs o ( ) π πs o = πt o o + α x k + t sπt o πc o = πt o + α (πc o + t c πt o ) π o i = π o t + απ o i. We solve these systems ad calculate the approximatio ratio of the aive algorithm: ( o α 3 2 α 2) k t s + ( α 2 2 α ) k + α 2 g = (α 4 α 2 ) k t c + (α 3 α) k α 3 + 2 α 2 + α 2. (4) We ow set our tails as described above i Equatios 1-3 ad let u, k. So for large values of the tail sizes we get the followig limit:

o 2 α lim u,k g = (α 3 α 2 α + 1) δ + 2 α 3 2 α 2 2 α + 2. (5) Now lettig δ 0 (as ay positive value serves our purpose) we get the followig theorem. Theorem 1. Cosider the Lik Buildig problem with target ode x. Let G = (V, E) be some directed graph. Let π o x deote the highest possible PageRak that the target ode ca achieve after addig k liks, ad π g x deote the PageRak after addig the liks retured by the aive algorithm from Fig. 1. The for ay ɛ > 0 there exist ifiitely may differet graphs G where o 2 α g > 2 (1 α) (1 α 2 ) ɛ. (6) Note that ɛ ca be writte as fuctio of u, δ, k ad α. As u, k, ɛ 0 givig the asymptotic lower boud. For α = 0.85 the lower boud is about 13.8. 3.2 Lik Buildig is i APX I this sectio we preset a greedy polyomial time algorithm for Lik Buildig; computig a set of k ew backliks to target ode x with a correspodig value of G withi a costat factor from the optimal value. I other words we prove that Lik Buildig is a member of the complexity class APX. We also itroduce z ij as the expected umber of visits of ode j startig at ode i without zappig withi the radom surfer model. These values ca be computed i polyomial time [1]. Proof of APX Membership Now cosider the algorithm cosistig of k steps where we at each step add a backlik to ode x producig the maximum icrease i z xx the pseudo code of the algorithm is show i Fig. 3. This algorithm rus i polyomial time, producig a solutio to the Lik Buildig problem withi a costat factor from the optimal value as stated by the followig theorem. So, Lik Buildig is a member of the complexity class APX. r-greedy(g, x, k) S := repeat k times Let u be a ode which maximizes the value of S := S {u} E := E {(u, x)} Report S as the solutio π x z xx i G(V, E {(u, x)}) Fig. 3. Pseudo code for the greedy approach.

Theorem 2. We let π G x ad z G xx deote the values obtaied by the r-greedy algorithm i Fig. 3. Deotig the optimal value bye π o x, we have the followig π G x π o x zxx G zxx o (1 1 e ) πo x(1 α 2 )(1 1 e ). where e = 2.71828... ad z o xx is the value of z xx correspodig to π o x. Proof. Propositio 2.1 i [1] by Avrachekov ad Litvak states the followig π x = 1 α z xx(1 + r ix ). (7) i x where r ix is the probability that a radom surfer startig at i reaches x before zappig. This meas that the algorithm i Fig. 3 greedily adds backliks to x i a attempt to maximize the probability of reachig ode x before zappig, for a surfer dropped at a ode chose uiformly at radom. We show i Lemma 1 below that r ix i the graph obtaied by addig liks from X V to x is a submodular fuctio of X iformally this meas that addig the lik (u, x) early i the process produces a higher icrease of r ix compared to addig the lik later. We also show i Lemma 2 below that r ix is ot decreasig after addig (u, x), which is ituitively clear. We ow coclude from (7) that z xx is a submodular ad odecreasig fuctio sice z xx is a sum of submodular ad odecreasig terms. Whe we greedily maximize a oegative odecreasig submodular fuctio we will always obtai a solutio withi a fractio 1 1 e from the optimal accordig to [6] by Nemhauser et al. We ow have that: G zxx G πo x zxx o (1 1 e ). Fially, we use the fact that z G xx ad z o xx are umbers betwee 1 ad 1 1 α 2. For α = 0.85 this gives a upper boud of πo x of approximately 5.7 It must G be stressed that this upper boud is cosiderably smaller if z xx is close to the optimal value prior to the modificatio if z xx caot be improved the the e upper boud is e 1 = 1.58. It may be the case that we obtai a bigger value of π x by greedily maximizig π x istead of z xx, but π x (the PageRak of the target ode throughout the Lik Buildig process) is ot a submodular fuctio of X so we caot use the approach above to aalyze this situatio. To see that π x is ot submodular we just have to observe that addig a backlik from a sik ode creatig a short cycle late i the process will produce a higher icrease i π x compared to addig the lik early i the process. Proof of Submodularity ad Mootoicity of r ix Let f i (X) deote the value of r ix i G(V, E (X {x})) the graph obtaied after addig liks from all odes i X to x.

Lemma 1. f i is submodular for every i V. Proof. Let fi r (X) deote the probability of reachig x from i without zappig, i r steps or less, i G(V, E (X {x})). We will show by iductio i r that fi r is submodular. We will show the followig for arbitrary A B ad y / B: f r i (B {y}) f r i (B) f r i (A {y}) f r i (A). (8) We start with the iductio basis r = 1. It is ot hard to show that the two sides of (8) are equal for r = 1. For the iductio step; if you wat to reach x i r + 1 steps or less you have to follow oe of the liks to your eighbors ad reach x i r steps or less from the eighbor: f r+1 α i (X) = outdeg(i) j:i j f r j (X). (9) where j : i j deotes the odes that i liks to this set icludes x if i X. The outdegree of i is also depedet o X. If i is a sik i G(V, E (X {x})) the we ca use (9) with outdeg(i) = ad j : i j = V as explaied i Sect. 2, the siks ca be thought of as likig to all odes i the graph. Please also ote that f r x(x) = 1. We will ow show that the followig holds for every i V assumig that (8) holds for every i V : f r+1 i (B {y}) f r+1 i (B) f r+1 i (A {y}) f r+1 i (A). (10) 1. i A: The set j : i j ad outdeg(i) are the same for all four terms i (10). We use (9) ad the iductio hypothesis to see that (10) holds. 2. i B \ A : (a) i is a sik i G(V, E): The left had side of (10) is 0 while the right had side is positive or 0 accordig to Lemma 2 below. (b) i is ot a sik i G(V, E): I this case j : i j icludes x o the left had side of (10) but ot o the right had side the oly differece betwee the two sets ad outdeg(i) is oe bigger o the left had side. We ow use (9), the iductio hypothesis ad X : f r x(x) = 1. 3. i = y: We rearrage (10) such that the two terms icludig y are the oly terms o the left had side. We ow use the same approach as for the case i B \ A. 4. i V \ (B {y}): As the case i A. Fially, we use lim r f r i (X) = f i(x) to prove that (8) holds for f i. Lemma 2. f i is odecreasig for every i V. Proof. We shall prove the followig by iductio i r for y B: fi r (B {y}) fi r (B). (11) We start with the iductio basis r = 1.

α 1. i = y: The left had side is outdeg(y) where outdeg(y) is the ew outdegree of y ad the right had side is at most α (if y is a sik i G(V, E)). 2. i y: The two sides are the same. For the iductio step; assume that (11) holds for r ad all i V. We will show that the followig holds: f r+1 i (B {y}) f r+1 i (B). (12) 1. i = y: (a) i is a sik i G(V, E): The left had side of (12) is α ad the right had side is smaller tha α. (b) i is ot a sik i G(V, E): We use (9) i (12) ad obtai simple averages o both sides with bigger umbers o the left had side due to the iductio hypothesis. 2. i y: Agai we ca obtai averages where the umbers are bigger o the left had side due to the iductio hypothesis. Agai we use lim r f r i (X) = f i(x) to coclude that (11) holds for f i. 4 Discussio ad Ope Problems We have preseted a costat-factor approximatio polyomial time algorithm for Lik Buildig. We also preseted a lower boud for the approximatio ratio achieved by a perhaps more ituitive ad simpler greedy algorithm. The problem of developig a polyomial time approximatio scheme (PTAS) for Lik Buildig remais ope. Refereces 1. Avrachekov, K., Litvak, N.: The effect of ew liks o Google PageRak. Stochastic Models 22(2), 319 331 (2006) 2. Biachii, M., Gori, M., Scarselli, F.: Iside pagerak. ACM Trasactios o Iteret Techology 5(1), 92 128 (Feb 2005) 3. Bri, S., Page, L.: The aatomy of a large-scale hypertextual Web search egie. Computer etworks ad ISDN systems 30(1-7), 107 117 (1998) 4. Dekerchove, C., Niove, L., Vadoore, P.: Maximizig PageRak via outliks. Liear Algebra ad its Applicatios 429(5-6), 1254 1276 (Sep 2008) 5. Lagville, A., Meyer, C.: Deeper iside pagerak. Iteret Mathematics 1(3), 335 380 (2004) 6. Nemhauser, G., Wolsey, L., Fisher, M.: A aalysis of approximatios for maximizig submodular set fuctiosi. Mathematical Programmig 14(1), 265 294 (1978) 7. Olse, M.: Maximizig PageRak with New Backliks. CIAC Algorithms ad complexity pp. 37 48 (2010) 8. Page, L., Bri, S., Motwai, R., Wiograd, T.: The PageRak citatio rakig: Brigig order to the web. Techical Report 1999-66, Staford IfoLab (November 1999), http://ilpubs.staford.edu:8090/422/ 9. SEOmoz: Search egie 2009 rakig factors (2009), http://www.seomoz.org/