Filtering: A Method for Solving Graph Problems in MapReduce


 Bethany Ross
 1 years ago
 Views:
Transcription
1 Filterig: A Method for Solvig Graph Problems i MapReduce Silvio Lattazi Google, Ic. New York, NY, USA Bejami Moseley Uiversity of Illiois Urbaa, IL, USA Sergei Vassilvitskii Yahoo! Research New York, NY, USA Siddharth Suri Yahoo! Research New York, NY, USA ABSTRACT The MapReduce framework is curretly the de facto stadard used throughout both idustry ad academia for petabyte scale data aalysis. As the iput to a typical MapReduce computatio is large, oe of the key requiremets of the framework is that the iput caot be stored o a sigle machie ad must be processed i parallel. I this paper we describe a geeral algorithmic desig techique i the MapReduce framework called filterig. The mai idea behid filterig is to reduce the size of the iput i a distributed fashio so that the resultig, much smaller, problem istace ca be solved o a sigle machie. Usig this approach we give ew algorithms i the MapReduce framework for a variety of fudametal graph problems for sufficietly dese graphs. Specifically, we preset algorithms for miimum spaig trees, maximal matchigs, approximate weighted matchigs, approximate vertex ad edge covers ad miimum cuts. I all of these cases, we parameterize our algorithms by the amout of memory available o the machies allowig us to show tradeoffs betwee the memory available ad the umber of MapReduce rouds. For each settig we will show that eve if the machies are oly give substatially subliear memory, our algorithms ru i a costat umber of MapReduce rouds. To demostrate the practical viability of our algorithms we implemet the maximal matchig algorithm that lies at the core of our aalysis ad show that it achieves a sigificat speedup over the sequetial versio. Categories ad Subject Descriptors F.. [Aalysis of Algorithms ad Problem Complexity]: Noumerical Algorithms ad Problems Work doe while visitig Yahoo! Labs. Work doe while visitig Yahoo! Labs. Partially supported by NSF grats CCF ad CCF Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. SPAA 11, Jue 4 6, 011, Sa Jose, Califoria, USA. Copyright 011 ACM /11/06...$ Geeral Terms Algorithms, Theory Keywords MapReduce, Graph Algorithms, Matchigs 1. INTRODUCTION The amout of data available ad requirig aalysis has grow at a astoishig rate i recet years. For example, Yahoo! processes over 100 billio evets, amoutig to over 10 terabytes, daily [7]. Similarly, Facebook processes over 80 terabytes of data per day [17]. Although the amout of memory i commercially available servers has also grow at a remarkable pace i the past decade, ad ow exceeds a oce uthikable amout of 100 GB, it remais woefully iadequate to process such huge amouts of data. To cope with this deluge of iformatio people have (agai) tured to parallel algorithms for data processig. I recet years MapReduce [3], ad its ope source implemetatio, Hadoop [0], have emerged as the stadard platform for large scale distributed computatio. About 5 years ago, Google reported that it processes over 3 petabytes of data usig MapReduce i oe moth [3]. Yahoo! ad Facebook use Hadoop as their primary method for aalyzig massive data sets [7, 17]. Moreover, over 100 compaies ad 10 uiversities are usig Hadoop [6, 1] for large scale data aalysis. May differet types of data have cotributed to this growth. Oe particularly rich datatype that has captured the iterest of both idustry ad academia is massive graphs. Graphs such as the World Wide Web ca easily cosist of billios of odes ad trillios of edges [16]. Citatio graphs, affiliatio graphs, istat messeger graphs, ad phoe call graphs have recetly bee studied as part of social etwork aalysis. Although it was previously thought that graphs of this ature are sparse, the work of Leskovec, Kleiberg ad Faloutsos [14] dispelled this otio. The authors aalyzed the growth over time of 9 differet massive graphs from 4 differet domais ad showed that graphs grow dese over time. Specifically, if (t) ad e(t) deote the umber of odes ad edges at time t, respectively, they show that e(t) (t) 1+c, where 1 c > 0. They lowest value of c they fid is 0.08, but they observe three graphs with c > 0.5. The algorithms we preset are efficiet for such dese graphs, as well as their sparser couterparts. Previous approaches to graph algorithms o MapReduce attempt to shoehor message passig style algorithms ito the framework
2 [9, 15]. These algorithms ofte require O(d) rouds, where d is the diameter of the iput graph, eve for such simple tasks as computig coected compoets, miimum spaig trees, etc. A roud i a MapReduce computatio ca be very expesive timewise, because it ofte requires a massive amout of data (o the order of terabytes) to be trasmitted from oe set of machies to aother. This is usually the domiat cost i a MapReduce computatio. Therefore miimizig the umber of rouds is essetial for efficiet MapReduce computatios. I this work we show how may fudametal graph algorithms ca be computed i a costat umber of rouds. We use the previously defied model of computatio for MapReduce [13] to perform our aalysis. 1.1 Cotributios All of our algorithms take the same geeral approach, which we call filterig. They proceed i two stages. They begi by usig the parallelizatio of MapReduce to selectively drop, or filter, parts of the iput with the goal of reducig the problem size so that the result is small eough to fit ito a sigle machie s memory. I the secod stage the algorithms compute the fial aswer o this reduced iput. The techical challege is to choose eough edges to drop but still be able to compute either a optimal or provably ear optimal solutio. The filterig step differs i complexity depedig o the problem ad takes a few slightly differet forms. We exhibit the flexibility of this approach by showig how it ca be used to solve a variety of graph problems. I Sectio.4 we apply the filterig techique to computig the coected compoets ad miimum spaig trees of dese graphs. The algorithm, which is much simpler, ad more efficiet algorithm tha the the oe that appeared i [13], partitios the origial iput ad solves a subproblem o each partitio. The algorithm recurses util the data set is small eough to fit ito the memory of a sigle machie. I Sectio 3, we tur to the problem of matchigs, ad show how to compute a maximal matchig i three MapReduce rouds i the model of [13]. The algorithm works by solvig a subproblem o a small sample of the origial iput. We the use this iterim solutio to prue out the vast majority of edges of the origial iput, thus dramatically reducig the size of the remaiig problem, ad recursig if it is ot small eough to fit oto a sigle machie. The algorithm allows for a tradeoff betwee the umber of rouds ad the available memory. Specifically, for graphs with at most 1+c edges ad machies with memory at least 1+ɛ our algorithm will require O(c/ɛ) rouds. If the machies have memory O() the our algorithm requires O(log ) rouds. We the use this algorithm as a buildig block, ad show algorithms for computig a 8approximatio for maximum weighted matchig, a approximatio to Vertex Cover ad a 3 / approximatio to Edge Cover. For all of these algorithms, the umber of machies used will be at most O(N/η) where N is the size of the iput ad η is the memory available o each machie. That is, these algorithms require just eough machies to fit the etire iput o all of the machies. Fially, i Sectio 4 we adapt the semial work of Karger [10] to the MapReduce settig. Here the filterig succeeds with a limited probability; however, we argue that we ca replicate the algorithm eough times i parallel so that oe of the rus succeeds without destroyig the miimum cut. 1. Related Work The authors of [13] give a formal model of computatio of MapReduce called MRC which we will briefly summarize i the ext sectio. There are two models of computatio that are similar to MRC. We describe these models ad their relatioship to MRC i tur. We also discuss how kow algorithms i those models relate to the algorithms preseted i this work. The algorithms preseted i this paper ru i a costat umber of rouds whe the memory per machie is superliear i the umber of vertices ( 1+ɛ for some ɛ > 0). Although this requiremet is remiiscet of the semistreamig model [4], the similarities ed there, as the two models are very differet. Oe problem that is hard to solve i semistreamig but ca be solved i MRC is graph coectivity. As show i [4], i the semistreamig model, without a superliear amout of memory it is impossible to aswer coectivity queries. I MRC, however, previous work [13] shows how to aswer coectivity queries whe the memory per machie is limited to 1 ɛ, albeit at the cost of a logarithmic umber of rouds. Coversely, a problem that is trivial i the semistreamig model but more complicated i MRC is fidig a maximal matchig. I the semistreamig model oe simply streams through the edges, ad adds the edge to the curret matchig if it is feasible. As we show i Sectio 3, fidig a maximal matchig i MRC is a computable, but otrivial edeavor. The techical challege for this algorithm stems from the fact that o sigle machie ca see all of the edges of the iput graph, rather the model requires the algorithm desiger to parallelize the processig 1. Although parallel algorithms are gaiig a resurgece, this is a area that was widely studied previously uder differet models of parallel computatio. The most popular model is the PRAM model, which allows for a polyomial umber of processors with shared memory. There are hudreds of papers for solvig problems i this model ad previous work [5, 13] shows how to simulate certai types of PRAM algorithms i MRC. Most of these results yield MRC algorithms that require Ω(log ) rouds, whereas i this work we focus o algorithms that use O(1) rouds. Noetheless, to compare with previous work, we ext describe PRAM algorithms that either ca be simulated i MRC, or could be directly implemeted i MRC. Israel ad Itai [8] give a O(log ) roud algorithm for computig maximal matchigs o a PRAM. It could be implemeted i MRC, but would require O(log ) rouds. Similarly, [19] gives a distributed algorithm which yields costat factor approximatio to the weighted matchig problem. This algorithm, which could also be implemeted i MRC, takes O(log ) rouds. Fially, Karger s algorithm is i RN C but also requires O(log ) rouds. We show how to implemet it i MapReduce i a costat umber of rouds i Sectio 4.. PRELIMINARIES.1 MapReduce Overview We remid the reader about the saliet features of the MapReduce computig paradigm (see [13] for more details). The iput, ad all itermediate data, is stored i key; value pairs ad the computatio proceeds i rouds. Each roud is split ito three cosecutive phases: map, shuffle ad reduce. I the map phase the iput is processed oe tuple at a time. All key; value pairs emitted by the map phase which have the same key are the aggregated by the MapReduce system durig the shuffle phase ad set to the same machie. Fially each key, alog with all the values associated with it, are processed together durig the reduce phase. Sice all the values with the same key ed up o the same machie, oe ca view the map phase as a kid of routig step that determies which values ed up together. The key acts as a (logi 1 I practice this requiremet stems from the fact that eve streamig through a terabyte of data requires a otrivial amout of time as the machie remais IO boud.
3 cal) address of the machie, ad the system makes sure all of the key; value pairs with the same key are collected o the same machie. To simplify our reasoig about the model, we ca combie the reduce ad the subsequet map phase. Lookig at the computatio through this les, every roud each machie performs some computatio o the set of key; value pairs assiged to it (reduce phase), ad the desigates which machie each output value should be set to i the ext roud (map phase). The shuffle esures that the data is moved to the right machie, after which the ext roud of computatio ca begi. I this simpler model, we shall oly use the term machies as opposed to mappers ad reducers. More formally, let ρ j deote the reduce fuctio for roud j, ad let µ j+1 deote the map fuctio for the followig roud of a MRC algorithm [13] where j 1. Now let φ j(x) = µ j+1 ρ j(x). Here ρ j takes as iput some set of key; value pairs deoted by x ad outputs aother set of key; value pairs. We defie the operator to feed the output of ρ j(x) to µ j+1 oe key; value pair at a time. Thus φ j deotes the operatio of first executig the reducer fuctio, ρ j, o the set of values i x ad the executig the map fuctio, µ j+1, o each key; value pair output by ρ j(x) idividually. This sytactic chage allows the algorithm desiger to avoid defiig mappers ad reducers ad istead defie what each machie does durig each roud of computatio ad specify which machie each output key; value pair should go to. We ca ow traslate the restrictios o ρ j ad µ j from the MRC model of [13] to restrictios o φ j. Sice we are joiig the reduce ad the subsequet map phase, we combie the restrictios imposed o both of these computatios. There are three sets of restrictios: those o the umber of machies, the memory available o each machie ad the total umber of rouds take by the computatio. For a iput of size N, ad a sufficietly small ɛ > 0, there are N 1 ɛ machies, each with N 1 ɛ memory available for computatio. As a result, the total amout of memory available to the etire system is O(N ɛ ). See [13] for a discussio ad justificatio. A algorithm i MRC belogs to MRC i if it rus i worst case O(log i N) rouds. Thus, whe desigig a MRC 0 algorithm there are three properties that eed to be checked: Machie Memory: I each roud the total memory used by a sigle machie is at most O(N 1 ɛ ) bits. Total Memory: The total amout of data shuffled i ay roud is O(N ɛ ) bits. Rouds: The umber of rouds is a costat.. Total Work ad Work Efficiecy Next we defie the amout of work doe by a MRC algorithm by takig the stadard defiitio of work efficiecy from the PRAM settig ad adaptig it to the MapReduce settig. Let w(n) deote the amout of work doe by a rroud, MRC algorithm o a iput of size N. This is simply the sum of the amout of work doe durig each roud of computatio. The amout of work doe durig roud i of a computatio is the product of the umber of machies used i that roud, deoted p i(n), ad the worst case ruig time of each machie, deoted t i(n). More specifically, w(n) = rx w i(n) = i=1 rx p i(n)t i(n). (1) If the amout of work doe by a MRC algorithm matches the I other words, the total amout of data shuffled i ay roud must be less tha the total amout of memory i the system. i=1 Algorithm: MST(V,E) 1: if E < η the : Compute T = MST (E) 3: retur T 4: ed if 5: l Θ( E /η) 6: Partitio E ito E 1, E,..., E l where E i < η usig a uiversal hash fuctio h : E {1,,..., l}. 7: I parallel: Compute T i, the miimum spaig tree o G(V, E i). 8: retur MST(V, it i) Figure 1: Miimum spaig tree algorithm ruig time of the best kow sequetial algorithm, we say the MRC algorithm is work efficiet..3 Notatio Let G = (V, E) be a udirected graph, ad deote by = V ad m = E. We will call G, cdese, if m = 1+c where 0 < c 1. I what follows we assume that the machies have some limited memory η. We will assume that the umber of available machies is O(m/η). Notice that the umber of machies is just the umber required to fit the iput o all of the machies simultaeously. All of our algorithms will cosider the case where η = 1+ɛ for some ɛ > 0. For a costat ɛ, the algorithms we defie will take a costat umber of rouds ad lie i MRC 0 [13], beatig the Ω(log ) ruig time provided by the PRAM simulatio costructios (see Theorem 7.1 i [13]). However, eve whe η = O() our algorithms will ru i O(log ) rouds. This exposes the memory vs. rouds tradeoff sice most of the algorithms preseted take fewer rouds as the memory per machie icreases. We ow proceed to describe the idividual algorithms, i order of progressively more complex filterig techiques..4 Warm Up: Coected Compoets ad Miimum Spaig Trees We preset the formal algorithm for computig miimum spaig trees (the coected compoets algorithm is idetical). The algorithm works by partitioig the edges of the iput graph ito subsets of size η ad sedig each subgraph to its ow machie. The, each machie throws out ay edge that is guarateed ot to be a part of ay MST because it is the heaviest edge o some cycle i that machie s subgraph. If the resultig graph fits ito memory of a sigle machie, the algorithm termiates. Otherwise, the algorithm recurses o the smaller istace. We give the pseudocode i Figure 1. We assume the algorithm is give a cdese graph; each machie has memory η = O( 1+ɛ ), ad that the umber of machies l = Θ( c ɛ ). Thus the algorithm oly uses eough memory, across the etire system, to store the iput. We show that every iteratio reduces the iput size by c/ɛ, ad thus after c /ɛ iteratios the algorithm termiates. LEMMA.1. Algorithm MST(V,E) termiates after c /ɛ iteratios ad returs the Miimum Spaig Tree. PROOF. To show correctess, ote that ay edge that is ot part of the MST o a subgraph of G is also ot part of the MST of G by the cycle property of miimum spaig trees. It remais to show that (1) the memory costraits of each machie are ever violated ad () the total umber of rouds is limited. Sice the partitio is doe radomly, a easy Cheroff argumet shows that o machie gets assiged more tha η edges
4 with high probability. Fially, ote that S i Ti l( 1) = O( 1+c ɛ ). Therefore after c /ɛ 1 iteratios the iput is small eough to fit oto a sigle machie, ad the overall algorithm termiates after c /ɛ rouds. LEMMA.. The MST(V,E) algorithm does O( cm ɛ total work. α(m, )) PROOF. Durig a specific iteratio, radomly partitioig E ito E 1, E,..., E l requires a liear sca over the edges which is O(m) work. Computig the miimum spaig tree M i of each part of the partitio usig the algorithm of [] takes O(l m α(m, )) work. Computig the MST of Gsparse o oe l machie usig the same algorithm requires l( 1)α(m, ) = O(mα(m, )) work. For costat ɛ the MRC algorithm uses O(mα(m, )) work. Sice the best kow sequetial algorithm [11] rus i time O(m) i expectatio, the MRC algorithm is work efficiet up to a factor of α(m, ). 3. MATCHINGS AND COVERS The maximum matchig problem ad its variats play a cetral role i theoretical computer sciece, so it is atural to determie if is possible to efficietly compute a maximum matchig, or, more simply, a maximal matchig, i the MapReduce framework. The questio is ot trivial. Ideed, due to the costraits of the model, it is ot possible to store (or eve stream through) all of the edges of a graph o a sigle machie. Furthermore, it is easy to come up with examples where the partitioig techique similar to that used for MSTs (Sectio.4) yields a arbitrarily bad matchig. Simply samplig the edges uiformly, or eve usig oe of the sparsificatio approaches [18] appears ufruitful because good sparsifiers do ot ecessarily preserve maximal matchigs. Despite these difficulties, we are able to show that by combiig a simple samplig techique ad a postprocessig strategy it is possible to compute a uweighted maximal matchig ad thus a  approximatio to the uweighted maximum matchig problem usig oly machies with memory of size O() ad O(log ) rouds. More geerally, we show that we ca fid a maximal matchig o cdese graphs i O( c /ɛ) rouds usig machies with Ω( 1+ɛ ) memory; oly three rouds are ecessary if ɛ = c /3. We exted this techique to obtai a 8approximatio algorithm for maximum weighted matchig ad use similar approaches to approximate the vertex ad edge cover problems. This sectio is orgaized as follows: first we preset the algorithm to solve the uweighted maximal matchig, ad the we explai how to use this algorithm to solve the weighted maximum matchig problem. Fially, we show how the techiques ca be adapted to solve the miimum vertex ad the miimum edge cover problems. 3.1 Uweighted Maximal Matchigs The algorithm works by first samplig O(η) edges ad fidig a maximal matchig M 1 o the resultig subgraph. Give this matchig, we ca ow safely remove edges that are i coflict (i.e. those icidet o odes i M 1) from the origial graph G. If the resultig filtered graph, H is small eough to fit oto a sigle machie, the algorithm augmets M 1 with a matchig foud o H. Otherwise, we augmet M 1 with the matchig foud by recursig o H. Note that sice the size of the graph reduces from roud to roud, the effective samplig probability icreases, resultig i a larger sample of the remaiig graph. Formally, let G(V, E) be a simple graph where = V ad E 1+c for some c > 0. We begi by assumig that each of the machies has at least η memory. We fix the exact value of η later, but require that η 40. We give the pseudocode for the algorithm below: 1. Set M = ad S = E.. Sample every edge (u, v) S uiformly at radom with probability p = η. Let 10 S E be the set of sampled edges. 3. If E > η the algorithm fails. Otherwise give the graph G(V, E ) as iput to a sigle machie ad compute a maximal matchig M o it. Set M = M M. 4. Let I be the set of umatched vertices i G. Compute the subgraph of G iduced by I, G[I], ad let E[I] be the set of edges i G[I]. If E i > η, set S = E[I] ad retur to step. Otherwise cotiue to step Compute a maximal matchig M o G[I] ad output M = M M To proceed we eed the followig techical lemma, which shows that with high probability every iduced subgraph with sufficietly may edges, has at least oe edge i the sample. LEMMA 3.1. Let E E be a set of edges chose idepedetly with probability p. The with probability at least 1 e, for all I V either E[I] < /p or E[I] E. PROOF. Fix oe such subgraph, G[I] = (I, E[I]) with E[I] /p. The probability that oe of the edges i E[I] were chose to be i E is (1 p) E[I] (1 p) /p e. Sice there are at most total possible iduced subgraphs G[I], the probability that there exists oe that does ot have a edge i E is at most e e. Next we boud the umber of iteratios the algorithm takes. Note that, the term iteratio refers to the umber of times the algorithm is repeated. This does ot refer to a MapReduce roud. LEMMA 3.. If η 40 the the algorithm rus for at most O(log ) iteratios with high probability. Furthermore, if η = 1+ɛ, where 0 < ɛ < c is a fixed costat, the the algorithm rus i at most c/ɛ iteratios with high probability. PROOF. Fix a iteratio i of the algorithm ad let p be the samplig probability for this iteratio. Let E i be the set of edges at the begiig of this iteratio, ad deote by I be the set of umatched vertices after this iteratio. From Lemma 3.1, if E[I] /p the a edge of E[I] will be sampled with high probability. Note that o edge i E[I] is icidet o ay edge i M. Thus, if a edge from E[I] is sampled the our algorithm would have chose this edge to be i the matchig. This cotradicts the fact that o vertex i I is matched. Hece, E[I] /p 0 E i with η high probability. Now cosider the first iteratio of the algorithm, let G 1(V 1, E 1) be the iduced graph o the umatched odes after the first step of the algorithm. The above argumet implies that E 1 0 E 0 0 E η E. Similarly E 0 E 1 (0) E 0 E. So η η after i iteratios E i E i. The first part of the claim follows. To coclude the proof ote that if η = 1+ɛ, we have that E i E iɛ, ad thus the algorithm termiates after c/ɛ iteratios. We cotiue by showig the correctess of the algorithm. THEOREM 3.1. The algorithm fids a maximal matchig of G = (V, E) with high probability. PROOF. First cosider the case that the algorithm does ot fail. Assume, for the sake of cotradictio, that there exists a edge η
5 (u, v) E such that either u or v are matched i the fial matchig M that is output. Cosider the last iteratio of the algorithm. Sice (u, v) E ad u ad v are ot matched, (u, v) E[I]. Sice this is the last ru of the algorithm, a maximal matchig M of G[I] is computed o oe machie. Sice M is maximal, either u or v or both must be matched i it. All of the edges of M get added to M i the last step, which gives our cotradictio. Next, cosider the case that the algorithm failed. This occurs due to the set of edges E havig size larger tha η i some iteratio of the algorithm. Note that E[ E ] = S p = η /10 i a give iteratio. By the Cheroff Boud it follows that E η with probability smaller tha η 40 (sice η 40). By Lemma 3. the algorithm completes i at most O(log ) rouds, thus the total failure probability is bouded by O(log 40 ) usig the uio boud. Fially we show how to implemet this algorithm i MapReduce.!"#$%&'(%)#"*&+,' %#!" %!!" $#!" $!!" #!"!" '()(**+*",*.)/01" 30)+(/4""!&!!$"!&!!#"!&!$"!&!%"!&!#"!&$" $" .%/0&'134.4)0)*5'(/,' COROLLARY 3.1. The Maximal Matchig algorithm ca be implemeted i three MapReduce rouds whe η = 1+c/3. Furthermore, whe η = 1+ɛ the the algorithm rus for 3 c/ɛ rouds ad O(log ) rouds whe η 40. PROOF. By Lemma 3. the algorithm rus for oe iteratio with high probability whe η = 1+c/3, c/ɛ iteratios whe η = 1+ɛ. Therefore it oly remais to describe how to compute the graph G[I]. For this we appeal to Lemma 6.1 i [13], where the set S i are the edges icidet o ode i, ad the fuctio f i drops the edge i if it is matched ad keeps it otherwise. Hece, each iteratio of the algorithm requires 3 MapReduce rouds. LEMMA 3.3. The maximal matchig algorithm preseted above is work efficiet whe η = 1+ɛ where 0 < ɛ < c is a fixed costat. PROOF. By Lemma 3. whe η = 1+ɛ there are at most a costat umber of iteratios of the algorithm. Thus it suffices to show that O(m) work is doe i a sigle iteratio. Samplig each edge with probability p requires a liear sca over the edges, which is O(m) work. Computig a maximal matchig o oe machie ca be doe usig a straightforward, greedy semistreamig algorithm requirig E η m work. Computig G[I] ca be doe as follows. Load M oto m ɛ machies where 0 < ɛ < c ad partitio E amog those machies. The, if a edge i E is icidet o a edge i M the machies drop that edge, otherwise that edge is i G[I]. This results i O(m) work to load all of the data oto the machies ad O(m) work to compute G[I]. Sice G[I] has at most m edges, computig M o oe machie usig the best kow greedy semistreamig algorithm also requires O(m) work. Sice the vertices i a maximal matchig provide a  approximatio to the vertex cover problem, we get the followig corollary. COROLLARY 3.. A approximatio to the optimal vertex cover ca be computed i three MapReduce rouds whe η = 1+c/3. Further, whe η = 1+ɛ the the algorithm rus for 3 c/ɛ rouds ad O(log ) rouds whe η 40. This algorithm does O(m) total work whe η = 1+ɛ for a costat ɛ > Experimetal Validatio I this Sectio we experimetally validate the above algorithm ad demostrate that it leads to sigificat rutime improvemets Figure : The ruig time of the MapReduce matchig algorithm for differet values of p, as well as the baselie provided by the streamig implemetatio. i practice. Our data set cosists of a sample of a graph of the twitter follower etwork, previously used i [1]. The graph has 50,767,3 odes,,669,50,959 edges, ad takes about 44GB whe stored o disk. We implemeted the greedy streamig algorithm for maximum matchig as well as the three phase MapReduce algorithm described above. The streamig algorithm remais I/O bouded ad completes i 81 miutes. The total ruig times for the MapReduce algorithm with differet values for the samplig probability p are give i Figure. The MapReduce algorithm achieves a sigificat speedup (over 10x) over a large umber of values for p. The speed up is the result of the fact that a sigle machie ever scas the whole iput. Both the samplig i stage 1 ad the filterig i stage are performed i parallel. Note that the parallelizatio does ot come for free, ad the MapReduce system has otrivial overhead over the straightforward streamig implemetatio. For example, whe p = 1, the MapReduce algorithm essetially implemets the streamig algorithm (sice all of the edges are mapped oto a sigle machie), however the ruig time is almost.5 times slower. Overall these results show that the algorithms proposed are ot oly iterestig from a theoretical viewpoit, but are viable ad useful i practice. 3. Maximum Weighted Matchig We preset a algorithm that computes a approximatio to the maximum weighted matchig problem usig a costat umber of MapReduce rouds. Our algorithm takes advatage of both the sequetial ad parallel power of MapReduce. Ideed, it will compute several matchigs i parallel ad the combie them o a sigle machie to compute the fial result. We assume that the maximum weight o a edge is polyomial i E ad we prove a 8 approximatio algorithm. Our aalysis is motivated by the work of Feigebaum et al. [4], but is techically differet sice o sigle machies sees all of the edges. The iput of the algorithm is a simple graph G(V, E) ad a weight fuctio w : E R. We assume that V = ad E = 1+c for a costat c > 0. Without loss of geerality, assume that mi{w(e) : e E} = 1 ad W = max{w(e) : e E}. The algorithm works as follows: 1. Split the graph G ito G 1, G,, G log W, where G i is
6 the graph o the set V of vertices ad cotais edges with weights i ( i 1, i ].. For 1 i log W ru the maximal matchig algorithm o G i. Let M i be the maximal matchig for the graph G i. 3. Set M =. Cosider the edge sets sequetially, i descedig order, M log W,..., M, M 1. Whe cosiderig a edge e M i, we add it to the matchig if ad oly if M {e} is a valid matchig. After all edges are cosidered, output M. LEMMA 3.4. The above algorithm outputs a 8approximatio to the weighted maximum matchig problem. PROOF. Let OPT be the maximum weighted matchig i G ad deote by V (M i) the set of vertices icidet o the edges i M i. Cosider a edge (u, v) = e OPT, such that e G j for some j. Let i be the maximum i such that {u, v} V (M i ). Note that i must exist, ad i j else we could have added e to M j. Therefore, w(e) i. Now for every such edge (u, v) OPT we select oe vertex from {u, v} V (M i ). Without loss of geerality, let that v be the selected vertex. We say that v is a blockig vertex for e. For each blockig vertex v, we associate its icidet edge i M i ad call it the blockig edge for e. Let V b (i) be the set of blockig vertices i V (M i), we have that log W X i=1 i V b (i) X e OPT w(e). This follows from the fact that every vertex ca block at most oe e OPT ad that OPT is a valid matchig. Note also that from the defiitio of blockig vertex if (u, v) M M j the u, v / k<j V b (k). Now suppose that a edge (x, y) M k is discarded by step 3 of the algorithm. This ca happe if ad oly if there is a edge already preset i the matchig with a higher weight adjacet to x or y. Formally, there is a (u, v) M, (u, v) M j with j > k ad {u, v} {x, y}. Without loss of geerality assume that {u, v} {x, y} = x ad cosider such a edge (x, v). We say that (x, v) killed the edge (x, y) ad the vertex y. Notice that a edge (u, v) M ad (u, v) M j kills at most two edges for every M k with k < j ad kills at most two odes i V b (k). Fially we also defie (u, v) b as the set of blockig vertices associated with the blockig edge (u, v). Now cosider V b (k), each blockig vertex was either killed by oe of the edges i the matchig M, or is adjacet to oe of the edges i M k. Furthermore, the total weight of the edges i OPT with that were blocked by a blockig vertex killed by (u, v) is at most j 1 X i {V j 1 X b (i) killed by (u, v)} i+1 j+1 4w((u, v)). i=1 () To coclude, ote that each edge i OPT that is ot i M was either blocked directly by a edge i M, or was blocked by a vertex that was killed by a edge i M. To boud the former, cosider a edge (u, v) M j M. Note that this edge ca be icidet o at most edges i OPT, each of weight j w((u, v)), ad thus the weight i OPT icidet o a edge (u, v) is 4w((u, v)). Puttig this together with Equatio we coclude: X 8 w((u, v)) X w(e). (u,v) M i=1 e OPT Furthermore we ca show that the aalysis of our algorithm is essetially tight. Ideed there exists a family of graphs where our algorithm fids a solutio with weight w(op T ) with high probability. We prove the followig lemma i Appedix 8 o(1) A. LEMMA 3.5. There is a graph where our algorithm computes a solutio that has value w(op T ) with high probability. 8 o(1) Fially, suppose that the weight fuctio w : E R is such that e E, w(e) O(poly( E )) ad that each machie has memory at least η max{ log, V log W }. The we ca ru the above algorithm i MapReduce usig oly oe more roud tha the maximal matchig algorithm. I the first roud we split G ito G 1,..., G log W ; the we ru the maximal matchig algorithm of the previous subsectio i parallel o log W machies. I the last roud, we ru the last step o a sigle machie. The last step is always possible because we have at most V log W edges each with weights of size log W. THEOREM 3.. There is a algorithm that fids a 8 approximatio to the maximum weighted matchig problem o a c dese graph usig machies with memory η = 1+ɛ i 3 c/ɛ + 1 rouds with high probability. COROLLARY 3.3. There is a algorithm that, with high probability, fids a 8approximatio to the maximum weighted matchig problem that rus i four MapReduce rouds whe η = 1+/3c. To coclude the aalysis of the algorithm we ow study the work amout of the maximum matchig algorithm. LEMMA 3.6. The amout of work performed by the maximum matchig algorithm preseted above is O(m) whe η = 1+ɛ where 0 < ɛ < c is a fixed costat. PROOF. The first step of the algorithm requires O(m) work as it ca be doe usig a liear sca over the edges. I the secod step, by Lemma 3.3 each machie performs work that is liear i the umber of edges that are assiged to the machie. Sice the edges are partitioed across the machies, the total work doe i the secod step is O(m). Fially we ca perform the third step by a semistreamig algorithm that greedily adds edges i the order M log W,..., M, M 1, requirig O(m) work. 3.3 Miimum Edge Cover Next we tur to the miimum edge cover problem. A edge cover of a graph G(V, E) is a set of edges E E such that each vertex of V has at least oe edpoit i E. The miimum edge cover is a edge cover E of miimum size. Let G(V, E) be a simple graph. The algorithm to compute a edge cover is as follows: 1. Fid a maximal matchig M of G usig the procedure described i Sectio Let I be the set of ucovered vertices. For each ucovered vertex, take ay edge icidet o the vertex i I. Let this set of edges be U. 3. Output E = M U. Note that this procedure produces a feasible edge cover E. To boud the size of E let OPT deote the size of the miimum edge cover for the graph G ad let OPT m deote the size of the maximum matchig i G. It is kow that the miimum edge cover of a graph is equal to V OPT m. We also kow that U = V M. Therefore, E = V M V 1 OPTm
7 sice a maximal matchig has size at least 1 OPTm. Kowig that OPT m V / ad usig Corollary 3.1 to boud the umber of rouds we have the followig theorem. THEOREM 3.3. There is a algorithm that, with high probability, fids a 3 approximatio to the miimum edge cover i MapReduce. If each machie has memory η 40 the the algorithm rus i O(log ) rouds. Further, if η = 1+ɛ, where 0 < ɛ < c is a fixed costat, the the algorithm rus i 3 c/ɛ + 1 rouds. COROLLARY 3.4. There is a algorithm that, with high probability, fids a 3 approximatio to the miimum edge cover i four MapReduce rouds whe η = 1+/3c. Now we prove that the amout of work performed by the edge cover algorithm is O(m). LEMMA 3.7. The amout of work performed by the edge cover algorithm preseted above is O(m) whe η = 1+ɛ where 0 < ɛ < c is a fixed costat. PROOF. By Lemma 3.3 whe η = 1+ɛ the first step of the algorithm ca be doe performig O(m) operatios. The secod step ca be performed by a semi streamig algorithm that requires O(m) work. Thus the claim follows. 4. MINIMUM CUT Whereas i the previous algorithms the filterig was doe by droppig certai edges, this algorithm filters by cotractig edges. Cotractig a edge, will obviously reduce the umber of edges ad may either keep the umber of vertices the same (i the case we cotracted a self loop), or reduce it by oe. To compute the miimum cut of a graph we appeal to the cotractio algorithm itroduced by Karger [10]. The algorithm has a well kow property that the radom choices made i the early rouds succeed with high probability, whereas those made i the later rouds have a much lower probability of success. We exploit this property by showig how to filter the iput i the first phase (by cotractig edges) so that the remaiig graph is guarateed to be small eough to fit oto a sigle machie, yet large eough to esure that the failure probability remais bouded. Oce the filterig phase is complete, ad the problem istace is small eough to fit oto a sigle machie, we ca employ ay oe of the well kow methods to fid the miimum cut i the filtered graph. We the decrease the failure probability by ruig several executios of the algorithm i parallel, thus esurig that i oe of the copies the miimum cut survives this filterig phase. The complicatig factor i the scheme above is cotractig the right umber of edges so that the properties above hold. We proceed by labelig each edge with a radom umber betwee 0 ad 1 ad the searchig for a threshold t so that cotractig all of the edges with label less tha t results i the desired umber of vertices. Typically such a search would take logarithmic time, however, by doig the search i parallel across a large umber of machies, we ca reduce the depth of the recursio tree to be costat. Moreover, to compute the umber of vertices remaiig after the first t edges are cotracted, we refer to the coected compoets algorithm i Sectio.4. Sice the coected compoets algorithm uses a small umber of machies, we ca show that eve with may parallel ivocatios we will ot violate the machie budget. We preset the algorithm ad its aalysis below. Also, the algorithm uses two subrouties, Fid t ad Cotract which are defied i tur. Algorithm 1 MiCut(E) 1: for i = 1 to δ 1 (i parallel) do : tag e E with a umber r e chose uiformly at radom from [0, 1] 3: t Fid t(e, 0, 1) 4: E i Cotract(E, t) 5: C i mi cut of E i 6: ed for 7: retur miimum cut over all C i 4.1 Fid Algorithm The pseudocode for the algorithm to fid the correct threshold is give below. The algorithm performs a parallel search o the value t so that cotractig all edges with weight at most t results i a graph with δ 3 vertices. The algorithm ivokes δ copies of the coected compoets algorithm, each of which uses at most c ɛ machies, with 1+ɛ memory. Algorithm Fid t(e, mi, max) 1: {Uses δ+c/ɛ machies.} : γ max mi δ 3: for j = 1 to δ (i parallel) do 4: τ j mi +jγ 5: E j {e E r e τ j} 6: cc j umber of coected compoets i G = (V, E j) 7: ed for 8: if there exists a j such that cc j = δ 3 the 9: retur j 10: else 11: retur Fid t(e, τ j, τ j+1) where j is the smallest value s.t. cc j < δ 3, cc j+1 > δ 3 1: ed if 4. Cotractio Algorithm We state the cotractio algorithm ad prove bouds o its performace. Algorithm 3 Cotract(E, t) 1: CC coected compoets i {e E r e t} : let h : [] [ δ 4 ] be a uiversal hash fuctio 3: map each edge (u, v) to machie h(u) ad h(v) 4: map the assigmet of ode u to its coected compoet CC(u), to machie h(u) 5: o each reducer reame all istaces of u to CC(u) 6: map each edge (u, v) to machie h(u) + h(v) 7: Drop self loops (edges i same coected compoet) 8: Aggregate parallel edges LEMMA 4.1. The Cotract algorithm uses δ 4 machies with O( m ) space with high probability. δ 4 PROOF. Partitio V ito parts P j = {v V j 1 < deg(v) j }. Sice the degree of each ode is bouded by, there are at most log parts i the partitio. Defie the volume of part j as V j = P j j. Parts havig volume less tha m 1 ɛ could all be mapped to oe reducer without violatig its space restrictio. We ow focus o parts with V j > m 1 ɛ, ad so let P j be such a part. Thus P j cotais betwee m1 ɛ ad m1 ɛ vertices. Let ρ j j
8 be a arbitrary reducer. Sice h is uiversal, the probability that ay vertex v P j maps to ρ is exactly δ 4. Therefore, i expectatio, the umber of vertices of P j mappig to ρ is at most m1 ɛ. j δ 4 Sice each of these vertices has degree at most j, i expectatio the umber of edges that map to ρ is at most m1 ɛ. Let the radom variable X j deote the umber of vertices from P j that map δ 4 to ρ. Say that a bad evet happes if more tha 4m1 ɛ vertices of V j j map to ρ. Cheroff bouds tell us that the probability of such a evet happeig is O(1/ δ 4 ), Pr»X j > 10m1 ɛ δ 4 < «10m 1 ɛ δ 4 < 1. (3) δ 4 Takig a uio boud over all δ 4 reducers ad log parts, we ca coclude that the probability of ay reducer beig overloaded is bouded below by 1 o(1). 4.3 Aalysis of the MiCut Algorithm We proceed to boud the total umber of machies, maximum amout of memory, ad the total umber of rouds used by the MiCut algorithm. LEMMA 4.. The total umber of machies used by the MiCut algorithm is δ 1 ` δ +c ɛ + δ 4. PROOF. The algorithm begis by ruig δ 1 parallel copies of a simpler algorithm which first ivokes Fid t, to fid a threshold t for each istace. This algorithm uses δ parallel copies of a coected compoet algorithm, which itself uses c ɛ machies (see Sectio.4). After fidig the threshold, we ivoke the Cotract algorithm, which uses δ 4 machies per istace. Together this gives the desired umber of machies. LEMMA 4.3. The memory used by each machie durig the executio of MiCut is bouded by max{ δ 3, 1+c δ 4, 1+ɛ }. PROOF. There are three distict steps where we must boud the memory. The first is the the searchig phase of Fid t. Sice this algorithm executes istaces of the coected compoets algorithm i parallel, the results of Sectio.4 esure that each istace uses at most η = 1+ɛ memory. The secod is the cotractio algorithm. Lemma 4.1 assures us that the iput to each machie is of size at most O( m δ 4 ). Fially, the last step of MiCut requires that we load a istace with δ 3 vertices, ad hece at most δ 3 edges oto a sigle machie. LEMMA 4.4. Suppose the amout of memory available per machie is η = 1+ɛ. MiCut rus i O( 1 ɛδ ) umber of rouds. PROOF. The oly variable part of the ruig time is the umber of rouds ecessary to foud a threshold τ so that the umber of coected compoets i Fid t is exactly δ 3. Observe that after the k th recursive call, the umber of edges with threshold betwee m mi ad max is. Therefore the algorithm must termiate δ k after at most 1+c δ rouds, which is costat for costat δ. We are ow ready to prove the mai theorem. THEOREM 4.1. Algorithm MiCut returs the miimum cut i G with probability at least 1 o(1), uses at most η = 1+ɛ memory per machie ad completes i O( 1 ɛ ) rouds. PROOF. We first show that the success probability is at least «1 1 δ 3 δ 1. The algorithm ivokes δ 1 parallel copies of the followig approach: (1) simulate Karger s cotractio algorithm [10] for the first δ 3 steps resultig i a graph G t ad () Idetify the miimum cut o G t. By Corollary. of [1] step (1) succeeds with probability at least p = Ω( δ3 ). Sice the secod step ca be made to fail with 0 probability, each of the parallel copies succeeds with probability at least p. By ruig δ 1 idepedet copies of the algorithm, the probability that all of the copies fail i step (1) is at most 1 (1 p) δ1. To prove the theorem, we must fid a settig of the parameters δ 1, δ, δ 3, ad δ 4 so that the memory, machies, ad correctess costraits are satisfied. δ 1 max{ δ 3, 1+c δ 4 } η = 1+ɛ Memory δ+c ɛ + δ 4 = o(m) = o( 1+c ) Machies «1 δ 3 δ 1 = o(1) Correctess Settig δ 1 = 1 ɛ /, δ = ɛ, δ 3 = 1+ɛ, δ4 = c satisfies all of them. Ackowledgmets We would like to thak Ashish Goel, Jake Hofma, Joh Lagford, Ravi Kumar, Serge Plotki ad Cog Yu for may helpful discussios. 5. REFERENCES [1] E. Bakshy, J. Hofma, W. Maso, ad D. J. Watts. Everyoe s a ifluecer: Quatifyig ifluece o twitter. I Proceedigs of WSDM, 011. [] Berard Chazelle. A miimum spaig tree algorithm with iverseackerma type complexity. Joural of the ACM, 47(6): , November 000. [3] Jeffrey Dea ad Sajay Ghemawat. MapReduce: Simplified data processig o large clusters. I Proceedigs of OSDI, pages , 004. [4] Joa Feigebaum, Sampath Kaa, Adrew McGregor, Siddharth Suri, ad Jia Zhag. O graph problems i a semistreamig model. Theoretical Computer Sciece, 348( 3):07 16, December 005. [5] Michael T. Goodrich. Simulatig parallel algorithms i the mapreduce framework with applicatios to parallel computatioal geometry. Secod Workshop o Massive Data Algorithmics (MASSIVE 010), Jue 010. [6] Hadoop Wiki  Powered By. [7] Blake Irvig. Big data ad the power of hadoop. Yahoo! Hadoop Summit, Jue 010. [8] Amos Israel ad A. Itai. A fast ad simple radomized parallel algorithm for maximal matchig. Iformatio Processig Letters, ():77 80, [9] U Kag, Charalampos Tsourakakis, Aa Paula Appel, Christos Faloutsos, ad Jure Leskovec. HADI: Fast diameter estimatio ad miig i massive graphs with hadoop. Techical Report CMUML , CMU, December 008. [10] David R. Karger. Global micuts i RN C ad other ramificatios of a simple micut algorithm. I Proceedigs of SODA, pages 1 30, Jauary 1993.
9 [11] David R. Karger, Philip N. Klei, ad Robert E. Tarja. A radomized lieartime algorithm for fidig miimum spaig trees. I Proceedigs of the twetysixth aual ACM symposium o Theory of computig, Proceedigs of STOC, pages 9 15, New York, NY, USA, ACM. [1] David R. Karger ad Clifford Stei. A Õ( ) algorithm for miimum cuts. I Proceedigs of STOC, pages , May [13] Howard Karloff, Siddharth Suri, ad Sergei Vassilvitskii. A model of computatio for MapReduce. I Proceedigs of SODA, pages , 010. [14] Jure Leskovec, Jo Kleiberg, ad Christos Faloutsos. Graphs over time: Desificatio laws, shrikig daimeters ad possible explaatios. I Proc. 11th ACM SIGKDD Iteratioal Coferece o Kowledge Discovery ad Data Miig, 005. [15] Jimmy Li ad Chris Dyer. DataItesive Text Processig with MapReduce. Number 7 i Sythesis Lectures o Huma Laguage Techologies. Morga ad Claypool, April 010. [16] Grzegorz Malewicz, Matthew H. Auster, Aart J.C. Bik, James C. Dehert, Ila Hor, Naty Leiser, ad Grzegorz Czajkowski. Pregel: A system for largescale graph processig. I Proceedigs of SIGMOD, pages , Idiaapolis, IN, USA, Jue 010. ACM. [17] Mike Schroepfer. Iside largescale aalytics at facebook. Yahoo! Hadoop Summit, Jue 010. [18] Daiel A. Spielma ad Nikhil Srivastava. Graph sparsificatio by effective resistaces. I Proceedigs of STOC, pages , New York, NY, USA, 008. ACM. [19] Mirjam Wattehofer ad Roger Wattehofer. Distributed weighted matchig. I Proceedigs of DISC, pages Spriger, 003. [0] Tom White. Hadoop: The Defiitive Guide. O Reilly Media, 009. [1] Yahoo! Ic Press Release. Yahoo! parters with four top uiversities to advace cloud computig systems ad applicatios research. April 009. APPENDIX A. WEIGHTED MATCHING LOWER BOUND LEMMA A.1. There is a graph where our algorithm compute a solutio that has value w(op T ) with high probability. 8 o(1) PROOF. Let G(V, E) a graph o odes ad m vertices, ad fix a W = log m. We say that a bipartite graph G(V, E 1, E ) is balaced if E 1 = E. Cosider the followig graph: there is a cetral balaced bipartite clique, B 1, o odes ad all the edges of the clique have weight W + 1. log W Every side of the cetral bipartite clique is also part of other log W 1 balaced bipartite cliques. We refer to those cliques has B E 1, BE 1 3,, BE 1 W, BE, BE 3,, BE W. I both BE 1 i ad B E i we have that the weight of the edges i them have weight W + 1. Furthermore every ode i B i 1 is also coected with a additioal ode of degree oe with a edge of weight W, ad every ode i B E 1 i \B 1 ad B E i \B 1 is coected to a ode of degree W oe with a edge of weight. Figure 3 shows the subgraph i 1 composed by B 1 ad the two graphs B E 1 i ad B E i. Figure 3: The subgraph(we have draw oly the edges of W weight W, + 1, W ad W + 1) of the graph of G for i+1 i which our algorithm fids a solutio with value w(op T ) with 8 o(1) high probability. Note that the optimal weighted maximum matchig for this graph is the oe that is composed by all the edges icidet to a ode of degree oe ad its total value is W + W log W + W + + = W P W 1 log W log W log W log W i=0 = i W +1 1 W = (4 o(1))w. log W 1 1 log W Now we will show a upperboud o the performace of our algorithm that holds with high probability. Recall that i step oe our algorithm splits the graph i G 1,, G W subgraph where the edges i G i are i i ( i 1, i ] the it computes a maximal matchig o G i usig the techique show i the previous subsectio. I particular the algorithm works as follows: it samples the edges i 1 G i with probability E i ad the computes a maximal matchig ɛ o it, fially it tries to match the umatched odes usig the edges betwee the umatched odes i G i. To uderstad the value of the matchig retured by the algorithm, cosider the graph G i ote that this graph is composed oly by the edges i B E 1 i, B E i ad the edges coected to odes of W degree oe with weight. We refer to those last edges as the heavy edges i G i 1 i. Note that the heavy edges are all coected to vertices i B E 1 i \B 1 ad B E i \B 1. Let s be the umber of vertices i a side of B E 1 i, ote that G i, for i > 1 has 6s odes ad s + s edges. 1 Recall that we sample a edges with probability C V i = ɛ 1, so we have that i expectatio we sample Θ s 1 = C(6s) ɛ (s) ɛ Θ `(s) 1 ɛ heavy edges, thus usig the Cheroff boud we have that the probability that the umber of sampled heavy edges is bigger or equal tha Θ 3(6s) (1 ɛ) is e 3(6s)(1 ɛ). Further otice that by lemma 3.1 for every set of ode i B E 1 i or B E i with (6s) 1+ɛ edges we have at least a edge i it with probability e 6s( C 1 (6s)ɛ log(6s)). Thus the maximum umber of odes left umatched i B E 1 i or after step of the maximal matchig algorithm is smaller B E i the (6s) 1+ɛ + Θ 3(6s) (1 ɛ) so eve if we matched those odes with the heavy edges i G i we use at most (6s) 1+ɛ +
10 Θ 3(6s) (1 ɛ) of those edges. Thus we have that for every G i, for every i > 1 the maximal matchig algorithm uses at most Θ 6(6s) (1 ɛ) + (6s) 1+ɛ s = o heavy edges with probability 1 e 6s( C 1 (6s)ɛ log(6s)). log s With the same reasoig, just with differet costat, we otice that the same fact holds also for G 1. So we have that for every G i we use oly o Vi heavy edges with probability log V i 1 e Θ( V i ), further otice that every maximal matchig that the algorithm computes it always matches the odes i B 1, because it is alway possible to use the edges that coect those odes to the odes of degree 1. Kowig that we otice that the fial matchig that our algorithm outputs is composed by the maximal matchig of G 1 plus all the heavy edges i maximal matchig of G,, G W. Thus we have that with probability Q i 1 e Θ( V i ) = 1 o(1) 3 the total weight of the computed solutio is upperbouded by ` W W log W o s = W + log W log s log W o(w ) ad so the ratio betwee the optimum ad the solutio is (4 o(1))w log W = 1 W log W +o(w ) 8 o(1). Thus the claim follows. 3 Note that every V i Θ log W
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Type Less, Fid More: Fast Autocompletio Search with a Succict Idex Holger Bast MaxPlackIstitut für Iformatik Saarbrücke, Germay bast@mpiif.mpg.de Igmar Weber MaxPlackIstitut für Iformatik Saarbrücke,
More informationDryad: Distributed DataParallel Programs from Sequential Building Blocks
Dryad: Distributed DataParallel Programs from Sequetial uildig locks Michael Isard Microsoft esearch, Silico Valley drew irrell Microsoft esearch, Silico Valley Mihai udiu Microsoft esearch, Silico Valley
More informationMAXIMUM LIKELIHOODESTIMATION OF DISCRETELY SAMPLED DIFFUSIONS: A CLOSEDFORM APPROXIMATION APPROACH. By Yacine AïtSahalia 1
Ecoometrica, Vol. 7, No. 1 (Jauary, 22), 223 262 MAXIMUM LIKELIHOODESTIMATION OF DISCRETEL SAMPLED DIFFUSIONS: A CLOSEDFORM APPROXIMATION APPROACH By acie AïtSahalia 1 Whe a cotiuoustime diffusio is
More informationConsistency of Random Forests and Other Averaging Classifiers
Joural of Machie Learig Research 9 (2008) 20152033 Submitted 1/08; Revised 5/08; Published 9/08 Cosistecy of Radom Forests ad Other Averagig Classifiers Gérard Biau LSTA & LPMA Uiversité Pierre et Marie
More informationSUPPORT UNION RECOVERY IN HIGHDIMENSIONAL MULTIVARIATE REGRESSION 1
The Aals of Statistics 2011, Vol. 39, No. 1, 1 47 DOI: 10.1214/09AOS776 Istitute of Mathematical Statistics, 2011 SUPPORT UNION RECOVERY IN HIGHDIMENSIONAL MULTIVARIATE REGRESSION 1 BY GUILLAUME OBOZINSKI,
More informationHOW MANY TIMES SHOULD YOU SHUFFLE A DECK OF CARDS? 1
1 HOW MANY TIMES SHOULD YOU SHUFFLE A DECK OF CARDS? 1 Brad Ma Departmet of Mathematics Harvard Uiversity ABSTRACT I this paper a mathematical model of card shufflig is costructed, ad used to determie
More informationRealTime Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations
RealTime Computig Without Stable States: A New Framework for Neural Computatio Based o Perturbatios Wolfgag aass+, Thomas Natschläger+ & Hery arkram* + Istitute for Theoretical Computer Sciece, Techische
More informationCounterfactual Reasoning and Learning Systems: The Example of Computational Advertising
Joural of Machie Learig Research 14 (2013) 32073260 Submitted 9/12; Revised 3/13; Published 11/13 Couterfactual Reasoig ad Learig Systems: The Example of Computatioal Advertisig Léo Bottou Microsoft 1
More informationSOME GEOMETRY IN HIGHDIMENSIONAL SPACES
SOME GEOMETRY IN HIGHDIMENSIONAL SPACES MATH 57A. Itroductio Our geometric ituitio is derived from threedimesioal space. Three coordiates suffice. May objects of iterest i aalysis, however, require far
More informationCrowds: Anonymity for Web Transactions
Crowds: Aoymity for Web Trasactios Michael K. Reiter ad Aviel D. Rubi AT&T Labs Research I this paper we itroduce a system called Crowds for protectig users aoymity o the worldwideweb. Crowds, amed for
More informationSoftware Reliability via RuTime ResultCheckig Hal Wasserma Uiversity of Califoria, Berkeley ad Mauel Blum City Uiversity of Hog Kog ad Uiversity of Califoria, Berkeley We review the eld of resultcheckig,
More informationStatistical Learning Theory
1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 2728, 2012 2 / 130
More information4. Trees. 4.1 Basics. Definition: A graph having no cycles is said to be acyclic. A forest is an acyclic graph.
4. Trees Oe of the importat classes of graphs is the trees. The importace of trees is evidet from their applicatios i various areas, especially theoretical computer sciece ad molecular evolutio. 4.1 Basics
More informationFor information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com
Morga Kaufma Publishers is a imprit of Elsevier. 30 Corporate Drive, Suite 400, Burligto, MA 01803, USA This book is prited o acidfree paper. # 2009 by Elsevier Ic. All rights reserved. Desigatios used
More informationMeanSemivariance Optimization: A Heuristic Approach
57 MeaSemivariace Optimizatio: A Heuristic Approach Javier Estrada Academics ad practitioers optimize portfolios usig the meavariace approach far more ofte tha the measemivariace approach, despite the
More informationStéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3
ESAIM: Probability ad Statistics URL: http://wwwemathfr/ps/ Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi
More informationA Direct Approach to Inference in Nonparametric and Semiparametric Quantile Models
A Direct Approach to Iferece i Noparametric ad Semiparametric Quatile Models Yaqi Fa ad Ruixua Liu Uiversity of Washigto, Seattle Workig Paper o. 40 Ceter for Statistics ad the Social Scieces Uiversity
More informationSystemic Risk and Stability in Financial Networks
America Ecoomic Review 2015, 105(2): 564 608 http://dx.doi.org/10.1257/aer.20130456 Systemic Risk ad Stability i Fiacial Networks By Daro Acemoglu, Asuma Ozdaglar, ad Alireza TahbazSalehi * This paper
More informationKernel Mean Estimation and Stein Effect
Krikamol Muadet KRIKAMOL@TUEBINGEN.MPG.DE Empirical Iferece Departmet, Max Plack Istitute for Itelliget Systems, Tübige, Germay Keji Fukumizu FUKUMIZU@ISM.AC.JP The Istitute of Statistical Mathematics,
More informationStatistical Modeling of NonMetallic Inclusions in Steels and Extreme Value Analysis
Statistical Modelig of NoMetallic Iclusios i Steels ad Extreme Value Aalysis Vo der Fakultät für Mathematik, Iformatik ud Naturwisseschafte der RWTH Aache Uiversity zur Erlagug des akademische Grades
More informationRamseytype theorems with forbidden subgraphs
Ramseytype theorems with forbidde subgraphs Noga Alo Jáos Pach József Solymosi Abstract A graph is called Hfree if it cotais o iduced copy of H. We discuss the followig questio raised by Erdős ad Hajal.
More informationArecent solicitation from the National Science Foundation
ClietServer Computig WWW Traffic Reductio ad Load Balacig through ServerBased Cachig Azer Bestavros Bosto Uiversity This cachig protocol exploits the geographic ad temporal locality of referece exhibited
More informationBig Data. Better prognoses and quicker decisions.
Big Data. Better progoses ad quicker decisios. With Big Data, you make decisios o the basis of mass data. Everyoe is curretly talkig about the aalysis of large amouts of data, the Big Data. It gives compaies
More informationNoisy mean field stochastic games with network applications
Noisy mea field stochastic games with etwork applicatios Hamidou Tembie LSS, CNRSSupélecUiv. Paris Sud, Frace Email: tembie@ieee.org Pedro Vilaova AMCS, KAUST, Saudi Arabia Email:pedro.guerra@kaust.edu.sa
More informationWHICH MEAN DO YOU MEAN? AN EXPOSITION ON MEANS
WHICH MEAN DO YOU MEAN? AN EXPOSITION ON MEANS A Thesis Submitted to the Graduate Faculty of the Louisiaa State Uiversity ad Agricultural ad Mechaical College i partial fulfillmet of the requiremets for
More informationOn the Number of CrossingFree Matchings, (Cycles, and Partitions)
O the Number of CrossigFree Matchigs, (Cycles, ad Partitios Micha Sharir Emo Welzl Abstract We show that a set of poits i the plae has at most O(1005 perfect matchigs with crossigfree straightlie embeddig
More informationPresent Values, Investment Returns and Discount Rates
Preset Values, Ivestmet Returs ad Discout Rates Dimitry Midli, ASA, MAAA, PhD Presidet CDI Advisors LLC dmidli@cdiadvisors.com May 2, 203 Copyright 20, CDI Advisors LLC The cocept of preset value lies
More informationG r a d e. 5 M a t h e M a t i c s. Number
G r a d e 5 M a t h e M a t i c s Number Grade 5: Number (5.N.1) edurig uderstadigs: the positio of a digit i a umber determies its value. each place value positio is 10 times greater tha the place value
More informationWorking Paper Modeling storage and demand management in electricity distribution grids
ecostor www.ecostor.eu Der OpeAccessPublikatiosserver der ZBW LeibizIformatioszetrum Wirthaft The Ope Access Publicatio Server of the ZBW Leibiz Iformatio Cetre for Ecoomics Schroeder, Adreas; Siegmeier,
More informationThe Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Joural of Machie Learig Research 0 2009 22952328 Submitted 3/09; Revised 5/09; ublished 0/09 The Noparaormal: Semiparametric Estimatio of High Dimesioal Udirected Graphs Ha Liu Joh Lafferty Larry Wasserma
More information