The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

The Power of Both Choices: Practical Load Balacig for Distributed Stream Processig Egies Muhammad Ais Uddi Nasir #1, Giamarco De Fracisci Morales 2, David García-Soriao 3 Nicolas Kourtellis 4, Marco Serafii $5 # KTH Royal Istitute of Techology, Stockholm, Swede Yahoo Labs, Barceloa, Spai $ Qatar Computig Research Istitute, Doha, Qatar 1 aisu@kth.se, 2 gdfm@apache.org, 3 davidgs@yahoo-ic.com 4 kourtell@yahoo-ic.com, 5 mserafii@qf.org.qa Abstract We study the problem of load balacig i distributed stream processig egies, which is exacerbated i the presece of skew. We itroduce PARTIAL KEY GROUPING (PKG), a ew stream partitioig scheme that adapts the classical power of two choices to a distributed streamig settig by leveragig two ovel techiques: key splittig ad local load estimatio. I so doig, it achieves better load balacig tha key groupig while beig more scalable tha shuffle groupig. We test PKG o several large datasets, both real-world ad sythetic. Compared to stadard hashig, PKG reduces the load imbalace by up to several orders of magitude, ad ofte achieves early-perfect load balace. This result traslates ito a improvemet of up to 60% i throughput ad up to 45% i latecy whe deployed o a real Storm cluster. I. INTRODUCTION Distributed stream processig egies (DSPEs) such as S4, 1 Storm, 2 ad Samza 3 have recetly gaied much attetio owig to their ability to process huge volumes of data with very low latecy o clusters of commodity hardware. Streamig applicatios are represeted by directed acyclic graphs (DAG) where vertices, called processig elemets (PEs), represet operators, ad edges, called streams, represet the data flow from oe PE to the ext. For scalability, streams are partitioed ito sub-streams ad processed i parallel o a replica of the PE called processig elemet istace (PEI). Applicatios of DSPEs, especially i data miig ad machie learig, typically require accumulatig state across the stream by groupig the data o commo fields [1, 2]. Aki to MapReduce, this groupig i DSPEs is usually implemeted by partitioig the stream o a key ad esurig that messages with the same key are processed by the same PEI. This partitioig scheme is called key groupig. Typically, it maps keys to sub-streams by usig a hash fuctio. Hash-based routig allows each source PEI to route each message solely via its key, without eedig to keep ay state or to coordiate amog PEIs. Alas, it also results i load imbalace as it represets a sigle-choice paradigm [3], ad because it disregards the popularity of a key, i.e., the umber of messages with the same key i the stream, as depicted i Figure 1. 1 https://icubator.apache.org/s4 2 https://storm.icubator.apache.org 3 https://samza.icubator.apache.org Stream Source Source Worker Worker Worker Fig. 1: Load imbalace geerated by skew i the key distributio whe usig key groupig. The color of each message represets its key. Large web compaies ru massive deploymets of DSPEs i productio. Give their scale, good utilizatio of the resources is critical. However, the skewed distributio of may workloads causes a few PEIs to sustai a sigificatly higher load tha others. This suboptimal load balacig leads to poor resource utilizatio ad iefficiecy. Aother partitioig scheme called shuffle groupig achieves excellet load balacig by usig a roud-robi routig, i.e., by sedig a message to a ew PEI i cyclic order, irrespective of its key. However, this scheme is mostly suited for stateless computatios. Shuffle groupig may require a additioal aggregatio phase ad more memory to express stateful computatios (Sectio II). Additioally, it may cause a decrease i accuracy for data miig algorithms (Sectio VI). I this work, we focus o the problem of load balacig of stateful applicatios i DSPEs whe the iput stream follows a skewed key distributio. I this settig, load balacig is attaied by havig upstream PEIs create a balaced partitio of messages for dowstream PEIs, for each edge of the DAG. Ay practical solutio for this task eeds to be both streamig ad distributed: the former costrait eforces the use of a olie algorithm, as the distributio of keys is ot kow i advace, while the latter calls for a decetralized solutio with miimal coordiatio overhead i order to esure scalability.

To address this problem, we leverage the power of two choices [4] (PoTC), whereby the system picks the least loaded out of two cadidate PEIs for each key. However, to maitai the sematics of key groupig while usig PoTC (i.e., so that oe key is hadled by a sigle PEI), sources would eed to track which of the two possible choices has bee made for each key. This requiremet imposes a coordiatio overhead every time a ew key appears, so that all sources agree o the choice. I additio, sources should the store this choice i a routig table. Each edge i the DAG would thus require a routig table for every source, each with oe etry per key. Give that a typical stream may cotai billios of keys, this solutio is ot practical. Istead, we propose to relax the key groupig costrait ad allow each key to be hadled by both cadidate PEIs. We call this techique key splittig; it allows us to apply PoTC without the eed to agree o, or keep track of, the choices made. As show i Sectio V, key splittig guaratees good load balace eve i the presece of skew. A secod issue is how to estimate the load of a dowstream PEI. Traditioal work o PoTC assumes global kowledge of the curret load of each server, which is challegig i a distributed system. Additioally, it assumes that all messages origiate from a sigle source, whereas messages i a DSPE are geerated i parallel by multiple sources. I this paper we prove that, iterestigly, a simple local load estimatio techique, whereby each source idepedetly tracks the load of dowstream PEIs, performs very well i practice. This techique gives results that are almost idistiguishable from those give by a global load oracle. The combiatio of these two techiques (key splittig ad local load estimatio) eables a ew stream partitioig scheme amed PARTIAL KEY GROUPING. I summary, we make the followig cotributios. We study the problem of load balacig i moder distributed stream processig egies. We show how to apply PoTC to DSPEs i a pricipled ad practical way, ad propose two ovel techiques to do so: key splittig ad local load estimatio. We propose PARTIAL KEY GROUPING, a ovel ad simple stream partitioig scheme that applies to ay DSPE. Whe implemeted o top of Apache Storm, it requires a sigle fuctio ad less tha 20 lies of code. 4 We measure the impact of PKG o a real deploymet o Apache Storm. Compared to key groupig, it improves the throughput of a example applicatio o real-world datasets by up to 60%, ad the latecy by up to 45%. II. PRELIMINARIES AND MOTIVATION We cosider a DSPE ruig o a cluster of machies that commuicate by exchagig messages followig the flow of a DAG, as discussed. I this work, we focus o balacig the data trasmissio alog a sigle edge i a DAG. Load balacig across the whole DAG is achieved by balacig alog each 4 Available at https://github.com/gdfm/partial-key-groupig edge idepedetly. Each edge represets a sigle stream of data, alog with its partitioig scheme. Give a stream uder cosideratio, let the set of upstream PEIs (sources) be S, ad the set of dowstream PEIs (workers) be W, ad their sizes be S = S ad W = W (see Figure 1). The iput to the egie is a sequece of messages m = t, k, v where t is the timestamp at which the message is received, k K, K = K is the message key, ad v is the value. The messages are preseted to the egie i ascedig order by timestamp. A stream partitioig fuctio P t : K N maps each key i the key space to a atural umber, at a give time t. This umber idetifies the worker resposible for processig the message. Each worker is associated to oe or more keys. We use a defiitio of load similar to others i the literature (e.g., Flux [5]). At time t, the load of a worker i is the umber of messages hadled by the worker up to t: L i (t) = { τ, k, v : P τ (k) = i τ t} I priciple, depedig o the applicatio, two differet messages might impose a differet load o workers. However, i most cases these differeces eve out ad modelig such applicatio-specific differeces is ot ecessary. We defie imbalace at time t as the differece betwee the maximum ad the average load of the workers: I(t) = max(l i (t)) avg(l i (t)), for i W i We tackle the problem of idetifyig a stream partitioig fuctio that miimizes the imbalace, while at the same time avoidig the dowsides of shuffle groupig. A. Existig Stream Partitioig Fuctios Data is set betwee PEs by exchagig messages over the etwork. Several primitives are offered by DSPEs for sources to partitio the stream, i.e., to route messages to differet workers. There are two mai primitives of iterest: key groupig (KG) ad shuffle groupig (SG). KG esures that messages with the same key are hadled by the same PEI (aalogous to MapReduce). It is usually implemeted through hashig. SG routes messages idepedetly, typically i a roudrobi fashio. SG provides excellet load balace by assigig a almost equal umber of messages to each PEI. However, o guaratee is made o the partitioig of the key space, as each occurrece of a key ca be assiged to ay PEIs. SG is the perfect choice for stateless operators. However, with stateful operators oe has to hadle, store ad aggregate multiple partial results for the same key, thus icurrig additioal costs. I geeral, whe the distributio of iput keys is skewed, the umber of messages that each PEI eeds to hadle ca vary greatly. While this problem is ot preset for stateless operators, which ca use SG to evely distribute messages, stateful operators implemeted via KG suffer from load imbalace. This issue geerates a degradatio of the service level, or reduces the utilizatio of the cluster which must be provisioed to hadle the peak load of the sigle most loaded server. i

Example. To make the discussio more cocrete, we itroduce a simple applicatio that will be our ruig example: streamig top-k word cout. This applicatio is a adaptatio of the classical MapReduce word cout to the streamig paradigm where we wat to geerate a list of top-k words by frequecy at periodic itervals (e.g., each T secods). It is also a commo applicatio i may domais, for example to idetify tredig topics i a stream of tweets. Implemetatio via key groupig. Followig the MapReduce paradigm, the implemetatio of word cout described by Neumeyer et al. [6] or Noll [7] uses KG o the source stream. The couter PE keeps a ruig couter for each word. KG esures that each word is hadled by a sigle PEI, which thus has the total cout for the word i the stream. At periodic itervals, the couter PEIs sed their top-k couters to a sigle dowstream aggregator to compute the top-k words. While this applicatio is clearly simplistic, it models quite well a geeral class of applicatios commo i data miig ad machie learig whose goal is to create a model by trackig aggregated statistics of the data. Clearly KG geerates load imbalace as, for istace, the PEI associated to the key the will receive may more messages tha the oe associated with Barceloa. This example captures the core of the problem we tackle: the distributio of word frequecies follows a Zipf law where few words are extremely commo while a large majority are rare. Therefore, a eve distributio of keys such as the oe geerated by KG results i a ueve distributio of messages. Implemetatio via shuffle groupig. A alterative implemetatio uses shuffle groupig o the source stream to get partial word couts. These couts are set dowstream to a aggregator every T secods via key groupig. The aggregator simply combies the couts for each key to get the total cout ad selects the top-k for the fial result. Usig SG requires a slightly more complex logic but it geerates a eve distributio of messages amog the couter PEIs. However, it suffers from other problems. Give that there is o guaratee which PEI will hadle a key, each PEI potetially eeds to keep a couter for every key i the stream. Therefore, the memory usage of the applicatio grows liearly with the parallelism level. Hece, it is ot possible to scale to a larger workload by addig more machies: the applicatio is ot scalable i terms of memory. Eve if we resort to approximatio algorithms, i geeral, the error depeds o the umber of aggregatios performed, thus it grows liearly with the parallelism level. We aalyze this case i further detail alog with other applicatio scearios i Sectio VI. B. Key groupig with rebalacig Oe commo solutio for load balacig i DSPEs is operator migratio [5, 8, 9, 10, 11, 12]. Oce a situatio of load imbalace is detected, the system activates a rebalacig routie that moves part of the keys, ad the state associated with them, away from a overloaded server. While this solutio is easy to uderstad, its applicatio i our cotext is ot straightforward for several reasos. Rebalacig requires settig a umber of parameters such as how ofte to check for imbalace ad how ofte to rebalace. These parameters are ofte applicatio-specific as they ivolve a trade-off betwee imbalace ad rebalacig cost that depeds o the size of the state to migrate. Further, implemetig a rebalacig mechaism usually requires major modificatios of the DSPE at had. This task may be hard, ad is usually see with suspicio by the commuity drivig ope source projects, as witessed by the may variats of Hadoop that were ever merged back ito the mai lie of developmet [13, 14, 15]. I our cotext, rebalacig implies migratig keys from oe sub-stream to aother. However, this migratio is ot directly supported by the programmig abstractios of some DSPEs. Storm ad Samza use a coarse-graied stream partitioig paradigm. Each stream is partitioed ito as may sub-streams as the umber of dowstream PEIs. Key migratio is ot compatible with this partitioig paradigm, as a key caot be ucoupled from its sub-stream. I cotrast, S4 employs a fie-graied paradigm where the stream is partitioed ito oe sub-stream per key value, ad there is a oe-to-oe mappig of a key to a PEI. The latter paradigm easily supports migratio, as each key is processed idepedetly. A major problem with mappig keys to PEIs explicitly is that the DSPE must maitai several routig tables: oe for each stream. Each routig table has oe etry for each key i the stream. Keepig these tables is impractical because the memory requiremets are staggerig. I a typical web miig applicatio, each routig table ca easily have billios of keys. For a moderately large DAG with tes of edges, each with tes of sources, the memory overhead easily becomes prohibitive. Fially, as already metioed, for each stream there are several sources sedig messages i parallel. Modificatios to the routig table must be cosistet across all sources, so they require coordiatio, which creates further overhead. For these reasos we cosider a alterative approach to load balacig. III. PARTIAL KEY GROUPING The problem described so far curretly lacks a satisfyig solutio. To solve this issue, we resort to a widely-used techique i the literature of load balacig: the so-called power of two choices (PoTC). While this techique is well-kow ad has bee aalyzed thoroughly both from a theoretical ad practical perspective [16, 17, 18, 19, 4, 20], its applicatio i the cotext of DSPEs is ot straightforward ad has ot bee previously studied. Itroduced by Azar et al. [17], PoTC is a simple ad elegat techique that allows to achieve load balace whe assigig uits of load to workers. It is best described i terms of balls ad bis. Imagie a process where a stream of balls (uits of work) is distributed to a set of bis (the workers) as evely as possible. The sigle-choice paradigm correspods to puttig each ball ito oe bi selected uiformly at radom. By cotrast, the power of two choices selects two bis uiformly at radom, ad puts the ball ito the least loaded oe. This

simple modificatio of the algorithm has powerful implicatios that are well kow i the literature (see Sectios IV, VII). Sigle choice. The curret solutio used by all DSPEs to partitio a stream with key groupig correspods to the siglechoice paradigm. The system has access to a sigle hash fuctio H 1 (k). The partitioig of keys ito sub-streams is determied by the fuctio P t (k) = H 1 (k) mod W, where mod is the modulo operator. The sigle-choice paradigm is attractive because of its simplicity: the routig does ot require to maitai ay state ad ca be doe idepedetly i parallel. However, it suffers from a problem of load imbalace [4]. This problem is exacerbated whe the distributio of iput keys is skewed. PoTC. Whe usig the power of two choices, we have two hash fuctios H 1 (k) ad H 2 (k). The algorithm maps each key to the sub-stream assiged to the least loaded worker betwee the two possible choices, that is: P t (k) = argmi i (L i (t) : H 1 (k) = i H 2 (k) = i). The theoretical gai i load balace with two choices is expoetial compared to a sigle choice. However, usig more tha two choices oly brigs costat factor improvemets [17]. Therefore, we restrict our study to two choices. PoTC itroduces two additioal complicatios. First, to maitai the sematics of key groupig, the system eeds to keep state ad track the choices made. Secod, the system has to kow the load of the workers i order to make the right choice. We discuss these two issues ext. A. Key Splittig A aïve applicatio of PoTC to key groupig requires the system to store a bit of iformatio for each key see, to keep track of which of the two choices eeds to be used thereafter. This variat is referred to as static PoTC. Static PoTC icurs some of the problems discussed for key groupig with rebalacig. Sice the actual worker to which a key is routed is determied dyamically, sources eed to keep a routig table with a etry per key. As already discussed, maitaiig this routig table is ofte impractical. I order to leverage PoTC ad make it viable for DSPEs, we relax the requiremet of key groupig. Rather tha mappig each key to oe of the two possible choices, we allow it to be mapped to both choices. Every time a source seds a message, it selects the worker with the lowest curret load amog the two cadidates associated to that key. This techique, called key splittig, itroduces several ew trade-offs. First, key splittig allows the system to operate i a decetralized maer, by allowig multiple sources to take decisios idepedetly i parallel. As i key groupig ad shuffle groupig, o state eeds to be kept by the system ad each message ca be routed idepedetly. Key splittig eables far better load balacig compared to key groupig. It allows usig PoTC to balace the load o the workers: by splittig each key o multiple workers, it hadles the skew i the key popularity. Moreover, give that all its decisios are dyamic ad based o the curret load of the system (as opposed to static PoTC), key splittig adapts to chages i the popularity of keys over time. Third, key splittig reduces the memory usage ad aggregatio overhead compared to shuffle groupig. Give that each key is assiged to exactly two PEIs, the memory to store its state is just a costat factor higher tha whe usig key groupig. Istead, with shuffle groupig the memory grows liearly with the umber of workers W. Additioally, state aggregatio eeds to happe oly oce for the two partial states, as opposed to W 1 times i shuffle groupig. This improvemet also allows to reduce the error icurred durig aggregatio for some algorithms, as discussed i Sectio VI. From the poit of view of the applicatio developer, key splittig gives rise to a ovel stream partitioig scheme called PARTIAL KEY GROUPING, which lies i-betwee key groupig ad shuffle groupig. Naturally, ot all algorithms ca be expressed via PKG. The fuctios that ca leverage PKG are the same oes that ca leverage a combier i MapReduce, i.e., associative fuctios ad mooids. Examples of applicatios iclude aïve Bayes, heavy hitters, ad streamig parallel decisio trees, as detailed i Sectio VI. O the cotrary, other fuctios such as computig the media caot be easily expressed via PKG. Example. Let us examie the streamig top-k word cout example usig PKG. I this case, each word is tracked by two couters o two differet PEIs. Each couter holds a partial cout for the word, while the total cout is the sum of the two partial couts. Therefore, the total memory usage is 2 K, i.e., O(K). Compare this result to SG where the memory is O(W K). Partial couts are set dowstream to a aggregator that computes the fial result. For each word, the applicatio seds two couters, ad the aggregator performs a costat time aggregatio. The total work for the aggregatio is O(K). Coversely, with SG the total work is agai O(W K). Compared to KG, the implemetatio with PKG requires additioal logic, some more memory ad has some aggregatio overhead. However, it also provides a much better load balace which maximizes the resource utilizatio of the cluster. The experimets i Sectio V prove that the beefits outweigh its cost. B. Local Load Estimatio PoTC requires kowledge of the load of each worker to take its routig decisio. A DSPE is a distributed system, ad, i geeral, sources ad workers are deployed o differet machies. Therefore, the load of each worker is ot readily available to each source. Iterestigly, we prove that o commuicatio betwee sources ad workers is eeded to effectively apply PoTC. We propose a local load estimatio techique, whereby each source idepedetly maitais a local load-estimate vector with oe elemet per worker. The load estimates are updated by usig oly local iformatio of the portio of stream set by each source. We argue that i order to achieve global load balace it is sufficiet that each source idepedetly balaces the load it geerates across all workers.

The correctess of local load estimatio directly follows from our stadard defiitio of load i Sectio II. The load o a worker L i is simply the sum of the loads that each source j imposes o the give worker: L i (t) = j S Lj i (t). Each source j ca keep a estimate of the load o each worker i based o the load it has geerated L j i. As log as each source keeps its ow portio of load balaced, the the overall load o the workers will also be balaced. Ideed, the maximum overall load is at most the sum of the maximum load that each source sees locally. It follows that the maximum imbalace is also at most the sum of the local imbalaces. IV. ANALYSIS We proceed to aalyze the coditios uder which PKG achieves good load balace. Recall from Sectio II that we have a set W of workers at our disposal ad receive a sequece of m messages k 1,..., k m with values from a key uiverse K. Upo receivig the i-th message with value k i K, we eed to decide its placemet amog the workers; decisios are irrevocable. We assume oe message arrives per uit of time. Our goal is to miimize the evetual maximum load L(m), which is the same as miimizig the imbalace I(m). A simple placemet scheme such as shuffle groupig provides a imbalace of at most oe, but we would like to limit the umber of workers processig each key to d N +. Chromatic balls ad bis. We model our problem i the framework of balls ad bis processes, where keys correspod to colors, messages to colored balls, ad workers to bis. Choose d idepedet hash fuctios H 1,..., H d : K [] uiformly at radom. Defie the Greedy-d scheme as follows: at time t, the t-th ball (whose color is k t ) is placed o the bi with miimum curret load amog H 1 (k t ),..., H d (k t ), i.e., P t (k t ) = argmi i {H1(k t),...,h d (k t)} L i (t). Recall that with key splittig there is o eed to remember the choice for the ext time a ball of the same color appears. Observe that whe d = 1, each ball color is assiged to a uique bi so o choice has to be made; this models hashbased key groupig. At the other extreme, whe d l, all bis are valid choices, ad we obtai shuffle groupig. Key distributio. Fially, we assume the existece of a uderlyig discrete distributio D supported o K from which ball colors are draw, i.e., k 1,..., k m is a sequece of m idepedet samples from D. Without loss of geerality, we idetify the set K of keys with N + or, if K is fiite of cardiality K = K, with [K] = {1,..., K}. We assume them ordered by decreasig probability: if p i is the probability of drawig key i from D, the p 1 p 2 p 3... ad i K p i = 1. We also idetify the set W of bis with []. A. Imbalace with PARTIAL KEY GROUPING Compariso with stadard problems. As log as we keep gettig balls of differet colors, our process is idetical to the stadard Greedy-d process of Azar et al. [17]. This occurs with high probability provided that m is small eough. But for sufficietly large m (e.g., whe m 1 p 1 ), repeated keys will start to arrive. Recall that for ay umber of choices d 2, the maximum imbalace after throwig m balls of differet colors l l ito bis with the stadard Greedy-d process is l d + m + O(1). Ufortuately, such strog bouds (idepedet of m) caot apply to our settig. To gai some ituitio o what may go wrog, cosider the followig examples where d=2. Note that for the maximum load ot to be much larger tha the average load, the umber of bis used must ot exceed O(1/p 1 ), where p 1 is the maximum key probability. Ideed, at ay time we expect the two bis h 1 (1), h 2 (1) to cotai together at least a p 1 fractio of all balls, just coutig the occurreces of a sigle key. Hece the expected maximum load amog the two grows at a rate of at least p 1 /2 per uit of time, while the overall average load icreases by exactly 1 per uit of time. Thus, if p 1 > 2/, the expected imbalace at time m will be lower bouded by ( p1 2 1 )m, which grows liearly with m. This holds irrespective of the placemet scheme used. However, requirig p 1 2/ is ot eough to prevet imbalace Ω(m). Cosider the uiform distributio over keys. Let B = i {H 1(i), H 2 (i)} be the set of all bis that belog to oe of the potetial choices for some key. As is well-kow, the expected size of B is ( 1 ) 1 2 (1 1 e ). So all 2 keys use oly a (1 1 e ) 0.865 fractio of all bis, ad 2 roughly 0.135 bis will remai uused. I fact the imbalace m 0.865 m after m balls will be at least 0.156m. The problem is that most cocrete istatiatios of our two radom hash fuctios cause the existece of a overpopulated set B of bis iside which the average bi load must grow faster tha the average load across all bis. (I fact, this case subsumes our first example above, where B was {H 1 (1), H 2 (1)}.) Fially, eve i the absece of overpopulated bi subsets, some iheret imbalace is due to deviatios betwee the empirical ad true key distributios. For istace, suppose there are two keys 1, 2 with equal probability 1 2 ad = 4 bis. With costat probability, key 1 is assiged to bis 1, 2 ad key 2 to bis 3, 4. This situatio looks perfect because the Greedy-2 choice will sed each occurrece of key 1 to bis 1, 2 alterately so the loads of bis 1, 2 will always equal up to ±1. However, the umber of balls with key 1 see is likely to deviate from m/2 by roughly Θ( m), so either the top two or the bottom two bis will receive m/4 + Ω( m) balls, ad the imbalace will be Ω( m) with costat probability. I the remaider of this sectio we carry out our aalysis, which broadly costrued asserts that the above are the oly impedimets to achieve good balace. Statemet of results. We oted that oce the umber of bis exceeds 2/p 1 (where p 1 is the maximum key frequecy), the maximum load will be domiated by the loads of the bis to which the most frequet key is mapped. Hece the mai case of iterest is where p 1 = O( 1 ). We focus o the case where the umber of balls is large compared to the umber of bis. The followig results show that partial key groupig ca sigificatly reduce the maximum load (ad the imbalace), compared to key groupig.

Theorem 4.1: Suppose we use bis ad let m 2. Assume a key distributio D with maximum probability p 1 1 5. The the imbalace after m steps of the Greedy-d process satisfies, with probability at least 1 1, { O ( m I(m) = l l l ), if d = 1 O ( ). m, if d 2 As the ext result shows, the bouds above are bestpossible. 5 Theorem 4.2: There is a distributio D satisfyig the hypothesis of Theorem 4.1 such that the imbalace after m steps of the Greedy-d process satisfies, with probability at least 1 1, { Ω ( m I(m) = l l l ), if d = 1 Ω ( ). m, if d 2 We omit the proof of Theorem 4.2 (it follows by cosiderig a uiform distributio over 5 keys). The ext sectio is devoted to the proof of the upper boud, Theorem 4.1. B. Proof Cocetratio iequalities. We recall the followig results, which we eed to prove our mai theorem. Theorem 4.3 (Cheroff bouds): Suppose {X i } is a sequece of idepedet radom variables with X i [0, M] ad let Y = i X i, µ = i E[X i]. The for all β µ, Pr[Y β] C(µ, β, M), where ( C(µ, β, M) exp β l( β eµ ) + µ ). M Theorem 4.4 (McDiarmid s iequality): Let X 1,..., X be a vector of idepedet radom variables ad let f be a fuctio satisfyig f(a) f(a ) 1 wheever the vectors a ad a differ i just oe coordiate. The Pr[f(X 1,..., X ) > E[f(X 1,..., X )] + λ] exp( 2λ 2 ). The µ r measure of bi subsets. For every oempty set of bis S [] ad 1 r d, defie µ r (S) = {p i {H 1 (i),..., H r (i)} B}. We will be iterested i µ 1 (B) (which measures the probability that a radom key from D will have its choice iside B) ad µ d (B) (which measures the probability that a radom key from D will have all its choices iside B). Note that µ 1 (B) = j B µ 1({j}) ad µ d (B) µ 1 (B). Lemma 4.5: For every B [], E[µ 1 (B)] = B ad, if p 1 1, [ Pr µ 1 (B) B ] ( ) B 1 (eλ) λ λ. 5 However, the imbalace ca be much smaller tha the worst-case bouds from Theorem 4.1 if the probability of most keys is much smaller tha p 1, which is the case i may setups. Proof: The first claim follows from liearity of expectatio ad the fact that i p i = 1. For the secod, let B = k. Usig Theorem 4.3, Pr [ µ 1 (B) k (eλ)] is at most ( k C, k ) eλ, p 1 exp ( kp ) eλ l λ exp( kλ l λ), sice p 1. Lemma 4.6: For every B [], E[µ d (B)] = provided that p 1 1 5, [ Pr µ d (B) B ] ( ) 5 B e B. ( ) d B ad, Proof: Agai the first claim is easy. For the secod, let B = k. Usig Theorem 4.3, Pr [ µ d (B) ] k is at most ( ( k ) ) ( d, k C, p k(d 1) ( ) ) 1 exp l p 1 ek ( ( )) exp 5k l ek sice p 1 5. Corollary 4.7: Assume p 1 1 4, d 2. The, with high probability, { µd (B) max B / B [], B } 1. 5 Proof: We use Lemma 4.5 ad the uio boud. The probability that the claim fails to hold is bouded by [ Pr µ d (B) k ] ( ) ( ) 5k ek k B /5 k /5 ( e ) ( ) k 5k ( ) ek 1 = o, k k /5 where we used ( ( k) e ) k, k valid for all k. For a schedulig algorithm A ad a set B [] of bis, write L A B (t) = max j B L j (t) for the maximum load amog the bis i B after t balls have bee processed by A. Lemma 4.8: Suppose there is a set A [] of bis such that for all T A, µ d (T ) T. The A = Greedy-d satisfies L A A (m) = O( m ) + LA []\A(m) with high probability. Proof: We use a couplig argumet. Cosider the followig two idepedet processes P ad Q: P proceeds as Greedy-d, while Q picks the bi for each ball idepedetly at radom from [] ad icreases its load. Cosider ay time t at which the load vector is ω t N ad M t = M(ω t ) is the set of bis with maximum load. After hadlig the t-th ball, let X t deote the evet that P icreases the maximum load i A because the ew ball has all choices i M t A, ad Y t deote the evet that Q icreases the maximum load i A. Fially, let Z t deote the evet that P icreases the maximum load i A because the ew ball has some choice i M t A ad some choice i M t \ A, but the load of oe of its choices i M t A is o larger. We idetify these evets with their idicator radom variables.

Note that the maximum load i A at the ed of Process P is L P A (m) = t [m] (X t + Z t ), while at the ed of Process Q is L Q A (m) = t [m] Y t. Coditioed o ay load vector ω t, the probability of X t is Pr[X t ω t ] = µ d (M t A) M t A M t = Pr[Y t ω t ], So Pr[X t ω t ] Pr[Y t ω t ], which implies that for ay b N, Pr[ t [m] X t b] Pr[ t [m] Y t b]. But with high probability, the maximum load of Process Q is b = O(m/), so t X t = O(m/) holds with at least the same probability. O the other had, t Z t L P []\A(m) because each occurrece of Z t icreases the maximum load o A, ad oce a time t is reached such that L P A (t) > LP []\A(m), evet Z t must cease to happe. Therefore L P A (m) = t [m] X t + t [m] Z t O(m/) + L P []\A(m), yieldig the result. Proof of Theorem 4.1: Let A = { j [] µ 1 ({j}) 3e }. Observe that every bi j / A has µ 1 ({j}) < 3e ad this implies that, coditioed o the choice of hash fuctios, the maximum load of all bis outside A is at most 20 with high probability. 6 Therefore our task reduces to showig that the maximum load of the bis i A is O( m ). Assume that the thesis of Corollary 4.7 holds, which happes except with probability o(1/). Cosider the sequece of radom variables give by X j = µ 1 ({j}), ad let f = f(x 1, X 2,...) = j I [µ 1({j}) > 3e ]. By Lemma 4.5, A = E[f] 1 27. Moreover, the fuctio f satisfies the hypothesis of Theorem 4.4. We coclude that, with high probability, A 5. By Corollary 4.7, we have that for all B A, µ 2 (B) B. Thus Lemma 4.8 applies to A. This meas that after throwig m balls, the maximum load amog the bis i A is O( m ), as we wished to show. V. EVALUATION We assess the performace of our proposal by usig both simulatios ad a real deploymet. I so doig, we aswer the followig questios: Q1: What is the effect of key splittig o PoTC? Q2: How does local estimatio compare to a global oracle? Q3: How robust is PARTIAL KEY GROUPING? Q4: What is the overall effect of PARTIAL KEY GROUPING o applicatios deployed o a real DSPE? A. Experimetal Setup Datasets. Table I summarizes the datasets used. We use two mai real datasets, oe from Wikipedia ad oe from Twitter. These datasets were chose for their large size, their differet degree of skewess, ad because they are represetative of Web ad olie social etwork domais. The Wikipedia 6 This is by majorizatio with the process that just throws every ball to the first choice; see, e.g, Azar et al. [17]. TABLE I: Summary of the datasets used i the experimets: umber of messages, umber of keys ad percetage of messages havig the most frequet key (p 1 ). Dataset Symbol Messages Keys p 1(%) Wikipedia WP 22M 2.9M 9.32 Twitter TW 1.2G 31M 2.67 Cashtags CT 690k 2.9k 3.29 Sythetic 1 LN 1 10M 16k 14.71 Sythetic 2 LN 2 10M 1.1k 7.01 LiveJoural LJ 69M 4.9M 0.29 Slashdot0811 SL 1 905k 77k 3.28 Slashdot0902 SL 2 948k 82k 3.11 dataset (WP) 7 is a log of the pages visited durig a day i Jauary 2008. Each visit is a message ad the page s URL represets its key. The Twitter dataset (TW) is a sample of tweets crawled durig July 2012. Each tweet is parsed ad split ito its words, which are used as the key for the message. A additioal Twitter dataset comprises a sample of tweets crawled i November 2013. The keys for the messages are the cashtags i these tweets. A cashtag is a ticker symbol used i the stock market to idetify a publicly traded compay preceded by the dollar sig (e.g., $AAPL for Apple). Popular cash tags chage from week to week. This dataset allows to study the effect of shift of skew i the key distributio. We also geerate two sythetic datasets (LN 1, LN 2 ) with keys followig a log-ormal distributio, a commoly used heavy-tailed skewed distributio [21]. The parameters of the distributio (µ 1 =1.789, σ 1 =2.366; µ 2 =2.245, σ 2 =1.133) come from a aalysis of Orkut, ad try to emulate workloads from the olie social etwork domai [22]. Fially, we experimet o three additioal datasets comprised of directed graphs 8 (LJ, SL 1, SL 2 ). We use the edges i the graph as messages ad the vertices as keys. These datasets are used to test the robustess of PKG to skew i partitioig the stream at the sources, as explaied ext. They also represet a differet kid of applicatio domai: streamig graph miig. Simulatio. We process the datasets by simulatig the DAG preseted i Figure 1. The stream is composed of timestamped keys that are read by multiple idepedet sources (S) via shuffle groupig, uless otherwise specified. The sources forward the received keys to the workers (W) dowstream. I our simulatios we assume that the sources perform data extractio ad trasformatio, while the workers perform data aggregatio, which is the most computatioally expesive part of the DAG. Thus, the workers are the bottleeck i the DAG ad the focus for the load balacig. B. Experimetal Results Q1. We measure the imbalace i the simulatios whe usig the followig techiques: 7 http://www.wikibech.eu/?page id=60 8 http://sap.staford.edu/data

G L 5 L 10 L 15 L 20 H Fractio of Imbalace 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 TW 5 10 50 100 workers WP 5 10 50 100 workers CT 5 10 50 100 workers LN 1 5 10 50 100 workers LN 2 5 10 50 100 workers Fig. 2: Fractio of average imbalace with respect to total umber of messages for each dataset, for differet umber of workers ad umber of sources. TABLE II: Average imbalace whe varyig the umber of workers for the Wikipedia ad Twitter datasets. Dataset WP TW W 5 10 50 100 5 10 50 100 PKG 0.8 2.9 5.9e5 8.0e5 0.4 1.7 2.74 4.0e6 Off-Greedy 0.8 0.9 1.6e6 1.8e6 0.4 0.7 7.8e6 2.0e7 O-Greedy 7.8 1.4e5 1.6e6 1.8e6 8.4 92.7 1.2e7 2.0e7 PoTC 15.8 1.7e5 1.6e6 1.8e6 2.2e4 5.1e3 1.4e7 2.0e7 Hashig 1.4e6 1.7e6 2.0e6 2.0e6 4.1e7 3.7e7 2.4e7 3.3e7 H: Hashig, which represets stadard key groupig (KG) ad is our mai baselie. We use a 64-bit Murmur hash fuctio to miimize the probability of collisio. PoTC: Power of two choices without usig key splittig, i.e., traditioal PoTC applied to key groupig. O-Greedy: Olie greedy algorithm that picks the least loaded worker to hadle a ew key. Off-Greedy: Offlie greedy sorts the keys by decreasig frequecy ad executes O-Greedy. PKG: PoTC with key splittig. Note that PKG is the oly method that uses key splittig. Off-Greedy kows the whole distributio of keys so it represets a ufair compariso for olie algorithms. Table II shows the results of the compariso o the two mai datasets WP ad TW. Each value is the average imbalace measured throughout the simulatio. As expected, hashig performs the worst, creatig a large imbalace i all cases. While PoTC performs better tha hashig i all the experimets, it is outclassed by O-Greedy o TW. O-Greedy performs very close to Off-Greedy, which is a good result cosiderig that it is a olie algorithm. Iterestigly, PKG performs eve better tha Off-Greedy. Relaxig the costrait of KG allows to achieve a load balace comparable to offlie algorithms. We coclude that PoTC aloe is ot eough to guaratee good load balace, ad key splittig is fudametal ot oly to make the techique practical i a distributed system, but also to make it effective i a streamig settig. As expected, icreasig the umber of workers also icreases the average imbalace. The behavior of the system is biary: either well balaced or largely imbalaced. The trasitio betwee the two states happes whe the umber of workers surpasses the limit O(1/p 1 ) described i Sectio IV, which happes aroud 50 workers for WP ad 100 for TW. Q2. Give the aforemetioed results, we focus our attetio o PKG heceforth. So far, it still uses global iformatio about the load of the workers whe decidig which choice to make. Next, we experimet with local estimatio, i.e., each source performs its ow estimatio of the worker load, based o the sub-stream processed so far. We cosider the followig alteratives: G: PKG with global iformatio of worker load. L: PKG with local estimatio of worker load ad differet umber of sources, e.g., L 5 deotes S = 5. LP: PKG with local estimatio ad periodic probig of worker load every T p miutes. For istace, L 5 P 1 deotes S = 5 ad T p = 1. Whe probig is executed, the local estimate vector is set to the actual load of the workers. Figure 2 shows the average imbalace (ormalized to the size of the dataset) with differet techiques, for differet umber of sources ad workers, ad for several datasets. The baselie (H) always imposes very high load imbalace o the workers. Coversely, PKG with local estimatio (L) has always a lower imbalace. Furthermore, the differece from the global variat (G) is always less tha oe order of magitude. Fially, this result is robust to chages i the umber of sources. Figure 3 displays the imbalace of the system through time I(t) for TW, WP ad CT, 5 sources, ad for W = 10 ad 50. Results for W = 5 ad W = 100 are omitted as they are similar to W = 10 ad W = 50, respectively. PKG with global iformatio (G) ad its variat with local estimatio (L 5 ) perform best. Iterestigly, eve though both G ad L achieve very good load balace, their choices are quite differet. I a experimet measurig the agreemet o the destiatio of each message, G ad L have oly 47% Jaccard overlap. Hece, L reaches a local miimum which is very close i value to the oe obtaied by G, although differet. Also i this case, good balace ca oly be achieved up to a umber of workers that depeds o the dataset. Whe that umber is exceeded, the imbalace icreases rapidly, as see i the cases of WP ad partially for CT for W = 50, where all techiques lead to the same high load imbalace.

Fractio of Imbalace 10-8 10-9 10 workers TW 10-10 0 10 20 30 10-6 WP 10-7 10-8 0 10 20 30 40 10-2 10-3 CT 10-4 10-5 10-6 10-7 0 200 400 600 hours Fractio of Imbalace 10-7 10-8 50 workers TW 10-9 0 10 20 30 10-1 WP 10-2 10-3 G L 5 10-4 L 5 P 1 0 10 20 30 40 10-2 10-3 10-4 10-5 0 200 400 600 hours Fig. 3: Fractio of imbalace through time for differet datasets, techiques, ad umber of workers, with S = 5. Fially, we compare our local estimatio strategy with a variat that makes use of periodic probig of workers load every miute (L 5 P 1 ). Probig removes ay icosistecy i the load estimates that the sources may have accumulated. However, iterestigly, this techique does ot improve the load balace, as show i Figure 3. Eve icreasig the frequecy of probig does ot reduce imbalace (results ot show i the figure for clarity). I coclusio, local iformatio is sufficiet to obtai good load balace, therefore it is ot ecessary to icur the overhead of probig. Q3. To operatioalize this questio, we use the directed graphs datasets. We use KG to distribute the messages to the sources to test the robustess of PKG to skew i the sources, i.e., whe each source forwards a ueve part of the stream. We simulate a simple applicatio that computes a fuctio of the icomig edges of a vertex (e.g., i-degree, PageRak). The iput keys for the source PE is the source vertex id, while the key set to the worker PE is the destiatio vertex id, that is, the source PE iverts the edge. This schema projects the outdegree distributio of the graph o sources, ad the i-degree distributio o workers, both of which are highly skewed. Figure 4 shows the average imbalace for the experimets with a skewed split of the keys to sources for the LJ social graph (results o SL 1 ad SL 2 are similar to LJ ad are omitted due to space costrait). For compariso, we iclude the results whe the split is performed uiformly usig shuffle groupig of keys o sources. O average, the imbalace geerated by the skew o sources is similar to the oe obtaied with uiform splittig. As expected, the imbalace slightly icreases as the umber of sources ad workers icrease, but, i geeral, it remais at very low absolute values. CT Fractio of Imbalace 4.10-7 3.10-7 2.10-7 1.10-7 0 Uiform L 5 Skewed L 5 Uiform L 10 Skewed L 10 Uiform L 15 Skewed L 15 Uiform L 20 Skewed L 20 5 10 50 100 workers Fig. 4: Fractio of average imbalace with uiform ad skewed splittig of the iput keys o the sources whe usig the LJ graph. To aswer Q3, we additioally experimet with drift i the skew distributio by usig the cashtag dataset (CT). The bottom row of Figure 3 demostrates that all techiques achieve a low imbalace, eve though the chage of key popularity through time geerates occasioal spikes. I coclusio, PKG is robust to skew o the sources, ad ca therefore be chaied to key groupig. It is also robust to the drift i key distributio commo of may real-world streams. Q4. We implemet ad test our techique o the streamig top-k word cout example, ad perform two experimets to compare PKG, KG, ad SG o WP. We choose word cout as it is oe of the simplest possible examples, thus limitig the umber of cofoudig factors. It is also represetative of may data miig algorithms as the oes described i Sectio VI (e.g., coutig frequet items or co-occurreces of feature-class pairs). Due to the requiremet of real-world deploymet o a DSPE, we igore techiques that require coordiatio (i.e., PoTC ad O-Greedy). We use a topology cofiguratio of a sigle source alog with 9 workers (couters) ruig o a storm cluster of 10 virtual servers. We report overall throughput, ed-to-ed latecy, ad memory usage. I the first experimet, we emulate differet levels of CPU cosumptio per key by addig a fixed delay to the processig. We prefer this solutio over implemetig a specific applicatio i order to be able to cotrol the load o the workers. We choose a rage that is able to brig our cofiguratio to a saturatio poit, although the raw umbers would vary for differet setups. Eve though real deploymets rarely operate at saturatio poit, PKG allows better resource utilizatio, therefore supportig the same workload o a smaller umber of machies. I this case, the miimum delay (0.1ms) correspods approximately to readig 400kB sequetially from memory, while the maximum delay (1ms) to 1 10-th of a disk seek. 9 Nevertheless, eve more expesive tasks exist: parsig a setece with NLP tools ca take up to 500ms. 10 The system does ot perform aggregatio i this setup, as we are oly iterested i the raw effect o the workers. Figure 5(a) shows the throughput achieved whe varyig the CPU delay for the three partitioig strategies. Regardless of the delay, SG ad PKG perform similarly, ad their throughput is higher tha KG. The throughput of KG is reduced by 60% 9 http://breoco.com/dea perf.html 10 http://lp.staford.edu/software/parser-faq.shtml#

Throughput (keys/s) 1600 1400 1200 1000 800 600 400 200 0 PKG SG KG 0 0.2 0.4 0.6 0.8 1 (a) CPU delay (ms) 1200 1100 60s 300s 600s 300s 600s 30s PKG SG 1000 10s KG 0 1.10 5 2.10 5 3.10 5 4.10 5 (b) Memory (couters) Fig. 5: (a) Throughput for PKG, SG ad KG for differet CPU delays. (b) Throughput for PKG ad SG vs. average memory for differet aggregatio periods. whe the CPU delay icreases tefold, while the impact o PKG ad SG is smaller ( 37% decrease). We deduce that reducig the imbalace is critical for clusters operatig close to their saturatio poit, ad that PKG is able to hadle bottleecks similarly to SG ad better tha KG. I additio, the imbalace geerated by KG traslates ito loger latecies for the applicatio. Whe the workers are heavily loaded, the average latecy with KG is up to 45% larger tha with PKG. Fially, the beefits of PKG over SG regardig memory are substatial. Overall, PKG (3.6M couters) requires about 30% more memory tha KG (2.9M couters), but about half the memory of SG (7.2M couters). I the secod experimet, we fix the CPU delay to 0.4ms per key, as it is the saturatio poit for KG i our setup. We activate the aggregatio of couters at differet time itervals T to emulate differet applicatio policies for whe to receive up-to-date top-k word couts. I this case, PKG ad SG eed additioal memory compared to KG to keep partial couters. Shorter aggregatio periods reduce the memory requiremets, as partial couters are flushed ofte, at the cost of a higher umber of aggregatio messages. Figure 5(b) shows the relatioship betwee throughput ad memory overhead for PKG ad SG. The throughput of KG is show for compariso. For all values of aggregatio period, PKG achieves higher throughput tha SG, with lower memory overhead ad similar average latecy per message. Whe the aggregatio period is above 30s, the beefits of PKG compesate its extra overhead ad its overall throughput is higher tha whe usig KG. VI. APPLICATIONS PKG is a ovel programmig primitive for stream partitioig ad ot every algorithm ca be expressed with it. I geeral, all algorithms that use shuffle groupig ca use PKG to reduce their memory footprit. I additio, may algorithms expressed via key groupig ca be rewritte to use PKG i order to get better load balacig. I this sectio we provide a few such examples of commo data miig algorithms, ad show the advatages of PKG. Heceforth, we assume that each message cotais a data poit for the applicatio, e.g., a feature vector i a high-dimesioal space. A. Naïve Bayes Classifier A aïve Bayes classifier is a probabilistic model that assumes idepedece of features. It estimates the probability of a class C give a feature vector X by usig Bayes theorem. I practice, the classifier works by coutig the frequecy of co-occurrece of each feature ad class values. The simplest way to parallelize this algorithm is to spread the couters across several workers via vertical parallelism, i.e., each feature is tracked idepedetly i parallel. Followig this desig, the algorithm ca be implemeted by the same patter used for the KG example i Sectio II-A. Sparse datasets ofte have a skewed distributio of features, e.g., for text classificatio. Therefore, this implemetatio suffers from the same load imbalace, which PKG solves. Horizotal parallelism ca also be used to parallelize the algorithm, i.e., by shufflig messages to separate workers. This implemetatio uses the same patter as the DAG i the SG example i Sectio II-A. The cout for a sigle featureclass pair is distributed across several workers, ad eeds to be combied at predictio (query) time. This combiatio requires broadcastig the query to all the workers, as a feature ca be tracked by ay worker. This implemetatio, while balacig the work better tha key groupig, requires a expesive query stage that may be affected by stragglers. PKG tracks each feature o two workers ad avoids replicatig couters o all workers. Furthermore, the two workers are determiistically assiged for each feature. Thus, at query time, the algorithm eeds to probe oly two workers for each feature, rather tha havig to broadcast it to all the workers. The resultig query phase is less expesive ad less sesitive to stragglers tha with shuffle groupig. B. Streamig Parallel Decisio Tree A decisio tree is a classificatio algorithm that uses a treelike model where odes are tests o features, braches are possible outcomes, ad leafs are class assigmets. Be-Haim ad Tom-Tov [1] propose a algorithm to build a streamig parallel decisio tree that uses approximated histograms to fid the test value for cotiuous features. Messages are shuffled amog W workers. Each worker geerates histograms idepedetly for its sub-stream, oe histogram for each feature-class-leaf triplet. These histograms are the periodically set to a sigle aggregator that merges them to get a approximated histogram for the whole stream. The aggregator uses this fial histogram to grow the model by takig split decisios for the curret leaves i the tree. Overall, the algorithm keeps W D C L histograms, where D is the umber of features, C is the umber of classes, ad L is the curret umber of leaves. The memory footprit of the algorithm depeds o W, so it is impossible to fit larger models by icreasig the parallelism. Moreover, the aggregator eeds to merge W D C histograms each time a split decisio is tried, ad mergig the histograms is oe of the most expesive operatios. Istead, PKG reduces both the space complexity ad aggregatio cost. If applied o the features of each message, a sigle

feature is tracked by two workers, with a overall cost of oly 2 D C L histograms. Furthermore, the aggregator eeds to merge oly two histograms per feature-class-leaf triplet. This scheme allows to alleviate memory pressure by addig more workers, as the space complexity does ot deped o W. C. Heavy Hitters ad Space Savig The heavy hitters problem cosists i fidig the top-k most frequet items occurrig i a stream. The SPACESAVING [23] algorithm solves this problem approximately i costat time ad space. Recetly, Beride et al. [2] have show that SPACESAVING is space-optimal, ad how to exted its guaratees to merged summaries. This result allows for parallelized executio by mergig partial summaries built idepedetly o separate sub-streams. I this case, the error boud o the frequecy of a sigle item depeds o a term represetig the error due to the mergig, plus aother term which is the sum of the errors of each idividual summary for a give item i: ˆf i f i f + where f i is the true frequecy of item i ad ˆf i is the estimated oe, each j is the error from summarizig each sub-stream, while f is the error from summarizig the whole stream, i.e., from mergig the summaries. Observe that the error boud depeds o the parallelism level W. Coversely, by usig KG, the error for a item depeds oly o a sigle summary, thus it is equivalet to the sequetial case, at the expese of poor load balacig. Usig PKG we achieve both beefits: the load is balaced amog workers, ad the error for each item depeds o the sum of oly two error terms, regardless of the parallelism level. However, the idividual error bouds may deped o W. W j=i j VII. RELATED WORK Various works i the literature either exted the theoretical results from the power of two choices, or apply them to the desig of large-scale systems for data processig. Theoretical results. Load balacig i a DSPE ca be see as a balls-ad-bis problem, where m balls are to be placed i bis. The power of two choices has bee extesively researched from a theoretical poit of view for balacig the load amog machies [4, 20]. Previous results cosider each ball equivalet. For a DSPE, this assumptio holds if we map balls to messages ad bis to servers. However, if we map balls to keys, more popular keys should be cosider to be heavier. Talwar ad Wieder [21] tackle the case where each ball has a weight draw idepedetly from a fixed weight distributio X. They prove that, as log as X is smooth, the expected imbalace is idepedet of the umber of balls. However, the solutio assumes that X is kow beforehad, which is ot the case i a streamig settig. Thus, i our work we take the stadard approach of mappig balls to messages. Aother assumptio commo i previous works is that there is a sigle source of balls. Existig algorithms that exted PoTC to multiple sources execute several rouds of itrasource coordiatio before takig a decisio [16, 19, 24]. Overall, these techiques icur a sigificat coordiatio overhead, which becomes prohibitive i a DSPE that hadles thousads of messages per secod. Stream processig systems. Existig load balacig techiques for DSPEs are aalogous to key groupig with rebalacig [5, 8, 9, 10, 11, 12]. I our work, we cosider operators that allow replicatio ad aggregatio, similar to a stadard combier i map-reduce, ad show that it is sufficiet to balace load amog two replicas based local load estimatio. We refer to Sectio II-A for a more extesive discussio of key groupig with rebalacig. Flux moitors the load of each operator, raks servers by load, ad migrates operators from the most loaded to the least loaded server, from the secod most loaded to the secod least loaded, ad so o [5]. Aurora* ad Medusa propose policies to migratig operators i DSPEs ad federated DSPEs [8]. Borealis uses a similar approach but it also aims at reducig the correlatio of load spikes amog operators placed o the same server [9]. This correlatio is estimated by usig a fiite set of load samples take i the recet past. Gedik [10] developed a partitioig fuctio (a hybrid betwee explicit mappig ad cosistet hashig of items to servers) for stateful data parallelism i DSPEs that leverages item frequecies to cotrol migratio cost ad imbalace i the system. Similarly, Balkese et al. [11] proposed frequecy-aware hash-based partitioig to achieve load balace. Castro Feradez et al. [12] propose itegratig commo operator state maagemet techiques for both checkpoitig ad migratio. Other distributed systems. Several storage systems use cosistet hashig to allocate data items to servers [25]. Cosistet hashig substatially produces a radom allocatio ad is desiged to deal with systems where the set of servers available varies over time. I this paper, we propose replicatig DSPE operators o two servers selected at radom. Oe could use cosistet hashig also to select these two replicas, usig the replicatio techique used by Chord [26] ad other systems. Sparrow [27] is a stateless distributed job scheduler that exploits a variat of the power of two choices [24]. It employs batch probig, alog with late bidig, to assig m tasks of a job to the least loaded of d m radomly selected workers (d 1). Sparrow cosiders oly idepedet tasks that ca be executed by ay worker. I DSPEs, a message ca oly be set to the workers that are accumulatig the state correspodig to the key of that message. Furthermore, DSPEs deal with messages that arrive at a much higher rate tha Sparrow s fie-graied tasks, so we prefer to use local load estimatio. I the domai of graph processig, several systems have bee proposed to solve the load balacig problem, e.g., Miza [28], GPS [29], ad xdgp [30]. Most of these systems perform dyamic load rebalacig at rutime via vertex migratio. We have already discussed why rebalacig is impractical i our cotext i Sectio II.

Fially, SkewTue [31] solves the problem of load balacig i MapReduce-like systems by idetifyig ad redistributig the uprocessed data from the stragglers to other workers. Techiques such as SkewTue are a good choice for batch processig systems, but caot be directly applied to DSPEs. VIII. CONCLUSION Despite beig a well-kow problem i the literature, load balacig has ot bee exhaustively studied i the cotext of distributed stream processig egies. Curret solutios fail to provide satisfactory load balace whe faced with skewed datasets. To solve this issue, we itroduced PARTIAL KEY GROUPING, a ew stream partitioig strategy that allows better load balace tha key groupig while icurrig less memory overhead tha shuffle groupig. Compared to key groupig, PKG is able to reduce the imbalace by up to several orders of magitude, thus improvig throughput ad latecy of a example applicatio by up to 45%. This work gives rise to further iterestig research questios. Is it possible to achieve good load balace without foregoig atomicity of processig of keys? What are the ecessary coditios, ad how ca it be achieved? I particular, ca a solutio based o rebalacig be practical? Ad i a larger perspective, which other primitives ca a DSPE offer to express algorithms effectively while makig them ru efficietly? While most DSPEs have settled o just a small set, the desig space still remais largely uexplored. REFERENCES [1] Y. Be-Haim ad E. Tom-Tov, A Streamig Parallel Decisio Tree Algorithm, JMLR, vol. 11, pp. 849 872, 2010. [2] R. Beride, P. Idyk, G. Cormode, ad M. J. Strauss, Space-optimal heavy hitters with strog error bouds, ACM Tras. Database Syst., vol. 35, o. 4, pp. 1 28, 2010. [3] G. H. Goet, Expected legth of the logest probe sequece i hash code searchig, J. ACM, vol. 28, o. 2, pp. 289 304, 1981. [4] M. Mitzemacher, The power of two choices i radomized load balacig, IEEE Tras. Parallel Distrib. Syst., vol. 12, o. 10, pp. 1094 1104, 2001. [5] M. A. Shah, J. M. Hellerstei, S. Chadrasekara, ad M. J. Frakli, Flux: A adaptive partitioig operator for cotiuous query systems, i ICDE, 2003, pp. 25 36. [6] L. Neumeyer, B. Robbis, A. Nair, ad A. Kesari, S4: Distributed stream computig platform, i ICDM Workshops, 2010, pp. 170 177. [7] M. G. Noll, Implemetig Real-Time Tredig Topics With a Distributed Rollig Cout Algorithm i Storm, www.michael-oll.com/blog/2013/01/18/implemetig-realtime-tredig-topics-i-storm, 2013. [8] M. Cheriack, H. Balakrisha, M. Balaziska, D. Carey, U. Cetitemel, Y. Xig, ad S. B. Zdoik, Scalable distributed stream processig, i CIDR, vol. 3, 2003, pp. 257 268. [9] Y. Xig, S. Zdoik, ad J.-H. Hwag, Dyamic load distributio i the borealis stream processor, i ICDE, 2005, pp. 791 802. [10] B. Gedik, Partitioig fuctios for stateful data parallelism i stream processig, The VLDB Joural, pp. 1 23, 2013. [11] C. Balkese, N. Tatbul, ad M. T. Özsu, Adaptive iput admissio ad maagemet for parallel stream processig, i DEBS. ACM, 2013, pp. 15 26. [12] R. Castro Feradez, M. Migliavacca, E. Kalyviaaki, ad P. Pietzuch, Itegratig scale out ad fault tolerace i stream processig usig operator state maagemet, i SIGMOD, 2013, pp. 725 736. [13] A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, ad A. Rasi, HadoopDB: A Architectural Hybrid of MapReduce ad DBMS Techologies for Aalytical Workloads, PVLDB, vol. 2, o. 1, pp. 922 933, 2009. [14] J. Dittrich, J.-A. Quiaé-Ruiz, A. Jidal, Y. Kargi, V. Setty, ad J. Schad, Hadoop++: makig a yellow elephat ru like a cheetah (without it eve oticig), PVLDB, vol. 3, o. 1-2, pp. 515 529, 2010. [15] H. Yag, A. Dasda, R. L. Hsiao, ad D. S. Parker, Map-reduce-merge: simplified relatioal data processig o large clusters, i SIGMOD, 2007, pp. 1029 1040. [16] M. Adler, S. Chakrabarti, M. Mitzemacher, ad L. Rasmusse, Parallel Radomized Load Balacig, i STOC, 1995, pp. 119 130. [17] Y. Azar, A. Z. Broder, A. R. Karli, ad E. Upfal, Balaced allocatios, SIAM J. Comput., vol. 29, o. 1, pp. 180 200, 1999. [18] J. Byers, J. Cosidie, ad M. Mitzemacher, Geometric geeralizatios of the power of two choices, i SPAA, 2003, pp. 54 63. [19] C. Leze ad R. Wattehofer, Tight bouds for parallel radomized load balacig: Exteded abstract, i STOC, 2011, pp. 11 20. [20] M. Mitzemacher, R. Sitarama et al., The power of two radom choices: A survey of techiques ad results, i Hadbook of Radomized Computig, 2001, pp. 255 312. [21] K. Talwar ad U. Wieder, Balaced allocatios: the weighted case, i STOC, 2007, pp. 256 265. [22] F. Beeveuto, T. Rodrigues, M. Cha, ad V. Almeida, Characterizig user behavior i olie social etworks, i IMC, 2009. [23] A. Metwally, D. Agrawal, ad A. El Abbadi, Efficiet computatio of frequet ad top-k elemets i data streams, i ICDT, 2005, pp. 398 412. [24] G. Park, A Geeralizatio of Multiple Choice Balls-ito-bis, i PODC, 2011, pp. 297 298. [25] D. Karger, E. Lehma, T. Leighto, R. Paigrahy, M. Levie, ad D. Lewi, Cosistet hashig ad radom trees: Distributed cachig protocols for relievig hot spots o the world wide web, i STOC, 1997, pp. 654 663. [26] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, ad H. Balakrisha, Chord: A scalable peer-to-peer lookup service for iteret applicatios, SIGCOMM Computer Commuicatio Review, vol. 31, o. 4, pp. 149 160, 2001. [27] K. Ousterhout, P. Wedell, M. Zaharia, ad I. Stoica, Sparrow: distributed, low latecy schedulig, i SOSP, 2013, pp. 69 84. [28] Z. Khayyat, K. Awara, A. Aloazi, H. Jamjoom, D. Williams, ad P. Kalis, Miza: a system for dyamic load balacig i large-scale graph processig, i ECCS, 2013, pp. 169 182. [29] S. Salihoglu ad J. Widom, Gps: A graph processig system, i ICSSDM, 2013, p. 22. [30] L. Vaquero, F. Cuadrado, D. Logothetis, ad C. Martella, xdgp: A Dyamic Graph Processig System with Adaptive Partitioig, arxiv, vol. abs/1309.1049, 2013. [31] Y. Kwo, M. Balaziska, B. Howe, ad J. Rolia, Skewtue: Mitigatig Skew i MapReduce Applicatios, i SIGMOD, 2012, pp. 25 36.