SMART: Scalable, Bandwidth-Aware Monitoring of Continuous Aggregation Queries

: Scalable, Bandwdth-Aware Montorng of Contnuous Aggregaton Queres Navendu Jan, Praveen Yalagandula, Mke Dahln, and Yn Zhang Unversty of Texas at Austn HP Labs ABSTRACT We present, a scalable, bandwdth-aware montorng system that maxmzes result precson of contnuous aggregate queres over dstrbuted data streams. Whle prevous approaches reduce bandwdth cost under fxed precson constrants, n practce, montorng systems may stll ncur a substantal cost rskng overload under bursty traffc condtons. therefore bounds the worst-case system cost to provde overload reslence and to facltate practcal deployment of montorng systems. The prmary challenge for s how to select dynamc updates at each node n a dstrbuted system to maxmze global precson whle keepng per-node montorng bandwdth below a specfed budget. To address ths challenge, s herarchcal algorthm (1) allocates bandwdth budgets n a nearoptmal manner to maxmze global precson and (2) selftunes bandwdth settngs to mprove precson under dynamc workloads. Our prototype mplementaton of provdes key solutons to (a) prortze pendng updates for mult-attrbute queres, (b) buld bounded fan-n, load-aware aggregaton trees to mprove accuracy and fast anomaly detecton, and (c) combne temporal batchng wth arthmetc flterng to reduce load and to quantfy result staleness. Fnally, our evaluaton usng smulatons and a network montorng applcaton shows that mproves accuracy by up to an order of magntude compared to unform bandwdth allocaton and performs close to the optmal algorthm under modest bandwdth budgets. 1. INTRODUCTION Dstrbuted stream processng systems [1,19,32] must provde hgh performance and hgh fdelty for query processng as such systems grow n scale and complexty. In these systems, data streams are often bursty where nput rates may unexpectedly ncrease over tme [3, 23]. Examples nclude network traffc montorng, dentfyng dstrbuted denalof-servce (DDoS) attacks on the Internet, fnancal stocks montorng, web clck stream analyss, and event-drven montorng n sensor networks. Therefore, t s desrable for these systems to bound the montorng load whle stll provdng useful accuracy guarantees on the query results. Exstng technques [21,22,24,26,37,4] am to address ths problem by mnmzng the montorng cost whle promsng an a pror numerc error bound (e.g., ± 1%) on query precson. Unfortunately, although these technques effectvely reduce load under fxed precson, they are unsutable n dynamc, hgh-volume stream processng envronments for three reasons. (1) Settng precson requres workload knowledge: Choosng error bounds a pror s unntutve when workloads are not known n advance or may change unpredctably over tme. (Should the error be 1% or 3%?) Conversely, t may be easer to set the montorng budget (e.g., a system admnstrator s wllng to pay.1% of network bandwdth for montorng). (2) Bad precson settng hurts performance: A bad choce of the error bound may sgnfcantly degrade the qualty of a query result [6] (e.g., when the error bound s too large) or ncur a hgh communcaton and processng cost (e.g., when the error bound s too small). (3) Bursty traffc mposes unacceptable overheads: Even wth reasonable error bounds, n practce, montorng systems may stll ncur a substantal cost rskng overload under bursty and often unpredctable traffc condtons (see motvatng example descrbed below). An ntutve soluton to handle hgh load s to smply provson adequate resources n antcpaton of the worst-case load. However, snce over-provsonng s nether economcally vable [36] nor scalable [2], t s mportant to provde protecton mechansms to bound the bandwdth cost. Motvatng Example: We present a smple example to llustrate the challenges for scalable montorng under bursty workloads. We smulate a set of 1 data sources connected to a centralzed montor wth ncomng bandwdth lmt of up to messages per second. The nput workload dstrbuton s modeled based on the standard exponental dstrbuton wth a parameter λ, and upon each arrval, the value of attrbute a at data source s updated accordng to random walk model n whch the value ether ncreases or decreases by an amount sampled unformally from [., 1.]. Fgure 1 shows the load-error tradeoff for a sngle data source. As expected, on ncreasng the error budget, arthmetc flterng [21, 26] quckly decreases the load as majorty updates get fltered. To quantfy the montorng cost, we use λ=1, a baselne error budget of 2 and ts correspondng expected load of. (Fgure 1), effectvely settng the total expected cost for montorng 1 data sources as messages per second. Fgure 2 shows the nduced message load at the central montor under fxed error of 2 per attrbute. We observe that under peak data arrvals, the system ncurs up to 4x hgher cost to meet the error bounds over the expected cost of messages per second. Thus, montorng systems may nduce hgh overload to bound precson under bursty workloads. Fgure 3 shows the dvergence between data source attrbutes and 1

Expected Messages per second 1 Load-Error Tradeoff.9.8.7.6..4.3.2.1 1 2 3 4 6 7 8 9 1 Error Budget Messages per second 2 1 1 Load (Fxed Error = 2) Expected Load 1 2 3 4 6 7 8 9 1 Tme (seconds, *1) Dvergence 1 9 8 7 6 4 3 2 1 Dvergence (Fxed BW=) Expected Dvergence 1 2 3 4 6 7 8 9 1 Tme (seconds, *1) Fgure 1: Expected outgong message load vs. error budget for a sngle attrbute at a data source under random walk workload. Fgure 2: Montorng system nduces overload under bursty workloads to bound result error. The system ncurs up to 4x overload over the expected load. Fgure 3: Montorng system causes hgh naccuracy under bursty workloads to bound load. The result error s up to x hgher over the expected dvergence. ther cached values at the central montor under fxed load of. In ths case, the system may provde hghly naccurate results havng up to x error over the expected result error of 2. As a result, exstng montorng technques are neffectve under bursty workloads they rsk overload, loss of accuracy, or both. Ths smple example clearly shows that large-scale montorng systems must bound the bandwdth cost whle stll provdng query results wth useful accuracy guarantees. Our Contrbutons: To address these challenges, we propose, the frst (to the best of our knowledge) scalable, bandwdth-aware montorng system that adaptvely sets bandwdth constrants to maxmze precson of contnuous aggregate queres over dstrbuted data streams. addresses the above challenges wth the followng technques: (1) Maxmze precson under fxed bandwdth budget: formulates a bandwdth-aware optmzaton problem whose goal s to maxmze the result precson of an aggregate query n a herarchcal aggregaton tree subject to gven bandwdth constrants. Ths model provdes a closed-form, near-optmal soluton to self-tune bandwdth settngs for achevng hgh accuracy usng only local and aggregated nformaton at each node n the tree. Further, provdes a graceful degradaton n query precson as the avalable bandwdth decreases. Our expermental results show that self-tunng bandwdth settngs mproves accuracy by an order of magntude over unform bandwdth allocaton and performs close to the optmal algorthm under modest bandwdth budgets. (2) Scalablty va multple aggregaton trees: bulds on recent work that uses dstrbuted hash tables (DHTs) to construct scalable, load-balanced forests of self-organzng aggregaton trees [, 12, 29, 39]. Scalablty to tens of thousands of nodes and mllons of attrbutes s acheved by mappng dfferent attrbutes to dfferent trees. For each tree n ths forest of aggregaton trees, s self-tunng algorthm adjusts bandwdth settngs to acheve hgh precson under dynamc workloads where the estmate of load vs. precson trade-offs, and hence the optmal dstrbuton, can change over tme. (3) Hgh performance by combnng arthmetc flterng and temporal batchng: ntegrates temporal batchng wth arthmetc flterng to reduce the montorng load and to quantfy staleness of query results. For hgh-volume workloads, t s advantageous for nodes to batch multple updates (that arrve close n tme) and send a sngle combned update, thereby more effectvely usng budgeted bandwdth for refreshng updates. In an aggregaton tree, ths temporal batchng allows leaf sensors to condense a seres of updates nto a perodc report and allows nternal nodes to combne updates from dfferent subtrees before transmttng them further. Further, t bounds the delay (e.g., at most 3 seconds) from when an update occurs at a leaf untl t s reported at the root. To mnmze query result staleness, prortzes updates so that they propagate from the leaves to the root n the allotted tme whle meetng bandwdth budgets. We have mplemented a prototype of on SDIMS, a scalable aggregaton system [39] bult on top of FreePastry [13]. Our prototype mplementaton provdes two key optmzatons. Frst, t constructs bounded fan-n, bandwdth-aware aggregaton trees to mprove accuracy and to quckly detect anomales n heterogeneous envronments. Recent studes [9, 11, 18, 21, 22] suggest that only a few attrbutes (e.g., network flows) generate a sgnfcant fracton of total traffc n many montorng applcatons. Thus, for fast anomaly detecton, an aggregaton tree should quckly route ther updates towards the root such that no nternal node becomes a bottleneck due to ether hgh n-degree or low bandwdth. Second, for mult-attrbute queres, provdes a refresh schedule that selects and prortzes attrbutes for refreshng n order to mnmze the error n query results. We evaluate through extensve smulatons and a real network montorng applcaton of detectng dstrbuted heavy htters. Experence wth ths applcaton bult on llustrates the mproved precson and scalablty benefts: for a gven montorng budget, s adaptvty can sgnfcantly mprove the query precson whle montorng a large number of attrbutes. Compared to unform bandwdth allocaton, mproves accuracy by up to an order of magntude and provdes accuracy wthn 27% of the optmal algorthm under modest bandwdth budgets. 2

Example Queres: We lst several real-world applcaton queres whch form the bass of montorng system. Q 1 Q 2 Q 3 Q 4 Montor dstrbuted heavy htters that send the hghest traffc n aggregate across all network endponts. Fnd top-k ports across all nodes that have been heavly scanned n the recent past ndcatng worm actvty. Montor anomaly condtons e.g., SUM(nodes sensng fre) threshold, MAX(chemcal concentraton) n sensor networks. Montor the top-k popular web objects n a wde-area content dstrbuton network e.g., Akama. All these aggregate queres requre processng a large number of rapd update streams n lmted bandwdth/batterylfe envronments, and can beneft from. In Secton, we show results for Q 1 usng a real network montorng system we have mplemented. In summary, ths paper makes the followng contrbutons. The frst research effort that dentfes the key lmtatons of prevous fx error, mnmze load technques and addresses them by boundng the worst-case load whle stll provdng query results wth useful accuracy. A practcal, bandwdth-aware montorng system that adapts bandwdth budgets to maxmze precson of contnuous aggregate queres under hgh-volume, dynamc workloads. Our mplementaton provdes key optmzatons for mprovng accuracy and fast anomaly detecton n realworld heterogeneous envronments. Our evaluaton demonstrates that provdes a key substrate for scalable montorng: t provdes hgh accuracy n dynamc envronments and performs close to the optmal algorthm under modest bandwdth budgets. The rest of ths paper s organzed as follows. Secton 2 provdes background descrpton of SDIMS [39], a scalable DHT-based aggregaton system, and precson-performance tradeoffs that underle. Secton 3 descrbes the adaptve algorthm that self-tunes bandwdth settngs to mprove result accuracy. Secton 4 presents the mplementaton of n our SDIMS aggregaton system. Secton presents the expermental evaluaton of. Fnally, Secton 6 dscusses related work, and Secton 7 provdes conclusons. 2. POINT OF DEPARTURE extends SDIMS [39] whch embodes two key abstractons for scalable montorng: aggregaton and DHTbased aggregaton. then ntroduces controlled tradeoffs between precson bounds and montorng load. 2.1 DHT-based Herarchcal Aggregaton Aggregaton s a fundamental abstracton for scalable montorng [, 12, 18, 29, 37, 39] because t allows applcatons to access summary vews of global nformaton and detaled vews of rare events and nearby nformaton. 37 Vrtual Nodes (Internal Aggregaton Ponts) 18 19 7 11 7 12 3 4 2 9 6 1 9 3 1 1 11 1 11 11 111 Physcal Nodes (Leaf Sensors) Fgure 4: The aggregaton tree for key n an eght node system. Also shown are the aggregate values for a smple SUM() aggregaton functon. s aggregaton abstracton defnes a tree spannng all nodes n the system. As Fgure 4 llustrates, each physcal node s a leaf and each subtree represents a logcal group of nodes 1. An nternal non-leaf node, whch we call a vrtual node, s smulated by a physcal leaf node of the subtree rooted at the vrtual node. Fgure 4 llustrates the computaton of a smple SUM aggregate. leverages DHTs [29,31,3] to construct a forest of aggregaton trees and maps dfferent attrbutes to dfferent trees for scalablty. DHT systems assgn a long (e.g., 16 bts), random ID to each node and defne a routng algorthm to send a request for ID to a node root such that the unon of paths from all nodes forms a tree DHTtree rooted at the node root. By aggregatng an attrbute wth ID = hash(attrbute) along the aggregaton tree correspondng to DHTtree, dfferent attrbutes are load balanced across dfferent trees. Ths approach can provde aggregaton that scales to large numbers of nodes and attrbutes [, 29, 39]. 2.2 Query Result Approxmaton quantfes the precson of query results n terms of numerc error between the reported result and the actual value. We formally defne the numerc approxmaton of a query result usng Arthmetc Imprecson [21, 26]. Arthmetc mprecson (AI) determnstcally bounds the numerc dfference between a reported value of an attrbute and ts true value [21, 26, 27, 4]. For example, an AI = 1% bounds that the reported value ether underestmates or overestmates the true value by at most 1%. When applcatons do not need exact answers [21, 26, 37, 4], arthmetc mprecson provdes an effectve way to reduce system load by ntroducng addtonal flterng on update propagaton. Next, we descrbe the AI mechansm of how enforces the numerc error bounds whle maxmzng load reducton. AI Mechansm: We frst descrbe the basc mechansm for enforcng AI for each aggregaton subtree n the system. To enforce AI, each aggregaton subtree T for an attrbute has an error budget δ T that defnes the maxmum naccuracy of any result the subtree wll report to ts parent for that attrbute. The root of each subtree dvdes ths error budget among tself δ self and ts chldren δ c (wth δ T δ self + P c chldren δc), and the chldren recursvely 1 Logcal groups can correspond to admnstratve domans (e.g., department or unversty) or groups of nodes wthn a doman (e.g., a /28 subnet wth 14 hosts n the CS department) [16, 39]. L3 L2 L1 L 3

do the same. Note that δ c determnes the chld s error flter to cull as many updates as possble before sendng to the parent, and δ self s useful for applyng addtonal flterng after combnng all updates receved from the chldren. Here we present the AI mechansm for the SUM aggregate snce t s lkely to be common n network montorng [21,26] and fnancal applcatons [1]; other standard aggregaton functons (e.g., MAX, MIN, AVG, etc.) are smlar and defned precsely n an extended techncal report [2]. Ths arrangement reduces system load by flterng small updates that fall wthn the range of values cached by a subtree s parent. In partcular, after a node A wth error budget δ T reports a range [V mn, V max] for an attrbute value to ts parent (where V max V mn + δ T ), f the node A receves an update from a chld, the node A can skp updatng ts parent as long as t can ensure that the true value of the attrbute for the subtree les between V mn and V max,.e., f V mn P c chldren V c mn V max P c chldren V c max where Vmn c and Vmax c denote the most recent update receved from chld c. mantans per-attrbute δ T values so that dfferent attrbutes wth dfferent error requrements and dfferent update patterns can use dfferent δ budgets n dfferent subtrees. 2.3 Case-study Applcaton To gude the system development of and to drve our performance evaluaton, we have bult a case-study applcaton usng SDIMS aggregaton framework: a dstrbuted heavy htter detecton servce. Dstrbuted Heavy Htters (DHH) detecton s mportant for both montorng traffc anomales such as DDoS attacks, botnet attacks, and flash crowds as well as accountng and bandwdth provsonng [11]. Heavy htters are enttes that account for at least a specfed proporton of the total actvty measured n terms of number of packets, bytes, connectons, etc. [11] n a dstrbuted system for example, the top 1 IPs that account for a sgnfcant fracton of total ncomng traffc n the last 1 mnutes [11]. To answer ths dstrbuted query, the key challenge s scalablty for aggregatng per-flow statstcs for tens of thousands to mllons of concurrent flows across all the network endponts n real-tme. For example, a subset of the Ablene [2] traces used n our experments nclude 8 thousand flows that send about 2 mllon updates per hour. To process ths workload, a centralzed system needs to handle about 7 messages per second at the central montor. To scalably compute the global heavy htters lst, we chan two aggregatons where the results from the frst feed nto the second. Frst, SDIMS calculates the total ncomng traffc for each destnaton address from all nodes n the system usng SUM as the aggregaton functon and hash(hh-step1, destip) as the key. For example, tuple (H = hash(hh-step1, 72.179.8.7), 9 KB) at the root of the aggregaton tree T H ndcates that a total of 9 KB of data was receved for 72.179.8.7 across all vantage ponts n the network durng the last tme wndow. In the second step, we feed these aggregated total bandwdths for each destnaton IP nto a SELECT-TOP-1 aggregaton wth key hash(hh-step2, TOP-1) to dentfy the TOP-1 heavy htters among all flows. (1) In Secton we show how s self-tunng algorthm adapts bandwdth settngs to montor a large number of attrbutes and provdes hgh result accuracy by flterng majorty of mce flows [11] (attrbutes wth low frequency) whle prortzng updates for the heavy-htter flows; we expect typcal montorng applcatons to have a large number of mce flows but only a few heavy htters. 3. DESIGN In ths secton we present the desgn and descrbe our self-tunng algorthm that adapts bandwdth settngs at each node to maxmze query precson under gven bandwdth budgets. 3.1 System Model We focus on dstrbuted stream processng envronments wth a large number of data sources that perform n-network aggregaton to compute contnuous aggregate queres over ncomng data streams. The bandwdth resources for query processng may be lmted at a number of ponts n the network. In partcular, a node j s outgong bandwdth (B O j ) may be constraned, a node j s ncomng bandwdth (Bj I ) may be constraned, or both. Note that constranng the ncomng bandwdth bounds (a) the control traffc overhead for montorng and (b) the CPU processng load for computng the aggregaton functon across ncomng data nputs. As dscussed n Secton 1, boundng ths ncomng load s mportant for handlng abnormal traffc condtons. Fnally, these bandwdth capactes may vary (a) among nodes n heterogeneous envronments and (b) wth tme f traffc s shared wth other applcatons. At any tme, each attrbute s numerc value s bounded by an AI error δ. If δ s small, then updates may frequently drve an attrbutes value out of ts last reported range [V mn, V max ], forcng the system to send messages to update the range. A system can, however, reduce ts bandwdth requrements by ncreasng δ. Thus, f a herarchcal aggregaton system has bandwdth constrants, t should determne a δ value at each aggregaton pont that meets the bandwdth constrants nto and out of that pont, and t should select these δ values so as to mnmze the total AI for the attrbute. In partcular, rather than splttng each node s ncomng bandwdth evenly among ts chldren, the system attempts to assgn bandwdth to where t wll do the most good by reducng the resultng mprecson of the node s aggregate values. In the rest of ths secton, we descrbe the algorthm for mnmzng mprecson whle meetng bandwdth constrants n four steps. Frst, we descrbe a smplfed algorthm for a one-level aggregaton tree and statc workloads. For each leaf node n the system, ths algorthm calculates an deal error settng δ and correspondng expected bandwdth consumpton b such that (1) each leaf node s outbound bandwdth (.e., rate of updates sent to the root) s at most ts outgong bandwdth budget, (2) the root node s ncomng bandwdth (.e., sum of update rates nbound from chldren) s at most ts ncomng bandwdth budget, and (3) the sum of the δ s s mnmzed gven the frst two constrants. Second, we descrbe how to handle dynamc workloads where the estmate of AI error δ vs. bandwdth trade-offs, and hence the optmal dstrbuton, can change over tme. A key challenge here s throttlng the rate at whch the system 4

Symbol Meanng B O outgong bandwdth constrant for node B I ncomng bandwdth constrant for node u nput update rate at node σ standard devaton of node s nput workload chld() all chldren of node δ node s AI error settng b node s outgong bandwdth load δ opt node s optmal AI error settng b opt node s optmal outgong bandwdth load Table 1: Summary of key notatons. redstrbutes bandwdth budgets across nodes snce such redstrbuton also ncurs bandwdth costs. Thrd, we generalze the algorthm to handle mult-level aggregaton trees. Fnally, we dscuss how our mplementaton copes wth varablty. In partcular, sets the per-node δs so that the average bandwdth meets a target. However, spkes of update load for an attrbute or concdent updates for multple attrbutes could cause nstantaneous bandwdth to exceed the target. To avod such nstantaneous overload, therefore prortzes pendng updates based on the mpact they wll have on ther aggregate values and drans them to the network at the target rate. 3.2 One-Level Tree Quantfy AI Precson vs. Bandwdth Tradeoff: To estmate the optmal dstrbuton of load budgets among dfferent nodes, we need a smple way of quantfyng the amount of query error reducton that can be acheved when a gven bandwdth budget s used for AI flterng. Intutvely, the flterng gan depends on the sze of the error budget relatve to the nherent varablty n the underlyng data dstrbuton. Specfcally, as llustrated n Fgure, f the precson threshold δ at node s much smaller than the standard devaton σ of the underlyng data dstrbuton, δ s unlkely to flter many data updates but stll consume valuable bandwdth. Meanwhle, f δ s above σ, we would expect the load to decrease quckly as δ ncreases untl the pont where a large fracton of updates are fltered. To quantfy the tradeoff between load and precson, we draw from the basc formulaton of Jan et al. [21] to develop a smple metrc n for capturng the tradeoff between load and error budget. Our metrc utlzes Chebyshev s nequalty whch gves a bound on the probablty of devaton of a gven random varable from ts mathematcal expectaton n terms of ts varance. Let X be a random varable wth fnte mathematcal expectaton µ and varance σ 2. Chebyshev s nequalty states that for any k, P r( X µ kσ) 1 k 2 (2) For AI flterng, the term kσ represents the error budget δ for node. Substtutng for k n Equaton 2 gves: P r( X µ δ ) σ2 δ 2 Intutvely, ths equaton mples that f δ σ.e., the error budget s smaller than the standard devaton (mplyng k 1), then δ s unlkely to flter many data updates (Fgure.) In ths case, Equaton 3 provdes only a weak bound on the message cost: the probablty that each ncomng update wll trgger an outgong message s upper bounded by 1. (3) u Message Load σ AI Error Fgure : Expected message load vs. AI error. However, f δ kσ for any k 1, the fracton of unfltered updates s probablstcally bounded by σ2. In general, δ 2 gven the nput update rate u for node wth error budget δ, the expected message cost for node per unt tme s: j M = MIN u, σ 2 δ 2 u ff We use the expected message cost M to optmally set node s outgong bandwdth b opt. Estmate Optmal Bandwdth Settngs under Fxed Load: To estmate the optmal load dstrbuton and settngs of AI error δs at each node n a one-level tree rooted at node r, we formulate an optmzaton problem of mnmzng the total error for a SUM aggregate at root r under gven bandwdth budgets Br I (at the root) and B O (at chld ). Specfcally, we have: chld(r), b opt s.t. P MIN δ opt chld(r) P σ 2 u (δ opt chld(r) ) 2 BI r = σ2 u (δ opt ) 2 mn{bo, u } where b opt denotes the optmal settng of outgong bandwdth of node to meet P the global objectve of mnmzng the total error δ opt subject to the constrants chld(r) of (1) node s outgong bandwdth budget (.e., b opt mn{b O P, u }) and (2) ncomng bandwdth budget at root r (.e., b opt Br I ). Ths formulaton s based on the chld(r) AI flterng model.e., node suppresses updates wthn δ of ts last transmtted update. δ opt (4) () subject to the constrant that Enforcng Incomng Bandwdth Constrant: To solve Equaton (), we frst relax the outgong bandwdth constrants, b opt mn{b O, u } snce the ncomng bandwdth determnes the processng overhead whch seems lkely to become a bgger bottleneck than the outgong bandwdth. Later, we provde an optmal soluton when the outgong bandwdth constrants are also enforced. We use the method of Lagrangan Multplers to fnd the extremum of the objectve functon f : `P «P σ g : 2 u B I (δ opt ) 2 r. The above formulaton yelds a closed-form and computa-

tonally nexpensve optmal soluton [2]: v P 3 u σ 2 c u c t δ opt c chld(r) = qσ 3 2 u (6) B I r whch provdes a closed-form formula for settng b opt b opt = σ2 u ) = 2 BI r (δ opt : p 3 σ 2 u P 3 (7) σ 2 c u c c chld(r) As a smple example, f u = σ = 1, qthen each node sets the same δ for the attrbute as δ = N, and the bandwdth budget for node wll be 1 δ 2 B I r = BI r N.e., gven a total bandwdth budget and unform data dstrbuton across nodes, each node gets a unform share of the bandwdth to update each attrbute. Note that P to set b opt (Equaton (7)), each node needs to know 3 σ 2 c u c and root r s ncomng bandwdth c chld(r) budget B I r ; computes a smple SUM aggregate to obtan ths nformaton at each parent and propagates down to all ts chldren n an aggregaton tree. As a smple optmzaton, these messages are pggy-backed on data updates. Enforcng Outgong Bandwdth Constrants: Note that the above optmal load assgnment assumes that the outgong bandwdth constrant b opt c mn{bc O, u c } holds for every chld c. If for a node, the above optmal soluton doesn t satsfy b opt mn{b O, u }, then we need to set b opt = mn{b O, u } to satsfy ts bandwdth constrant. Ths stuaton may arse n heterogeneous envronments where a subset of nodes may experence larger nput loads (e.g., DDoS attacks) or may become severely resourceconstraned (e.g., sensor networks wth low power devces). Note that settng b opt = mn{b O, u } can free up part of r s ncomng capacty Br I, whch can be reassgned to other chldren to ncrease ther bandwdth budget thereby mprovng the overall accuracy. therefore apples an teratve algorthm that n each teraton determnes all saturated chldren C sat at each step, fxes ther load budgets and error settngs, and recomputes Equatons (6), (7) for all the remanng chldren (assumng chld set C sat s absent). A chld j s labeled saturated f b opt j Bj O.e., the avalable outgong bandwdth s less than or equal to that requred by the optmal soluton. If chld set C sat s empty, the procedure termnates gvng the optmal bandwdth and AI error settngs for each node; all saturated nodes set bandwdth equal to ther outgong load threshold. Note that for our DHT-based aggregaton trees, the fan-n for a node s typcally 16 (.e., a 4-bt correcton per hop) so the teratve algorthm runs n constant tme (at most 16 tmes). 3.3 Self-Tunng Bandwdth Settngs The above optmal soluton s derved assumng that σ and u are gven and reman constant. In practce, σ and u may change over tme dependng on the workload characterstcs. therefore performs self-tunng to dynamcally readjust the bandwdth settngs to mnmze error. Relaxaton: A self-tunng algorthm that adapts too rapdly may react napproprately to transent stuatons. We thus apply exponental smoothng to adjust the bandwdth settngs.e., where α =.. b new = α b opt + (1 α) b (8) Cost-Beneft Throttlng: Fnally, each node needs to send messages to the root to mnmze the query error. Therefore, we need to prortze sendng those messages that beneft precson the most. A nave refreshng algorthm that updates attrbutes n a round-robn fashon could easly spend network messages that do not reduce error whle consumng valuable bandwdth resource. Lmtng sendng of nonuseful updates s a partcular concern for applcatons lke DHH that montor a large number of attrbutes, only a few of whch are actve enough to be worth optmzng. To address ths challenge, after computng the new error budgets, a node computes a charge metrc for each attrbute a, whch estmates the reducton n error ganed by refreshng a: charge a = (T curr T a lastsent) D a (9) where T curr s the current tme, TlastSent a s the last tme an attrbute a s update was sent, and D a denotes the devaton between a s current value at that node and ts last reported range [V mn, V max] to the parent. For example, f a s AI error range cached at the parent s [1, 2] and a new update wth value 11 arrves at a chld node, we expand the range to [1,11] at the chld to nclude the new value settng D a =1. Notce that an attrbute s charge wll be large f () there s a large error mbalance (.e., D a s large), or () there s a long-lastng mbalance (e.g., T curr TlastSent a s large). Further, only usng D a may hurt precson f the attrbute has a repeated behavor of quckly dvergng after the last refresh. Therefore, the latter term prortzes attrbutes who are lkely to agan dverge slowly after beng refreshed thereby gvng a long-term precson beneft. Snce redstrbuton also consumes bandwdth budgets, we only send messages to readjust bandwdth settngs b new when dong so s lkely to reduce the tme-averaged error for those attrbutes by at least a threshold τ (.e., f charge a > τ). Further, to ensure the nvarant that a node does not exceed ts bandwdth budget whle sendng updates for multple attrbutes, we present a smple technque to prortze updates across multple attrbutes n Secton 3.. 3.4 Mult-Level Trees To scale to a large number of nodes, we extend our basc algorthm for a one-level tree to a dstrbuted algorthm for a mult-level aggregaton herarchy. Note that for a one-level aggregaton tree, leaf nodes use AI error δ to flter updates. In addton, to reduce the communcaton and processng load n an aggregaton herarchy, the nternal nodes may also retan a δ self to help prevent updates receved from ther chldren from beng propagated further up the tree [1, 21]. At any nternal node n the aggregaton tree, the selftunng algorthm works n a smlar manner as that n a one-level tree case: the nternal node s a local root for each of ts mmedate chldren, and the bandwdth targets B I for the parent and Bc O for each chld c are nput as constrants n the optmzaton problem formulaton n Equaton (). The key dfference s that for chldren who are nternal nodes, we use ther AI error δ self n ths optmzaton framework. To estmate the optmal bandwdth and AI error settngs for each chld, the parent node p tracks ts ncomng update rate 6

.e., the aggregate number of messages sent by all ts chldren per tme unt (u p ) and the standard devaton (σ p ) of updates receved from ts chldren. Note that u c, σ c reports are accumulated by chld c untl they can be pggy-backed on an update message to ts parent. Gven ths nformaton, the parent node estmates for each chld c, the optmal outgong bandwdth settngs b opt c and the correspondng AI error settngs δ opt c(self) that mnmze the total error n the aggregate value computed across the chldren gven bandwdth constrants. Ths procedure s performed at each parent n the aggregaton tree. 3. Prortzng Pendng Updates Fnally, we descrbe how our mplementaton addresses an mportant and practcal challenge of varablty n bandwdth targets. In partcular, we want to avod stuatons where a node may exceed nstantaneous bandwdth target ether due to (1) spkes of update load for an attrbute, (2) concdent updates for multple attrbutes, or (3) sudden ncrease n avalable bandwdth, but may stll meet ts average bandwdth target. Ths scenaro s undesrable snce a node may flood ts parent wth updates that far exceed the parent s ncomng capacty. To address ths problem, our mplementaton provdes a prorty heap that stores all pendng updates that need to be sent ordered by ther prorty. Usng ths heap data structure, effcently prortzes pendng updates based on the mpact they wll have on ther aggregate values and drans them to the network at a target rate. Specfcally, at each tme step, a node keeps removng the maxmum prorty attrbute from the heap and sendng t to the correspondng parent untl the node s nstantaneous target bandwdth s reached. Note that to compare prortes across dfferent attrbutes that have dfferent value ranges across dfferent queres, we normalze an attrbute a s refresh prorty (Equaton (9)) wth respect to ther standard devaton by dvdng the devaton D a by the standard devaton σ a of that attrbute. Further, to support queres and attrbutes (wthn a query) wth dfferent mportance values, provdes a smple mechansm of weghtng an attrbute s refresh prorty wth ts mportance value. 4. IMPLEMENTATION In ths secton, we descrbe several mportant optmzatons and key ssues for mplementng n our prototype mplementaton. Frst, we present a key performance optmzaton of temporal batchng of updates to reduce load and to quantfy staleness of query results n large-scale montorng systems. Second, we descrbe how to buld bandwdthaware aggregaton trees to mprove accuracy and fast anomaly detecton n heterogeneous network envronments. Then, we descrbe how handles falures and reconfguratons. Fnally, we dscuss how to mprove precson for dfferent aggregates. 4.1 Temporal Imprecson ntegrates temporal mprecson wth arthmetc flterng to provde staleness guarantees on query results and to reduce the montorng load. Temporal Imprecson (TI) bounds the delay from when an event/update occurs untl t s reported [24, 26, 34, 4]. A temporal mprecson of TI (e.g., TI = 3 seconds) guarantees that every event that occurred TI or more seconds level 4 level 3 level 2 level 1 level Event level 4 level 3 level 2 level 1 level Event TI TI (a) Send synchronzed updates every TI seconds. TI/4 TI/2 3TI/4 TI (b) Send unsynchronzed updates every TI/ l seconds. Next TI nterval starts here Fgure 6: For a gven TI bound, ppelned delays wth synchronzed clocks (a) allows nodes to send less frequently than unppelned delays wthout synchronzed clocks (b). ago s reflected n the reported result; events younger than TI may or may not be reflected. In, each attrbute has a TI nterval durng whch ts updates are batched nto a combned message, checked f the combned update drves the aggregate value out of the last reported AI range, and then pushed nto the prorty queue to be sent to the parent. Temporal mprecson benefts montorng applcatons n two ways. Frst, t accounts for nherent network and processng delays n the system; gven a worst case per-hop cost hop max even mmedate propagaton provdes a temporal guarantee no better than l hop max where l s the maxmum number of hops from any leaf to the root of an aggregaton tree. Second, explctly exposng TI allows to reduce load by usng temporal batchng: a set of updates at a leaf sensor are condensed nto a perodc report or a set of updates that arrve at an nternal node over a tme nterval are combned nto a sngle message before beng sent further up the tree [2]. Ths temporal batchng mproves scalablty by reducng processng and network load as we show usng experments on a network montorng applcaton n Secton. mplements TI usng a smple mechansm of havng each node send updates to ts parent once per TI/l seconds smlar to TAG [24] as shown n Fgure 6(b). Further, to maxmze the possblty of batchng updates, when clocks are synchronzed 2, ppelnes delays across tree levels so that each node sends once every (TI ) seconds wth each level s sendng tme staggered so that the updates from level arrve just before level + 1 can send (Fgure 6(a)). The term accounts for the worst-case per hop delays and maxmum clock skew; detals are n the extended techncal report [2]. 4.2 Bandwdth-Aware Tree Constructon As descrbed n Secton 2, leverages DHTs [29 31, 3] to construct a forest of aggregaton trees and maps dfferent attrbutes to dfferent trees [, 12, 29, 39] for scalablty and load balancng. then uses these trees to perform n-network aggregaton. 2 Algorthms n the lterature can acheve clock synchronzaton among nodes to wthn one mllsecond [38]. 7

constructs bounded fan-n, bandwdth-aware aggregaton trees to mprove result accuracy and quckly detect anomales n heterogeneous envronments. Recent studes [9, 11, 18, 21, 22] suggest that only a few attrbutes (e.g., elephant flows [11]) generate a sgnfcant fracton of total traffc n many montorng applcatons. Thus, to provde fast anomaly detecton, an aggregaton tree should quckly route the updates of elephant flows towards the root such that no nternal node becomes a processng/communcaton bottleneck due to ether hgh n-degree or low bandwdth. DHTs provde dfferent degrees of flexblty n choosng neghbors and next-hop paths n buldng aggregaton trees [14]. Many DHT mplementatons [13, 3] use proxmty (usually round-trp latency) n the underlyng network topology to select neghbors n the DHT overlay. Ths neghbor selecton n turn determnes the parent-chld relatonshps n the aggregaton tree. However, n a heterogeneous envronment where dfferent nodes have dfferent bandwdth budgets, an aggregaton tree formed solely based on RTTs may degrade the qualty precson of the query result for two reasons. Frst, a node may not have suffcent outgong bandwdth to send updates up n the tree even though ts underlyng tree may be suffcently well-provsoned. In such an envronment, ths node becomes a bottleneck as the updates sent by the underlyng subtree go wasted wthout beneftng the query precson. Second, a resource-lmted parent may not be able to process the aggregate outgong update rate of all ts chldren. Thus, a practcal protocol for buldng trees would be to bound the number of chldren at each nternal node and use both latences and bandwdth constrants to select the best parent at each tree level. To mprove accuracy and quckly dentfy anomales, bulds DHT-based aggregaton trees as follows: Bound the fan-n (.e., number of chldren) at each parent node. In our mplementaton, a chld node selects ts parent such that the fan-n at the parent s at most 16.e., each parent has maxmum up to 16 chldren. Use both network latency and avalable bandwdth capacty as the proxmty metrc for selectng parent nodes. orders DHT neghbors of a node such that they have the hghest ncomng bandwdth capacty and have network latency below a specfed threshold. Thus, nodes close n proxmty and havng hgh bandwdth capactes are hghly lkely to be selected as parent nodes. Fnally, note that n some envronments, t mght be useful to select nodes wth low bandwdth as parents e.g., f the nput workloads at the leaf nodes comes from an ndependent unform dstrbuton, then a node closer to the root s expected to receve very few updates snce an aggregate (e.g., SUM) s lkely to become more stable gong towards the root. However, n practce, real workloads (1) are often nonunform wth few attrbutes generatng a sgnfcant fracton of the total traffc [11] and (2) exhbt both temporal and spatal skewness wth nput rates unexpectedly ncreasng over tme and across nodes. We quantfy the effectveness of constructng bounded fan-n, bandwdth-aware aggregaton trees n Secton. 4.3 Robustness Falures and reconfguratons are common n large scale systems. As a result, a query mght return a stale answer when nodes whose nputs are needed to compute the aggregate result become unreachable. More mportantly, n a large scale montorng system, such falures can nteract badly wth our technques for provdng scalablty herarchy, arthmetc flterng, and temporal batchng. For example, f a montorng subtree s slent over an nterval, t s dffcult to dstngush between two cases: (a) the subtree has sent no updates because the nputs have not sgnfcantly changed or (b) the nputs have sgnfcantly changed but the subtree s unable to transmt ts report. As a result, reported results can devate arbtrarly from the truth. Addressng ths fundamental problem of node falures and network dsruptons n large-scale montorng systems s beyond the scope of ths paper. In a separate study, we have developed a new consstency metrc called Network Imprecson (NI) that characterzes and quantfes the accuracy of query results n the face of falures, network dsruptons, and system reconfguratons; the detals are avalable n an extended techncal report [2]. 4.4 Improvng precson for dfferent aggregates focuses on SUM aggregate snce t s lkely to be common n network montorng [21, 26] and fnancal applcatons [1]; maxmzng precson for the AVG aggregate s smlar to SUM. For MAX, MIN aggregates, f the AI error budget s fxed, then the best error assgnment s to gve equal AI error budget to all the leaf nodes. However, snce bandwdth budget s lmted n practce, each node may set ts precson dfferently to meet ts avalable bandwdth. An ntutve soluton to compute global MAX, MIN values s to smply broadcast them to each node, and a node sends an update only f t changes the global aggregate. However, ths approach may lmt scalablty n large-scale data stream systems that montor a large number of dynamc attrbutes. For the TOP-K aggregate, acheves hgh accuracy by prortzng updates based on the hghest aggregate values and the result devaton D a. Our experments n Secton show that ths approach works well n practce. We plan to develop mechansms for other aggregates such as quantles n future work.. EXPERIMENTAL EVALUATION In ths secton, we present the precson and scalablty results of an extensve expermental study of our algorthm n a data streamng envronment. Frst, we use smulatons to evaluate the mprovement n query result accuracy due to s adaptve bandwdth settngs. Second, we quantfy the accuracy acheved by for the DHH applcaton n a network montorng mplementaton. To perform ths experment, we mplemented a prototype of usng SDIMS aggregaton system [39] on top of FreePastry [31]. We used Ablene [2] netflow traces and performed the evaluaton on 12 node nstances mapped to 3 physcal machnes n the department Condor cluster. Fnally, we nvestgate the precson benefts of constructng bandwdth-aware tree constructon usng our prototype. In summary, our expermental results show that s an mportant and effectve substrate for scalable montorng: provdes hgh result accuracy whle boundng the montorng load, contnuously adapts to dynamc workloads, and acheves sgnfcant precson benefts for an mportant real-world montorng applcaton of detectng ds- 8

4 3 3 2 2 1 1 Optmal Unform 3 3 2 2 1 1 Optmal Unform 14 12 1 8 6 4 Optmal Unform 2.2.4.6.8 1 Bandwdth Fracton.2.4.6.8 1 Bandwdth Fracton.2.4.6.8 1 Bandwdth Fracton Fgure 7: provdes hgher precson benefts as skewness n a workload ncreases. The three fgures show average error vs. load budget for three dfferent skewness settngs (a) 2:8% (b) :% (c) 9:1%. trbuted heavy htters..1 Smulaton Experments In ths secton, we quantfy the result accuracy acheved by compared to unform bandwdth allocaton and an dealzed optmal algorthm. Frst, we assess the effectveness of adaptve bandwdth settngs n mprovng result precson as skewness n a workload ncreases. Second, we analyze the effect of fluctuatng bandwdth. Fnally, we evaluate for dfferent workloads. In all experments, all actve sensors are at the leaf nodes of an aggregaton tree. Each sensor generates a data value every tme unt (round) for two sets of synthetc workloads for 1, rounds: (1) a random walk pattern n whch the value ether ncreases or decreases by an amount sampled unformally from [., 1.], and (2) a Gaussan dstrbuton wth standard devaton 1 and mean. We smulate m {1, 1} data sources each havng n {1, 1} attrbutes n one-level and herarchcal topologes and under fxed and fluctuatng bandwdth loads. All attrbutes have equal weghts, messages have the same sze, and each message uses one unt of bandwdth. Evaluatng Update Rate Skewness: Frst, we evaluate the precson benefts of compared to other approaches as skewness n a workload ncreases. We compare t wth (1) the optmal algorthm under the dealzed and unrealstc model of perfect global knowledge of each attrbute s dvergence at each data source and (2) a unform allocaton polcy where the ncomng bandwdth capacty B at a parent s allocated equally ( B ) among ts c chldren. c Although ths smple polcy s correct (the total ncomng load from the chldren s guaranteed to never exceed the ncomng load of ther parent at all tmes), t s not generally the best polcy as we show n our experments. We frst perform a smple experment for a one-level tree, and later show the results for general herarchcal topologes. Fgure 7 shows the query precson acheved by for m=1 and n=1 under the followng skewness settngs: (a) 2:8% (b) :% (c) 9:1%. For example, the 2:8% skewness represents that randomly selected 2% attrbutes are updated wth probablty.1 whle the remanng ones are updated consstently every second under the random walk model. In all subsequent graphs n ths secton, the x- axs denotes the bandwdth budget as a fracton of the total cost m.n of refreshng all the attrbutes across all nodes; the y-axs shows the resultng average error. For 2:8% skewness, snce only a small fracton of attrbutes are stable, can only reclam up to 2% load budget from stable attrbutes sources and dstrbute t to dynamc sources to reduce ther error. For small bandwdth budgets, mproves accuracy by up to 3% compared to unform allocaton. The optmal algorthm mproves accuracy by 27% over. As the load budget ncreases, converges to the optmal soluton. mproves error by 4% over unform allocaton under 2% load budget and by more than an order of magntude under suffcently large budgets. For the : case, can reclam % of the total load budget compared to unform allocaton and gve t to unstable sources. In ths case, reduces error by up to % over unform polcy at 4% bandwdth and acheves almost the accuracy of the optmal soluton. Fnally, for 9% skewness, acheves the accuracy of the optmal algorthm even under 2% fracton of the total bandwdth and mproves accuracy by more than an order of magntude over unform allocaton. We observed qualtatvely smlar results for other m and n settngs. Note that the advantage of s self-tunng algorthm depends on the skewness n the workload. We expect that for systems montorng a large numbers of attrbutes (e.g., top-k heavy htters query), some attrbutes (e.g., the elephants) wll have hgh varablty n data values and update rates so these attrbutes gan only a modest advantage n accuracy from, whle other attrbutes (e.g., the mce) wll have large ratos and hence, the query wll gan large advantages snce a top-k query only needs to provde hgh accuracy for the top-k flows. We typcally expect many more mce attrbutes than elephant attrbutes for common montorng applcatons. Effect of Fluctuatng Bandwdth: Next, we evaluate the effectveness of under fluctuatng bandwdth. We vary the ncomng bandwdth over tme followng a sne wave pattern and set the maxmum rate of bandwdth change to 1%, 2%, and 3% for m = 1 nodes each havng n = 1 attrbutes. We use update rate skewness of % as descrbed above. From Fgure 8, we observe that for 1% varaton, provdes % reducton n error over unform allocaton at 4% bandwdth fracton. As we ncrease the bandwdth fluctuaton from 1% to 3%, reduces error by about 7% under 4% bandwdth fracton, and more than an order of magntude for larger fractons. Further, n all cases, acheves accuracy close to the optmal algorthm. Evaluatng Dfferent Workloads: Fnally, we evaluate the performance of under dfferent confguratons 9

3 3 Optmal Unform 3 3 Optmal Unform 3 3 Optmal Unform 2 2 1 1 2 2 1 1 2 2 1 1.2.4.6.8 1 Bandwdth Fracton.2.4.6.8 1 Bandwdth Fracton.2.4.6.8 1 Bandwdth Fracton Fgure 8: effectvely reduces naccuracy even under fluctuatng bandwdth. The maxmum bandwdth varaton n the fgure s (a) 1% (b) 2% and (c) 3%. by varyng nput data dstrbuton, standard devaton (step szes), and update frequency at each node. The workload data dstrbuton s generated from a random walk pattern and Gaussan models. For standard devaton/stepsze, 7% nodes have unform parameters as prevously descrbed; the remanng 3% nodes have these parameters proportonal to rank (.e., wth localty) or randomly assgned (.e., no localty) from the range [., 1]. Fgure 9 shows the correspondng results for dfferent settngs of data dstrbuton and standard devaton for m=1 and n=1. The update frequency s set to.7 skewness as descrbed prevously. We make three key observatons. Frst, mnmzes error close to the optmum algorthm under the rank based assgnment (Fgure 9(a),(c)). Second, under random assgnment, acheves lesser accuracy benefts snce updates generated from wthn the same subtree are not correlated. In both cases, as the bandwdth ncreases, quckly mnmzes error close to the optmal. Thrd, because step-szes are based on node rank, prortzes attrbutes havng the largest stepszes and apples cost-beneft throttlng to ensure that the precson benefts exceed costs. The unform polcy, however, does not make such a dstncton equally favorng all attrbutes that need to be refreshed, thereby ncurrng a hgh error. Fnally, under lmted bandwdth, refreshng mce attrbutes wth small step szes does not sgnfcantly reduce the result error for queres such as TOP-K heavy htters but consumes valuable bandwdth resources. For all these confguratons, reduces error by up to an order of magntude over unform allocaton. The optmal approach reduces error by 2% over. Overall, across all confguratons n Secton.1, reduces naccuracy by up to an order of magntude compared to unform allocaton and s wthn 27% of the optmal algorthm under modest bandwdth fracton..2 Testbed Experments In ths secton, we quantfy the error reducton of reported results due to self-tunng precson for the heavy htter montorng applcaton. We use multple netflow traces obtaned from the Ablene [2] Internet2 backbone network. The traces were collected from 3 Ablene routers for 1 hour; each router logged per-flow data every mnutes, and we splt these logs nto 12 buckets based on the hash of source IP. As descrbed n Secton 2.3, our DHH applcaton executes a Top-1 heavy htter query on ths dataset for trackng the top 1 flows 1 9 8 7 6 4 3 2 1 Optmal Unform.2.4.6.8 1 Bandwdth Fracton 1 9 8 7 6 4 3 2 1 Optmal Unform.2.4.6.8 1 Bandwdth Fracton 9 8 7 6 4 3 2 1 Optmal Unform.2.4.6.8 1 Bandwdth Fracton 1 9 8 7 6 4 3 2 1 Optmal Unform.2.4.6.8 1 Bandwdth Fracton Fgure 9: Precson benefts of vs. optmal algorthm and unform allocaton for dfferent {workload, step szes/standard devaton} confguratons: (a) random walk, rank (b) random walk, random (c) Gaussan, rank, (d) Gaussan, random. (destnaton IP as key) n terms of bytes receved over a 3 second movng wndow shfted every 1 seconds. We analyzed ths workload [2] and observed that the 12 sensors track roughly 8, flows and send around 2 mllon updates n an hour. Further, t shows a heavy-taled Zpf-lke dstrbuton: 6% flows send less than 1 KB of aggregate traffc, 9% flows less than KB, and 99% of the flows send less than 33 KB durng the 1-hour run; the maxmum aggregate flow value s about 179.4 MB. We observed a smlar heavy-taled dstrbuton for the number of updates per flow (attrbute) [2]. For ths experment, we fxed the outgong bandwdth as a constant between. and 1 messages per node per second. Snce, we bound the fan-n of an nternal node n our DHT-based aggregaton tree to 16, the maxmum ncomng load at any node s thus 16 messages per second whch s a reasonable processng load n our envronment. Fgure 1 plots the outgong load per node on the x-axs and the result precson acheved for the top-1 heavy htter query on the y-axs. The dfferent lnes n the graph correspond to a temporal batchng nterval of 1 seconds, 3 seconds, 6 seconds, and fve mnutes. Each data pont denotes the average result dvergence for the TOP-1 heavy htters set. 1

4 4 3 3 2 2 1 1 TI = 1s TI = 3s TI = 6s TI = m 1 2 3 4 6 7 8 9 1 Messages sent per node per second Fgure 1: Bandwdth load vs. query result precson for the DHH applcaton. Tree-BW metrc 1.8.6.4.2 BW-Unaware DHT BW-aware DHT.1.2.3.4..6.7.8.9 1 Skewness n Bandwdth Fgure 11: Bandwdth skewness vs. normalzed Tree-BW metrc for bandwdth-aware and -unaware DHTs. It s mportant to note that for ths query, the result set was consstent under TI of 1, 3, and 6 seconds wth the true result set. However, under TI of mnutes, % results dffered by less than 3%. We observe that under TI = 1 seconds, can reduce average error from 23% at. load budget to about 2% at load budget of 1. Comparng across TI values, the average error ncreases wth ncrease n the TI batchng nterval. However, we beleve that for many montorng envronments, the performance benefts of TI wll outwegh the ncreased error as the bandwdth load can be sgnfcantly decreased due to temporal batchng..3 Evaluatng Bandwdth-aware Tree Formaton In ths secton, we evaluate the benefts of s bandwdthaware tree constructon. As descrbed n Secton 4, uses both bandwdth and latency as proxmty metrcs n DHT routng compared to only latency n tradtonal DHT mplementatons. For quanttatve comparson, we compute a Tree-BW metrc for a tree as the sum of L B across all nodes, where L s the number of leaves n the subtree rooted at node and B s the bandwdth of node. Ths weghted summaton metrc s hgher for trees that select nternal nodes havng hgher bandwdth capactes. For ths experment, we classfy nodes belongng to two dfferent classes of bandwdth budgets: 1Mbps and 1Mbps. A.1 skewness n bandwdth mples that 1% of the nodes have 1Mbps bandwdth and the remanng have 1Mbps bandwdth. Fgure 11 compares the benefts of s tree constructon usng bandwdth-aware DHT routng aganst trees constructed n a bandwdth-oblvous unaware for a 124-node system; the y-axs shows the normalzed Tree- BW (wth respect to the maxmum value observed) metrc for varous bandwdth skewness settngs (x-axs). Note that s bandwdth-aware DHT routng acheves better Tree-BW metrc values by up to a factor of 3.7x over a bandwdth-unaware DHT. 6. RELATED WORK s motvated by pror fx error, mnmze load technques [21,22,24,26,37,4], but t departs n three sgnfcant ways drven by our focus on provdng overload reslence for practcal, scalable montorng. Frst, reformulates the key optmzaton problem, whch we beleve s an mportant contrbuton. Whle pror approaches reduce cost under fxed precson constrants, n practce, bursty, hgh-volume workloads stll rsk overloadng a montorng system as dscussed n Secton 1. Infact, t s precsely durng these abnormal events that a montorng system needs to provde hgh accuracy results wth a real-tme query response. therefore bounds the worst-case system cost to provde overload reslence. Second, the techncal advances to solve ths new problem are sgnfcant. Whle uses Chebyshev nequalty [21] to capture the load vs. error trade-off, t faces new constrants of lmted bandwdth budgets both at a parent and each of ts chldren n a herarchcal aggregaton tree. Gven these constrants, self-tunes bandwdth settngs n a near-optmal manner to acheve hgh accuracy under dynamc workloads. Thrd, s real-world mplementaton provdes key solutons to mprove result accuracy and fast anomaly detecton n montorng systems. Recently, load sheddng technques [3, 36] have been proposed to handle system overload condtons. The key dea s to carefully drop some tuples to reduce processng load but at the expense of reducng the accuracy of query answers. Further, these approaches handle load spkes assumng ether CPU [3, 36] or the man memory [17] as the key resource bottleneck. In comparson, bandwdth s the prmary resource constrant n. Best-effort cache synchronzaton technques [7,28] am to mnmze the dvergence between source data objects and cached copes n a one-level tree. In these technques, each object s treated ndvdually for refreshng. In comparson, performs herarchcal query processng for scalablty and t proves valuable to co-relate updates to the same attrbute at dfferent data sources n order to maxmze accuracy of an aggregate query result. To our knowledge, there has not been pror work on maxmzng precson of aggregate queres n herarchcal trees under lmted bandwdth. Babcock and Olston [4] focus on effcently computng top k aggregate values gven fxed error n a sngle-level tree, but do not consder how to maxmze precson of top k results under fxed bandwdth n herarchcal trees. Slbersten et al. propose a samplng-based approach (combned wth global knowledge) at randomly chosen tme steps for computng top-k queres n sensor networks [33]. Ther approach focuses on returnng the k nodes wth the hghest 11

sensor readngs In comparson, focuses on largescale aggregaton queres such as dstrbuted heavy htters whch requre computng a total aggregate value for each of the tens of thousands to mllons of attrbute across all the nodes n the network. Fnally, gven the unpredctable traffc nature of network anomales (e.g., DDoS attacks, flash crowds, botnet attacks), samples based on past readngs are unlkely to be effectve. Sketches are small-space data structures that provde approxmate answers to aggregate queres [8]. These technques, however, requre error bounds to be set a pror to provde the approxmaton guarantees. 7. CONCLUSIONS We desgned, mplemented, and evaluated a scalable, load-aware montorng system that performs self-tunng of bandwdth budgets to maxmze precson of contnuous aggregate queres. In future work, we plan to examne technques for secure nformaton aggregaton n dstrbuted stream processng envronments spannng multple admnstratve domans as well as develop a broad range of dstrbuted montorng applcatons that could beneft from. 8. REFERENCES [1] D. J. Abad, Y. Ahmad, M. Balaznska, U. Centntemel, M. Chernack, J.-H. Hwang, W. Lndner, A. S. Maskey, A. Rasn, E. Ryvkna, N. Tatbul, Y. Xng, and S. Zdonk. The Desgn of the Boreals Stream Processng Engne. In CIDR, Aslomar, Calforna, 2. [2] Ablene nternet2 network. http://ablene.nternet2.edu/. [3] B. Babcock, M. Datar, and R. Motwan. Load sheddng for aggregaton queres over data streams. In ICDE, 24. [4] B. Babcock and C. Olston. Dstrbuted top-k montorng. In SIGMOD, June 23. [] A. Bharambe, M. Agrawal, and S. Seshan. Mercury: Supportng Scalable Mult-Attrbute Range Queres. In SIGCOMM, Portland, OR, August 24. [6] R. Cheng, B. Kao, S. Prabhakar, A. Kwan, and Y. Tu. Adaptve stream flters for entty-based queres wth non-value tolerance. In VLDB, 2. [7] J. Cho and H. Garca-Molna. Synchronzng a database to mprove freshness. In SIGMOD, 2. [8] G. Cormode and M. Garofalaks. Sketchng streams through the net: dstrbuted approxmate query trackng. In VLDB, 2. [9] C. D. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Ggascope: A stream database for network applcatons. In SIGMOD, 23. [1] A. Delgannaks, Y. Kotds, and N. Roussopoulos. Herarchcal n-network data aggregaton wth qualty guarantees. In EDBT, 24. [11] C. Estan and G. Varghese. New drectons n traffc measurement and accountng. In SIGCOMM, 22. [12] M. J. Freedman and D. Mazres. Sloppy Hashng and Self-Organzng Clusters. In IPTPS, 23. [13] FreePastry. http://freepastry.rce.edu. [14] P. K. Gummad, R. Gummad, S. D. Grbble, S. Ratnasamy, S. Shenker, and I. Stoca. The mpact of dht routng geometry on reslence and proxmty. In SIGCOMM, 23. [1] R. Gupta and K. Ramamrtham. Optmzed query plannng of contnuous aggregaton queres n dynamc data dssemnaton networks. In WWW, pages 321 33, 27. [16] N. J. A. Harvey, M. B. Jones, S. Sarou, M. Themer, and A. Wolman. SkpNet: A Scalable Overlay Network wth Practcal Localty Propertes. In USITS, March 23. [17] Y. Hu, D. M. Chu, and J. C. Lu. Adaptve flow aggregaton - a new soluton for robust flow montorng under securty attacks. In NOMS, 26. [18] R. Huebsch, J. M. Hellersten, N. Lanham, B. T. Loo, S. Shenker, and I. Stoca. Queryng the Internet wth PIER. In VLDB, 23. [19] N. Jan, L. Amn, H. Andrade, R. Kng, Y. Park, P. Selo, and C. Venkatraman. Desgn, mplementaton, and evaluaton of the lnear road benchmark on the stream processng core. In SIGMOD, 26. [2] N. Jan, D. Kt, P. Mahajan, P. Yalagandula, M. Dahln, and Y. Zhang. PRISM: Precson-Integrated Scalable Montorng (extended). Techncal Report TR-7-1, UT Austn Department of Computer Scences, 27. [21] N. Jan, D. Kt, P. Mahajan, P. Yalagandula, M. Dahln, and Y. Zhang. Star: Self tunng aggregaton for scalable montorng. In VLDB, 27. [22] R. Keralapura, G. Cormode, and J. Ramamrtham. Communcaton-effcent dstrbuted montorng of thresholded counts. In SIGMOD, 26. [23] J. Klenberg. Bursty and herarchcal structure n streams. In KDD, 22. [24] S. R. Madden, M. J. Frankln, J. M. Hellersten, and W. Hong. TAG: a Tny AGgregaton Servce for Ad-Hoc Sensor Networks. In OSDI, 22. [2] R. Mahajan, S. Bellovn, S. Floyd, J. Vern, and P. Scott. Controllng hgh bandwdth aggregates n the network. In ACM SIGCOMM CCR, 21. [26] C. Olston, J. Jang, and J. Wdom. Adaptve flters for contnuous queres over dstrbuted data streams. In SIGMOD, 23. [27] C. Olston and J. Wdom. Offerng a precson-performance tradeoff for aggregaton queres over replcated data. In VLDB, Sept. 2. [28] C. Olston and J. Wdom. Best-effort cache synchronzaton wth source cooperaton. In SIGMOD, 22. [29] C. G. Plaxton, R. Rajaraman, and A. W. Rcha. Accessng Nearby Copes of Replcated Objects n a Dstrbuted Envronment. In ACM SPAA, 1997. [3] S. Ratnasamy, P. Francs, M. Handley, R. Karp, and S. Shenker. A Scalable Content Addressable Network. In SIGCOMM, 21. [31] A. Rowstron and P. Druschel. Pastry: Scalable, Dstrbuted Object Locaton and Routng for Large-scale Peer-to-peer Systems. In Mddleware, 21. [32] M. A. Shah, J. M. Hellersten, and E. Brewer. Hghly-avalable, fault-tolerant, parallel dataflows. In SIGMOD, 24. [33] A. S. Slbersten, R. Braynard, C. Ells, K. Munagala, and J. Yang. A samplng-based approach to optmzng top-k queres n sensor networks. In ICDE, 26. [34] A. Sngla, U. Ramachandran, and J. Hodgns. Temporal notons of synchronzaton and consstency n Beehve. In Proc. SPAA, 1997. [3] I. Stoca, R. Morrs, D. Karger, F. Kaashoek, and H. Balakrshnan. Chord: A scalable Peer-To-Peer lookup servce for nternet applcatons. In ACM SIGCOMM, 21. [36] N. Tatbul, U. Çetntemel, and S. Zdonk. Stayng FIT: Effcent Load Sheddng Technques for Dstrbuted Stream Processng. In VLDB, 27. [37] R. van Renesse, K. Brman, and W. Vogels. Astrolabe: A robust and scalable technology for dstrbuted system montorng, management, and data mnng. TOCS, 23. [38] D. Vetch, S. Babu, and A. Pasztor. Robust synchronzaton of software clocks across the nternet. In IMC, 24. [39] P. Yalagandula and M. Dahln. A scalable dstrbuted nformaton management system. In Proc SIGCOMM, 24. [4] H. Yu and A. Vahdat. Desgn and evaluaton of a cont-based contnuous consstency model for replcated servces. TOCS, 22. 12