Reliable State Monitoring in Cloud Datacenters

Relable State Montorng n Cloud Datacenters Shcong Meng Arun K. Iyengar Isabelle M. Rouvellou Lng Lu Ksung Lee Balaj Palansamy Yuzhe Tang College of Computng, Georga Insttute of Technology, Atlanta, GA 30332, USA IBM Research T.J. Watson, Hawthorne, NY 10532, USA {smeng@cc., lnglu@cc., kslee@, balaj@cc., yztang@}gatech.edu {arun, rouvellou}@us.bm.com Abstract State montorng s wdely used for detectng crtcal events and abnormaltes of dstrbuted systems. As the scale of such systems grows and the degree of workload consoldaton ncreases n Cloud datacenters, node falures and performance nterferences, especally transent ones, become the norm rather than the excepton. Hence, dstrbuted state montorng tasks are often exposed to mpared communcaton caused by such dynamcs on dfferent nodes. Unfortunately, exstng dstrbuted state montorng approaches are often desgned under the assumpton of alwaysonlne dstrbuted montorng nodes and relable nter-node communcaton. As a result, these approaches often produce msleadng results whch n turn ntroduce varous problems to Cloud users who rely on state montorng results to perform automatc management tasks such as auto-scalng. Ths paper ntroduces a new state montorng approach that tackles ths challenge by exposng and handlng communcaton dynamcs such as message delay and loss n Cloud montorng envronments. Our approach delvers two dstnct features. Frst, t quanttatvely estmates the accuracy of montorng results to capture uncertantes ntroduced by messagng dynamcs. Ths feature helps users to dstngush trustworthy montorng results from ones heavly devated from the truth, yet sgnfcantly mproves montorng utlty compared wth smple technques that nvaldate all montorng results generated wth the presence of messagng dynamcs. Second, our approach also adapts to non-transent messagng ssues by reconfgurng dstrbuted montorng algorthms to mnmze montorng errors. Our expermental results show that, even under severe message loss and delay, our approach consstently mproves montorng accuracy, and when appled to Cloud applcaton auto-scalng, outperforms exstng state montorng technques n terms of the ablty to correctly trgger dynamc provsonng. I. INTRODUCTION State montorng s a fundamental buldng block for many dstrbuted applcatons and servces hosted n Cloud datacenters. It s wdely used to determne whether the aggregated state of a dstrbuted applcaton or servce meets some predefned condtons [1]. For example, a web applcaton owner may use state montorng to check f the aggregated access observed at dstrbuted applcaton-hostng servers exceeds a pre-defned level [2]. Table I lsts several common applcatons of state montorng. Most exstng state montorng research efforts have been focused on mnmzng the cost and the performance mpact of state montorng. For example, a good number of state montorng technques developed n ths lne of works focus on the threshold based state montorng by carefully parttonng montorng tasks between local nodes and coordnator nodes such that the overall communcaton cost s mnmzed [3][2][4][1]. Studes along ths drecton often make strong assumptons on montorng-related communcatons, such as % node avalablty and nstant message delvery. These assumptons, however, often do not hold n real Cloud deployments. Many Cloud systems and applcatons utlze hundreds or even thousands of computng nodes to acheve hgh throughput and scalablty. At ths level of scale, node/network falures, especally transent one, are farly common [5][6]. Furthermore, resource sharng technques and vrtualzaton n Cloud often ntroduce performance nterferences and degradaton, and cause computng nodes to response slowly or even become temporarly unavalable [7][8]. Such unpredctable dynamcs n turn ntroduce message delay and loss to montorng related communcatons. Montorng approaches desgned wthout consderng such messagng dynamcs would nevtably produce unrelable results. Even worse, users are left n the dark wthout knowng that the montorng output s no longer relable. For nstance, state montorng technques assumng % node avalablty or nstant message delvery [3][1] would wat for messages from faled nodes ndefntely wthout notfyng users about potental errors n montorng results. Consequently, actons performed based on such unrelable results can be harmful or even catastrophc [9]. Furthermore, smple error-avodng technques such as nvaldatng montorng results when messagng dynamcs exst do not work well ether, as certan ssues such as performance nterferences can last farly long and the scale of Cloud montorng tasks makes falures very common. For example, even f the probablty of one node falng s 0.001, the probablty of observng messagng dynamcs n a task nvolvng 500 nodes would be 1 (1 0.001) 500 0.4. Invaldatng montorng results whenever problems exst would render 40% montorng results useless. In ths paper, we present a new state montorng framework that ncorporates messagng dynamcs n terms of message delay and message losses nto montorng results reportng and dstrbuted montorng coordnaton. Our framework provdes two fundamental features for state montorng. Frst, t estmates the accuracy of montorng results based on the mpact of messagng dynamcs, whch provdes valuable nformaton for users to decde whether montorng results are trustworthy. Second, t mnmzes the mpact of dynamcs whenever possble by contnuously adaptng to changes n montorng communcaton and strvng to produce accurate montorng results. When combned, these two features shape a relable state montorng model that can tolerate communcaton dynamcs and mtgate ther mpact on montorng results. To the best of our knowledge, our approach s the frst state montorng framework that explctly handles messagng dynamcs n large-scale dstrbuted montorng. We perform extensve experments, ncludng both trace-drven and real deployment

Applcatons Content Delvery Rate Lmtng [2] Traffc Engneerng [10] Qualty of Servce [11] Fghtng DoS Attack Botnet Detecton [12] Descrpton Montorng the total access to a fle mrrored at multple servers to decde f servng capacty s suffcent. Lmtng a user s total access towards a cloud servce deployed at multple physcal locatons. Montorng the overall traffc from an organzaton s sub-network (conssts of dstrbuted hosts) to the Internet. Montorng and Adjustng the total delay of a flow whch s the sum of the actual delay n each router on ts path. Detectng DoS Attack by countng SYN packets arrvng at dfferent hosts wthn a sub-network. Trackng the overall smultaneous TCP connectons from a set of hosts to a gven destnaton. TABLE I: Examples of State Montorng ones. The results show that our approach produces good accuracy estmaton and mnmzes montorng errors ntroduced by messagng dynamcs va adaptaton. Compared wth exstng montorng technques, our approach sgnfcantly reduces problematc montorng results n performance montorng for Cloud applcaton auto-scalng [13] wth the presence of messagng dynamcs, and mproves applcaton response tme by up to 30%. The rest of ths paper s organzed as follows. In Secton II, we ntroduce the problem of relable state montorng. Secton III presents the detals of our approach. We dscuss our expermental evaluaton n Secton IV. Secton V summarzes related work and Secton VI concludes ths paper. II. PROBLEM DEFINITION Most exstng state montorng studes employ an nstantaneous state montorng model, whch trggers a state alert whenever a predefned threshold s volated. Specfcally, the nstantaneous state montorng model [3][14][15][16][17] detects state alerts by comparng the current aggregate value wth a global threshold. Gven the montored value on montor at tme t, x (t), [1, n], where n s the number of montors nvolved n the montorng task, and the global threshold T, t consders the state at tme t to be abnormal and trggers a state alert f n =1 x (t) > T, whch we refer to as a global volaton. To perform state montorng, the lne of exstng works employs a dstrbuted montorng framework wth multple montors and one coordnator (Fgure 1). The global threshold T s decomposed nto a set of local thresholds T for each montor such that n =1 T T. As a result, as long as x (t) T, [1, n],.e. the montored value at any node s lower or equal to ts local threshold, the global threshold cannot be exceeded because n =1 x (t) n =1 T T. In ths case, montors do not need to report ther local values to the coordnator. When x (t) > T on montor, t s possble that n =1 x (t) > T (global volaton). Hence, montor sends a message to the coordnator to report a local volaton wth the value x (t). The coordnator, after recevng the local volaton report, nvokes a global poll procedure whch notfes other montors to report ther local values, and then determnes whether n =1 x (t) T. The focus of exstng works s to fnd optmal local threshold values that mnmze the overall communcaton cost. For nstance, f a montor often observes relatvely hgher x, t may be assgned wth a hgher T so that t does not frequently report local volatons to the coordnator and trgger expensve global polls. A. Relable State Montorng and Challenges Exstng state montorng works [3][14][15][16][17][1] often share the followng assumptons: 1) nodes nvolved n a montorng task s perfectly relable n the sense that they are always avalable and responsve to montorng requests; 2) a montorng message can always be relably and nstantly delvered from one node to another. These two assumptons, however, do not always hold n Cloud datacenters. Frst, Cloud applcatons and servces are often hosted by a massve number of dstrbuted computng nodes. Falures, especally transent ones, are common for nodes of such large-scale dstrbuted systems [5], [6]. Second, Cloud datacenters often employ vrtualzaton technques to consoldate workloads and provde management flexbltes such as vrtual machne clonng and lve mgraton. Despte ts benefts, vrtualzaton also ntroduces a number of challenges such as performance nterferences among vrtual machnes runnng on the same physcal host. Such nterferences could ntroduce serous network performance degradaton, ncludng heavy message delays and message drops [7], [8]. Note that relable data delvery protocols such as TCP cannot prevent montorng message loss caused by falures of montorng nodes or networks, nor can t avod message delay. To provde robustness aganst messagng dynamcs, Jan et al. [9] recently proposed to employ a set of coarse network performance metrcs to reflex the status of montorng communcaton, e.g., the number of nodes contrbuted to a montorng task. The ntenton s to allow users to decde how trustworthy montorng results are based on values of such metrcs. Whle ths approach certanly has ts merts n certan montorng scenaros, t also has some lmtatons. Frst, t consders the status of a montor as ether onlne or offlne, and overlooks stuatons nvolvng message delays. For nstance, a montor node may appear onlne, but t may ntroduce consderable latences to messages sent to or receved from t. Such message delays are as mportant as message loss caused by offlne nodes, because they may also lead to ms-detecton of anomales. In fact, anecdotal evdences [18] suggest that communcaton latency caused by vrtual machne nterferences n vrtualzed Cloud datacenters s a common and serous ssue. Second, t s dffcult for users to nterpret the mpact of reported network level ssues on montorng accuracy. If one of the nodes fals to report ts local montorng data, does the correspondng montorng result stll relable? The scale of dstrbuted Cloud montorng exacerbates the problem as message delay or loss can be qute common gven the number of partcpatng nodes, e.g. hundreds of web servers for large Cloud applcatons, and even thousands of servers for Hadoop clusters. If we smply nvaldate the montorng results whenever message delay or loss occurs, we would end up wth frequent gaps n montorng data and low montorng utlty. On the contrary, f we choose to use such montorng results, how should we assess the accuracy of montorng results gven the observed message delay and loss? Fgure 1 shows a motvatng example where a dstrbuted rate lmtng montorng task nvolves one coordnator and sx montors. As a Cloud servce often runs on dstrbuted servers across multple datacenters, servce provders need to perform dstrbuted rate lmtng to ensure that the aggregated access rate

T = 300 Coordnator ΣX T? A B C D E F [10-] [10-] [10-] [10-] [10-] [20-300] TA = TB = TC = TD = TE = TF = 50 Fg. 1: A Motvatng Example of a user does not exceed the purchased level. The task n Fgure 1 contnuously checks the access rate of an user (x ) on all 6 servers (A to F) and trggers a state alert when the sum of the access rates on all 6 stes exceed the global threshold T = 300. The numbers under each montor ndcate the range of values observed by the montor. Such range statstcs can be obtaned through long-term observatons. For smplcty, we assume local thresholds employed by all montors have the same value 50,.e. T A = T B = T C = T D = T E = T F = 50. matng montorng accuracy based on messagng dynamcs nformaton s dffcult. Smply usng the scope of message delay or loss to nfer accuracy can be msleadng. For example, f montor A, B, and C (50% of total montors) all fal to response n a global poll durng a transent falure, one may come to the concluson that the result of global poll should be nvaldated as half of the montors do not contrbute to the result. However, as montor A, B and C observe relatvely small montored values (e.g., most users access server F whch s geographcally closer), the correspondng global poll results may stll be useful. For nstance, f the global poll suggests that x D + x E + x F =, we can conclude that there s no global volaton,.e. ={A...F } x 300 wth hgh confdence, because the probablty of ={A...C} x 1 s farly hgh gven observed value ranges of A, B and C. On the contrary, f montor F fals to response, even though F s only one node, the uncertanty of montorng results ncreases sgnfcantly. For example, f ={A...E} x = 150, t s hard to tell whether a global volaton exsts due to the hgh varance of F s observed values. An deal approach should provde users an ntutve accuracy estmaton such as the current montorng result s correct wth a probablty of 0.93, nstead of smply reportng the statstcs of message delay or loss. Such an approach must quanttatvely estmate the accuracy of montorng results. It should also be aware of state montorng algorthm context as the algorthm has two phases, the local volaton reportng phase and global poll phase. Thrd, accuracy estmaton alone s not enough to provde relable montorng and mnmze the mpact of messagng qualty degradaton. Resolvng node falures may take tme. Network performance degradaton caused by vrtual machne nterferences often lasts for a whle untl one vrtual machne s mgrated to other hosts. As a result, messagng dynamcs can last for some tme. Wthout self-adaptve montorng to mnmze the correspondng accuracy loss, users may lose access to any meanngful montorng result durng a farly long perod, whch may not be acceptable for Cloud users who pay for usng Cloud montorng servces such as CloudWatch [19]. For nstance, f node F contnuously experences message loss, local volaton reports sent from F are very lkely to be dropped. Consequently, the coordnator does not trgger global polls when t receves no local volaton reports. If a true volaton exsts, e.g. x A = 45, x B = 45, x C = 45, x D = 45, x E = 45, x F = 110 and ={A...F } x = 335, the coordnator wll ms-detect t. One possble approach to reduce montorng errors ntroduced by such messagng dynamcs s to let healthy nodes,.e. nodes not affected by messagng dynamcs, to report ther local values at a fner granularty to compensate the nformaton loss on problem nodes. In the above example, f we reduce local thresholds on node A, B, C, D, E to 30. the coordnator wll receve local volatons from node A, B, C, D and E, and trgger a global poll. Even f F also fals to response to the global poll, the coordnator can fnd that {A,...,E} x = 225. For the sake of the example, suppose x F s unformly dstrbuted over [20, 300]. The coordnator can nfer that the probablty of a global volaton s hgh. Ths s because a global volaton exsts f x F > 75 whch s very lkely (> 0.8) gven x F s dstrbuton. Smlarly, adaptaton can also be used to rule out the possblty of global volatons. For nstance, f node E s troubled by messagng dynamcs, we can ncrease E s local threshold to so that the probablty of detectng local volaton on E s trval. Correspondngly, we also reduce the thresholds on the rest of the nodes to 45 to ensure the correctness of montorng ( T T ). As a result, as long as {A,...,D,F } x < 230, we can nfer that there s no global volaton wth hgh probablty, even though node E s under the mpact of messagng dynamcs. Whle ths type of self-adaptaton seems promsng, desgnng such a scheme s dffcult and reles on answers to a number of fundamental questons: how should we dvde the global thresholds when there are multple problem nodes to mnmze the possble error they may ntroduce, especally when they observes dfferent levels of message and delay? In the rest of ths paper, we address these challenges and present detals of our relable state montorng approach. III. RELIABLE STATE MONITORING State montorng contnuously checks whether a montored system enters a crtcal pre-defned state. Hence, state montorng tasks usually generate bnary results whch ndcate ether state volaton exsts (postve detecton) or no state volaton exsts (negatve detecton). Beyond ths basc result, our relable state montorng approach also marks the estmated accuracy of a montorng result n the form of error probabltes. For postve detectons, the error probablty s the probablty of false postves. The error probablty s the probablty of false negatves for negatve detectons. To perform accuracy estmaton, we desgn estmaton schemes for both local volaton reportng and global poll processes respectvely. These schemes leverage the nformaton on messagng dynamcs and per-node montored value dstrbutons to capture uncertantes caused by messagng dynamcs. In addton, we also examne the unque problem of out-of-order global polls caused by message delay. The fnal accuracy estmaton results synthesze the uncertantes observed at dfferent stages of the state montorng algorthm. Besdes accuracy estmaton, our approach also mnmzes errors caused by non-transent messagng dynamcs va two parallel

volaton occurred + Invald detecton + + Detecton Wndow vald detecton t-w t Fg. 2: Detecton Wndow + Tme drectons of adjustments on dstrbuted montorng parameters. One tres to mnmze the chance that troubled nodes delver local volaton reports to the coordnator. Snce they may fal to delver these reports, such adjustments essentally mnmze the uncertantes caused by them. The other drecton of adjustments s to confgure healthy nodes to report ther local montored values more often. Ths allows the coordnator to make better accuracy estmaton whch n turn helps to detect or rule out a global volaton wth hgh confdence. A. Messagng Dynamcs Although a Cloud datacenter may encounter countless types of falures and anomales at dfferent levels (network/server/os/etc.), ther mpact on montorng related communcaton can often be characterzed by message delay and message loss. For brevty, we use the term messagng dynamcs to refer to both message delay and loss. Dependng on the serousness of messagng dynamcs, the montorng system may observe dfferent dffcultes n nter-node communcaton, from slght message delay to complete node falure (% message loss rate or ndefnte delay). The focus of our study s utlzng message delay and loss nformaton to provde relable state montorng functonaltes va accuracy estmaton and accuracy-drven self-adaptaton. Our approach obtans message delay and loss nformaton n two ways. One s drect observaton n global polls, e.g., the coordnator knows whether t has receved a response from a certan montor on tme. The other s utlzng exstng technques such as [9] to collect par-wse message delay and loss nformaton between a montor and the coordnator. Note that our approach s orthogonal to the messagng qualty measurement technques, as t takes the output of the measurement to perform accuracy estmaton and self-adaptaton. Our approach only requres basc messagng dynamcs nformaton. For message delay, t requres a hstogram that records the dstrbuton of observed message delays. For message loss, t takes the message loss rate as nput. B. Detecton Wndow We ntroduce the concept of detecton wndow to allow users to defne ther tolerance level of result delays. Specfcally, a detecton wndow s a sldng tme wndow wth length w. We consder a global volaton V detected at tme t a correctly detected one f ts actual occurrence tme t o [t w, t]. Note that multple global volatons may occur between the current tme t and t w as Fgure 2 shows. We do not dstngush dfferent global volatons wthn the current detecton wndow, as users often care about whether there exsts a global volaton wthn the detecton wndow nstead of exactly how many global volatons there are. The concept of detecton wndow s mportant for capturng the dynamc nature of state montorng n real world deployment. C. Accuracy maton Recall that the dstrbuted state montorng algorthm we ntroduced n Secton II has two stages, the local volaton reportng stage and the global poll stage. As message delay and loss have an mpact on both stages, our analyss on ther accuracy mpact needs to be conducted separately. When message delay or loss occurs durng local volaton reportng, the coordnator may fal to receve a local volaton report and trgger a global poll n tme. Consequently, t may ms-detect a global volaton f one does exst, and ntroduce false negatve results. To estmate the montorng accuracy at ths stage, the coordnator contnuously updates the estmated probablty of falng to receve one or more local volatons based on the current messagng dynamcs stuaton and per-montor value dstrbuton. When message delay or loss occurs durng a global poll, the coordnator cannot collect all necessary nformaton on tme, whch agan may cause the coordnator to ms-detect global volaton and ntroduces false negatves. Hence, we estmate the probablty of ms-detectng a global volaton based on collected values durng the global poll and the value dstrbuton of troubled montors. Local Volaton Reportng. To facltate the accuracy estmaton at the local volaton reportng stage, each montor mantans a local hstogram that records the dstrbuton of local montored values. Much prevous research [14][16][20][1] suggests that such dstrbuton statstcs of recent montored values provde good estmaton on future values. Specfcally, each montor mantans a hstogram of the values that t sees over tme as H (x) where H (x) s the probablty of montor observng the value x. We use equ-depth hstograms to keep track of the data dstrbuton. For generalty purposes, we assume that the montored value dstrbuton s ndependent of messagng dynamcs. To ensure that the hstogram reflects recently seen values more promnently than older ones, each montor contnuously updates ts hstogram wth exponental agng. A montor also perodcally sends ts local hstogram to the coordnator. We frst look at the probablty of montor falng to report a local volaton whch can be computed as follows, P (f ) = P (v )P (m ) where P (v ) s the probablty of detectng a local volaton on montor, and P (m ) s the probablty of a message sent from montor falng to reach the coordnator due to messagng dynamcs. P (v ) = P (x > T ) where x and T are the montored value and the local threshold on montor respectvely. P (x > T ) can be easly computed based on T and the dstrbuton of x provded by the hstogram of montor. P (m ) depends on the stuaton of message delay and loss. Let P (p ) be the probablty of a message sent from montor to the coordnator beng dropped. Let P (d ) be the probablty of a reportng message sent from montor to the coordnator beng delayed beyond users tolerance,.e. the local volaton report s delayed more than a tme length of w (the detecton wndow sze) so that the potental global volaton assocated wth the delayed local volaton report becomes nvald even f detected. Gven P (p ) and P (d ), we have P (m ) = 1 (1 P (p ))(1 P (d )) The ratonal here s that f a local volaton report successfully

reaches the coordnator, t must not beng dropped or heavly delayed at the same tme. Both P (p ) and P (d ) can be easly determned based on the measurement output of messagng dynamcs. P (p ) s smply the message loss rate. P (d ) can be computed as P (d ) = P (l > w) where l s the latency of messages sent from montor to the coordnator, and P (l > w) s easy to obtan gven the latency dstrbuton of messages. Clearly, P (m ) grows wth P (p ) and P (d ) and P (m ) = 0 when messagng dynamcs do not exst. Durng the local volaton reportng phase, the overall probablty of the coordnator falng to receve local volatons P (F ) depends on all montors. Therefore, we have n P (F ) = 1 (1 P (f )) where n s the number of montors and we consder local volatons on dfferent montors are ndependent for generalty. Clearly, P (F ) grows wth the number of problem montors. Wth P (F ), the probablty of false negatves caused by mssng local volaton reports P l can be estmated as P l = cp (F ) where c s referred as the converson rate between local volatons and global volatons,.e., the percentage of local volatons leadng to true global volatons. The coordnator mantans c based on ts observatons on prevous local volatons and global volatons. Global Polls. Recall that n the orgnal state montorng algorthm, when the coordnator receves a local volaton report, t ntates the global poll process, where t requests all montors to report ther current local montored values. However, when message delay and loss exst, the coordnator may receve a delayed report about a local volaton that actually occurs at an earler tme t. As a result, when the coordnator nvokes a global poll, t requests all montors to report ther prevous local montored values observed at tme t. To support ths functonalty, montors locally keep a record of prevous montored values observed wthn the detecton wndow (a sldng tme wndow wth sze w). Values observed even earler are dscarded as the correspondng montorng results are consdered as expred. Once the coordnator ntates the global poll process, our accuracy estmaton also enters the second stage, where we estmate the possblty of ms-detectng global volatons due to message delay and loss n the global poll process. The estmaton starts when the coordnator does not receve all responses on tme. Snce the coordnator does not report anythng untl t receves all montorng data, the probablty of detectng a state volaton gven the set of receved montored values s P (V ) = P { K x > T K x } (1) where K s the set of montors whose responses do not reach the coordnator, and K are the rest of the montors. The rght hand sde of the equaton can be determned based on the value hstogram of montors. At any tme pont, the probablty of detectng global volaton s the probablty of detectng global volaton wthn the tme wndow of delay tolerance. Out-of-Order Global Polls. Due to the exstence of message delays, local volaton reports sent from dfferent montors may arrve out-of-order. Accordngly, as new global poll processes may start before prevous global poll processes fnsh, the coor- Montors Coordnator local volaton reports on-gong global polls P(V) = 0.67 P(V) = 0.15 Tme Fg. 3: Out-of-order Global Polls dnator may be nvolved n multple ongong global poll processes at the same tme as Fgure 3 shows. When the coordnator receves local volaton reports r, t frst checks ts tmestamp t r (local volaton occurrng tme) to see f t r t w where t s the current tme (report recevng tme) and w s the user-specfed detecton wndow sze. If true, t gnores the local volaton report as the volaton report s expred. Otherwse, t ntates a global poll process and use t r as ts tmestamp. As each global poll may take dfferent tme to fnsh (due to message delay or loss), the coordnator contnuously checks the lfetme of global polls and removes those wth t r that t r t w. For accuracy estmaton, users are nterested n whether there exsts one or more global volatons wthn the tme nterval of [t w, t]. When there are multple ongong global polls, t means that there are multple potental global volatons requrng verfcaton. Accordngly, our accuracy estmaton should be on whether there exsts at least one ongong global poll leadng to global volaton. Let P j (V ) be the probablty of trggerng global volaton n global poll j. P j (V ) can be determned based on Equaton 1. The probablty P g of at least one global poll out of M ongong ones trggerng global volaton s P g = 1 Π M j=1(1 P j (V )) Clearly, P g ncreases quckly when the coordnator observes growng number of ongong global polls. If P g s suffcently hgh, our montorng algorthm wll report possble state volaton. Ths s partcularly useful for stuatons wth a few montors sufferng serous message delay or loss, because no global polls can fnsh f these nodes cannot send ther responses n tme and the coordnator can never trgger a global volaton f runnng exstng state montorng algorthms. Combnng matons of Both Stages. Whle we have consdered the accuracy estmaton problem for local volaton reportng and global poll stages separately, a runnng coordnator often experences both local volaton falures and ncomplete global polls at the same tme. Hence, combnng estmaton on both stages s crtcal for delverng correct accuracy estmaton results. The overall probablty of false negatves can be computed as β = 1 (1 P l )(1 P g ) where P l and P g are the probablty of false negatves ntroduced by faled local volaton reportng and global polls respectvely. Note that β P l + P g as the event of mss-detectng a global volaton due to faled local volaton reportng, and the event of mss-detectng a global volaton due to faled global polls are not mutually exclusve. A Balanced State Montorng Algorthm. The orgnal state montorng algorthm nvokes global polls only when t receves local volaton reports, and trggers state alerts only after the coordnator collects responses from all montors. When messagng dynamcs exst, such an algorthm has two ssues. Frst, t may mss

opportuntes to nvoke global polls. Second, t never produces false postve results, but may ntroduce many false negatves results. We ntroduce a balanced state montorng algorthm that mnmzes the overall montorng error. The balanced algorthm s obtaned through two revsons on the orgnal algorthm. Frst, when P (F ), the probablty of falng to receve local volaton reports at the coordnator, s suffcently large (e.g. 0.95), the algorthm trggers a global poll. Second, f the estmated false negatve probablty β n the global poll phase rases above 50%, the montorng algorthm also reports state volaton wth a false postve probablty 1 β. The balanced algorthm s more lkely detect global volatons compared wth the orgnal algorthm, especally when β s large. D. Accuracy-Orented Adaptaton Sometmes montors may experence long-lastng message loss and delays. For nstance, a Xen-based guest doman contnuously generatng ntensve network IO may cause consderable CPU consumpton on Doman0, whch further leads to constant packet queung for other guest domans runnng on the same host [7][8]. As a result, montor processes runnng on troubled guest domans would experence contnuous messagng dynamcs untl the performance nterference s resolved. Relable state montorng should also adapt to such non-transent messagng dynamcs and mnmze accuracy loss whenever possble. Recall that the dstrbuted state montorng algorthm employs local thresholds to mnmze the amount of local volaton reports sendng to the coordnator. Ths technque, however, ntroduces extra uncertantes when messagng dynamcs exst, because the coordnator cannot dstngush the case where a montor does not detect local volaton from the case where a montor fals to report a local volaton. Our approach mnmzes such uncertantes through two smultaneous adjustments of local thresholds. Frst, t adjusts local thresholds on troubled montors to reduce ts chance of detectng local volatons, as the correspondng reports may not arrve the coordnator whch n turn ntroduces uncertantes. Second, t also adjusts local thresholds on healthy montors to ncrease ther local volaton reportng frequences to maxmze the nformaton avalable to the coordnator so that t can provde good accuracy estmaton. The adjustment on healthy montors s also mportant for montorng correctness where we ensure n T T. As the mpact of message delay and loss to local volaton reportng can be measured by the expected number of faled local volaton reports E(fr), we formulate the local threshold adjustment problem as a constraned optmzaton problem as follows, mn E(fr) = Σ n P (v T )P (m ) s.t. Σ n T T where P (v T ) s the condtonal probablty of reportng local volaton on montor gven ts local threshold T and P (m ) s the probablty of falng to send a message to the coordnator. Snce we do not have a closed form for P (v T ) = P (x > T ) (only hstograms of x ), we replace P (v T ) wth ts upper bound P (v T ) by applyng Markov s nequalty (Chebyshev s nequalty does not yeld a closed form) where P ( x > T ) E( x ) T. Snce x s postve n most scenaros and E( x ) can be obtaned through x s hstograms, applyng ths approxmaton and Lagrange multpler leads us to a closed form soluton. We fnd the resultng adjustments perform well n practce. In addton, we nvoke adaptaton only when at least one node experences relatvely long-lastng (e.g. 5 mnutes) messagng dynamcs to avod frequent adaptaton. IV. EVALUATION Our experments consst of both trace-drven smulaton and real system evaluaton. The trace-drven experment evaluates the performance of our approach wth access traces of WorldCup 1998 offcal webste hosted by 30 servers dstrbuted across the globe [21]. We used the server log data consstng of 57 mllon page requests dstrbuted across servers. We evaluate the montorng accuracy acheved by our approach for a varety of messagng dynamcs n ths set of experments. The other part of our experments leverages our montorng technques to support auto-scalng of Cloud applcatons where server nstances can be added to the resource pool of an applcaton dynamcally based on the current workload [13]. We deploy a dstrbuted RUBS [22], an aucton web applcaton modeled after ebay.com for performance benchmarkng, and use state montorng to trgger new server nstance provsonng. For the real system evaluaton, we are nterested n the mpact of mproved montorng accuracy on real world applcaton performance. A. Results Fgure 4 shows the state volaton detecton percentage of dfferent montorng approaches under dfferent levels and types of messagng qualty degradaton. Here the y-axs s the percentage of state volaton detected by the montorng algorthm over state volaton detected by an oracle whch can detect all volatons n a gven trace. In our comparson, we consder four montorng algorthms: 1), the exstng nstantaneous montorng algorthm whch s oblvous to nter-node messagng qualty; 2), the nstantaneous montorng algorthm enhanced wth our accuracy estmaton technques; 3), the nstantaneous montorng algorthm enhanced wth our accuracy-orented adaptaton technques; 4) +, the nstantaneous montorng algorthm enhanced wth both estmaton and adaptaton technques. We emulate a dstrbuted rate lmtng montorng task whch trggers state volatons whenever t detects the overall request rate (the sum of request rate on all montors) exceeds a global threshold (set to 3000 per second). The task nvolves 30 montors, each of whch montors the request rate of one server by readng the correspondng server request trace perodcally. Furthermore, we set the detecton wndow sze to be 15 seconds, whch means a state volaton s consdered as successfully detected f the tme of detecton s at most 15 seconds later than the occurrence tme of the state volaton. Fgure 4(a) llustrates the performance of dfferent algorthms under ncreasng message delay. Here the x-axs shows the levels of njected message delay. For delay on level k(k = 0, 1, 2, 3, 4), we pck 20 k% of messages of a problem montor and nject a delay tme randomly chosen from 5 to seconds. By default, we randomly pck 10% of montors to be problem montors. Whle there are many ways to nject message delays, we use the above njecton method for the sake of smplcty and nterpretaton. The detecton rate of the oblvous algorthm drops quckly as

Detected State Alerts(%) 95 85 75 + 0 Delay Level (a) Detected State Alerts(%) + 50 0 Loss Level (b) Detected State Alerts(%) + 0 Mxed Delay and Loss Level (c) Detected State Alerts(%) + 50 0 20 40 Percentage of Problem Nodes Fg. 4: State Volaton Detecton Rate: (a) under ncreasng level of delay; (b) under ncreasng level of message loss; (c)under ncreasng level of mxed message delay and loss; (d) wth ncreasng number of problem montors (d) delay level ncreases, prmarly because ts global poll process always wats untl messages from all montors arrve and the resultng delay on the volaton reportng often exceeds the delay tolerance nterval. The algorthm performs much better as t can estmate the probablty of a state volaton based on ncomplete global poll results, whch allows the scheme to report state volaton when the estmated probablty s hgh (above 0.9 n our experment). For nstance, when an ncomplete global poll yelds a total request rate close to the global threshold, t s very lkely that a state volaton exsts even though responses from problem montors are not avalable. The scheme, however, provdes lmted mprovement when used alone. Ths s because accuracy-orented adaptaton by tself only reduces the chance of a problem montor reportng local volaton. Wthout accuracy estmaton, the scheme stll wats for all responses n global polls. Wth both accuracy estmaton and adaptaton, the + scheme acheves sgnfcantly hgher detecton rate. In Fgure 4(b), we use dfferent levels of message loss to evaluate the performance of dfferent algorthms. Smlar to the njecton of delay, we randomly pck 20 k% message of a problem node to drop for a k-level message loss. The relatve performance of the four algorthms s smlar to what we observed n Fgure 4(a), although the detecton rate acheved by each algorthm drops slghtly compared wth that n Fgure 4(a) as delayed messages often stll help to detect state volaton compared wth completely dropped messages. For the rest of experments, we nject mxed message delay and loss, nstead of mess delay or loss alone, for comprehensve relablty evaluaton. Smlarly, the k level delay and loss means that 10% messages are randomly chosen to drop and another 10% messages are randomly chosen to add delays. Fgure 4(c) shows the volaton detecton performance of dfferent algorthms gven ncreasng levels of mxed message delay and loss. We observe smlar results n ths fgure and the performance acheved by our approach les between those acheved n the two prevous fgures. In Fgure 4(d), we vary the scope of problem nodes from 20%(the default case) to %. The result suggests that our approach consstently mproves montorng accuracy. Nevertheless, when problem montors becomes domnant, ts performance s relatvely worse that that n the three prevous fgures. Fgure 5(a) shows the correspondng percentage of false postves (reportng state volatons when none exst) produced. Recall that the orgnal montorng algorthm does not produce false postves (0 false postve for n Fgure 5(a)) as ts global polls reports state volatons only when the completely collected responses confrms state volatons, whch, however, False Postve(%) 0.025 0.02 0.015 0.01 0.005 0 Mxed Delay and Loss Level (a) False Negatve(%) 0.1 0.05 0 Mxed Delay and Loss Level Fg. 5: Errors n State Volaton Detecton: (a) comparson of false postve; (b) comparson of false negatve; % Accuracy Gan Breakup 40 20 0 Adaptaton Soft Global Poll mated Alert Mxed Delay and Loss Level (a) % Accuracy Gan Breakup 40 20 0 (b) Adaptaton Soft Global Poll mated Alert 20 40 Percentage of Problem Nodes Fg. 6: Accuracy Improvement Breakup: (a) wth ncreasng message loss and delay levels; (b) wth ncreasng percentage of problem montors. causes hgh false negatve rate (shown n Fgure 5(b)). Fgure 5(b) shows the false negatves (reportng no state volaton when at least one exsts) rates of all schemes. All of our three schemes acheve farly low false postve and false negatve rates. Fgure 6 llustrates the three key efforts our approach makes to mprove montorng accuracy and the correspondng porton of correctly reported state volatons that are mssed by the orgnal montorng algorthm. Here Adaptaton refers to the effort of reconfgurng local threshold, Soft-Global-Poll refers to the effort of trggerng global polls when the estmated local volaton reportng probablty s hgh (nstead of recevng a local volaton), and mated-alert refers to the effort of reportng state volaton when the estmated probablty s suffcently hgh. Note that multple efforts may contrbute to a correctly reported state volaton at the same tme. Among the three efforts n both Fgure 6(a) and Fgure 6(b), mated-alert clearly contrbutes the most as ncomplete global polls are the man reason for false negatves n the orgnal montorng algorthm. Fgure 7 shows the performance dfference of RUBS wth auto-scalng enabled by dfferent montorng schemes. We deploy a PHP verson of RUBS n Emulab [23] where t has a set of web (b)

Resp. Tme over (%) 50 Mxed Delay and Loss Level (a) #Tmeouts over (%) 50 Mxed Delay and Loss Level Fg. 7: Impact on Cloud applcaton auto-scalng: (a) comparson of response tme; (b) comparson of tmeouts. servers and a database backend. Each web server runs n a small footprnt XEN-based vrtual machne (1 vcpu) and the database runs on a dedcated physcal machne. Ths s to ensure that the database s not the performance bottleneck. We perodcally ntroduce workload bursts to RUBS, and use state montorng to check f the total number of tmeout requests on all web servers exceeds a gven threshold,.e., one montor runs on one web server to observe local tmeout requests. RUBS ntally runs wth 5 web servers. When volatons are detected, we gradually add new web servers one by one to absorb workload bursts untl no volaton s detected (auto-scalng). Smlarly, when no volatons are detected for 5 mnutes, we gradually remove dynamcally added web servers one by one. We ntroduce messagng delay and loss to montor-coordnator communcaton n the same way as that n the trace-drven experments. The y-axs of Fgure 7 shows the average response tme and the tmeout request number of RUBS requests whch are normalzed by those of the oblvous scheme. Clearly, as our enhanced schemes detect more state volatons, they can more relablty trgger auto-scalng when there s a workload burst, whch n turn reduces response tme and request tmeout by up to 30%. In addton, accuracy estmaton acheves hgher detecton rate compared wth self-adaptaton does. Ths s because montors on load balanced web servers often observe smlar tmeouts and accuracy estmaton can often confrm global volatons based on partal montorng data. V. RELATED WORK Most exstng state montorng works [3][16][15][17][1] study communcaton effcent detecton of constrant volaton. These approaches often assume relable nter-node communcaton, and are subject to producng msleadng results wth the presence of messagng dynamcs that are common n Cloud montorng envronments. Jan and et al. [9] studes the mpact of herarchcal aggregaton, arthmetc flterng and temporary batchng n an unrelable network. They propose to gauge the degree of naccuracy based on the number of unreachable montorng nodes and the number of duplcated montorng messages caused by DHT overlay mantenance. Whle ths work provdes nsght for understandng the nterplay between montorng effcency and accuracy gven message losses, t also has several lmtatons as we mentoned n Secton II-A such as not consderng delay and dffcultes n assessng applcaton level montorng accuracy. Our work s complementary to [9] as we try to move forward the understandng of montorng relablty by studyng accuracy estmaton and self-adaptaton n state montorng. (b) VI. CONCLUSION AND FUTURE WORK We have presented a relable state montorng approach that enables estmaton of montorng accuracy based on observed messagng dynamcs, and self-adapton to dsruptons. We bult a prototype system based presented technques and evaluated the system n varous settngs. The results suggested that the system can effectvely delver relable montorng results and accuracy estmaton. As part of our ongong work, we are workng on safeguardng multple state montorng tasks. We are also studyng provdng relablty features to other types of montorng. ACKNOWLEDGMENT Ths work was ntated whle the frst author was nternng wth IBM T.J. Watson and also supported by an IBM PhD fellowshp for the frst author. The last three authors are partally supported by grants from NSF NetSE and NSF CyberTrust, an IBM faculty award and a grant from Intel ISTC Cloud Computng. REFERENCES [1] S. Meng, T. Wang, and L. Lu, State montorng n cloud datacenters, IEEE Transactons on Knowledge and Data Engneerng (TKDE), Specal Issue on Cloud Data Management, 2011. [2] B. Raghavan, K. V. Vshwanath, S. Ramabhadran, K. Yocum, and A. C. Snoeren, Cloud control wth dstrbuted rate lmtng, n SIGCOMM07. [3] M. Dlman and D. Raz, Effcent reactve montorng, n INFOCOM01. [4] S. Meng, S. R. Kashyap, C. Venkatraman, and L. Lu, Remo: Resourceaware applcaton state montorng for large-scale dstrbuted systems, n ICDCS, 2009, pp. 248 255. [5] J. Dean and S. Ghemawat, Mapreduce: Smplfed data processng on large clusters, n OSDI, 2004, pp. 137 150. [6] F. Chang, J. Dean, S. Ghemawat, W. C. Hseh, D. A. Wallach, M. Burrows, T. Chandra, A. Fkes, and R. Gruber, Bgtable: A dstrbuted storage system for structured data, n OSDI, 2006. [7] D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat, Enforcng performance solaton across vrtual machnes n xen, n Mddleware06. [8] X. Pu, L. Lu, Y. Me, S. Svathanu, Y. Koh, and C. Pu, Understandng performance nterference of /o workload n vrtualzed cloud envronments, n IEEE Cloud, 2010. [9] N. Jan, P. Mahajan, D. Kt, P. Yalagandula, M. Dahln, and Y. Zhang, Network mprecson: A new consstency metrc for scalable montorng, n OSDI, 2008, pp. 87 102. [10] L. Gao, M. Wang, and X. S. Wang, Qualty-drven evaluaton of trgger condtons on streamng tme seres, n SAC, 2005. [11] L. Raschd, H.-F. Wen, A. Gal, and V. Zadorozhny, Montorng the performance of wde area applcatons usng latency profles, n WWW03. [12] G. Gu, R. Perdsc, J. Zhang, and W. Lee, Botmner: Clusterng analyss of network traffc for protocol- and structure-ndependent botnet detecton, n USENIX Securty Symposum, 2008, pp. 139 154. [13] Auto scalng, http://aws.amazon.com/autoscalng/. [14] C. Olston, J. Jang, and J. Wdom, Adaptve flters for contnuous queres over dstrbuted data streams, n SIGMOD Conference, 2003. [15] R. Keralapura, G. Cormode, and J. Ramamrtham, Communcaton-effcent dstrbuted montorng of threshold counts, n SIGMOD, 2006. [16] I. Sharfman, A. Schuster, and D. Keren, A geometrc approach to montor threshold func. over dstrb. data streams, n SIGMOD, 2006. [17] S. Agrawal, S. Deb, K. V. M. Nadu, and R. Rastog, Effcent detecton of dstrbuted constrant volatons, n ICDE, 2007. [18] Vsual evdence of amazon ec2 network ssues, https://www.cloudkck. com/blog/2010/jan/12/vsual-ec2-latency/, 2010. [19] Amazon cloudwatch beta, http://aws.amazon.com/cloudwatch/. [20] N. Jan, M. Dahln, Y. Zhang, D. Kt, P. Mahajan, and P. Yalagandula, Star: Self-tunng aggregaton for scalable montorng, n VLDB, 2007. [21] M. Arltt and T. Jn, 1998 world cup web ste access logs, http://www.acm.org/sgcomm/ita/, August 1998. [22] Rubs, http://rubs.ow2.org/. [23] B. Whte, J. Lepreau, L. Stoller, R. Rcc, S. Guruprasad, M. Newbold, M. Hbler, C. Barb, and A. Joglekar, An ntegrated expermental envronment for dstrbuted systems and networks, n OSDI, 2002.