Design and Analysis of Benchmarking Experiments for Distributed Internet Services

Transcription

1 Desgn and Analyss of Benchmarkng Experments for Dstrbuted Internet Servces Eytan Bakshy Facebook Menlo Park, CA Etan Frachtenberg Facebook Menlo Park, CA ABSTRACT The successful development and deployment of large-scale Internet servces depend crtcally on performance. Even small regressons n processng tme can translate drectly nto sgnfcant energy and user experence costs. Despte the wdespread use of dstrbuted server nfrastructure (e.g., n cloud computng and Web servces), there s lttle research on how to benchmark such systems to obtan vald and precse nferences wth mnmal data collecton costs. Correctly A/B testng dstrbuted Internet servces can be surprsngly dffcult because nterdependences between user requests (e.g., for search results, socal meda streams, photos) and host servers volate assumptons requred by standard statstcal tests. We develop statstcal models of dstrbuted Internet servce performance based on data from Perflab, a producton system used at Facebook whch vets thousands of changes to the company s codebase each day. We show how these models can be used to understand the tradeoffs between dfferent benchmarkng routnes, and what factors must be taken nto account when performng statstcal tests. Usng smulatons and emprcal data from Perflab, we valdate our theoretcal results, and provde easy-to-mplement gudelnes for desgnng and analyzng such benchmarks. Keywords A/B testng; benchmarkng; cloud computng; crossed random effects; bootstrappng Categores and Subject Descrptors G.3 [Probablty and Statstcs]: Expermental Desgn 1. INTRODUCTION Benchmarkng experments are used extensvely at Facebook and other Internet companes to detect performance regressons and bugs, as well as to optmze the performance of exstng nfrastructure and backend servces. They often nvolve testng varous code paths wth one or more users on multple hosts. Ths ntroduces multple sources of varablty: each user request may nvolve dfferent amounts of computaton because results are dynamcally customzed to ther own data. The amount of data vares from user to user and often follows heavy-taled dstrbutons for quanttes such as frend count, feed stores, and search matches [2]. Furthermore, dfferent users and requests engage dfferent code paths based on ther data, and can affected dfferently by just-ntme (JIT) complaton and cachng. User-based benchmarks therefore need to take nto account a good mx of both servce endponts and users makng the requests, n order to capture the broad range of performance ssues that may arse n producton settngs. An addtonal source of varablty comes from utlzng multple hosts, as s common among cloud and Web-based servces. Testng n a dstrbuted envronment s often necessary due to engneerng or tme constrants, and can even mprove the representatveness of the overall performance data, because the producton envronment s also dstrbuted. But each host can have ts own performance characterstcs, and even these vary over tme. The performance of computer archtectures has grown ncreasngly nondetermnstc over the years, and performance nnovatons often come at a cost of lower predctablty. Examples nclude: dynamc frequency scalng and sleep states; adaptve cachng, branch predcton, and memory prefetchng; and modular, ndvdual hardware components comprsng the system wth ther own varance [5, 16]. Our research addresses challenges nherent to any benchmarkng system whch tests user traffc on a dstrbuted set of hosts. Motvated by problems encountered n the development Perflab [14], an automated system responsble for A/B testng hundreds of code changes each day at Facebook, we develop statstcal models of user-based, dstrbuted benchmarkng experments. These models help us address real-lfe challenges that arse n a producton benchmarkng system, ncludng hgh varablty n performance tests, and hgh rates of false postves n detectng performance regressons. Our paper s organzed as follows. Usng data from our producton benchmarkng system, we motvate the use of statstcal concepts, ncludng two-stage samplng, dependence, and uncertanty (Secton 3). Wth ths framework, we develop a statstcal model that allows us to explore the consequences of how known sources of varaton requests and hosts a hypothetcal benchmark. We then use these models to formally evaluate the precson of dfferent expermental desgns of dstrbuted benchmarks (Secton 4). These varance expressons for each desgn also nforms how statstcal tests should be computed. Fnally, we propose and evaluate a bootstrappng procedure n Secton 5 whch shows good performance both n our smulatons and producton tests. 2. PROBLEM DESCRIPTION We often want to make changes to software a new user nterface, a feature, a rankng model, a retreval system, or vrtual host and need to know how each change affects performance metrcs such as computaton tme or network load. Ths goal s typcally accomplshed by drectng a subset of user traffc to dfferent ma- 1

2 chnes (hosts) runnng dfferent versons of software and comparng ther performance. Ths scheme poses a number of challenges that affect our ablty to detect statstcally sgnfcant dfferences, both wth respect to Type I errors a false postve where two versons are deemed to have dfferent performance when n fact they don t, and Type II errors an nablty to detect sgnfcant dfferences where they do exst. These problems can result n fnancal loss, wasted engneerng tme, and msallocaton of resources. As mentoned earler, A/B testng s challengng both from the perspectve of reducng the varablty n results, and from the perspectve of statstcal nference. Performance can vary sgnfcantly due to many factors, ncludng characterstcs of user requests, machnes, and complaton. Experments not desgned to account for these sources of varaton can obscure performance reversons. Furthermore, methods for computng confdence ntervals for userbased benchmarkng experments whch do not take nto account these sources of varaton can result n hgh false-postve rates. 2.1 Desgn goals When desgnng an experment to detect performance reversons, we strve to accomplsh three man goals: (1) Representatveness: we wsh to produce performance estmates that are vald (reflectng performance for the producton use case), take samples from a representatve subset of test cases, and generalze well to the test cases that weren t sampled; (2) Precson: performance estmates should be fne-graned enough to dscern even small effects, per our tolerance lmts; (3) Economy: estmates should be obtaned wth mnmal cost (measured n wall tme), subject to resource constrants. In short, we wsh to desgn a representatve experment that mnmzes both the number of observatons and the varance of the results. To smplfy, we consder a sngle endpont or servce, and use the terms request, query, and user nterchangeably Testng envronment To motvate our model, as well as valdate our results emprcally, we dscuss Facebook s n-house benchmarkng system, Perflab [14]. Perflab can assess the performance mpact of new code wthout actually nstallng t on the servers used by real users. Ths enables developers to use perflab as part of ther testng sequence even before they commt code. Moreover, perflab s used to automatcally check all code commtted nto the code repostory, to uncover both logc and performance problems. Even small performance ssues need to be montored and corrected contnuously, because f they are left to accumulate they can quckly lead to capacty problems and unnecessary expendture on addtonal nfrastructure. Problems uncovered by perflab or other tests that cannot be resolved wthn a short tme may cause the code revson to be removed from the push and delayed to a subsequent push, after the problems are resolved. A Perflab experment s started by defnng two versons of the software/envronment (whch can be dentcal for an A/A test ). Perflab reserves a set of hosts to run ether verson, and prepares them for a batch of requests by restartng them and flushng the caches of the backend servces they rely on (whch are coped and solated from the producton envronment, to ensure Perflab has no sde effects). The workload s selected as a representatve subset of endponts and users from a prevous sample of lve traffc. It s replayed repeatedly and concurrently at the systems under tests, whch queue up the requests so that the system s under a relatvely constant, near-saturaton load, to represent a realstcally hgh producton load. Ths contnues untl we get the requred number 1 We dscuss how our approach can be generalzed to multple endponts n Secton 7. of observatons, but we dscard a number of ntal observatons ( warm-up perod) and fnal observatons ( cool-down perod). Ths range was determned emprcally by analyzng the autocorrelaton functon [13] for ndvdual requests as a functon of repetton number. Perflab then resets the hosts, swaps software versons so that hosts prevously runnng verson A now run verson B and vce-versa, and reruns the experment to collect more observatons. For each observaton, we collect all the metrcs of nterest, whch nclude: CPU tme and nstructons; database/key-value fetches and bandwdth; memory use; HTML sze; and varous others. These metrcs detect not only performance changes, as n CPU tme and nstructons, but can also detect bugs. For example, a sharp drop n the number of database fetches (for a gven user and endpont) between two software versons s more often than not an undesred sde effect and an ndcaton of a faulty code path, rather than a mraculous performance mprovement, snce the underlyng user data hasn t changed between versons. 3. STATISTICS OF USER-BASED BENCHMARKS In order to understand how best to desgn user-based benchmarks, we must frst consder how one capture (sample) metrcs over a populaton of requests. We begn by dscussng the relatonshp between samplng theory and data collecton for benchmarks. Ths leads us to confront sources of varablty frst from requests, and then from hosts. We conclude ths secton wth a very general model that brngs these aspects together, and serves as the bass for the formal analyss of expermental desgns for dstrbuted benchmarks nvolvng user traffc. 3.1 Samplng Two-stage samplng of user requests We wsh to estmate the average performance of a gven verson of software, and do so by sendng requests (e.g., loadng a partcular endpont for varous users) to a servce, and measurng the sample average ȳ, for some outcome y for each observaton. Our certanty about the true average value Ȳ of the complete dstrbuton Y ncreases wth the number of observatons. If each observatons of Y s ndependent and dentcally dstrbuted (d), then the standard error of for our estmate of Ȳ s SEd = s/ N, where N s the number of observatons, and s s the sample standard devaton of our observed y s. The standard error s the standard devaton of our estmate of Ȳ, and we can use t to obtan the confdence nterval for Ȳ at sgnfcance level α by computng ȳ ± Φ 1 (1 α/2)se, where Φ 1 s the nverse cumulatve dstrbuton of the standard normal. For example, the 95% confdence nterval for Ȳ can be computed as ȳ ± 1.96SE. Often tmes data collected from benchmarks nvolvng user traffc s not d. For example, f we repeat requests for users multple tmes, the observatons may be clustered along certan values. Fgure 1 shows how ths knd of clusterng occurs wth respect to CPU tme 2 for requests fulflled to a heavly traffcked endpont at Facebook. Here, we see 300 repettons for 4, 16, and 256 dfferent requests (recall that by dfferent requests, we are referrng to dfferent users for a fxed endpont). CPU tme s dstrbuted approxmately normally for any gven request, but the amount of varablty wthn 2 Because relatve performance s easer to reason about than absolute performance n ths context, all outcomes n ths paper are standardzed, so that for any partcular benchmark for an endpont, the grand mean s subtracted from the outcome and then dvded by the standard devaton. 2

3 count CPU tme Fgure 1: Clusterng n the dstrbuton of CPU tmes for requests. Panels correspond to dfferent numbers of requests. Requests are repeated 300 tmes. standard error repettons requests Fgure 2: Standard error for clustered data (e.g., requests) as a functon of the number of requests and repettons for dfferent ntraclass correlaton coeffcents, ρ. Multple repeated observatons of the same request have lttle effect on the SE when ρ s large. a sngle request s much smaller than the overall varablty between requests. When hundreds of dfferent requests are mxed n, n fact, t s dffcult to tell that ths aggregate dstrbuton s really the mxture of many, approxmately normally dstrbuted clusters. If observatons are clustered n some way (as n Fgure 1), the effectve sample sze could be much smaller than the total number of observatons. Measurements for the same request tend to be correlated, such that e.g., the executon tme for requests for the same user are more smlar to one another than requests related to dfferent users. Ths s captured by the ntra-class correlaton coeffcent (ICC) ρ, whch s the rato of between-cluster varaton to total varaton [28]: ρ = σα 2 + σε 2 Where σ α s the standard devaton of the request effect and σ ε s the standard devaton of the measurement nose. If ρ s close to 1, then repeated observatons for the same request are nearly dentcal, and when t s close to 0, there s lttle relaton between requests. It s easy to see that f most of the varablty occurs between clusters, rather than wthn clusters (hgh ρ), addtonal repeated measurements for the same cluster do not help wth obtanng a good σ2 α ρ requests 32 requests 256 requests estmate of the mean of Y, Ȳ. Ths dea s captured by the desgn effect [10], d eff = (1 + (T 1)ρ), whch s the rato of the varance of the clustered sample (e.g., multple samples from the same request) to the smple random sample (e.g., random samples of multple ndependent requests), where T s the number of tmes each request s repeated. Under samplng desgns where we may choose both the number of requests R, and repettons T, the standard error s nstead: 1 SE clust = σ/ (1 + (T 1)ρ) (1) RT Such that addtonal repettons only reduce the standard error of our estmate f ρ s small. Fgure 2 shows the relatonshp between the standard error and R for small and large vales of ρ. For most of the top endponts benchmarked at Facebook ncludng news feed loads and search queres nearly all endponts have values of ρ between 0.8 and In other words, from our experence wth user traffc, mere repetton of requests s not an effectve means of ncreasng precson when the ntraclass correlaton, ρ, s large Samplng on a budget Explorng the tradeoffs between collectng more requests vs. more repettons needs to take nto account not only the value of ρ but also the costs of swtchng requests. For example, t s often necessary to frst warm up a vrtual host or cache to reduce temporal effects (e.g., seral autocorrelaton) [16] to get accurate estmates for an effect. That s, the fxed cost requred to sample a request s often much greater than t s to gan addtonal repettons. In stuatons n whch ρ s smaller, t may be useful to consder larger number of repettons per request. There s also an addtonal fxed cost S f to set up an experment (or a batch ), whch ncludes resettng the hardware and software as necessary and watng for the runtme systems to warm up. So, f the fxed cost for benchmarkng a sngle request s C f, the margnal cost for an addtonal repetton of a request s C m, and an avalable budget, B, one could mnmze Eq. 1 subject to the constrant that S f + R(C f + T C m) B. 3.2 Models of dstrbuted benchmarks Whle benchmarkng on a sngle host s smple enough, we are often constraned to run benchmarks n a short perod of tme, whch can be dffcult to accomplsh on only one host. Facebook, for example, runs automated performance tests on every sngle code commt to ts man ste, thousands of tmes a day [14]. It s therefore mperatve to conduct benchmarks n parallel on multple hosts. Usng multple hosts also has the beneft of surfacng performance ssues that may affect one host and not another, for further nvestgaton. 3 But multple hosts also ntroduce new sources of varablty, whch we model statstcally n the followng sectons. To motvate and llustrate these models, let us look at an example of emprcal data from a Perflab A/A benchmark (Fgure 3). Two requests (correspondng to servce queres for two dstnct users under the same software verson) are repeated across two dfferent machnes n two batches, runnng thrty tmes on each. For each batch, the software on the hosts was restarted, and caches and JIT envronments were warmed up. The observatons correspond to CPU tme measured for an endpont runnng under a PHP vrtual machne. Note that each repetton of the same request on the same machne n the same batch s relatvely smlar and has few 3 Such nvestgaton sometmes leads to dentfyng faulty systems n the benchmarkng envronment, but can also reveal unntended consequences of the software nteractng wth dfferent subsystems. 3

4 CPU tme batch 0 batch repetton host.d request Fgure 3: CPU tme for two requests executed over 30 repettons, executed on two hosts over two dfferent batches. Each shape represents a dfferent request, and each color represents a dfferent host. Panels correspond to dfferent batches. Lnes represent a model ft based on Eq. 3. tme trends. Furthermore, the between-request varablty (crcle vs. square shapes) tends to be greater than the per-machne varablty (e.g., the green vs purple dots). Between-batch effects appear to be smlar n sze to the nose Random effects formulaton An observaton from a benchmark can be thought of as beng generated by several ndependent effects. In the smplest case, we could thnk of a sngle observaton (e.g., CPU tme) for some partcular endpont or servce as havng some average value (e.g., 500ms), plus some shft due to the user nvolved n the request (e.g., +500ms for a user wth many frends), plus a shft due to the host executng the request (e.g., -200ms for a faster host), plus some random nose. Ths formulaton s referred to as a crossed random effects model, and we refer to host and request-level shfts as a random effects [3, 26] or levels [28]. Formally, we can denote the request correspondng to an observaton as r[], and the host correspondng to hosts as h[]. In the equaton below, we denote the random effects (random varables) for requests, hosts, and nose for each observaton as α r[], β h[], and ε, respectvely. The mean value s gven by µ. f denotes a transformaton (a lnk functon ). For addtve models, f s smply the dentty functon, but for multplcatve models (e.g., each effect nduces a certan percent ncrease or decrease n performance), f may be the exponental functon 4, exp( ): Y = f(µ + α r[] + β h[] + ε ) Under the heterogeneous random effects model [26], each request and host can have ts own varance. Ths model can be complex to manpulate analytcally, and dffcult to estmate n practce, so one common approach s to nstead assume that random effects for requests or hosts are drawn from the same dstrbuton. Homogeneous random effects model for a sngle batch. Under ths model, random effects for hosts and requests are respec- 4 For the sake of smplcty we work wth addtve models, but t s often desrable to work wth the multplcatve model, or equvalently log(y ). The remanng results hold equally for the logtransformed outcomes. tvely drawn from a common dstrbuton. 5 Y = µ + α r[] + β h[] + ε α r N (0, σ 2 α), β h N (0, σ 2 β), ε N (0, σ 2 ε). (2) In ths homogenous random effects model each request has a constant effect, whch s sampled from some common normal dstrbuton, N (0, σα). 2 Any varablty from the same request on the same host s ndependent of the request. Homogeneous random effects model for multple batches. Modern runtme envronments are non-determnstc and for varous reasons, restartng a servce may cause the characterstc performance of hosts or requests to devate slghtly. For example, the dynamc request order and mx can vary and affect the code paths that the JIT comples and keeps n cache. Ths can be specfed by addtonal batch-level effects for hosts and requests. We model ths behavor as follows: each tme a servce s ntalzed, we draw a batch effect for each request, γ r[],b[] N (0, σ γ), and smlarly for each host, η h[],b[] N (0, σ η). That s, each tme a host s restarted n some way (a new batch ), there s some addtonal nose ntroduced that remans constant throughout the executon of the batch, ether pertanng to the host or the request. Note that per-request and per-host batch effects are ndependent from batch to batch 6, smlar to how ε s ndependent between observatons: Y = µ + α r[] + β h[] + γ r[],b[] + η h[],b[] + ε α r N (0, σ 2 α), β h N (0, σ 2 β), γ r,b N (0, σ 2 γ), η h,b N (0, σ 2 η), ε N (0, σ 2 ε) (3) Estmaton How large are each of these effects n practce? We begn wth some of observatons from a real benchmark to llustrate the sources of varablty, and then move on to estmatng model parameters for several endponts n producton. Models such as Eq. 2 and Eq. 3 can be ft effcently to benchmark data va restrcted maxmum lkelhood estmaton usng off-the-shelf statstcal packages, such as lmer n R. In Fgure 3, the horzontal lnes ndcate the predcted values for each endpont usng the model from Eq. 3 ft to an A/A test (e.g., an experment n whch both versons have the same software) usng lmer. In general, we fnd that the requestlevel random effects are much greater than the host level effects, whch are smlar n magntude to the batch-level effects. We summarze estmates for top endponts at Facebook n Table 1, and use them n subsequent sectons to llustrate our analytcal results. endpont µ σ α σ β σ γ σ η σ ε Table 1: Parameter estmates for the model n Eq. 3 for the fve most traffcked endponts benchmarked at Facebook. 5 Clearly, some requests vary more than others (cf., Fgure 1). Ths mght be caused by, e.g., the fact that some users requre addtonal data fetches or processng tme, whch can produce greater varance n response tme. Log-transformng the outcome varable may help reduce ths heteroskedastcty. 6 Ths formal assumpton corresponds to no carryover effects from one batch to another [7]. As we descrbe n Secton 2.2, the system we use n producton s desgned to elmnate carryover effects. 4

5 4. EXPERIMENTAL DESIGN Benchmarkng experments for Internet servces are most commonly run to compare two dfferent versons of software usng a mx of requests and hosts (servers). How precsely we are able to measure dfferences can depend greatly on whch requests are delvered to what machnes, and what versons of the software those machnes are runnng. In ths secton, we generalze the random effects model for multple batches to nclude expermental comparsons, and derve expressons for how the standard error of expermental comparsons (.e., the dfference n means) depends on aspects of the expermental desgn. We then present four smple expermental desgns that cover a range of benchmarkng setups, ncludng lve benchmarks and carefully controlled experments, and derve ther standard errors. Fnally, we wll show how basc parameters, ncludng the number of requests, hosts, and repettons, affect the standard error of dfferent desgns. 4.1 Formulaton We treat the problem formally usng the potental outcomes framework [24], n whch we consder outcomes (e.g., CPU tme) for an observaton (a request-host par runnng wthn a partcular batch), runnng under ether verson (the expermental condton), whose assgnment s denoted by D = 0 or D = 1. We use Y (1) to denote the potental outcome of under the treatment, and Y (0) for the control. Although we cannot smultaneously observe both potental outcomes for any partcular, we can compute the average treatment effect, δ = E[Y (1) Y (0) ] because by lnearty of expectaton, t s equal to the dfference n means E[Y (1) ] E[Y (0) ] across dfferent populatons when D s randomly assgned. In a benchmarkng experment, we dentfy δ by specfyng a schedule of delvery of requests to hosts, along wth hosts assgnments to condtons. The partculars of how ths assgnment procedure works s the expermental desgn, and t can substantally affect the precson wth whch we can estmate δ. We generalze the random effects model n Eq. 2 to nclude average treatment effects and treatment nteractons: Y (d) = µ (d) + α (d) r[] + β(d) h[] + γ r[],b[] + η h[],b[] + ε α r N (0, Σ α), βh N (0, Σ β ), γ r,b N (0, σ 2 γ), η h,b N (0, σ 2 η), ε N (0, σ 2 ε) (4) Our goal therefore s to dentfy the true dfference n means, δ = µ (1) µ (0). Unfortunately, we can never observe δ drectly, and nstead must estmate t from data, wth nose. Exactly how much nose there s depends on whch hosts and requests are nvolved, the software versons, and how many batches are needed to run the experment. More formally, we denote the number of observatons for a partcular request host batch tuple r, h, b runnng under the treatment condton d, by n (d) rhb. We express the total nose from requests, hosts, and resdual error under each condton as: φ (d) R [ n (d) r α r (d) + ] n (d) r b γ r,b r b φ (d) H φ (d) E h 2N =1 [ n (d) h β(d) h + b ε (d) 1[D = d], ] n (d) hb η h,b where, e.g., n (d) r b represents the total number of observatons nvolvng a request r executed n batch b under condton d. We take the total number of observatons per condton to be equal so that n (0) = n (1) = N. We can then wrte down our estmate, ˆδ, as ˆδ = δ + 1 [ (φ (1) R N φ(0) R ) + (φ(1) H φ(0) H ) + (φ(1) E φ(0) E ]. ) To obtan confdence ntervals for ˆδ, we also need to know ts varance, V[ˆδ]. Followng Bakshy & Eckles [4], we approach the problem by frst descrbng how observatons from each error component are repeated wthn each condton. We defne the duplcaton coeffcents [22, 23]: ν (d) R 1 N r ( n (d) ) 2 r ν (d) H 1 N h ( n (d) h ) 2, whch are the average number of observatons sharng the same request (ν R) or host (ν H). We then defne the between-condton duplcaton coeffcent [4], whch gves a measure of how balanced hosts or requests are across condtons: ω R 1 N r n (0) r n (1) r ω H 1 N h n (0) h n(1) h. Furthermore, we only consder expermental desgns n whch the request-level and host-level duplcaton are the same n both condtons, so that ν (0) R = ν(1) R and ν(0) H = ν(1) H, and omt the superscrpts n subsequent expressons. 7 Notng that because batch-level random effects are ndependent, the varance of ther sums over batches can be expressed n terms of duplcaton coeffcents, so that e.g., V[ 1 N b n (d) r b γ r,b] = 1 N 2 ( n (d) r ) 2 σ 2 γ = 1 N νrσ2 γ, and because all random effects are ndependent, t s straghtforward to show that the varance of ˆδ can be wrtten n terms of these duplcaton factors: V[ˆδ] = 1 ) [(ν R(σ 2α(1) N + σ2α(0) + 2σ2γ) 2ω Rσ α(0),α(1) ) + (ν H(σ 2β(1) + σ2β(0) + 2σ2η) 2ω Hσ (5) β(0),β(1) )] + (σ 2ε(0) + σ2ε(1). Ths expresson llustrates how repeated observatons of the same request or host affect the varance of our estmator for ˆδ. Host-level varaton s multpled by how often hosts are repeated n the data, and smlarly for requests. Varance s reduced when requests or hosts appear equally n both condtons, snce, for example, ω R s greatest when n (0) r = n (1) r for all r. We use these facts to explore how dfferent expermental desgns ways of delverng requests to hosts under dfferent condtons affect the precson wth whch we can estmate δ. 4.2 Four desgns for dstrbuted benchmarks In ths secton, we descrbe four smple expermental desgns that 7 Note that the ndvdual components, e.g., whch requests appear n the treatment and control need not be the same for the duplcaton coeffcents to be equal. 5

6 Host Req. Ver. Batch 1 h 0 r h 0 r h 1 r h 1 r Host Req. Ver. Batch 1 h 0 r h 0 r h 1 r h 1 r Host Req. Ver. Batch 1 h 0 r h 1 r h 0 r h 1 r Host Req. Ver. Batch 1 h 0 r h 1 r h 0 r h 1 r (a) Unbalanced (b) Request balanced (c) Host balanced (d) Fully balanced Table 2: Example schedules for the four expermental desgns for experments wth two hosts (H = 2) n whch two requests are executed per condton (R = 2). Each example shows four observatons, each wth a host ID, request ID (e.g., a request to an endpont for a partcular user), software verson, and batch number. reflect basc engneerng tradeoffs, and analyze ther standard errors under a common set of condtons. In partcular, we consder our certanty about expermental effects when: 1. The sharp null s true that s, the experment has no effects at all, so that δ s zero and all varance components are the same (e.g., σ 2 = σ 2 = σ α (0) α (1) α (0) α (1)) [4], or 2. There are no treatment nteractons, but there s a constant addtve effect δ. 8 Although these requrements are rather narrow, they correspond to a meanngful and common scenaro n whch benchmarks are used for dfference detecton; by mnmzng the standard error of the experment, we are better able to detect stuatons n whch there s a devaton from no change n performance. In the four desgns we dscuss n the followng sectons, we constran the desgn space to smplfy presentaton n a few ways. Frst, we assume symmetry wth respect to the pattern of delvery of requests; we repeat each request the same number of tmes n each condton, so that N = RT. Second, because executng the same request on multple hosts wthn the same condton would ncrease ν R (and therefore nflate V [ˆδ]), we only consder desgns n whch requests are executed on at most one machne per condton. And fnally, snce by (1) and (2), the varances of the error terms are equal, we drop the superscrpts for each σ 2. We consder two classes of desgns: sngle-batch experments, n whch half of all hosts are assgned to the treatment, and the other half to the control, or two-batch experments n whch hosts run both versons of the software. Note that R requests are splt among two batches n the latter case. 9 In the sngle-batch desgn, each of the R requests s executed on ether the frst block of hosts or the second block, where each block s ether assgned to the treatment or control. Therefore, when computng the host-level duplcaton coeffcent, ν H for a partcular condton, we only sum across H/2 hosts: ν H = 1 RT H/2 h=1 ( RT ) 2 = 2RT H/2 H, In the two-batch desgn, each of the R requests are executed n both the treatment and control across spread across all H hosts: ν H = 1 RT H h=1 ( RT ) 2 = RT H H, 8 When the outcome varable s log-transformed, ths δ corresponds to a multplcatve effect. 9 Alternatvely, one can thnk of the experment as nvolvng subjects and tems [3] (whch correspond to hosts and requests). The two-batch experments correspond to wthn-subjects desgns, and sngle-batch experments correspond to between-subjects desgns Defntons Unbalanced desgn ( lve benchmarkng ). In the unbalanced desgn, each host only executes one verson of the software, and each request s processed once. It can be carred out n one batch (see example layout n Table 2 (a)). Ths desgn s the smplest to mplement snce t does not requre the ablty to replay requests. It s often the desgn of choce when benchmarkng lve requests only, whether as a necessary feature or merely as a choce of convenence: t obvates the need for a possbly complex nfrastructure to record and replay requests. The varance of the dfference-nmeans estmator for the unbalanced desgn s: V UB(ˆδ) = 2T (σα 2 + σγ) RT H (σ2 β + ση) 2 + 2σɛ 2 1 The standard error s VUB(ˆδ). Expandng and smplfyng RT ths expresson yelds: SE UB (ˆδ) = 2( 1 R (σ2 α + σγ) H (σ2 β + σ2 η) + 1 RT σɛ2) The convenence of the unbalanced desgn comes at a prce: the standard error term ncludes error components both from requests and hosts. We wll show later that ths desgn s the least powerful of the four: t acheves sgnfcantly lower accuracy (or wder confdence ntervals) for the same resource budget. Request balanced desgn ( parallel benchmarkng ). The request balanced desgn executes the same request n parallel on dfferent hosts usng dfferent versons of the software. Ths requres that one has the ablty to splt or replay traffc, and can be done n one batch (see example layout n Table 2 (b)). Its standard error s defned as: SE RB (ˆδ) = 2( 1 R σ2 γ + 2 H (σ2 β + σ2 η) + 1 RT σɛ2) The request balanced desgn cancels out the effect of the request, but nose due to each host s amplfed by the average number of requests per host, R. Compared to lve benchmarkng, ths desgn offers hgher accuracy, wth smlar resources. Compared to sequen- H tal benchmarkng, ths desgn can be run n half the wall tme, but wth twce as many hosts. It s therefore sutable for benchmarkng when request replayng s avalable and tme budget s more mportant than host budget. Host balanced desgn ( sequental benchmarkng ). The host balanced desgn executes requests usng dfferent versons of the software on the same host, and thus requres two batches. Each request s agan only executed once, so a request replayng ablty s not requred (see Table 2 (c)). 6

7 fully balanced host balanced standard error request balanced unbalanced number of requests Fgure 4: Standard errors for each of the four expermental desgns as a functon of the number of requests and hosts. Lnes are theoretcal standard errors from Secton and ponts are emprcal standard errors from 10,000 smulatons. Panels ndcate the number of hosts used n the benchmark, and the number of requests are on a log scale. Compared to lve benchmarkng, sequental benchmarkng takes twce as long to run, but can use just half the number of hosts. It may therefore be useful n stuatons wth no replay ablty and a lmted number of hosts for benchmarkng. It s also more accurate, for a gven number of observatons, as shown by ts standard error: SE HB(ˆδ) = 2 ( 1 R (σ2 α + σγ) H σ2 η + 1 RT σɛ2) The host balanced desgn cancels out the effect of the host, but one s left wth nose due to the request. Note that the number of hosts does not affect the precson of the SEs n ths desgn. Fully balanced desgn ( controlled benchmarkng ). The fully balanced desgn acheves the most accuracy wth the most resources. It executes the same request on the same host usng dfferent versons of the software. Ths requres that one has the ablty to splt or replay traffc, and takes two batches to complete (see example layout n Table 2 (d)). Ths desgn has by far the least varance of the four desgns, and s the best choce for benchmarkng experments n terms of accuracy, as shown by the standard error: SE FB (ˆδ) = 2( 1 R σ2 γ + 1 H σ2 η + 1 RT σɛ2) Ths desgn requres replay ablty, as well as twce the machnes of sequental benchmarkng and twce the wall tme (batches) of lve and parallel benchmarkng. However, t s so much more accurate than the other desgns, that t requres far fewer observatons (requests) to reach a smlar level of accuracy. Dependng on the tradeoffs between request runnng tme costs and batch setup cost, as well as the desred accuracy, ths desgn may end up takng less computatonal resources than the other desgns Analyss We analyze each desgn va smulaton and vsualzaton. We obtaned realstc smulaton parameters from our model fts to the top endpont (Table 1). Our smulaton then smply draws random ŷ values from normal dstrbutons usng these parameters. For any gven desgn, the smulatons dffer n that host effects, request effects, and random nose are redrawn from a normal dstrbuton wth σ α, σ β, and σ ε, respectvely. densty fully balanced host balanced request balanced unbalanced δ^ Fgure 5: Dstrbuton of ˆδs for each of the four expermental desgns. The data was generated by smulatng 10,000 hypothetcal experments from the random effects model n on Eq. 3 usng parameter estmates from Endpont 1 n Table 1 wth 16 hosts and 256 requests, for each desgn. We frst consder smulatons for each expermental desgn wth a fxed number of hosts and requests and zero average effects. Fgure 5 shows the dstrbuton of ˆδs generated from 10,000 smulatons. We can see that the fully balanced desgn has by far the least varance n ˆδ, followed by the request balanced, host balanced, and unbalanced desgns. Next, we explore the parameter space of varyng hosts and requests n Fgure 4. In all cases, addng hosts or addng requests narrows the SE. For the unbalanced and host balanced desgns, the effect of the number of requests on the SE s much more pronounced than that of the number of hosts: n the former because varablty due to requests s much hgher than that due to hosts, as shown n Fgure 1; and n the latter because we control for the hosts. Smlarly, the request balanced desgn controls for requests, and therefore shows lttle effect from varyng the number of requests. And fnally, the fully balanced desgn exhbts both the smallest SE n absolute terms, as well as the least senstvty to the number of hosts. 7

8 5. BOOTSTRAPPING So far we have dscussed smple models that help us understand the man levers that can mprove statstcal precson n dstrbuted, user-based benchmarks. Our theoretcal results, however, assume that the model s correct, and that parameters are known or can easly estmated from data. Ths s generally not the case, and n fact, estmatng models from the data may requre many more observatons than s necessary to estmate an average treatment effect. 10 In ths secton we wll revew a smple non-parametrc method for performng statstcal nference the bootstrap. We then evaluate how well t does at reconstructng known standard errors based on smulatons from the random effects model. Fnally, we demonstrate the performance of the bootstrap on real producton benchmarks from Perflab. 5.1 Overvew of the bootstrap Often tmes we wsh to generate confdence ntervals for an average wthout makng strong assumptons about how the data was generated. The bootstrap [12] s one such technque for dong ths. The bootstrap dstrbuton of a sample statstc (e.g., the dfference n means between two expermental condtons) s the dstrbuton of that statstc when observatons are resampled [12] or reweghted [23, 25]. We descrbe the latter method because t s easest to mplement n a computatonally effcent manner. The most basc way of gettng a confdence nterval for an average treatment effect for d data s to reweght observatons ndependently, and repeat ths process R tmes. We assgn each observaton a weght w r, from a mean-one random varable, (e.g., Unform(0,2) or Pos(1)) [23, 25] and use them to average the data, usng each replcate number r and observaton number as a random seed: ˆδ r = 1 N r w r,y I(D = 1) 1 M r w r,y I(D = 0) Here, Nr and Mr denote the sum of the bootstrap weghts (e.g., wr,i(d = 1)) under the treatment and control, respectvely. Ths process produces a dstrbuton of our statstc, the sample dfference n means, ˆδ r=1...,r. One can then summarze ths dstrbuton to obtan confdence ntervals. For example, to compute the 95% confdence nterval for ˆδ by takng the 2.5th and 97.5th quantles of the bootstrap dstrbuton of ˆδ. Another method s to use the central lmt theorem (CLT). The dstrbuton of our statstc s expected to be asymptotcally normal, so that one can compute the 95% nterval usng the quantles of the normal dstrbuton wth a mean and standard devaton set to the sample mean and standard devaton of the ˆδ r s. The CLT ntervals are generally more stable than drectly computng the quantles of the bootstrap dstrbuton, so we use ths method throughout the remanng sectons. Smlar to how the d standard errors n Secton underestmate the varablty n Ȳ, we expect the d bootstrap to underestmate the varance of ˆδ when observatons are clustered, yeldng overly narrow ( ant-conservatve ) confdence ntervals and hgh false postve rates [21]. The soluton to ths problem s to use a clustered bootstrap. In the clustered bootstrap, weghts are assgned for each factor level, rather than observaton number. For example, f we wsh to use a clustered bootstrap based on the request ID, as to capture varablty due to the request, we can assgn 10 For example, dentfyng host-level effects requres executng the same request on multple machnes, and dentfyng request-level effects requres multple repettons of the same request. Both ncrease request-level duplcaton, whch reduces effcency when request-level effects are large relatve to the nose, σ ε (Sec. 4.2). all observatons for a partcular host to a weght. In the case of requests, we would nstead use host IDs and replcate numbers as our random number seed, and for each replcate, compute: ˆδ r = 1 N r w r,h[] y I(D = 1) 1 M r w r,h[] y I(D = 0) 5.2 Bootstrappng expermental dfferences Before examnng the behavor of the bootstrap on producton behavor, we valdate ts performance based on how well t approxmates known standard errors, as generated by our dealzed benchmark models from Secton 3. To llustrate how ths works, we can plot the true dstrbuton of ˆδs drawn from the 10,000 smulatons for the fully balanced desgn shown n Fgure 6, along wth the ˆδ s under the host and request-clustered bootstrap for a sngle experment. The host-clustered bootstrap, whch accounts for the varaton nduced by repeated observatons of the same host, tends to produce estmates of the standard errors that are smlar to the true standard error, whle the request-clustered bootstrap tends to be too narrow, n that t underestmates the varablty of ˆδ. densty true dstrbuton host bootstrap request bootstrap δ^ Fgure 6: Comparson of the dstrbuton of ˆδs generated by 10,000 smulatons (sold lne) wth the dstrbuton of bootstrapped ˆδ s from a sngle experment (dashed lnes). estmated treatment effect unbalanced host balanced request balanced fully balanced rank Fgure 7: Vsualzaton of bootstrap confdence ntervals from 500 smulated A/A tests, run wth dfferent expermental desgns and bootstrap methods. Experments are ranked by estmated effect sze. Shaded error bars ndcate false postves. The request-level and d bootstraps yeld overly narrow confdence ntervals. d bootstrap request bootstrap host bootstrap sgnfcant FALSE TRUE bootstrap.method d bootstrap request bootstrap host bootstrap 8

9 method.x host.bootstrap d.bootstrap request.bootstrap Endpont /profle_book.php /profle_book.php /profle_book.php /home.php:ltestand:top_news_secton_of /home.php:ltestand:top_news_secton_of /home.php:ltestand:top_news_secton_of /home.php /home.php /home.php /wap/story.php /wap/story.php /wap/story.php /wap/home.php:touch /wap/home.php:touch /wap/home.php:touch /ajax/typeahead/search.php:search /wap/profle_tmelne.php /ajax/typeahead/search.php:search /ajax/typeahead/search.php:search TypeaheadFacebarQueryController /wap/profle_tmelne.php /wap/profle_tmelne.php /wap/home.php:faceweb TypeaheadFacebarQueryController TypeaheadFacebarQueryController /wap/home.php:basc /wap/home.php:faceweb /wap/home.php:faceweb /wap/home.php:basc /wap/home.php:basc 3 Standard error relatve to host-clustered bootstrap 0% 2% 4% 6% 8% Type I error rate Endpont Endpont Standard error relatve to host-clustered bootstrap Fgure 8: Emprcal valdaton of bootstrap estmators for(a) the top 10 endponts usng producton data from Perflab. (b) Left: Type I error rates for host-clustered and IID bootstrap (request-clustered bootstrap has a Type I error rate of > 20% for all endponts and s not shown); Sold lne ndcates the desred 5% Type I error rate, and the dashed lnes ndcate the range of possble observed Type I error rates that would be consstent wth a true error rate of 5%. Rght: Standard error of host-clustered, request-clustered, and d bootstrap relatve to the host-clustered standard error. Next, we evaluate three bootstrappng procedures d, hostclustered, and request-clustered wth each of the four desgns for a sngle endpont. To do ths, we use the same model parameters from endpont 1, as n prevous plots. To get an ntutve pcture for how the confdence ntervals are dstrbuted, we run 500 smulated A/A tests, and for each confguraton, we rank-order experments by the pont estmates of ˆδ, and vsualze ther confdence ntervals (Fgure 7). Shaded regons represent false postves (.e., ther confdence ntervals do not cross 0). The d bootstrap consstently produces the wdest CIs for all desgns. Ths happens because the d bootstrap doesn t preserve the balance across hosts or requests across condtons when resamplng. There s also a clear relatonshp between the wdth of CIs and the Type I error rate, n that there are a hgher proporton of type I error rates when the CIs are too narrow. To more closely examne the precson of each bootstrap method wth each desgn, we run 10,000 smulated A/A tests for each confguraton and summarze ther results n Table 3. For the requestbalanced desgn, we also nclude an addtonal bootstrap strategy, whch we call the host-block bootstrap, where pars of hosts that execute the same requests are bootstrapped. Ths ensures that when whole hosts are bootstrapped, the balance of requests s not broken. Ths method turns out to produce standard errors wth good coverage for the request-balanced desgn, and so we wll henceforth refer to the host block bootstrap strategy as the host-clustered bootstrap n further analyses. We can see that the host-clustered bootstrap appears to estmate the true SE most accurately for all desgns. 5.3 Evaluaton wth producton data Fnally, havng verfed that the bootstrap method provdes a conservatve estmate of the standard error when the true standard error s known (because t was generated by our statstcal model), we turn toward testng the bootstrap on raw producton data from Perflab, whch uses the fully balanced desgn. To do ths, we conduct 256 A/A tests usng dentcal bnares of the Facebook WWW codebase. Fgure 8 summarzes the results from these tests for the top 10 most vsted endponts that are benchmarked by Perflab. Consstent wth the results of our smulatons, we fnd that the request-level bootstrap s massvely ant-conservatve, and produces confdence ntervals that are far too narrow, resultng n a hgh false postve rate (.e., > 20%) across all endponts. Smlarly, we fnd that our emprcal results echo that of the smulatons: the d bootstrap produces estmates of the standard error that are far wder than they should be. The host-clustered bootstrap, how- Desgn Bootstrap Type I err. ŜE SE unbalanced request 21.4% unbalanced d 17.8% unbalanced host 3.6% request balanced request 71.6% request balanced d 10.8% request balanced host 0.8% request balanced host-block 5.7% host balanced request 8.4% host balanced d 6.8% host balanced host 5.0% fully balanced request 46.8% fully balanced host 4.8% fully balanced d 0.0% Table 3: Comparson of Type I error rates and standard errors for each expermental desgn and bootstrap strategy. ŜE ndcates the average estmated standard error from the bootstrap, whle SE ndcates the true standard error from the analytcal formulae. Antconservatve bootstrap estmates of the standard errors produce hgh false postve rates (e.g., >5%). ever, produces Type I error rates that are not sgnfcantly dfferent from 5% for all but 3 of the endponts; n these cases, the confdence ntervals are slghtly conservatve, as s desred n our use case. For these reasons, all producton benchmarks conducted at Facebook wth Perflab use the host-clustered bootstrap. 6. RELATED WORK Many researchers recognzed the obstreperous nature of performance measurement tools, and addressed t pecemeal. Some focus on controllng archtectural performance varablty, whch s stll a very actve research feld. Statstcal nference tools can be appled to reduce the effort of repeated expermentaton [11, 19]. These studes focus prmarly on managng host-level varablty (and even ntra-host-level varance), and do not spend much attenton on the varablty of software. Varablty n software performance has been examned n many other studes that attempt to quantfy the effcacy of technques such as: allowng for a warm-up perod [8]; reducng random performance fluctuatons usng regresson benchmarkng [18]; randomzng mult-threaded smulatons [1]; and applcaton-specfc benchmarkng [27]. In addton, there s an ncreasng nterest specfcally n the performance of onlne systems and n the modelng of large-scale workloads and dynamc content [2, 6]. 9

10 Several studes addressed the holstc performance evaluaton of hardware, software, and users. Cheng et al. descrbed a system called Monkey that captures and replays TCP workloads, allowng the repeated measurng of the system under test wthout generatng synthetc workloads [9]. Gupta et al., n ther Decast system, addressed the challenge of scalng down massve-scale dstrbuted systems nto representatve benchmarks [15]. In addton, representatve characterstcs of user-generated load s crtcal for benchmarkng large-scale onlne systems, as dscussed by Manley et al. [20]. Ther system, hbench:web, tres to capture user-level varaton n terms of user sessons and user equvalence classes. Request-level varaton s modeled statstcally, as opposed to measurng t drectly. A few studes also proposed statstcal models for benchmarkng experments. Kalbera et al. showed that smply averaged multple repettons of a benchmark wthout accountng for sources of varaton can produce unrepresentatve results [17, 18]. They developed models focusng on mnmzng expermentaton tme under software/envronment random effects, whch can be generalzed to other software-nduced varablty. These models are smlar to the model developed here, but they do not take nto account host-level or user-level effects, whch are central to the expermental desgn and statstcal nference problem we wsh to address. To the best of our knowledge, ths s the frst work to rgorously address dstrbuted benchmarkng n the context of user data. 7. CONCLUSIONS AND FUTURE WORK Benchmarkng for performance dfferences n large Internet software systems poses many challenges. On top of the many wellstuded requrements for accurate performance benchmarkng, we have the addtonal dmenson of user requests. Because of the potentally large performance varaton from one user request to another, t s crucal to take the clusterng of user requests nto account. Yet another complcatng dmenson, whch s nevertheless crtcal to scale large testng systems, s dstrbuted benchmarkng across multple hosts. It too ntroduces non-trval complcatons wth respect to how hosts nteract wth request-level effects and expermental desgn to affect the standard error of the benchmark. In ths paper we developed a statstcal model to understand and quantfy these effects, and explored ther practcal mpact on benchmarkng. Ths model enables the analytcal development of expermental desgns wth dfferent engneerng and effcency tradeoffs. Our results from these models show that a fully balanced desgn accountng for both request varablty and host varablty s optmal for the goal of mnmzng the benchmark s standard error gven a fxed number of requests and machnes. Although ths desgn may requre more computatonal resources than the other three, t s deal for Facebook s rapd development and deployment mode because t mnmzes developer resources. Desgn effects due to repeated observatons from the same host also show how resdual error terms that cannot be canceled out va balancng, such as batch-level host effects due to JIT optmzaton and cachng, can also be an mportant lever for further ncreasng precson, especally when there are few hosts relatve to the number of requests. From a practcal pont of vew, estmatng the model parameters to compute the standard errors for these experments (especally n a lve system) can be costly and complex. We showed how a non-parametrc bootstrappng technques can be used to estmate the standard error of each desgn. Usng emprcal data from Facebook s largest dfferental benchmarkng system, Perflab, we confrm that ths technque can relably capture the true standard error wth good accuracy. Consequently, all producton Perflab experments use a fully-balanced desgn, and the host-level bootstrap to evaluate changes n key metrcs, ncludng CPU tme, nstructons, memory usage, etc. Our hope s that ths paper provdes a smple and actonable understandng of the procedures nvolved n benchmarkng for performance changes n other contexts as well. Wth a more quanttatvely nformed approach, practtoners can select the most sutable expermental desgn to mnmze the benchmark s run tme for any desred level of accuracy. Our results focus on measurng the performance of a sngle endpont or servce. Often tmes one wshes to benchmark multple such endponts or servces. Poolng across these endponts to obtan a composte standard error s trval f one could reasonably regard the performance of each endpont as ndependent. 11 If there s a strong correlaton between endponts, however, the models here must be extended to take nto account covarances between observatons from dfferent endponts. Fnally, another area of development for future work s a more extensve evaluaton of the costs assocated wth each of the four basc desgns. For example, one mght devse a cost model whch takes as nputs the desred accuracy and parameters such as batch setup tme, fxed and margnal costs per request, etc. These cost functons formulae would depend on the partculars of the benchmarkng platform. For example, n some systems the number of requests needed for stable measurements for may depend on some maxmum load per host. How much warm up dfferent subsystems need mght also depend on how servces are dstrbuted across machnes. 8. ACKNOWLEDGEMENTS Perflab s developed by many engneers on the Ste Effcency Team at Facebook. We are especally thankful for all of the help from Davd Harrngton and Eunchang Lee, who worked closely wth us on mplementng and deployng the testng routnes dscussed n ths paper. Ths work would not be possble wthout ther help and collaboraton. We also thank Dean Eckles for hs suggestons on formulatng and fttng the random effects model, and Alex Deng for hs excellent feedback on ths paper. 9. REFERENCES [1] A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. K. Martn, D. J. Sorn, M. D. Hll, and D. A. Wood. Evaluatng non-determnstc mult-threaded commercal workloads. In In Proceedngs of the Ffth Workshop on Computer Archtecture Evaluaton Usng Commercal Workloads, pages 30 38, [2] B. Atkoglu, Y. Xu, E. Frachtenberg, S. Jang, and M. Paleczny. Workload analyss of a large-scale key-value store. In Proceedngs of the 12th Jont Conference On Measurement And Modelng of Computer Systems (SIGMETRICS/Performance 12), London, UK, June [3] R. H. Baayen, D. J. Davdson, and D. M. Bates. Mxed-effects modelng wth crossed random effects for subjects and tems. Journal of Memory and Language, 59(4): , [4] E. Bakshy and D. Eckles. Uncertanty n onlne experments wth dependent data: An evaluaton of bootstrap methods. In 11 If each endpont consttutes some fracton w of the traffc, and the outcomes for each endpont are uncorrelated, the varance of δ pooled across all endponts s smply Var[ˆδ pool] = w2 Var[ˆδ ]. Ths ndependence assumpton can be made more plausble by loadng endponts wth dsjont sets of users. 10