Location, Location, Location! Modeling Data Proximity in the Cloud

Locatio, Locatio, Locatio! Modelig Data Proximity i the Cloud Birjodh Tiwaa tiwaa@eecs.umich.edu Uiversity of Michiga rbor, MI Hitesh Ballai hiballa@microsoft.com Microsoft Research Cambridge, UK Mahesh Balakrisha maheshba@microsoft.com Microsoft Research Moutai View, C Z. Morley Mao zmao@eecs.umich.edu Uiversity of Michiga rbor, MI Marcos K. guilera aguilera@microsoft.com Microsoft Research Moutai View, C BSTRCT Cloud applicatios have icreasigly come to rely o distributed storage systems that hide the complexity of hadlig etwork ad ode failures behid simple, data-cetric iterfaces (such as PUTs ad GETs o key-value pairs). While these iterfaces are very easy to use, the applicatio is completely oblivious to the locatio of its data i the etwork; as a result, it has o way to optimize the placemet of data or computatio. I this paper, we propose exposig the etwork locatio of data to applicatios. The primary challege is that data does ot usually exist at a sigle poit i the etwork; it ca be striped, replicated, cached ad coded across differet locatios, i arbitrary ways that vary across storage systems. For example, a item that is sychroously mirrored i both Seattle ad Lodo will appear equally far from both locatios for writes, but equally close to both locatios for reads. ccordigly, we describe Cotour, a system that allows applicatios to query ad maipulate the locatio of data without requirig them to be aware of the physical machies storig the data, the replicatio protocols used or the uderlyig etwork topology. Categories ad Subject Descriptors C.2.1 [Network rchitecture ad Desig]: Network Topology; C.4 [Performace of Systems]: Modelig Techiques Geeral Terms Desig, Measuremet, Performace Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. Hotets 10, October 20 21, 2010, Moterey, C, US. Copyright 2010 CM 978-1-4503-0409-2/10/10...$10.00. Keywords Cloud Computig, Locatio, Network Topology, Key- Value Stores 1. INTRODUCTION Existig cloud platforms offer developers storage services with simple, data-cetric iterfaces to store ad retrieve applicatio data (for example, Microsoft zure s Blob Store ad mazo s S3 allow PUTs ad GETs o key-value pairs). Behid such simple iterfaces, these services use complex machiery to esure that data is available ad persistet i the face of etwork ad ode failures. s a result, developers ca focus o applicatio fuctioality without havig to reaso about complex failure scearios. Ufortuately, this simplicity comes at a cost; applicatios have little or o iformatio regardig the locatio of their data i the etwork. Without this iformatio, applicatios caot optimize their executio by movig computatio closer to data, data closer to users, or related data closer to each other. These kids of optimizatios ca be crucial for applicatios executig across differet data ceters (where etwork latecies ca be very high), as well as withi hierarchical data ceter etworks (where badwidth ca be limited). The curret state-of-the-art solutio for this problem ivolves guesswork: the cloud determies data placemet by predictig the future access patters of the applicatio based o past history, while treatig the applicatio as a black box. This approach ca be expesive ad couter-productive, sice the applicatio typically has more accurate iformatio tha the cloud about its ow future behavior. I additio, without iput from the applicatio, the cloud ca optimize oly simple aggregates of low-level metrics such as badwidth usage or access latecy. Requestig iformatio from the appli-

catio i.e., its future access patters or the high-level metrics of iterest to it is a possible solutio; however applicatios ca have arbitrary optimizatio criteria that are difficult to express to the cloud without complicatig the storage iterface. I this paper, we examie a differet approach: exposig the locatio of data to applicatios ad allowig them to optimize their ow executio. We wat applicatios to be able to estimate the time take to update or retrieve data from differet etwork locatios. We also wat to eable cloud iterfaces that allow applicatios to move computatio closer to data (ad vice versa), as well as request ew computatioal resources ear existig data. Importatly, we wat to do so without breakig the abstractio of data-cetric storage; applicatios must ot be aware of physical storage servers or the uderlyig etwork topology. The primary challege is that data does ot exist at a sigle locatio i the etwork. Data ca be striped, mirrored, cached ad coded across differet poits i the etwork, usig protocols with widely varyig sematics. Cosider a example i which a machie i Los geles accesses data sychroously mirrored at data ceters i Seattle ad Lodo. For reads, the data will appear to reside i Seattle, sice oly the local data ceter eeds to be cotacted. For writes, the data will appear to reside i Lodo, sice both data ceters are cotacted i parallel for a successful write operatio. Note that differet protocols where the cliet machie waits for a respose from the first mirror before updatig the secod, or lets oe mirror directly update the other lead to differet access latecies to the data. To capture such protocol-specific behavior, we propose the idea of replicatio topologies. These are simple represetatios of the iteractios betwee differet servers triggered by reads or writes to a storage service. Importatly, replicatio topologies are ot meat to be complete descriptios of protocols; istead, they capture oly the etwork pathways take by the protocol i failure-free operatio. We fid that most protocols used i practice such as sychroous ad asychroous mirrorig, erasure codes, chai replicatio ad differet types of quorums ca be modeled with very simple replicatio topologies. We describe the desig of Cotour, a system that uses replicatio topologies to provide data-cetric locatio fuctioality. Cotour provides estimates of the latecy to retrieve or update data i a storage service from ay ode i the etwork. It also supports higherlevel fuctioality such as closest-ode discovery (e.g., fidig a ode that is closest to a key/value pair for reads) ad costrait satisfactio (e.g., fidig a ode that ca update a particular key/value pair withi 100 ms ad read aother withi 10 ms). These iterfaces ca be used directly by applicatios to optimize performace. They ca also be used by cloud subsystems to support ew applicatio-facig iterfaces for movig data closer to give etwork locatios, or for requestig ew resources ear existig data. To work with Cotour, a storage service has to provide it with the replicatio topology used to access data. Cotour combies this iformatio with lik-level etwork topology iformatio such as RTTs, badwidth ad loss rates to estimate data access latecies from ay other locatio i the cloud. This lik-level iformatio is collected cotiuously i the backgroud, amortizig the cost of measuremet across multiple applicatios. Sythesizig accurate estimates i this maer is a sigificat challege; however, we believe that cloud platforms have eough moitorig ifrastructure i place o their iteral etworks to make this approach viable. dditioally, high-level fuctios such as closest-ode discovery do ot require very accurate latecy estimates; it is sufficiet if the estimate for the closest ode is lower tha for ay other ode. I this paper, our target applicatios are Iteret services caterig to a geo-distributed user base. Cotour is equally relevat for differet applicatios, such as MapReduce jobs ruig withi badwidth-costraied data ceters, or eterprise applicatios split across private ad public clouds. We omit discussig these applicatios i detail for lack of space. 2. LOCTION IN THE CLOUD To uderstad why locatio is importat i the cloud, we examie the aatomy of a typical applicatio. Cloud applicatios are composed of three distict types of etities: Cliets are machies or devices accessig the service from the public Iteret. Whe a user types i the URL of a service ito her browser, the request is redirected (typically through DNS-based load-balacig) to a data ceter hostig the service. Compute Nodes are the work-horses of the cloud, executig applicatio logic for services (kow as worker roles i zure ad istaces o EC2). Compute odes usually do ot store persistet state, though they ca store soft sessio state ad be sticky with respect to idividual user sessios. Withi each data ceter, icomig requests are directed to compute odes by a layer of web-servers that accept ad maage HTTP coectios from users. Storage Services store all applicatio data ad are accessed by compute odes via simple, datacetric iterfaces. Uder the hood, these are distributed storage systems ruig complex protocols to esure that data is always available ad

durable, eve whe machies, disks ad etworks fail. Each cloud provider offers a rage of such services, with differet iterfaces (such as key-value stores, queues, or eve liear address spaces) ad persistece guaratees. Cosider the example of FaceTracker, a hypothetical face recogitio applicatio for mobile phoes: whe the user of the phoe poits its camera towards some perso, the image is matched agaist a library of her frieds photos. I a cloud-based versio of this applicatio, the cliet would be the phoe itself, ad the library of photos would be stored i a cloud-based storage service (say a key-value store). Images are uploaded from the phoe to a compute ode, which rus face recogitio algorithms o them, fetchig photos of the user s frieds from the key-value store with GET calls. Successfully matched images are occasioally added to the library with PUT calls to improve accuracy. Of the three types of etities described, the applicatio obviously has o cotrol over the locatio of cliets, while it ca move data aroud with some cost i terms of badwidth ad time. Computatio is the easiest to move, sice compute odes have o persistet state. Compute odes typically eed to be close to cliets to miimize iteractio latecies; for FaceTracker, it is reasoable to assume that a user s compute ode is always i the data ceter closest to her locatio. I practice, deployed systems do a good job of redirectig cliets to compute odes i close-by data ceters; cosequetly, we do ot focus o the locatio of compute odes relative to cliets. gaist this backdrop, we examie applicatio scearios where the locatio of data matters with respect to compute odes, usig FaceTracker as a ruig example. Obtaiig ew resources: applicatio may eed ew compute odes ear existig data items. lteratively, it may wat to place a ewly created data item ear a existig compute ode. Whe lice uses Face- Tracker for the first time, her request is directed to a existig compute ode, which creates her photo library close to itself. Load balacig ad failure recovery: Whe a compute ode processig some data fails or gets overloaded, the applicatio may eed to shift computatio to a differet compute ode ear the same data. Whe the compute ode matchig lice s icomig images fails, FaceTracker restarts the task o a differet compute ode close to her photo library. Dispersed data: Whe a task accesses a set of differet data items, the applicatio may wat to locate it o a compute ode equally close to all these data items, or closer to some of them tha the rest. lice wats to match her camera s feed agaist all her frieds libraries; accordigly, FaceTracker locates her matchig task o a compute ode optimally placed with respect to all their libraries. Movig data closer to computatio: Whe users move or chage their access patters, the applicatio may wat to relocate data to be closer to the ew compute odes hadlig those users. Whe lice moves from New York City to Sa Fracisco, FaceTracker moves her photo library closer to the ew compute ode i Califoria ow matchig her images. Shared data: Whe a data item is beig cocurretly accessed by multiple compute odes, the applicatio may wat to place the data item equally close to all the compute odes, or prioritize some over others. lice is i New York ad wats to match her camera s feed agaist Bob s photo library, who lives i Seattle. Sice movig her matchig task to Seattle will icrease her latecy ad cost to upload images, FaceTracker moves Bob s library closer to her compute ode. 3. REPLICTION TOPOLOGIES Thus far, we have established the eed for applicatios to kow ad chage the locatio of data. This is relatively easy to achieve if the cloud stores each data item o a sigle machie i the etwork; the locatio of data is simply the locatio of the machie storig it. I this sceario, existig ode-cetric models for etwork locatio (such as etwork coordiates or treebased models) ca be used to estimate access latecies for the data. While such models usually provide lowlevel path properties betwee odes (such as RTT ad badwidth), it is possible to covert these metrics ito estimates of data trasfer times; for example, the time take to retrieve 1 MB of data via TCP/IP. Ufortuately, data i the cloud does ot usually reside o a sigle machie, makig ode-cetric etwork models ieffective i this cotext. Storage services typically replicate data over multiple odes, usig a wide rage of differet protocols. We use replicatio as a catch-all term, spaig techiques such as cachig, mirrorig, erasure codig ad stripig. Whe a read or write operatio is issued o a item i a storage service, multiple storage servers commuicate with each other to esure the right sematics for the operatio. The exact patters of commuicatio which servers talk to each other, ad whether they wait for resposes from each other before replyig back to the ode issuig the request deped o the replicatio protocol used. The key isight i this paper is that the critical path of iteractios betwee storage servers for ay replicatio protocol ca be captured usig simple represetatios called replicatio topologies. Replicatio topologies ca be draw as DGs, where a directed edge betwee two odes represets a message goig from oe

Writes: Compute Node Reads: Compute Node Primary Primary Secodary B C Secodary Leged: Payload Message Cotrol Message Figure 1: Read/Write replicatio topologies for a sychroous mirrorig protocol with oe primary () ad two secodaries (B, C). ode to the other. Edges with solid lies correspod to messages that have the actual payload beig read or writte, whereas edges with dashed lies correspod to small cotrol messages. To express message depedecies precisely, we allow a sigle machie to be represeted by multiple odes i the graph. To uderstad replicatio topologies better, cosider a simple sychroous mirrorig protocol cosistig of a primary replica ad two secodary replicas. Write operatios are first set to the primary, which sychroously stores them o the secodary replicas before respodig back to the compute ode. ll read operatios are satisfied directly by the primary; the secodaries are read from oly i the case of primary failure. We show the correspodig replicatio topology for this protocol i Figure 1, where a compute ode is show accessig a item sychroously mirrored o a primary ad two secodaries B ad C. For reads, the replicatio topology simply cosists of a sigle outgoig edge from a ode to a ode, ad the back to a ode. For writes, we have a outgoig edge from to, which the has two edges outgoig to B ad C, represetig the mirrorig messages. B ad C the respod back to, which i tur respods back to. Give this replicatio topology ad the locatios of,, B ad C i the etwork, it is possible to estimate the time take to write or retrieve a value. For example, if the value is of size 5 MB, the total time take by to write it ca be computed as follows. First, we compute the time take to trasfer 5 MB from to. Sice the cotacts B ad C i parallel, we the take the max of the time take alog those two paths. Each path ivolves sedig 5 MB from to B or C, respectively, ad the receivig a short ackowledgmet message back. Lastly, we add the time take to sed a ackowledgmet from to. I other words, the latecy to write a value is a simple fuctio over iter-ode data trasfer latecies, usig sum ad max operators. I additio to providig estimates of data access latecies, replicatio topologies also allow us to uderstad how these latecies are impacted by the locatio of each replica i the etwork. I the example above, movig closer to B or C does ot ecessarily improve performace for writes; what matters is the proximity of to, ad of to B ad C. Similarly, the locatio of B or C has o impact o read performace from. Such kowledge of the replicatio topology ca be used to implemet fuctioality such as closest-ode discovery more efficietly. Our represetatio icludes two more operators to idicate that a ode should cotact or wait for the closest subset of its eighbors. Figure 2 shows the replicatio topology for a erasure codig protocol; a item is stored as six coded pieces o six differet machies ( to F ), of which ay four pieces are sufficiet to recostruct the origial item. We show a variat of this protocol where cotacts all six machies i parallel, but waits for oly the first four machies that respod. To implemet this, we itroduce the first-k operator o the DG, idicatig that the ode waits for oly the first k icomig messages. differet variat of this protocol might have cotact oly the four closest machies, tha all six; to model this, we use a closest-t operator o the outgoig edges of a ode to idicate that it cotacts oly its t closest eighbors. We believe that replicatio topologies are geeral eough to model a wide rage of replicatio protocols. For istace, quorum-based protocols are similar to erasure codig i behavior, i that oly a subset of odes eeds to be cotacted (or waited for). Oe challegig aspect of some protocols is that their behavior ca chage over time; for example, a asychroous primary-backup protocol may allow reads from the secodary except whe the primary has just bee updated. We ca model these protocols as exhibitig differet replicatio topologies at differet poits i time. 4. THE CONTOUR SYSTEM I this sectio, we describe the desig of the Cotour system, ad how it iteracts with applicatios ad storage services. 4.1 Cotour ad the Storage Service Cotour expects the storage service to implemet a simple iterface that returs the read or write replicatio topology for a passed-i key. For example, if the storage service implemets sychroous mirrorig, the

Reads o (4,2) Erasure Code Cotact all, Wait for first 4 Compute Node B C D E F First 4 Figure 2: Read replicatio topology for a erasure codig protocol where ay four coded blocks are sufficiet to read the whole item. retured replicatio topology will be idetical to the diagram i Figure 1, with actual IP addresses substituted for abstract ode idetifiers (e.g., 192.168.0.1 istead of ). s described i the previous sectio, computig the access latecy from this replicatio topology ivolves a simple fuctio over iter-ode data trasfer latecies (for example, the time take to trasfer 5 MB from 192.168.0.1 to 192.168.0.2), ivolvig sum, mi ad max operators. Cotour determies this fuctio from the replicatio topology ad evaluates it usig estimates for iter-ode data trasfer latecies obtaied from its ow lik-level measuremets. secodary iterface that the storage service eeds to implemet is a call that returs the size of the value correspodig to a passed-i key. This is used by Cotour to compute access latecies for reads give the replicatio topology. 4.2 Cotour ad the pplicatio The basic fuctioality provided by Cotour is data access latecy estimatio: applicatios (or cloud subsystems) ca estimate the time take to read or write data from ay compute ode i the etwork. Over this basic primitive, Cotour builds more specific fuctioality; for example, it allows applicatios to fid the ode from a set closest to a particular uit of data. It also supports fidig odes that satisfy access latecy costraits with respect to multiple data items. This is useful if the applicatio wats to choose a existig compute ode to ru a particular task based o the data it accesses. It is also useful for cloud allocatio or schedulig compoets that wat to satisfy applicatiospecified requiremets for ew compute odes. ll these calls idetify idividual data uits with a opaque key parameter, which depeds o the uderlyig storage service ivolved; it ca be a simple key, a block umber i a liear address space, or a (row, colum) pair. Cotour does ot actively operate o the key i ay way; it merely uses it as a parameter to retrieve the replicatio topology from the storage service. O their ow, these iterfaces aturally support ay applicatio sceario that ivolves movig computatio closer to data. With the ivolvemet of the storage service, they ca also support scearios that ivolve movig data closer to specific etwork locatios. I these cases, we expect the storage service to support a applicatio-facig iterface that allows data to be moved closer to some ode. The storage service ca the use Cotour s iterfaces to estimate the distace of differet replicatio topologies from the target ode to fid oe that fits the access latecy costrait. 4.3 Desig Cosideratios The simplest way to implemet Cotour is as a cetralized service, accessed by applicatios via local libraries. Every time a applicatio wishes to estimate access latecies to a key, it ca first issue a query to a close-by Cotour server, which correspodigly retrieves the replicatio topology from the storage service. While simple to implemet, this purely pull-based approach ca result i high query latecies. Cachig resposes both at the Cotour server ad the applicatio machie ca reduce latecies, but itroduces the possibility of staleess. alterative ivolves itroducig push-based mechaisms. pplicatios could register their iterest i specific keys to their local Cotour server, which i tur registers its iterest i those keys to the storage service. Whe the Cotour server is otified by the storage service of a chage i the replicatio topology for a key or the uderlyig etwork topology chages it ca otify the compute ode iterested i that key. 5. DISCUSSION Cotour i differet settigs: We focused this paper o Iteret services ruig across geographically distat data ceters. Cotour geeralizes easily to other types of cloud applicatios with oe alteratio: the way that Cotour computes data trasfer latecies from lowlevel lik metrics such as RTT, badwidth ad loss rates ca chage depedig o the settig. For example, a differet methodology may be required to compute data trasfer latecies withi a data ceter, as opposed to wide area liks. This fuctioality ca be abstracted away ito a module resposible for geeratig estimates of data trasfer latecies betwee two physical odes. Modelig accesses to storage media: Oe aspect of usig Cotour withi a sigle data ceter is that the latecy of accessig storage media (such as disk or flash) eeds to be modeled as well, sice it costitutes a much larger fractio of ed-to-ed latecy i such settigs. We ca modify replicatio topologies to model

media access latecies by itroducig ew odes ad edges ito the graph as appropriate. For example, if is a storage server i a replicatio topology, it could be rewritte as two odes istead, 1 ad 2, with a directed edge goig from oe to the other. ll icomig edges ito would ow go ito 1, ad all outgoig edges would ow leave from 2. Does the cloud care about its ow privacy?: itriguig aspect of Cotour s approach is that the cloud could ed up revealig iformatio about its compositio. I a geo-distributed settig, this could result i applicatios learig the umber of data ceters or their locatio; withi a data ceter, they could possibly lear the type of topology used or the umber of machies. While this possibility exists eve i the absece of Cotour, the ability to collect large umbers of latecy estimates without actively trasferrig data provides applicatios a iexpesive way to ifer the cloud s iteral details. This is a ope problem for Cotour; oe possibility ivolves addig jitter to estimates to offer the cloud some measure of privacy. Other data-cetric metrics: Cotour provides applicatios with the estimated latecy to retrieve or update data from a storage service. differet metric of iterest could be cost i terms of dollars; for example, if a access results i traffic betwee data ceters, the applicatio may be charged for it by the cloud. ccordigly, Cotour could report to the applicatio the cost of a data access. other possible data-cetric metric is availability; Cotour could provide the probability that a access succeeds, based o the failure rates of the etwork paths ivolved i the replicatio topology. 6. RELTED WORK Cotour is ispired by existig work o etwork models for predictig path properties such as latecy betwee Iteret ed-hosts; for example, etwork coordiate systems such as Vivaldi [3] attempt to embed iter-ode latecies i a coordiate space. Other work i this space icludes systems that offer differet models [5], arrow locatio-cetric fuctioality [6] or geeral query iterfaces [4]. Cotour ca be viewed as a attempt to exted this class of work to provide a datacetric otio of etwork locatio. Volley [1] does automated data placemet for geodistributed applicatios based o user request patters. key differece from our work is that Volley igores write performace ad uses a simple, fixed replicatio strategy; accordigly, it equates the locatio of data to the locatio of its closest replica. I additio, Volley does ot seek to retai a data-cetric applicatio iterface. Lastly, Volley represets a desig where optimizatio is hadled by ifrastructure istead of the applicatios themselves; as metioed earlier, it is difficult for such desigs to hadle arbitrary applicatio priorities. other related system is PDS [2], which provides a architecture for buildig distributed storage systems by defiig policies for locatig ad updatig replicas. Like Cotour, PDS comes up with a geeral represetatio for differet replicatio policies; however, its goal is more ambitious sice it tries to completely defie the protocol. I cotrast, Cotour s replicatio topologies are meat to oly model the ed-to-ed latecies exhibited by the protocol. 7. CONCLUSION Moder cloud platforms make it easy for developers to write applicatios by abstractig away ode-level details uder data-cetric iterfaces. However, doig so robs developers of the ability to uderstad ad optimize applicatio performace. We preset a system called Cotour that allows odes to reaso about the locatio of data i the etwork without breakig the abstractio of data-cetric storage. t the core of Cotour are replicatio topologies, abstractios that express the critical server iteractios that occur o data accesses. ckowledgmets We d like to thak Vijaya Prabhakara, Veugopala Ramasubramaia ad Patrick Stuedi for their feedback durig the project. We would also like to thak the HotNets reviewers for their detailed commets o the paper. 8. REFERENCES [1] S. garwal, J. Duaga, N. Jai, S. Saroiu,. Wolma, ad H. Bhoga. Volley: utomated Data Placemet for Geo-Distributed Cloud Services. I NSDI 2010. [2] N. Belaramai, J. Zheg,. Nayate, R. Soulé, M. Dahli, ad R. Grimm. PDS: a policy architecture for distributed storage systems. I NSDI 2009. [3] F. Dabek, R. Cox, F. Kaashoek, ad R. Morris. Vivaldi: decetralized etwork coordiate system. I SIGCOMM 2004. [4] H. Madhyastha, T. Isdal, M. Piatek, C. Dixo, T. derso,. Krishamurthy, ad. Vekataramai. iplae: iformatio plae for distributed services. I OSDI 2006. [5] V. Ramasubramaia, D. Malkhi, F. Kuh, M. Balakrisha,. Gupta, ad. kella. O the treeess of iteret latecy ad badwidth. I SIGMETRICS 2009. [6] B. Wog,. Slivkis, ad E. Sirer. Meridia: lightweight etwork locatio service without virtual coordiates. I SIGCOMM 2005.