Wide-Area Traffic Management for. Cloud Services

Transcription

1 Wide-Area Traffic Management for Coud Services Joe Wenjie Jiang A Dissertation Presented to the Facuty of Princeton University in Candidacy for the Degree of Doctor of Phiosophy Recommended for Acceptance By the Department of Computer Science Adviser: Jennifer Rexford & Mung Chiang Apri 2012

2 Copyright by Joe Wenjie Jiang, A rights reserved.

3 Abstract Coud service providers (CSPs) need effective ways to distribute content across wide area networks. Providing arge-scae, geographicay-repicated onine services presents new opportunities for coordination between server seection (to match subscribers with servers), traffic engineering (to seect efficient paths for the traffic), and content pacement (to store content on specific servers). Traditiona designs isoate these probems, which degrades performance, scaabiity, reiabiity and responsiveness. We everage the theory of distributed optimization, cooperative game theory and approximation agorithms to provide soutions that jointy optimize these design decisions that are usuay controed by different institutions of a CSP. This dissertation proposes a set of wide-area traffic management soutions, which consists of the foowing three thrusts: (i) Sharing information: We deveop three cooperation modes with an increasing amount of information exchange between the ISP s (Internet Service Provider) traffic engineering and the CDN s (Content Distribution Network) server seection. We show that straightforward ways of sharing information can be quite sub-optima, and propose a Nash bargaining soution to reduce the efficiency oss. This work sheds ight on ways that different groups of a CSP can communicate to improve their performance. (ii) Joint contro: We propose a content distribution architecture by federating geographicay or administrativey separate groups of ast-mie CDN servers (e.g., nano data centers) ocated near end users. We design a set of mechanisms to sove a joint content pacement and request routing probem under this architecture, achieving both scaabiity and cost optimaity. This work demonstrates how to jointy contro mutipe traffic management decisions that may have different resoutions (e.g., inter vs. intra ISP), and may happen at different timescaes (e.g., minutes vs. severa times a day). (iii) Distributed impementation: Today s coud services are offered to a arge number of geographicay distributed cients, eading to the need for a decentraized traffic contro. We present DONAR, a distributed mapping service that outsources repica seection, whie providing a sufficienty expressive service interface for specifying mapping poicies based on performance, oad, and cost. Our soution runs on a set of distributed mapping nodes for directing oca cient requests, which ony requires a ightweight exchange of summary statistics for coordination between mapiii

4 ping nodes. This work exempifies a decentraized design that is simutaneousy scaabe, reiabe, and accurate. Coectivey, these soutions are combined to provide a synergistic traffic management system for CSPs who wish to offer better performance to their cients at a ower cost. The main contribution of this dissertation is to deveop new design techniques to make this process more systematic, automated and effective. iv

5 Acknowedgments I owe tremendous thanks to a great many peope that I am fortunate to meet, who hep buid my research and share their wisdom of ife. I am deepy gratefu to my advisors, Jennifer Rexford and Mung Chiang, for infuencing me by their pursuit of top quaity research. I am fortunate to have both of them advise me an unparaeed opportunity to earn appying theory in soving practica probems. Thanks to Jen for her sefess offerings throughout my entire graduate study: the freedom needed to pursue exciting ideas, the knowedge needed to sove compex probems, the optimism needed to chaenge the unknowns, and the skis needed for perfection on a aspects of research: writing, presentation, communication and teamwork. Thanks to Mung for his dedication that heps shape this thesis, his insightfu thoughts on both research and ife, his guidance on being a professiona researcher, and his passion that aways encourages me to expore new areas fearessy. I coud not ask for more from them. I woud ike to thank Mike Freedman, Andrea LaPaugh, and Augustin Chaintreau for serving on my thesis committee. Mike is aso the co-author and mentor of the DONAR project presented in Chapter 4. His taste of research and sharp comments have aways been a reiabe source for improving my work. Thanks to Andrea for enightening my presentation. Thanks to Augustin for being hands-on, which has made this thesis more soid and compete. I woud aso ike to thank the contributors of this thesis. Rui Zhang-Shen co-authored the work in Chapter 2. I am aso gratefu for her mentorship during the eary stage of my Ph.D. Thanks to Stratis Ioannidis, Laurent Massouie and Fabio Picconi for heping with the proofs and measurement data in Chapter 3. Thanks to Laurent and Christophe Diot for giving me the opportunity to intern at Technicoor Research Lab in Paris. Thanks to Stratis for demonstrating the quaity of a true theorist, and other researchers for making the internship an unforgettabe experience. Thanks to Patrick Wende, the gifted young undergrad, for a the system impementation work in Chapter 4. I enjoyed our brainstorming, deadine fights, and unch sandwiches. Specia thanks to my other coaborators, incuding S.-H. Gary Chan, Minghua Chen, Sangtae Ha, Tian Lan, Shao Liu, Srinivas Narayana, D. Tony Ren, Bin Wei, and Shaoquan Zhang, for their inputs to the work that I am not abe to present in this thesis. Thanks to Gary for hosting my summer visit in HKUST, and providing generous resource and support to conduct my research. v

6 Thanks to Mike Schansker, Yoshio Turner, and Jean Tourrihes for mentoring my internship at HP Labs, making it a productive and peasant winter in Pao Ato. I woud ike to thank a ong ist of members (past and present) from Cabernet group: Ioannis Avramopouos, Matthew Caesar, Aex Fabrikant, Sharon Godberg, Robert Harrison, Eiott Karpiovsky, Eric Keer, Changhoon Kim, Haakon Ringberg, Michae Schapira, Srinivas Narayana, Martin Suchara, Peng Sun, Yi Wang, Minan Yu, Rui Zhang-Shen, and Yaping Zhu; and from EDGE Lab: Ehsan Aryafar, Jiasi Chen, Amitabha Ghosh, Sangtae Ha, Prashanth Hande, Jiayue He, Jianwei Huang, Hazer Inatekin, Ioannis Kamitsos, Hongseok Kim, Haris Kremo, Tian Lan, Ying Li, Jiaping Liu, Shao Liu, Chris Leberknight, Soumya Sen, Chee Wei Tan, Feix Wong, Dahai Xu, and Yung Yi. Working in a mixed cuture of computer science and eectrica engineering, and with foks of versatie expertise, has given me a unique opportunity to broaden my scope of knowedge and ways of thinking. It was great to have worked with and earned from them. Specia thanks to Meissa Lawson for her years dedication in serving as the graduate coordinator. Friends have immensey enriched my ife at graduate schoo. I thank: Dain Shi, for being ike a brother; Yunzhou Wei, for beers and basketba, Yiyue Wu, for being the best roommate; Wei Yuan, for being a good od-schoo friend; Yinyin Yuan, for comparing Princeton and Cambridge, and introducing the art of wine; Pei Zhang, for offering fun and sefess hep; and many others from computer science department, for their support and making me fee home. I woud ike to thank John C.S. Lui for sharing his wisdom, and continuay encouraging me to foow the heart. I d ike to acknowedge Princeton University, Nationa Science Foundation, Air Force Office of Scientific Research, and Technicoor Labs for their financia support. Thanks to Yuanyuan for bringing me happiness and being part of my ife. Thanks to my parents for their enduring ove and support. This dissertation is dedicated to you. vi

7 Contents Abstract iii 1 Introduction An Overview of Today s Coud Service Providers Requirements of CSP Traffic Management The Need for Sharing Information The Need for a Joint Design The Need for a Decentraized Soution Design Approaches A Top-Down Design Optimization as a Design Language Optimization Decomposition for a Distributed Soution Contributions Cooperative Server Seection and Traffic Engineering in an ISP Network Introduction Traffic Engineering (TE) Mode Server Seection (SS) Modes Server Seection Probem Server Seection with End-to-end Info: Mode I Server Seection with Improved Visibiity: Mode II Anayzing TE-SS Interaction TE-SS Game and Nash Equiibrium vii

8 2.4.2 Goba Optimaity under Same Objective and Absence of Background Traffic Efficiency Loss The Paradox of Extra Information Pareto Optimaity and Iustration of Sub-Optimaity A Joint Design: Mode III Motivation Nash Bargaining Soution COTASK Agorithm Performance Evauation Simuation Setup Evauation Resuts Reated Work Summary Federating Content Distribution across Decentraized CDNs Introduction Probem Formuation and Soution Structure System Mode A Goba Optimization for Minimizing Costs System Architecture A Decentraized Soution to the Goba Probem Standard Dua Decomposition A Distributed Impementation Request Routing and Service Assignment Inter-Domain Request Routing Intra-Domain Service Assignment Optimaity of Uniform Sot Poicy Content Pacement Designated Sot Pacement An Agorithm Constructing a Designated Sot Pacement viii

9 3.6 Performance Evauation Uniform-sot service assignment Synthesized trace based simuation BitTorrent trace-based simuation Reated Work Summary DONAR: Decentraized Server Seection for Coud Services Introduction A Case for Outsourcing Repica Seection Decentraized Repica-Seection System Research Contributions and Roadmap Configurabe Mapping Poicies Customer Goas Appication Programming Interface Expressing Poicies with DONAR s API Repica Seection Agorithms Goba Repica-Seection Probem Distributed Mapping Service Decentraized Seection Agorithm DONAR s System Design Efficient Distributed Optimization Providing Fexibe Mapping Mechanisms Secure Registration and Dynamic Updates Reiabiity through Decentraization Impementation Evauation Trace-Based Simuation Predictabiity of Cient Request Rate Prototype Evauation ix

10 4.6 Reated Work Summary Concusion Summary of Contribution Synergizing Three Traffic Management Soutions Open Probems and Future Work Optimizing CSP Operationa Costs Traffic Management Within a CSP Backbone Long Term Server Pacement Traffic Management within Data Centers Concuding Remarks x

11 List of Figures 1.1 Four parties in the coud ecosystem: Internet Service Provider (ISP), Content Distribution Network (CDN), content provider, and cient CSP traffic management decisions The interaction between traffic engineering (TE) and server seection (SS) An Exampe of the Paradox of Extra Information A numerica exampe iustrating sub-optimaity ISP and CP cost functions The TE-SS tusse v.s. CP s traffic intensity (Abiene topoogy) TE and SS performance improvement of Mode II and III over Mode I. (a-b) Abiene network under ow traffic oad: moderate improvement; (c-d) Abiene network under high traffic oad: more significant improvement, but more information (in Mode II) does not necessariy benefit the CP and the ISP (the paradox of extra information) Performance evauation over different ISP topoogies. Abiene: sma cut graph; AT&T, Exodus: hub-and-spoke with shortcuts; Leve 3: compete mesh; Sprint: in between Decentraized soution to the goba probem GLOBAL The repacking poicy improves the resource utiization by aowing existing downoads to be migrated Pacement Agorithm xi

12 3.4 Characterization of rea-ife BitTorrent trace. (a) Cumuative counts of downoads/boxes. (b) Per-country counts of downoads/boxes. (c) Predictabiity of content demand in one hour interva over one month period Dropping probabiity decreases fast with uniform sot strategy. Simuation in a singe cass with a cataog size of C = Performance of fu soution with decentraized optimization, content pacement scheme and uniform-sot poicy, under the parameter settings C = 1000, D = 10, B = 1000, Ū = 3, M = Performance of different agorithms over a rea 30-day BitTorrent trace DONAR uses distributed mapping nodes for repica seection. Its agorithms can maintain a weighted spit of requests to a customer s repicas, whie preserving cient repica ocaity to the greatest extent possibe DONAR s Appication Programming Interface Muti-homed route contro versus wide-area repica seection Interactions on a DONAR node Software architecture of a DONAR node DONAR adapts to spit weight changes Network performance impications of repica-seection poicies Sensitivity anaysis of using toerance parameter ɛ i Stabiity of area code request rates Server request oads under cosest repica poicy Proportiona traffic distribution observed by DONAR (Top) and CoraCDN (Bottom), when an equa-spit poicy is enacted by DONAR. Horizonta gray ines represent the ɛ toerance ±2% around each spit rate Cient performance during equa spit xii

13 List of Tabes 2.1 Summary of resuts and engineering impications Summary of key notation Link capacities, ISP s and CP s ink cost functions in the exampe of Paradox of Extra Information Distributed agorithm for soving probem (2.12a) To cooperate or not: possibe strategies for content provider (CP) and network provider (ISP) Summary of key notations Operations performed by a cass tracker Summary of key notations Decentraized soution of server seection xiii

14 Chapter 1 Introduction The Internet is increasingy a patform for onine services such as Web search, socia networks, mutipayer games and video streaming distributed across mutipe ocations for better reiabiity and performance. In recent years, arge investments have been made in massive data centers supporting computing services, by Coud Service Providers (CSPs) such as Facebook, Googe, Microsoft, and Yahoo!. The significant investment in capita outay by these companies represents an ongoing trend of moving appications, e.g., for desktops or resource-constrained devices ike smartphones, into the coud. The trend toward geographicay-repicated onine services wi ony continue and increasingy incude sma enterprises, with the success of coud-computing patforms ike Amazon Web Services (AWS) [1]. CSPs usuay host a wide range of appications and onine services, and ease the computationa power, storage and bandwidth to their customers, for instance, the video subscription service Netfix. Each onine service has specific performance requirements. For exampe, Web search and mutipayer games need ow end-to-end atency; video streaming needs high throughput, whereas socia networks require a scaabe way to store and cache user data. Cients access these services from a wide variety of geographic ocations over access networks with widy different performance. CSPs undoubtedy care about the end-to-end performance their customers experience. For instance, even sma increases in round-trip times have significant impact on their revenue [2]. On the other hand, CSPs must consider operationa costs, such as the price they pay their upstream network providers for bandwidth, and the eectricity costs in their data centers. Large CSPs can easiy 1

15 send and receive petabytes of traffic a day [3], and spend tens of miions of doars per year on eectricity [4]. As such, CSPs increasingy need effective ways to manage their traffic in order to optimize cient performance and operationa costs. 1.1 An Overview of Today s Coud Service Providers Traditionay, traffic management over wide-area networks has been performed independenty by administrativey separated entities, which together make up today s coud ecosystem. As iustrated in Figure 1.1, there are four parties in the ecosystem: CDN (Data Center)! Server! CP!! Internet (ISPs)!! Cient Cient Cient Cient Figure 1.1: Four parties in the coud ecosystem: Internet Service Provider (ISP), Content Distribution Network (CDN), content provider, and cient. Internet Service Provider: Internet Service Providers (ISPs, a.k.a. network providers such as AT&T [5]) provide connectivity, or the bandwidth pipes to transport content simpy treating them as packets and thus are obivious of content sources. A traditiona ISP s primary roe is to depoy infrastructure, manage connectivity, and baance traffic oad inside its network by computing the network routing decisions. In particuar, an ISP soves the traffic engineering (TE) probem, i.e., adjusting the routing configuration to the prevaiing traffic. The goa of TE is to ensure efficient routing to minimize congestion, so that 2

16 users experience ow packet oss, high throughput, and ow atency, and that the network can gracefuy absorb bursty traffic. Content Distribution Network: Content Distribution Networks (CDNs, e.g., Akamai [6]) provide the infrastructure to repicate content across geographicay-diverse data centers (or servers). Today s CDNs strategicay pace servers across geographicay distributed ocations, and repicate content over a number of designated servers. CDNs sove a server seection (SS) probem, i.e., determining which servers shoud deiver content to each end user. The goa of server seection is to meet user demand, minimize network atency to reduce user waiting time, and baance server oad to increase throughput. Sometimes, the job of server seection can be outsourced to a third party, in addition to the CDN, to aow customers to specify their own high-eve poicies, based on performance, server and network oad, and cost. Content Provider: Content Providers (CPs, e.g., Netfix [7]), a.k.a. tenants of the coud service, produce, organize and deiver content to their cients by renting a set of CDN servers for better avaiabiity, performance and reiabiity. Content providers generate revenue by deivering content to cients. Increasingy, coud computing offers an attractive approach where the coud provider offers eastic server and network resources, whie aowing customers to design and impement their own services. Whie network routing and server seection are handed by individua CDNs and ISPs, such customers are eft argey to hande content pacement on their own, i.e., pacing and caching content on appropriate servers to meet cient demands and reduce end-to-end atency. Cient: Cients, who consume the content, choose their ISPs for Internet connectivity, and subscribe to various content providers for their services. Therefore, the user-perceived performance is affected by many factors that are infuenced by different ISPs, CDNs and CPs. As the Internet increasingy becomes a patform for onine services, the boundaries between these parties have been more burred than ever. We formay define a CSP as a service provider that pays two or mutipe roes in the coud ecosystem, for instance: ISP + CDN, e.g., AT&T, who depoys CDNs inside its own network. 3

17 Data Center! Cient Server! Mapping Node Cient Content!! Internet!! Mapping Node Cient Cient Data Center! Cient Server! Mapping Node Cient! Internet!! Mapping Node Cient Cient Data Center! Cient Server! CSP Backbone!! Cient! Internet!! Cient Cient (a) Content pacement (b) Server seection (c) Network routing Figure 1.2: CSP traffic management decisions CDN + CP, e.g., Youtube, who depoys many geographicay-distributed servers to stream videos. ISP + CDN + CP, e.g., Googe, who has its own data centers and the network backbone, provides onine services to its customers. In a of the above scenarios, CSPs have a unique opportunity to coordinate between trafficmanagement tasks that are previousy controed by different institutions. In particuar, today s CSPs optimize cient performance and operationa costs by controing (i) content pacement, i.e., pacing and caching content in servers (or data centers) that are cose to cients, (ii) server seection, i.e., directing cients across the wide area to an appropriate service ocation (or repica ), and (iii) network routing, i.e., seecting wide-area paths to cients, or intra-domain paths within a CSP s own backbone, as shown in Figure 1.2. To hande wide-area server-seection, CSPs usuay run DNS servers or HTTP proxies (front-end proxies) at mutipe ocations and have these nodes coordinate to distribute cient requests across the data centers. To have contro over wide-area path performance, arge CSPs often buid their own backbone network to inter-connect their data centers, or connect each data center to mutipe upstream ISPs when they do not have a backbone. Further, some CSPs depoy a arge number of nano data centers [8] and need to pace specific content in each server, or may additionay cache popuar content in the front-end proxies when they direct cient requests. In the rest of this chapter, we outine the detaied design requirements ( 1.2), introduce our design methodoogies ( 1.3), and summarize our main contributions ( 1.4). 4

18 1.2 Requirements of CSP Traffic Management Today s CSPs usuay do not make optima traffic management decisions and achieve the high performance and ow cost that they coud, due to (i) imited visibiity, (ii) independent contro, and (iii) poor scaabiity. To address these chaenges, we propose a compete set of networking soutions that promote information sharing, joint contro, and distributed impementation The Need for Sharing Information Traffic management decisions are made by administrativey separated groups of the CSP, or even affected by other institutions such as intermediate ISPs that transit traffic between cients and the CSP. They often have reativey poor visibiity into each other, eading to sub-optima decisions. Misaigned objectives ead to conficting decisions. Conventionay, ISPs and CDNs (or content providers) optimize different objectives. Typicay, ISPs adjust the routing configurations for a traffic inside their networks, in addition to the CDN traffic, in the hope of minimizing network congestion, achieving high throughput and ow atency, and reducing operationa costs such as traffic transit costs. On the other hand, CDNs direct cients to the cosest data center to reduce round-trip times, without regard to the resuting costs and network congestion. As a consequence, the network routing and server seection decisions are usuay at odds. For exampe, an ISP may prefer to route the CDN traffic on a onger path but with ower congestion, and CDN may direct a cient to a coser data center that is reached through an expensive provider path. Incompete visibiity eads to sub-optima decisions. In making server-seection decisions, a CDN needs to predict the cient round trip atency to a server, which depends on the widearea path performance. In practice, the CDN has imited visibiity into the underying network topoogy and routing, and therefore has imited abiity to predict cient performance in a timey and accurate manner. A CDN rates server performance by IP geoocation database [9, 10], or Internet path performance prediction toos [11], which are usuay oad obivious. Therefore, without information about ink oads and capacities, a CDN may direct excessive traffic to a geographicay coser server, eading to overoaded inks. 5

19 1.2.2 The Need for a Joint Design Traffic management decisions are made independenty by different institutions or different groups in the same company, yet they ceary infuence each other. As such, today s practice often achieves much ower performance or higher costs than a coordinated soution. Separate optimization does not achieve gobay optima performance. We observe that separate decision makings do not enabe a gobay optima performance, even given the compete visibiity into a participating systems. For exampe, separating server seection and traffic engineering, e.g., carefuy optimizing one decision on top of the other, eads to sub-optima equiibria, even when the CDN is given accurate and timey information about the network. In genera, such separate optimizations do not enabe a mutuay-optima performance, motivating a joint design for a coordinated soution. Decision making happens at different timescaes. Since traffic management decisions are made by different institutions, they usuay are optimized at different timescaes. For instance, the ISP runs traffic engineering at the timescae of hours, athough it coud run on a much smaer timescae. Server-seection is usuay optimized at a smaer timescae such as minutes to achieve an accurate oad-baancing across servers. Depending on content providers choices, how often the content pacement decision is updated can vary quite differenty, ranging from a few times a day to on-the-fy caching. These heterogeneities raise the need for a joint contro that is both accurate and practica The Need for a Decentraized Soution Traffic management decisions are often made in a centraized manner, eading to a high compexity and poor scaabiity. The functiona separation impied by today s architecture, and the arge number of network eements (e.g., data centers, servers, wide-area network paths, and cients), raise the need for a decentraized soution in our design. Functiona separation is practica and efficient. A joint design naturay eads to a centra coordinator for controing a traffic management decisions inside a CSP. However, we want a moduarized design by functionay separating these decisions, e.g., between the ISP and the CDN, yet achieving a jointy optima soution. Today s server seection, network routing, and content 6

20 pacement are themseves performed by arge distributed systems managed by separate groups in the same company, or even outsourced to third parties (e.g., Akamai running Bing s front-end servers, or DONAR [10]). Tighty couping these systems woud ead to a compex design that is difficut to administer. Instead, these systems shoud continue to operate separatey and run on existing infrastructures, with ightweight coordination to arrive at good coective decisions. Distributed impementation is scaabe and reiabe. The need for scaabiity and reiabiity shoud drive the design of our system, eading to a distributed soution that consists of a set of spatiay-distributed network nodes, such as mapping nodes (for server seection), backbone or edge routers (for networking routing), and nano data centers and proxy servers (for content pacement), to hande traffic management in an autonomous and coaborative manner. Whie a simpe approach by having a centra coordinator is straightforward, it introduces a singe point of faiure, as we as an attractive target for attackers trying to bring down the service. Further, it incurs significant overhead for the infrastructure nodes to interact with the controer, eading to excessive communication overhead. Finay, a centraized soution adds additiona deay, making the system ess responsive to sudden changes in cient demands (i.e., fash crowds). To overcome these imitations, we need a scaabe and reiabe distributed soution that runs on individua nodes whie sti attains a gobay optima performance. Meeting a the above requirements poses severa significant chaenges. First, each onine service runs at mutipe data centers at different ocations, which vary in their capacity (e.g., number of servers), their connectivity to the Internet (e.g., upstream ISPs for mutihoming, or bandwidth provisioning for individua nano data centers), and the proximity to their cients. These heterogeneities present many practica constraints in our design. Second, some probems, e.g., enabing the joint contro as an optimization probem, do not accept a simpe formuation that is computationay tractabe. We need advanced optimization and approximation techniques to make the soution computationay efficient and easy to impement, yet with provabe guarantee of optimaity. Further, as cient demands and network conditions are varying from time to time, our soution shoud be we adaptive to these changes. We address these chaenges in this dissertation. 7

21 1.3 Design Approaches To address the wide-area networking needs of CSPs, our research foows methodoogies that are taiored to our specific design goas, as summarized in the foowing three thrusts A Top-Down Design Today s CSPs depoy onine services across mutipe ISP networks and data centers. Traffic management decisions, such as server seection, can have different contro granuarities. For exampe, a CDN depoys its servers inside many ISP networks. When directing cient requests, a CDN s decision consists of two eves: inter-domain (e.g., which ISP to choose) and intra-domain (e.g., which server to choose within an ISP network). Further, traffic management decisions are made at different timescaes. Content pacement and network routing are updated reativey ess frequenty, whie server seection are re-optimized at a smaer timescae in order to adapt to changing traffic and network conditions. To address these issues, our design foows a top-down principe: make decisions from arge (e.g., ISP- and data center-wide) to sma (e.g., server-wide) resoutions, and from arge (e.g., houry) to sma (e.g., of severa minutes) timescaes. Such a design choice has severa merits. First, it simpifies the probem and aows us to divide-and-conquer the probem of a prohibitivey arge size. Second, the top-down design is scaabe as the number of data centers and cients grows. Third, it reduces the impementation compexity as it aows a separation of contro at different institutions Optimization as a Design Language Convex optimization [12] is a specia cass of mathematica optimization probems that can be soved numericay very efficienty. We utiize convex optimization to formuate traffic management probems faced by today s CSPs. We carefuy define a set of performance metrics, e.g., atency and throughput, as the objectives in the optimization probem. We are aso abe to capture various operationa costs, e.g., network congestion, bandwidth and eectricity costs, in the objective function. Optimization aso aows us to express rea-word constraints, such as ink capacity, and to freey customize CSP poicies, for instance, each data center specifies a traffic spit weight or bandwidth cap. The soid foundation in convex programming aows us to sove 8

22 these probems in a computationay efficient way, and with the optimaity guarantee that many heuristics used in practice today coud not achieve. We are faced with the chaenge, however, that many practica probems such as joint contro of traffic engineering and server seection do not have a straightforward convex formuation, and hence we cannot directy appy efficient soution techniques. To overcome this imitation, we deveop methods to convert or approximate the non-convex probem into a convex form, with a provabe bound on the optimaity gap Optimization Decomposition for a Distributed Soution To enabe decentraized traffic management rather than a centraized approach, we everage the optimization decomposition theory to derive distributed soutions that are provaby optima. Distributed agorithms are notoriousy prone to osciations (e.g., distributed mapping nodes overreact based on their own oca information) and inaccuracy (e.g., the system does not optimize the designated objectives). Our design must avoid faing into these traps. We utiize prima-based and dua-based decomposition methods to decoupe oca decision variabes, e.g., server seection at each mapping node, and content pacement at each server. We anayticay prove that our decentraized agorithms converge to the optima soutions of the optimization probems, and vaidate their effectiveness in practice with rea-ife traffic traces. 1.4 Contributions This dissertation is about design, anaysis, and evauation of a set of wide-area traffic management soutions for coud service providers, incuding new ways to (i) share information among cooperating parties, (ii) jointy optimize over mutipe design variabes, and (iii) impement a decentraized soution for wide-area traffic contro. We present these soutions on three stages sharing information, joint contro, and decentraized impementation that today s CSPs can perform to maximize cient performance and minimize operationa costs: Sharing Information (Chapter 2). We study how a CSP overcomes the imited visibiity by incentivizing information sharing among different parties. In particuar, we examine the cooperation between network routing and server seection. With the strong motivation for ISPs to provide content services, they are faced with the question of whether to stay with the current 9

23 design or to start sharing information. We deveop three cooperation modes with an increasing amount of information exchange between the ISP and the CDN. We show that straightforward ways to share information can sti be quite sub-optima, and propose a soution that is mutuaybeneficia for both the ISP and the CDN. This work sheds ight on ways the ISP and the CDN can share information, starting from the current practice, to move towards a fu cooperation that is uniateray-actionabe, backward-compatibe, and incrementay-depoyabe. Joint Contro (Chapter 3). Today, CSP traffic management decisions are made independenty by administrativey separate groups. We study how to enabe joint contro of mutipe traffic management decisions. In particuar, we consider a content deivery architecture based on geographicay or administrativey separate groups of ast-mie servers (nano data centers) ocated within users homes. We propose a set of mechanisms to manage joint content repication and request routing within this architecture, achieving both scaabiity and cost optimaity. This work demonstrates an exampe of joint optimization over content pacement and server seection decisions that may have varied resoutions (e.g., inter v.s. intra ISP), and may happen at different timescaes (e.g., minutes v.s. severa times a day). Decentraized Impementation (Chapter 4). Today s coud services are often offered to geographicay distributed cients, eading to the need for a decentraized traffic management soution inside a CSP s network. We present DONAR, a distributed system that can offoad the burden of server seection, whie providing these services with a sufficienty expressive interface for specifying mapping poicies. Our soution runs on a set of distributed mapping nodes for directing oca cient requests, which is simpe, efficient, and ony requires a ightweight exchange of summary statistics for coordination between mapping nodes. This work exempifies a decentraized design that is simutaneousy scaabe, accurate, and effective. Through the examination of the three systems, we beieve our soutions together shed ight on the fundamenta network support that CSPs shoud buid for their onine service. We summarize our key contributions as foows: A timey study of prevaent coud services and appications: We use three appication exampes to provide a set of wide-area networking soutions for CSPs, incuding: (i) cooperative server seection and traffic engineering for information sharing, (ii) content dis- 10

24 tribution among federated CDNs for joint contro, and (iii) DONAR: a distributed server seection system for decentraized impementation. Whie we do not enumerate every possibe traffic-management task that today s CSP has to hande, many of our soution techniques wi be usefu to CSPs who wish to depoy efficient, reiabe and scaabe distributed services. A mathematica framework for CSP traffic management: We provide a mathematica framework to formuate CSP traffic management, incuding server seection, network routing, and content pacement. Such a framework aso aows us to accuratey define key performance and cost metrics that are common to a wide range of onine services and appications. Practica optimization formuation: We provide optimization formuations for CSPs to maximize performance or minimize cost. Our practica probem formuation aso aows CSPs to freey express many poicy and operationa constraints that are often considered today. Scaabe soution architecture: We propose scaabe traffic management soutions through separation of contro and distributed agorithms. We design a set of simpe system architectures for CSPs to impement these soutions through oca measurements and message passing that invove a judicious amount of communication overhead. Evauation with rea traffic traces: We evauate the benefits of our proposed soutions through trace-driven simuations that empoy rea networks and traffic traces, incuding: (i) reaistic backbone topoogies of tier-1 ISPs, (ii) content downoad traces from the biggest BitTorrent network Vuze, and (iii) cient request traces from an operationa CDN CoraCDN. We demonstrate that our soutions are in practice effective, scaabe, and adaptive. Design, impementation, and depoyment of system prototypes: Together with our coaborators, we appy the distributed agorithm design DONAR, a decentraized serverseection service that is impemented, prototyped, and depoyed on CoraCDN and the Measurement Lab. Through ive experiments with DONAR, we demonstrate that our soution performs we in the wid. The advent of coud computing presents new opportunities for traffic management across widearea networks. Whie this probem has ong existed since the birth of the Internet, today s coud 11

25 service provider are faced with the dramaticay increasing scae of content-centric, geographicaydistributed services that run across data centers, making it more significant. Today s CSPs usuay rey on ad hoc techniques for controing their traffic across data centers and to and from cients. This dissertation proposes a set of networking soutions for CSPs who wish to offer better performance to users at a ower cost. Our research deveops new methods to make this process more systematic, and offer a new design paradigm for effective, automated techniques for scaabe wide-area traffic contro. 12

26 Chapter 2 Cooperative Server Seection and Traffic Engineering in an ISP Network This chapter focuses on the chaenge of sharing information among administrativey separated groups that are in charge of different traffic management decisions [13]. Traditionay, ISPs make profit by providing Internet connectivity, whie CPs pay the more ucrative roe of deivering content to users. As network connectivity is increasingy a commodity, ISPs have a strong incentive to offer content to their subscribers by depoying their own content distribution infrastructure. Providing content services in an ISP network presents new opportunities for coordination between traffic engineering (to seect efficient routes for the traffic) and server seection (to match servers with subscribers). In this work, we deveop a mathematica framework that considers three modes with an increasing amount of cooperation between the ISP and the CP. We show that separating server seection and traffic engineering eads to sub-optima equiibria, even when the CP is given accurate and timey information about the ISP s network in a partia cooperation. More surprisingy, extra visibiity may resut in a ess efficient outcome and such performance degradation can be unbounded. Leveraging ideas from cooperative game theory, we propose an architecture based on the concept of Nash bargaining soution. Simuations on reaistic backbone topoogies are per- 13

27 formed to quantify the performance differences among the three modes. Our resuts appy both when a network provider attempts to provide content, and when separate ISP and CP entities wish to cooperate. This study is a step toward a systematic understanding of the interactions between those who provide and operate networks and those who generate and distribute content. 2.1 Introduction ISPs and CPs are traditionay independent entities. ISPs ony provide connectivity, or the pipes to transport content. As in most transportation businesses, connectivity and bandwidth are becoming commodities and ISPs find their profit margin shrinking [14]. At the same time, content providers generate revenue by utiizing existing connectivity to deiver content to ISPs customers. This motivates ISPs to host and distribute content to their customers. Content can be enterpriseoriented, ike web-based services, or residentia-based, ike tripe pay as in AT&T s U-Verse [15] and Verizon FiOS [16] depoyments. When ISPs and CPs operate independenty, they optimize their performance without much cooperation, even though they infuence each other indirecty. When ISPs depoy content services or seek cooperation with CP, they face the question of how much can be gained from such cooperation and what kind of cooperation shoud be pursued. A traditiona service provider s primary roe is to depoy infrastructure, manage connectivity, and baance traffic oad inside its network. In particuar, an ISP soves the traffic engineering (TE) probem, i.e., adjusting the routing configuration to the prevaiing traffic. The goa of TE is to ensure efficient routing to minimize congestion, so that users experience ow packet oss, high throughput, and ow atency, and that the network can gracefuy absorb fash crowds. To offer its own content service, an ISP repicates content over a number of strategicay-paced servers and directs requests to different servers. The CP, whether as a separate business entity or as a new part of an ISP, soves the server seection (SS) probem, i.e., determining which servers shoud deiver content to each end user. The goa of SS is to meet user demand, minimize network atency to reduce user waiting time, and baance server oad to increase throughput. To offer both network connectivity and content deivery, an ISP is faced with couped TE and SS probems, as shown in Figure 2.1. TE and SS interact because TE affects the routes that carry the CP s traffic, and SS affects the offered oad seen by the network. Actuay, the degrees of 14

28 Routes! TE:! Minimize Congestion! SS:! Minimize Latency! Non-CP Traffic! CP Traffic! Figure 2.1: The interaction between traffic engineering (TE) and server seection (SS). freedom are aso the mirror-image of each other: the ISP contros routing matrix, which is the constant parameter in the SS probem, whie the CP contros traffic matrix, which is the constant parameter in the TE probem. In this chapter, we study severa approaches an ISP coud take in managing traffic engineering and server seection, ranging from running the two systems independenty to designing a joint system. We refer to CP as the part of the system that manages server seection, whether it is performed directy by the ISP or by a separate company that cooperates with the ISP. This study aows us to expore a migration path from the status-quo to different modes of synergistic traffic management. In particuar, we consider three scenarios with increasing amounts of cooperation between traffic engineering and server seection: Mode I: no cooperation (current practice). Mode II: improved visibiity (sharing information). Mode III: a joint design (sharing contro). Mode I. Content services coud be provided by a CDN that runs independenty on the ISP network. However, the CP has imited visibiity into the underying network topoogy and routing, and therefore has imited abiity to predict user performance in a timey and accurate manner. We mode a scenario where the CP measures the end-to-end atency of the network and greediy assigns each user to the servers with the owest atency to the user, a strategy some CPs empoy today [17]. We ca this SS with end-to-end info. In addition, TE assumes the offered traffic is 15

29 unaffected by its routing decisions, despite the fact that routing changes can affect path atencies and therefore the CP s traffic. When the TE probem and the SS probem are soved separatey, their interaction can be modeed as a game in which they take turns to optimize their own networks and sette in a Nash equiibrium, which may not be Pareto optima. Not surprisingy, performing TE and SS independenty is often sub-optima because (i) server seection is based on incompete (and perhaps inaccurate) information about network conditions and (ii) the two systems, acting aone, may miss opportunities for a joint seection of servers and routes. Modes II and III capture these two issues, aowing us to understand which factor is more important in practice. Mode II. Greater visibiity into network conditions shoud enabe the CP to make better decisions. There are, in genera, four types of information that coud be shared: (i) physica topoogy information [18, 19,?], (ii) ogica connectivity information, e.g., routing in the ISP network, (iii) dynamic properties of inks, e.g., OSPF ink weights, background traffic, and congestion eve, and (iv) dynamic properties of nodes, e.g., bandwidth and processing power that can be shared. Our work focuses on a combination of these types of information, i.e., (i)-(iii), so that the CP is abe to sove the SS probem more efficienty, i.e., to find the optima server seection. Sharing information requires minima extensions to existing soutions for TE and SS, making it amenabe to incrementa depoyment. Simiar to the resuts in the parae work [20], we observe and prove that TE and SS separatey optimizing over their own variabes is abe to converge to a goba optima soution, when the two systems share the same objectives with the absence of background traffic. However, when the two systems have different or even conficting performance objectives (e.g., SS minimizes end-to-end atency and TE minimizes congestion), the equiibrium is not optima. In addition, we find that mode II sometimes performs worse than mode I that is, extra visibiity into network conditions sometimes eads to a ess efficient outcome and the CP s atency degradation can be unbounded. The facts that both Mode I and Mode II in genera do not achieve optimaity, and that extra information (Mode II) sometimes hurts the performance, motivate us to consider a cean-sate joint design for seecting servers and routes next. Mode III. A joint design shoud achieve Pareto optimaity for TE and SS. In particuar, our joint design s objective function gives rise to Nash Bargaining Soution [21]. The soution not ony 16

30 Mode Optimaity Information Fairness Arch. Change I Large gap No exchange No Current practice Not Pareto-optima Measurement ony II Not Pareto-optima Topoogy, capacity No Minor CP changes Specia case goba-optima Routing Better SS agorithm More info. may hurt Background traffic III Pareto-optima Topoogy Yes Cean-sate design 5-30% improvement Link prices Incrementay depoyabe CP given more contro Tabe 2.1: Summary of resuts and engineering impications. guarantees efficiency, but aso fairness between synergistic or even conficting objectives of two payers. It is a point on the Pareto optima curve where both TE and SS have better performance compared to the Nash equiibrium. We then appy the optimization decomposition technique [22] so that the joint design can be impemented in a distributed fashion with a imited amount of information exchange. The anaytica and numerica evauation of these three modes aows us to gain insights for designing a cooperative TE and SS system, summarized in Tabe 2.1. The conventiona approach of Mode I requires minimum information passing, but suffers from sub-optimaity and unfairness. Mode II requires ony minor changes to the CP s server seection agorithm, but the resut is sti not Pareto optima and performance is not guaranteed to improve, even possiby degrading in some cases. Mode III ensures optimaity and fairness through a distributed protoco, requires a moderate increase in information exchange, and is incrementay depoyabe. Our resuts show that etting CP have some contro over network routing is the key to effective TE and SS cooperation. We perform numerica simuations on reaistic ISP topoogies, which aow us to observe the performance gains and osses over a wide range of traffic conditions. The joint design shows significant improvement for both the ISP and the CP. The simuation resuts further revea the impact of topoogies on the efficiencies and fairness of the three system modes. Our resuts appy both when a network provider attempts to provide content, and when separate ISP and CP entities wish to cooperate. For instance, an ISP paying both roes woud find the optimaity anaysis usefu such that a ow efficiency operating region can be avoided. And cooperative ISP and CP woud appreciate the distributed impementation of Nash bargaining soution that aows for an incrementa depoyment. 17

31 The rest of this chapter is organized as foows. Section 2.2 presents a standard mode for traffic engineering. Section 2.3 presents our two modes for server seection, when given minima information (i.e., Mode I) and more information (i.e., Mode II) about the underying network. Section 2.4 studies the interaction between TE and SS as a game and shows that they reach a Nash equiibrium. Section 2.5 anayzes the efficiency oss of Mode I and Mode II in genera. We show that the Nash equiibria achieved in both modes are not Pareto optima. In particuar, we show that more information is not aways hepfu. Section 2.6 discusses how to jointy optimize TE and SS by impementing a Nash bargaining soution. We propose an agorithm that aows practica and incrementa impementation. We perform arge-scae numerica simuations on reaistic ISP topoogies in Section 2.7. Finay, Section 2.8 presents reated work, and Section 2.9 concudes the chapter and discusses our future work. 2.2 Traffic Engineering (TE) Mode In this section, we describe the network mode and formuate the optimization probem that the standard TE mode soves. We aso start introducing the notation used in this work, which is summarized in Tabe 2.2. Consider a network represented by graph G = (V, E), where V denotes the set of nodes and E denotes the set of directed physica inks. A node can be a router, a host, or a server. Let x ij denote the rate of fow (i, j), from node i to node j, where i, j V. Fows are carried on end-toend paths consisting of some inks. One way of modeing routing is W = {w p }, i.e., w p = 1 if ink is on path p, and 0 otherwise. We do not imit the number of paths so W can incude a possibe paths, but in practice it is often pruned to incude ony paths that actuay carry traffic. The capacity of a ink E is C > 0. Given the traffic demand, traffic engineering changes routing to minimize network congestion. In practice, network operators contro routing either by changing OSPF (Open Shortest Path First) ink weights [23] or by estabishing MPLS (Mutiprotoco Labe Switch) abe-switched paths [24]. In this paper we use the muti-commodity fow soution to route traffic, because a) it is optima, i.e., it gives the routing with minimum congestion cost, and b) it can be reaized by routing protocos that use MPLS tunneing, or as recenty shown, in a distributed fashion by a 18

32 Notation Interpretation G Network graph G = (V, E). V set of nodes, E set of inks S S V, the set of CP servers T T V, the set of users C Capacity of ink r ij Proportion of fow i j traversing ink R The routing matrix R : {r ij }, TE variabe R bg Background routing matrix R : {r ij } (i,j) / S T X Traffic matrix of a communication pairs X = {x ij } (i,j) V V x st Traffic rate from server s to user t X cp X cp = {x st } (s,t) S T, SS variabe M t User t s demand rate for content B s Service capacity of server s x st The amount of traffic for (s, t) pair on ink ˆX cp ˆXcp = {x st } (s,t) S T, the generaized SS variabe f cp CP s traffic on ink f bg Background traffic on ink f f = f cp + f bg, tota traffic on ink. f = {f } E D p Deay of path p D Deay of ink g( ) Cost function used in ISP traffic engineering h( ) Cost function used in CP server seection Tabe 2.2: Summary of key notation. new ink-state routing protoco PEFT [25]. Let r ij [0, 1] denote the proportion of traffic of fow (i, j) that traverses ink. To reaize the muti-commodity fow soution, the network spits each fow over a number of paths. Let R = {r ij } be the routing matrix. Let f denote the tota traffic traversing ink, and we have f = (i,j) x ij r ij. Now traffic engineering can be formuated as the foowing optimization probem: TE(X) minimize T E = g (f ) (2.1a) subject to variabes f = (i,j) : In(v) x ij r ij C, (2.1b) r ij : Out(v) r ij = I v=j, (i, j), v V \{i} (2.1c) 0 r ij 1, (i, j), (2.1d) 19

33 where g ( ) represents a ink s congestion cost as a function of the oad, I v=j is an indicator function which equas 1 if v = j and 0 otherwise, In(v) denotes the set of incoming inks to node v, and Out(v) denotes the set of outgoing inks from node v. In this mode, TE does not differentiate between the CP s traffic and background traffic. In fact, TE assumes a constant traffic matrix X, i.e., the offered oad between each pair of nodes, which can either be a point-to-point background traffic fow, or a fow from a CP s server to a user. As we wi see ater, this common assumption is undermined when the CP performs dynamic server seection. For computationa tractabiity, ISPs usuay consider cost functions g ( ) that are convex, continuous, and non-decreasing. By using such an objective, TE penaizes high ink utiization and baances oad inside the network. We foow this approach and discuss the anaytica form of g ( ) that ISPs use in practice in a ater section. 2.3 Server Seection (SS) Modes Whie traffic engineering usuay assumes that traffic matrix is point-to-point and constant, both assumptions are vioated when some or a of the traffic is generated by the CP. A CP usuay has many servers that offer the same content, and the servers seected for each user depend on the network conditions. In this section, we present two nove CP modes which correspond to modes I and II introduced in Section 2.1. The first one modes the current CP operation, where the CP reies on end-to-end measurement of the network condition in order to make server seection decisions; the second one modes the situation when the CP obtains enough information from the ISP to cacuate the effect of its actions Server Seection Probem The CP soves the server seection probem to optimize the perceived performance of a of its users. We first introduce the notation used in modeing server seection. In the ISP s network, et S V denote the set of CP s servers, which are strategicay paced at different ocations in the network. For simpicity we assume that a content is dupicated at a servers, and our resuts can be extended to the genera case. Let T V denote the set of users who request content from the 20

34 servers. A user t T has a demand for content at rate M t, which we assume to be constant during the time a CP optimizes its server section. We aow a user to simutaneousy downoad content from mutipe servers, because node t can be viewed as an edge router in the ISP s network that aggregates the traffic of many endhosts, which may be served by different servers. To differentiate the CP s traffic from background traffic, we denote x st as the traffic rate from server s to user t. To satisfy the traffic demand, we need x st = M t. s S In addition, the tota amount of traffic aggregated at a server s is imited by its service capacity B s, i.e., x st B s. t T We denote X cp = {x st } s S,t T as the CP s decision variabe. One of the goas in server seection is to optimize the overa performance of the CP s customers. We use an additive ink cost for the CP based on atency modes, i.e., each ink has a cost, and the end-to-end path cost is the sum of the ink costs aong the way. As an exampe, suppose the content is deay-sensitive (e.g., IPTV), and the CP woud ike to minimize the average or tota end-to-end deay of a its users. Let D p denote the end-to-end atency of a path p, and D (f ) denote the atency of ink, modeed as a convex, non-decreasing, and continuous function of the amount of fow f on the ink. By definition, D p = p D (f ). By making the x st decisions, the CP impicity decides the network fow, and as a consequence the overa atency experienced by CP users, which can be rewritten as: SS = (s,t) = (s,t) = = p P (s,t) p P (s,t) x st p x st p D (f ) D p (f) D (f ) p x st p (s,t) p P (s,t): p f cp D (f ) (2.2) 21

35 where P (s, t) is the set of paths serving fow (s, t) and x st p is the amount of fow (s, t) traversing path p P (s, t). Let h ( ) represent the cost of ink, which we assume is convex, non-decreasing, and continuous. In this exampe, h (f cp, f ) = f cp D (f ). Thus, the ink cost h ( ) is a function of the CP s tota traffic f cp on the ink, as we as the ink s tota traffic f, which aso incudes background traffic. Expression (2.2) provides a simpe way to cacuate the tota user-experienced end-to-end deay simpy sum over a the inks, but it requires the knowedge of the oad on each ink, which is possibe ony in Mode II. Without such knowedge (Mode I), the CP can rey ony on end-to-end measurement of deay Server Seection with End-to-end Info: Mode I In today s Internet architecture, a CP does not have access to an ISP s network information, such as topoogy, routing, ink capacity, or background traffic. Therefore a CP reies on measured or inferred information to optimize its performance. To minimize its users atencies, for instance, a CP can assign each user to servers with the owest (measured) end-to-end atency to the user. In practice, content distribution networks ike Akamai s server seection agorithm is based on this principe [17]. We ca it SS with end-to-end info and use it as our first mode. CP monitors the atencies from a servers to a users, and makes server seection decisions to minimize users tota deay. Since the demand of a user can be arbitrariy divided among the servers, we can think of the CP as greediy assigning each infinitesima demand to the best server. The pacement of this traffic may change the path atency, which is monitored by the CP. Thus, at the equiibrium, the servers which send (non-zero) traffic to a user shoud have the same endto-end atency to the user, because otherwise the server with ower atency wi be assigned more demand, causing its atency to increase, and the servers not sending traffic to a user shoud have higher atency than those that serve the user. This is sometimes caed the Wardrop equiibrium [26]. The SS mode with end-to-end info is very simiar to sefish routing [27, 28], where each fow tries to minimize its average atency over mutipe paths without coordinating with other fows. It is known that the equiibrium point in sefish routing can be viewed as the soution to a goba convex optimization probem [27]. Therefore, SS with end-to-end info has a unique equiibrium point under mid assumptions. 22

36 Athough the equiibrium point is we-defined and is the soution to a convex optimization probem, in genera it is hard to compute the soution anayticay. Thus we everage the idea of Q-earning [29] to impement a distributed iterative agorithm to find the equiibrium of SS with end-to-end info. The agorithm is guaranteed to converge even under dynamic network environments with cross traffic and ink faiures, and hence can be used in practice by the CPs. The detaied description and impementation can be found in [30]. We show in Section 2.5 that, SS with end-to-end info is sub-optima. We use it as a baseine for how we a CP can do with ony the end-to-end atency measurements Server Seection with Improved Visibiity: Mode II We now describe how a CP can optimize server seection given compete visibiity into the underying network, but not into the ISP objective. That is, this is the best the CP can do without changing the routing in the network. We aso present an optimization formuation that aows us to anayticay study its performance. Suppose that content providers are abe to either obtain information on network conditions directy from the ISP, or infer it by its measurement infrastructure. In the best case, the CP is abe to obtain the compete information about the network, i.e., routing decision and ink atency. This situation is characterized by probem (2.3). To optimize the overa user experience, the CP soves the foowing cost minimization probem: SS(R) minimize SS = h (f cp, f ) (2.3a) subject to f cp = (s,t) x st r st, (2.3b) f = f cp s S + f bg C, (2.3c) x st = M t, t x st B s, s t T (2.3d) (2.3e) variabes x st 0, (s, t) (2.3f) 23

37 where we denote f bg = (i,j) (s,t) x ij r ij as the non-cp traffic on ink, which is a parameter to the optimization probem. If the cost function h ( ) is increasing and convex on the variabe f cp, one can verify that (2.3) is a convex optimization probem, hence has a unique goba optima vaue. To ease our presentation, we reax the server capacity constraint by assuming very arge server bandwidth caps in the remainder of this chapter. Our resuts hod in genera cases when the server constraints exist. SS with improved visibiity (2.3) is amenabe to an efficient impementation. The probem can either be soved centray, e.g., at the CP s centra coordinator, or via a distributed agorithm simiar to that used for Mode I. We sove (2.3) centray in our simuations, since we are more interested in the performance improvement brought by compete information than any particuar agorithm for impementing it. 2.4 Anayzing TE-SS Interaction In this section, we study the interaction between the ISP and the CP when they operate independenty without coordination in both Mode I and Mode II, using a game-theoretic mode. The game formuation aows us to anayze the stabiity condition, i.e., we show that aternating TE and SS optimizations wi reach an equiibrium point. In addition, we find that when the ISP and the CP optimize the same system objective, their interaction achieves goba optimaity under Mode II. Resuts in this section are aso found in a parae work [20] TE-SS Game and Nash Equiibrium We start with the formuation of a two-payer non-cooperative Nash game that characterizes the TE-SS interaction. Definition 1. The TE-SS game consists of a tupe < N, A, U >. The payer set N = {isp, cp}. The action set A isp = {R} and A cp = {X cp }, where the feasibe set of R and X cp are defined by the constraints in (2.1) and (2.3) respectivey. The utiity functions are U isp = T E and U cp = SS. Figure 2.1 shows the interaction between SS and TE. In both Mode I and Mode II, the ISP chooses the best response strategy, i.e., the ISP aways optimizes (2.1) given the CP s strategy X cp. 24

38 Simiary, the CP chooses the best response strategy in Mode II by soving (2.3). However, the CP s strategy in Mode I is not the best response, since it is not abe to optimize the objective (2.3) due to poor network visibiity. Indeed, the utiity the CP impicity optimizes in SS with end-to-end info is [27] U cp = E f 0 D (u)du This ater heps us understand the stabiity conditions of the game. Consider a particuar game procedure in which the ISP and the CP take turns to optimize their own objectives by varying their own decision variabes, treating that of the other payer as constant. Specificay, in the (k + 1)-th iteration, we have R (k+1) = argmin R T E(X (k) cp ) (2.4a) X (k+1) cp = argmin X cp SS(R (k+1) ) (2.4b) Note that the two optimization probems may be soved at different timescaes. The ISP runs traffic engineering at the timescae of hours, athough it coud run on a much smaer timescae. Depending on the CP s design choices, server seection is optimized a few times a day, or at a smaer timescae ike seconds or minutes of a typica content transfer duration. We assume that each payer has fuy soved its optimization probem before the other one starts. Next we prove the existence of Nash equiibrium of the TE-SS game. We estabish the stabiity condition when two payers use genera cost functions g ( ) and h ( ) that are continuous, nondecreasing, and convex. Whie TE s formuation is the same in Mode I and Mode II, we consider the two SS modes, i.e., SS with end-to-end info and SS with improved visibiity. Theorem 2. The TE-SS game has a Nash equiibrium for both Mode I and Mode II. Proof. It suffices to show that (i) each payer s strategy space is a nonempty compact convex subset, and (ii) each payer s utiity function is continuous and quasi-concave on its strategy space, and foow the standard proof in [31]. The ISP s strategy space is defined by the constraint set of (2.1), which are affine equaities and inequaities, hence a convex compact set. Since g ( ) is continuous and convex, we can easiy verify that the objective function (2.1a) is quasi-convex on R = {r ij }. CP s strategy space is defined by the constraint set of (2.3), which is aso convex and compact. 25

39 Simiary, if h (f cp ) is continuous and convex, the objective function (2.3a) is quasi-convex on X cp. In particuar, consider the specia case in which CP minimizes atency (2.2). When CP soves SS with end-to-end info, h (f ) = f 0 D (u)du. When CP soves SS with improved visibiity, h (f cp ) = f cp D (f ). In both cases, if D ( ) is continuous, non-decreasing, and convex, so is h ( ). Whie there exists a Nash equiibrium, it does not guarantee that aternating optimizations (2.4) ead to one. In Section 2.7 we demonstrate the convergence of aternating optimizations through simuation. In genera, the Nash equiibrium may not be unique, in terms of both decision variabes and objective vaues. Next, we discuss a specia case where the Nash equiibrium is unique and can be attained by aternating optimizations (2.4) Goba Optimaity under Same Objective and Absence of Background Traffic In the foowing, we consider a specia case of the TE-SS game, in which the ISP and the CP optimize the same objective function, i.e., g ( ) = h ( ), so T E = SS = Φ (f ), (2.5) when there is no background traffic. One exampe is when the network carries ony the CP traffic, and both the ISP and the CP aim to minimize the average traffic atency, i.e., Φ (f ) = f D (f ). An interesting question that naturay arises is whether the two payers aternating best response to each other s decision can ead to a sociay optima point. 26

40 Define a notion of goba optimum, which is the optima point to the foowing optimization probem. TESS Specia minimize subject to variabes Φ (f ) f = (s,t) s S (2.6a) x st C, (2.6b) : In(v) x st : Out(v) x st = M t I v=t, v / S, t T (2.6c) x st 0, (s, t), (2.6d) where x st denotes the traffic rate for fow (s, t) deivered on ink. The variabe x st aows a goba coordinator to route a user s demand from any server in any way it wants, thus probem (2.6) estabishes an upper-bound on how we one can do to minimize the traffic atency. Note that x st captures both R and X cp, which offers more degrees of freedom for a joint routing and server-seection probem. Its mathematica properties wi be further discussed in Section The specia case TE-SS game (2.5) has a Nash equiibrium, as shown in Theorem 2. Nash equiibrium may not be unique in genera. This is because when there is no traffic between a server-user pair, the TE routing decision for this pair can be arbitrary without affecting its utiity. In the worst case, a Nash equiibrium can be arbitrariy suboptima to the goba optimum. Suppose there exists non-zero traffic demand between any server-user pair as considered in [20], e.g., by appending an infinitesimay sma traffic oad to every server-user pair. We show that aternating optimizations (2.4) reach a unique Nash equiibrium, which is aso an optima soution to (2.6). TE-SS interaction does not sacrifice any efficiency in this specia case, and the optima operating point can be achieved by iterative best response uniateray, without the need for a goba coordination. This resut is shown in [20] in which the idea is to prove the equivaence of Nash equiibrium and the goba optimum. We show an aternative proof by considering aternating projections of variabes onto a convex set, which is presented as foows. 27

41 Consider the foowing optimization probem: minimize subject to variabes Φ (f ) f = (s,t) : In(v) (2.7a) x st r st C, (2.7b) r st x st = M t, t s S : Out(v) r st = I v=t, (s, t), v V \{s} (2.7c) (2.7d) 0 r st 1, x st 0 (2.7e) The aternating optimizations (2.4) in the specia case TE-SS game is soving (2.7) by appying the non-inear Gauss-Seide agorithm [32], which consists of aternating projections of one variabe onto the steepest descending direction whie keeping the other fixed. Note that (2.7) is a nonconvex probem, since it invoves the product of two variabes r st that it is equivaent to the convex probem (2.6). and x st. However, we next show Lemma 3. The non-convex probem (2.7) that the TE-SS game soves is equivaent to the convex probem (2.6). Proof. We show that there is a one-to-one mapping between the feasibe soutions of (2.6) and (2.7). Consider a feasibe soution {x st } of (2.6). Let x st = : In(t) xst : Out(t) xst, rst = x st /x st if x st 0. To avoid the case of x st = 0, suppose there is infinitesimay sma background traffic for every (s, t) pair, so the one-to-one mapping hods. On the other hand, for each feasibe soution {x st, r st } of (2.7), et xst = x st r st. It is easy to see that for every feasibe soution {xst}, the derived {x st, r st } is aso feasibe, and vice versa. Since the two probems have the same objective function, they are equivaent. Lemma 4. The Nash equiibrium of TE-SS game is an optima soution to (2.6). Proof. The key idea of the proof is to check the KKT conditions [12] of TE and SS optimization probems at Nash equiibrium (step I), and show that they aso satisfy the KKT condition of the goba probem (2.6) (step II). Step I : Consider a feasibe soution {x st, r st } at Nash equiibrium, i.e., each one is the best response of the other. To assist our proof, we define φ (f ) = Φ (f ) as the margina cost of ink. 28

42 We first show the optimaity condition of SS. Let φ st = φ (f ) r st, which denotes the margina cost of (s, t) pair. By the definition of Nash equiibrium, for any s such that x st > 0, we have φ st φ s t for any s S, by inspecting the KKT condition of the SS optimization. This impies that servers with positive rate have the same margina atency, which is ess than those of servers with zero rate. Let φ t = φ st for a x st > 0. We next show the optimaity condition of TE. Consider an (s, t) server-user pair. Let δ sv denote the average margina cost from node s to v, which can be recursivey defined as :(u,v) In(v) δ sv = (δ su + φ ) r st/ In(v) rst if v s 0 if v = s The KKT condition of the TE optimization is for v V, = (u, v), = (u, v) In(v), r st > 0 impies δ su + φ δ su + φ. In other words, for any node v, the margina cost accumuated from any incoming ink with positive fow is equa, and ess than those of incoming inks with zero fow. So we can define δ sv = δ su + φ, = (u, v) In(v) with r st 0. In fact, δ st = :(u,t) (δ su + φ ) r st = φ r st = φ st, by inspecting fow conservation at each node and the fact that any (s, t) path has the same margina atency as observed above. Combining the two KKT conditions together gives us the necessary and sufficient condition for Nash equiibrium: δ su + φ δ su + φ δ st δ s t, if x st > 0 if r st > 0 s S, v V,, In(v) s, s S (2.8) An intuitive expanation is to consider the margina atency of any path p that is reaized by the routing decision. Let P (t) be the set of paths that connect a possibe servers and the user t. Let φ p = : p φ. A path p is active if r st > 0 for a p, which means there is a positive fow between (s, t). Then the above condition can be transated into the foowing argument: for any path p, p P (t), φ p φ p if p is active. In other words, any active path has the same margina atency, which is ess than those of non-active paths. 29

43 Step II : We show the KKT condition of (2.6). Let {x st } be an optima soution to (2.6). Simiary, we define the margina atency from node s to v as :(u,v) In(v) sv = ( su + φ ) x st / In(v) xst if v s 0 if v = s The KKT condition of (2.6) is the foowing: su + φ su + φ if x st > 0 s S, v V,, In(v) st s t, if x st > 0 for some s, s S (2.9) One can readiy check the equivaence of conditions (2.8) and (2.9). To be more specific, suppose {x st, r st } is a Nash equiibrium that satisfies (2.8), we can construct {xst} as discussed in the proof of Lemma 3, which aso satisfies (2.9), and vice versa. Lemma 5. The aternating TE and SS optimizations (2.4) converge to a Nash equiibrium of the TE-SS game. Proof. Consider the objective of (2.7), which is aso a Lyapunov function. Since two payers have the same utiity function, and each step of (2.4) is one s best response by fixing the other s decision, the trajectory of the objective vaue is a decreasing sequence. In addition, the objective of (2.7) is ower-bounded by the optima vaue of (2.6). Therefore, there exists a imit point of the sequence. It remains to show that this imit point is indeed an equiibrium. Consider the sequence (X (k) cp, R (k) ), where k = 1, 2,..., is the step index. Denote the imit point of the objective (2.7a) by Φ. Since the feasibe space of {X cp, R} is compact and continuous, there exists a imit point (X cp, R ) as k. By the definition of (2.7a), X = argmin Xcp SS(R ). To show that R is aso a best response to X, suppose by contradiction that there exists R such that with X there is a ower objective vaue Φ < Φ. By the continuity of the objective function, there exists arge k such that ( R, X (k) ) wi resut in an objective vaue Φ within a ɛ ba of Φ for arbitrariy sma ɛ, namey, Φ Φ + ɛ Φ. Therefore, the best response to X (k) wi resut in an objective vaue no greater than Φ + ɛ, which is ess than Φ. This contradicts the fact that the sequence is ower-bounded by Φ. Thus we compete the proof that (X cp, R ) is a Nash equiibrium. 30

44 Theorem 6. The aternating optimizations of TE and SS (2.4) in the specia TE-SS game (2.5) achieves the goba optimum of (2.6). Proof. Combining Lemmas 3, 4, and 5 eads to the statement. The specia case anaysis estabishes a ower bound on the efficiency oss of TE-SS interaction. In genera, there are two sources of mis-aignment between the optimization probems of TE and SS: (i) different shapes of the cost functions, e.g., the deay function, and (ii) the existence of background traffic in the TE probem. The above specia case iustrates what might happen if these differences are avoided. However, in genera, such mis-aignment wi ead to a significant efficiency oss, as we show in the next section. The evauation resuts shown in Section 2.7 further highight the difference (i). 2.5 Efficiency Loss We next study the efficiency oss in the genera case of the TE-SS game, which may be caused by incompete information, or uniatera actions that miss the opportunity to achieve a jointy attainabe optima point. We present two case studies that iustrate these two sources of suboptima performance. We first present a toy network and show that under certain conditions the CP performs even worse in Mode II than Mode I, despite having more information about underying network conditions. We next propose the notion of Pareto-optimaity as the performance benchmark, and quantify the efficiency oss in both Mode I and Mode II The Paradox of Extra Information Consider an ISP network iustrated in Figure 2.2. We designate an end user node, T = {F }, and two CP servers, S = {B, C}. The end user has a content demand of M F = 2. We aso aow two background traffic fows, A D and A E, each of which has one unit of traffic demand. Edge directions are noted on the figure, so one can figure out the possibe routes, i.e., there are two paths for each traffic fow (cockwise and counter-cockwise). To simpify the anaysis and deiver the most essentia message from this exampe, suppose that both TE and SS costs on the four thin inks are negigibe so the four bod inks constitute the botteneck of the network. In Tabe 2.3, 31

45 F D E B C A Figure 2.2: An Exampe of the Paradox of Extra Information we ist the ink capacities, ISP s cost function g ( ), and ink atency function D ( ). Suppose the CP aims to minimize the average atency of its traffic. We compare the Nash equiibrium of two situations when the CP optimizes its network by SS with end-to-end info and SS with improved visibiity. ink 1 : BD 2 : BE 3 : CD 4 : CE C 1 + ɛ 1 + ɛ 1 + ɛ 1 + ɛ D (f ) f ɛ f ɛ f 3 f 4 g (x) g 1 ( ) = g 2 ( ) = g 3 ( ) = g 4 ( ) Tabe 2.3: Link capacities, ISP s and CP s ink cost functions in the exampe of Paradox of Extra Information. The stabiity condition for the ISP at Nash equiibrium is g 1(f 1 ) = g 2(f 2 ) = g 3(f 3 ) = g 4(f 4 ). Since the ISP s ink cost functions are identica, the tota traffic on each ink must be identica. On the other hand, the stabiity condition for the CP at Nash equiibrium is that (B, F ) and (C, F ) have the same margina atency. Based on the observations, we can derive two Nash equiibrium points. When the CP takes the strategy of SS with end-to-end info, et Mode I: { } X CP : x BF = 1, x CF = 1 { R : r1 BF = 1 α, r2 BF = α, r3 CF = α, r4 CF = 1 α, } r1 AD = α, r3 AD = 1 α, r2 AE = 1 α, r4 AE = α 32

46 One can check that this is indeed a Nash equiibrium soution, where f 1 = f 2 = f 3 = f 4 = 1, and D BF = D CF = 1 α + α/ɛ. The CP s objective SS I = 2(1 α + α/ɛ). When the CP takes the strategy of SS with improved visibiity, et Mode II: { } X CP : x BF = 1, x CF = 1 { R : r1 BF = α, r2 BF = 1 α, r3 CF = 1 α, r4 CF = α, } r1 AD = 1 α, r3 AD = α, r2 AE = α, r4 AE = 1 α This is a Nash equiibrium point, where f 1 = f 2 = f 3 = f 4 = 1, and d BF = d CF = α(1 + α) + (1 α)(1/ɛ + (1 α)/ɛ 2 ). The CP s objective SS II = 2(α + (1 α)/ɛ). When 0 < ɛ < 1, 0 α < 1/2, we have the counter-intuitive SS I < SS II : more information may hurt the CP s performance. In the worst case, SS II im = α 0,ɛ 0 SS I i.e., the performance degradation can be unbounded. This is not surprising, since the Nash equiibrium is generay non-unique, both in terms of equiibrium soutions and equiibrium objectives. When ISP and CP s objectives are mis-aigned, the ISP s decision may route CP s traffic on bad paths from the CP s perspective. In this exampe, the paradox happens when the ISP route the CP traffic on good paths in Mode I (though SS makes decision based on incompete information), and the ISP mis-routes the CP traffic to bad paths in Mode II (though SS gains better visibiity). In practice, such a scenario is ikey to happen, since the ISP cares about ink congestion (ink utiization), whie the CP cares about atency, which depends not ony on ink oad, but aso on propagation deay. Thus ISP and CP s partia coaboration by ony passing information is not sufficient to achieve goba optimaity Pareto Optimaity and Iustration of Sub-Optimaity As in the above exampe, one of the causes of sub-optimaity is that TE and SS s objectives are not necessariy aigned. To measure efficiency in a system with mutipe objectives, a common approach is to expore the Pareto curve. For points on the Pareto curve, we cannot improve 33

47 one objective further without hurting the other. The Pareto curve characterizes the tradeoff of potentiay conficting goas of different parties. One way to trace the tradeoff curve is to optimize a weighted sum of the objectives: minimize T E + γ SS (2.10a) variabes R R, X cp X cp (2.10b) where γ 0 is a scaar representing the reative weight of the two objectives. R and X cp are the feasibe regions defined by the constraints in (2.1) and (2.3): { ( ) R X cp = r ij, x st 0 r ij 1; x st 0; f = (i,j) : In(v) r ij x ij r ij C, E; s S x st = M t, t T : Out(v) } r ij = I v=j, v V \{i}; The probem (2.10) is not easy to sove. In fact, the objective of (2.10) is no onger convex in variabes {r st, x st}, and the feasibe region defined by constraints of (2.10) is not convex. One way to overcome this probem is to consider a reaxed decision space that is a superset of the origina soution space. Instead of restricting each payer to its own operating domain, i.e., ISP contros routing and CP contros server seection, we introduce a joint routing and content deivery probem. Let x st denote the rate of traffic carried on ink that beongs to fow (s, t). Such a convexification of the origina probem (2.10) gives more freedom to joint TE and SS probem. Denote the generaized CP decision variabe as ˆX cp = {x st } s S,t T, and R bg = {r ij } (i,j) / S T as background routing matrix. Consider the foowing optimization probem: 34

48 6 Measure of efficiency oss Pareto Curve Mode I Mode II 5.5 SS cost operating region TE cost Figure 2.3: A numerica exampe iustrating sub-optimaity. TESS weighted minimize T E + γ SS (2.11a) subject to variabes f cp = (s,t) f = f cp + : In(v) s S x st x st, (i,j) / S T r ij : In(v) : Out(v) (2.11b) x ij r ij C, (2.11c) r ij x st : Out(v) = I v=j, (i, j) / S T, v V \{i} (2.11d) x st = M t I v=t, v / S, t T (2.11e) 0, 0 r ij 1 (2.11f) Denote the feasibe space of the joint variabe as A = { ˆX cp, R bg }. If we vary γ and pot the achieved TE objectives versus SS objectives, we obtain the Pareto curve. To iustrate the Pareto curve and efficiency oss in Mode I and Mode II, we pot in Figure 2.3 the Pareto curve and the Nash equiibria in the two-dimensiona objective space (TE,SS) for the network shown in Figure 2.2. The simuation shows that when the CP everages the compete information to optimize (2.3a), it is abe to achieve ower deay, but the TE cost suffers. Though 35

49 it is not cear which operating point is better, both equiibria are away from the Pareto curve, which shows that there is room for performance improvement in both dimensions. 2.6 A Joint Design: Mode III Motivated by the need for a joint TE and SS design, we propose the Nash bargaining soution to reduce the efficiency oss observed above. Using the theory of optimization decomposition, we derive a distributed agorithm by which the ISP and the CP can act separatey and communicate with a imited amount of information exchange Motivation An ISP providing content distribution service in its own network has contro over both routing and server seection. So the ISP can consider the characteristics of both types of traffic (background and CP) and jointy optimize a carefuy chosen objective. The jointy optimized system shoud meet at east two goas: (i) optimaity, i.e., it shoud achieve Pareto optimaity so the network resources are efficienty utiized, and (ii) fairness, i.e., the tradeoff between two non-synergistic objectives shoud be baanced so both parties benefit from the cooperation. One natura design choice is to optimize the weighted sum of the traffic engineering goa and server seection goa as shown in (2.11). However, soving (2.11) for each γ and adaptivey tuning γ in a tria-and-error fashion is impractica and inefficient. First, it is not straightforward to weigh the tradeoff between the two objectives. Second, one needs to compute an appropriate weight parameter γ for every combination of background oad and CP traffic demand. In addition, the offine computation does not adapt to dynamic changes of network conditions, such as cross traffic or ink faiures. Last, tuning γ to expore a broad region of system operating points is computationay expensive. Besides the system considerations above, the economic perspective requires a fair soution. Namey, the joint design shoud benefit both TE and SS. In addition, such a mode aso appies to a more genera case when the ISP and the CP are different business entities. They cooperate ony when the cooperation eads to a win-win situation, and the division of the benefits shoud be 36

50 fair, i.e., one who makes greater contribution to the coaboration shoud be abe to receive more reward, even when their goas are conficting. Whie the joint system is designed from a cean state, it shoud accept an incrementa depoyment from the existing infrastructure. In particuar, we prefer that the functionaities of routing and server seection be separated, with minor changes to each component. The moduarized design aows us to manage each optimization independenty, with a judicious amount of information exchange. Designing for scaabiity and moduarity is beneficia to both the ISP and the CP, and aows their cooperation either as a singe entity or as different ones. Based on a the above considerations, we appy the concept of Nash bargaining soution [21, 33] from cooperative game theory. It ensures that the joint system achieves an efficient and fair operating point. The soution structure aso aows a moduar impementation Nash Bargaining Soution Consider a Nash bargaining soution which soves the foowing optimization probem: NBS maximize (T E 0 T E)(SS 0 SS) (2.12a) variabes { ˆX cp, R bg } A (2.12b) where (T E 0, SS 0 ) is a constant caed the disagreement point, which represents the starting point of their negotiation. Namey, (T E 0, SS 0 ) is the status-quo we observe before any cooperation. For instance, one can view the Nash equiibrium in Mode I as a disagreement point, since it is the operating point the system woud reach without any further optimization. By optimizing the product of performance improvements of TE and SS, the Nash bargaining soution guarantees the joint system is optima and fair. A Nash bargaining soution is defined by the foowing axioms, and is the ony soution that satisfies a of four axioms [21, 33]: Pareto optimaity. A Pareto optima soution ensures efficiency. 37

51 Symmetry. The two payers shoud get equa share of the gains through cooperation, if the two payers probems are symmetric, i.e., they have the same cost functions, and have the same objective vaue at the disagreement point. Expected utiity axiom. The Nash bargaining soution is invariant under affine transformations. Intuitivey, this axiom suggests that the Nash bargaining soution is insensitive to different units used in the objective and can be efficienty computed by affine projection. Independence of irreevant aternatives. This means that adding extra constraints in the feasibe operating region does not change the soution, as ong as the soution itsef is feasibe. The choice of the disagreement point is subject to different economic considerations. For a singe network provider who wishes to provide both services, it can optimize the product of improvement ratio by setting the disagreement point to be the origin, i.e., equivaent to T E SS/(T E 0 SS 0 ). For two separate ISP and CP entities who wish to cooperate, the Nash equiibrium of Mode I may be a natura choice, since it represents the benchmark performance of current practice, which is the baseine for any future cooperation. It can be obtained from the empirica observations of their average performance. Aternativey, they can choose their preferred performance eve as the disagreement point, written into the contract. In this work, we use the Nash equiibrium of Mode I as the disagreement point to compare the performances of our three modes COTASK Agorithm In this section, we show how Nash bargaining soution can be impemented in a moduarized manner, i.e., keeping SS and TE functionaities separate. This is important because moduarized design increases the re-usabiity of egacy systems with minor changes, ike existing CDNs depoyment. In terms of cooperation between two independent financia entities, the moduarized structure presents the possibiity of cooperation without reveaing confidentia interna information to each other. We next deveop COTASK (COoperative TrAffic engineering and Server seection inside an ISP network), a protoco that impements NBS by separate TE and SS optimizations and communication between them. We appy the theory of optimization decomposition [22] to decompose probem (2.12) into subprobems. ISP soves a new routing probem, which contros the routing of 38

52 background traffic ony. The CP soves a new server seection probem, given the network topoogy information. The ISP aso passes the routing contro of content traffic to the CP, offering more freedom to how content can be deivered on the network. They communicate via underying ink prices, which are computed ocay using traffic eves on each ink. Consider the objective (2.12a), which can be converted into maximize og(t E 0 T E) + og(ss 0 SS) since the og function is monotonic and the feasibe soution space is unaffected. The introduction of the og functions hep revea the decomposition structure of the origina probem. Two auxiiary variabe f cp and f bg are introduced to refect the preferred CP traffic eve from the ISP s perspective and the preferred background traffic eve from the CP s perspective. (2.12) can be rewritten as maximize subject to variabes og(t E 0 g (f bg f cp = x st, f bg = (s,t) f cp = f cp : In(v) s S x st, f bg r ij : In(v) 0, 0 r ij + f cp )) + og(ss 0 (i,j) / S T x ij r ij, h (f cp + f bg )) (2.13a) (2.13b) = f bg, f cp + f bg C, (2.13c) : Out(v) x st r ij = I v=j, (i, j) / S T, v V \{i} (2.13d) : Out(v) x st 1, (i, j) / S T, f cp, f bg = M t I v=t, v / S, t T (2.13e) (2.13f) The consistency constraint on the auxiiary variabe and the origina variabe ensures that the soution equivaent to probem (2.12). We take the partia Lagrangian of (2.13) as L(x st, r ij, f cp, f bg, λ, µ, ν ) = og(t E 0 g (f bg + f cp )) + og(ss 0 h (f cp + f bg )) + µ (f bg f bg ) + ν (f cp f cp ) + λ (C f bg f cp ) 39

53 λ is the ink price, which refects the cost of overshooting the ink capacity, and µ, ν are the consistency prices, which refect the cost of disagreement between ISP and CP on the preferred ink resource aocation. Observe that f cp and f bg can be separated in the Lagrangian function. We take a dua decomposition approach, and (2.13) is decomposed into two subprobems: SS NBS maximize og(ss 0 h (f cp + f bg )) + (ν f cp µ f bg λ f cp ) (2.14a) subject to variabes f cp s S x st = (s,t) : In(v) x st, x st : Out(v) 0, (s, t) S T, f bg x st = M t I v=t, v / S, t T (2.14b) (2.14c) (2.14d) and TE NBS maximize subject to variabes og(t E 0 f bg = : In(v) 0 r ij (i,j) / S T g (f bg r ij x ij r ij, : Out(v) + f cp )) + 1, (i, j) / S T, f cp (µ f bg ν f cp λ f bg ) (2.15a) (2.15b) r ij = I v=j, (i, j) / S T, v V \{i} (2.15c) (2.15d) The optima soutions of (2.14) and (2.15) for a given set of prices µ, ν, and λ define the dua function Dua(µ, ν, λ ). The dua probem is given as: minimize Dua(µ, ν, λ ) (2.16a) variabes λ 0, µ, ν (2.16b) 40

54 COTASK Agorithm (i) (ii) (iii) (iv) (i) (ii) (iii) (iv) ISP: TE agorithm Receives ink price λ and consistency price µ, ν from physica inks E ISP soves (2.15a) and computes R bg for background traffic ISP passes f bg, f cp information to each ink Go back to (i) CP: SS agorithm Receives ink price λ and consistency price µ, ν from physica inks E CP soves (2.14a) and computes X cp for content traffic. CP passes f cp, f bg information to each ink Go back to (i) Link: price update agorithm (i) Initiaization step: set λ 0, and µ, ν arbitrariy (ii) Updates ink price λ according to (2.17) (iii) Updates consistency prices µ, ν according to (2.18)(2.19) (iv) Passes λ, µ, ν information to TE and SS (v) Go back to (ii) Tabe 2.4: Distributed agorithm for soving probem (2.12a). We can sove the dua probem with the foowing price updates: λ (t + 1) = [λ (t) β λ ( C f bg µ (t + 1) = µ (t) β µ ( f bg ν (t + 1) = ν (t) β ν ( f cp f bg f cp )] + f cp, (2.17) ), (2.18) ), (2.19) where β s are diminishing step sizes or sma constant step sizes often used in practice [34]. Tabe 2.4 presents the COTASK agorithm that impements the Nash bargaining soution distributivey. In the COTASK agorithm, the ISP soves the new version TE, i.e., TE-NBS, and the CP soves the new version SS, i.e., SS-NBS. In terms of information sharing, the CP earns the network topoogy from the ISP. They do not directy exchange information with each other. Instead, they report f cp and f bg information to underying inks, which pass the computed price information 41

55 back to TE and SS. It is possibe to further impement TE or SS in a distributed manner, such as on the user/server eves. There are two main chaenges on practica impementation of COTASK. First, TE needs to adapt quicky to network dynamics. Fast timescae TE has recenty been proposed in various works. Second, an extra price update component is required on each ink, which invoves price computation and message passing between TE and SS. This functionaity can be potentiay impemented in routers. Theorem 7. The distributed agorithm COTASK converges to the optimum of (2.12) Proof. The COTASK agorithm is precisey captured by the decomposition method described above. Certain choice of step sizes, such as β(t) = β 0 /t, where β 0 > 0, guarantees that the agorithm converges to a goba optimum [35]. 2.7 Performance Evauation In this section, we use simuations to demonstrate the efficiency oss that may occur for rea network topoogies and traffic modes. We aso compare the performance of the three modes. We sove the Nash bargaining soution centray, without using the COTASK agorithm, since we are primariy interested in its performance. Compementary to the theoretica anaysis, the simuation resuts aow us to gain a better understanding of the efficiency oss under reaistic network environments. These simuation resuts aso provide guidance to network operators who need to decide which approach to take, sharing information or sharing contro Simuation Setup We evauate our modes under ISP topoogies obtained from Rocketfue [36]. We use the backbone topoogy of the research network Abiene [37] and severa major tier-1 ISPs in north America. The choice of these topoogies aso refects different geometric properties of the graph. For instance, Abiene is the simpest graph with two botteneck paths horizontay. The backbones of AT&T and Exodus have a hub-and-spoke structure with some shortcuts between nodes pairs. The topoogy of Leve 3 is amost a compete mesh, whie Sprint is in between these two kinds. We simuate 42

56 the traffic demand using a gravity mode [38], which refects the pairwise communication pattern on the Internet. The content demand of a CP user is assumed to be proportiona to the node popuation. The TE cost function g( ) and the SS cost function h( ) are chosen as foows. ISPs usuay mode congestion cost with a convex increasing function of the ink oad. The exact shape of the function g (f ) is not important, and we use the same piecewise inear cost function as in [23], given beow: f 0 f /C < 1/3 3f 2/3C 1/3 f /C < 2/3 10f 16/3C 2/3 f /C < 9/10 g (f, C ) = 70f 178/3C 9/10 f /C < 1 500f 1468/3C 1 f /C < 11/ f 16318/3C 11/10 f /C < The CP s cost function can be the performance cost ike atency, financia cost charged by ISPs. We consider the case where atency is the primary performance metric, i.e., the content traffic is deay sensitive ike video conferencing or ive streaming. So we et the CP s cost function h ( ) be of the form given by (2.2), i.e., h (f ) = f cp D (f ). A ink s atency D ( ) consists of queuing deay and propagation deay. The propagation deay is proportiona to the geographica distances between nodes. The queuing deay is approximated by the M/M/1 mode, i.e., D queue = 1 C f, f < C with a inear approximation when the ink utiization is over 99%. We reax hard capacity constraints by penaizing traffic overshooting the ink with a high cost, for consistency throughput this work. The shapes of the TE ink cost function and queuing deay function are iustrated in Figure 2.4. We intensionay choose the cost functions of TE and SS to be simiar in shape. This aows us to quantify the efficiency oss of Mode I and Mode II even when their objectives are reativey we aigned, as we as the improvement brought by Mode III. 43

57 600 TE ink cost function, C = Queuing deay function, C = g (f ) 300 deay f /C f /C (a) TE ink cost (b) Link queuing deay Figure 2.4: ISP and CP cost functions TE cost v.s. CP traffic voume Mode I (No cooperation) Mode II (Sharing Info) Mode III (Sharing Contro) SS cost v.s. CP traffic voume Mode III (Sharing Contro) Mode II (Sharing Info) Mode I (No cooperation) TE cost 2000 SS cost CP traffic percentage CP traffic percentage (a) (b) Figure 2.5: The TE-SS tusse v.s. CP s traffic intensity (Abiene topoogy) Evauation Resuts Tusse between background and CP s traffic We first demonstrate how CP s traffic intensity affects the overa network performance. We fix the tota amount of traffic and tune the ratio between background traffic and CP s traffic. We evauate the performance of different modes when CP traffic grows from 1% to 100% of the tota traffic. Figure 2.5 iustrates the resuts on Abiene topoogy. The genera trend of both TE and SS objectives for a three modes is that the cost first decreases as CP traffic percentage grows, and ater increases as CP s traffic dominates the network. 44

58 The decreasing trend is due to the fact that CP s traffic is sef-optimized by seecting servers cose to a user, thus offoading the network. The increasing trend is more interesting, suggesting that when a higher percentage of tota traffic is CP-generated, the negative effect of TE-SS interaction is ampified, even when the ISP and the CP share simiar cost functions. Low ink congestion usuay means ow end-to-end atency, and vice versa. However, they differ in the foowing: (i) TE might penaize high utiization before queueing deay becomes significant in order to eave as much room as possibe to accommodate changes in traffic, and (ii) CP considers both propagation deay and queueing deay so it may choose a moderatey-congested short path over a ighty-oaded ong path. This expains why the optimization efforts of two payers are at odds. Network congestion v.s. performance improvement We now study the network conditions under which more performance improvement is possibe. We evauate the three modes on the Abiene topoogy. Again, we fix the tota amount of traffic and vary the CP s traffic percentage. Now we change ink capacities and evauate two scenarios: when the network is moderatey congested and when the network is highy congested. We show the performance improvement of Mode II and Mode III over Mode I (in percentages) and pot the resuts in Figure 2.6. Figures 2.6(a-b) show the improvement of the ISP and the CP when the network is under ow oad. Generay, Mode II and Mode III improve both TE and SS, and Mode III outperforms Mode II in most cases, with the exception that Mode II is biased towards SS sometimes. However, both ISP and CP s improvement are not substantia (note the different scaes of y-axes), except when CP traffic is trivia (1%). This is because when the network is under ow oad, the sopes of TE and SS cost functions are fat, thus eaving itte space for improvement. Figure 2.6(c-d) show the resuts when the network is under a high oad. The performance improvement becomes more significant, especiay at two extremes: when CP traffic is trivia or prevaent. This suggests that when CP traffic is dominant, there is a arge room for improvement when two objectives are simiar in shape. However, observe that whie mode III aways improves TE and SS, Mode II coud sometimes perform worse than Mode I. This indicates that there are more inferior Nash equiibria, when a arger fraction of CP traffic exists. 45

59 percentage of SS cost saving CP s performance improvement Mode II (Sharing Info) Mode III (Sharing Contro) percentage of TE cost saving ISP s performance improvement Mode II (Sharing Info) Mode III (Sharing Contro) percentage of SS cost saving CP traffic percentage (a) CP s performance improvement Mode II (Sharing Info) Mode III (Sharing Contro) percentage of TE cost saving CP traffic percentage (b) ISP s performance improvement Mode II (Sharing Info) Mode III (Sharing Contro) CP traffic percentage (c) CP traffic percentage (d) Figure 2.6: TE and SS performance improvement of Mode II and III over Mode I. (a-b) Abiene network under ow traffic oad: moderate improvement; (c-d) Abiene network under high traffic oad: more significant improvement, but more information (in Mode II) does not necessariy benefit the CP and the ISP (the paradox of extra information). Impact of ISP topoogies We evauate our three modes on different ISP topoogies. The topoogica properties of different graphs are discussed earier. The CP s traffic is 80% of the tota traffic and ink capacities are set such that networks are under high traffic oad. Our findings are depicted in Figure 2.7. Note that performance improvement is reativey more significant in more compex graphs. Simpe topoogies with sma min-cut sizes are networks where the apparent paradox of more (incompete) information is ikey to happen. Besides the TE and SS objectives, we aso pot the maximum ink utiization to iustrate the eve of congestion in the network. Higher network oad shows more space for potentia improvement. Aso, mode III improves this metric generay, which might be another important consideration for network providers. 46

60 percentage of TE cost saving ISP improvement on different networks Mode II (Sharing Info) Mode III (Sharing Contro) Abiene AT&T Exodus Leve3 Sprint percentage of SS cost saving CP improvement on different networks Mode II (Sharing Info) Mode III (Sharing Contro) Abiene AT&T Exodus Leve3 Sprint max ink utiization Maximum ink utiization on different networks 1 Mode I (No Cooperation) 0.95 Mode II (Sharing Info) Mode III (Sharing Contro) Abiene AT&T Exodus Leve3 Sprint (a) (b) (c) Figure 2.7: Performance evauation over different ISP topoogies. Abiene: sma cut graph; AT&T, Exodus: hub-and-spoke with shortcuts; Leve 3: compete mesh; Sprint: in between. 2.8 Reated Work This work is an extension of our earier workshop paper [39]. Additions in this paper incude the foowing: a more genera CP mode, anaysis of optimaity conditions in three cooperation modes, paradox of extra information, impementation of Nash bargaining soution, and arge scae evauation. The most simiar work is a parae work [20], which studied the interaction between content distribution and traffic engineering. The authors show the optimaity conditions for two separate probems to converge to a sociay optima point, as discussed in Section They provide a 4/3-bound on efficiency oss for inear cost functions, and discuss generaizations to mutipe ISPs and overay networks. Some earier work studied the sef-interaction within ISPs or CPs themseves. In [28], the authors used simuation to show that sefish routing is cose to optima in Internet-ike environments without sacrificing much performance degradation. [40] studied the probem of oad baancing by overay routing, and how to aeviate race conditions among mutipe co-existing overays. [41] studied the resource aocation probem at inter-as eve where ISPs compete to maximize their revenues. [42] appied Nash bargaining soution to sove an inter-domain ISP peering probem. The need for cooperation between content providers and network providers is raising much discussion in both the research community and industry. [43] used price theory to reconcie the tusse between peer-assisted content distribution and ISP s resource management. [19] proposed 47

61 CP no change CP change ISP no change current practice partia coaboration ISP change partia coaboration joint system design Tabe 2.5: To cooperate or not: possibe strategies for content provider (CP) and network provider (ISP) a communication porta between ISPs and P2P appications, which P2P appications can consut for ISP-biased network information to reduce network providers cost without sacrificing their performances. [18] proposed an orace service run by the ISP, so P2P users can query for the ranked neighbor ist according to certain performance metrics. [44] utiized existing network views coected from content distribution networks to drive biased peer seection in BitTorrent, so cross- ISP traffic can be significanty reduced and downoad-rate improved. [45] studied the interaction between underay routing and overay routing, which can be thought of as a generaization of server seection. The authors studied the equiibrium behaviors when two probems have conficting goas. Our work expores when and why sub-optimaity appears, and proposes a cooperative soution to address these issues. [46] studied the economic aspects of traditiona transit providers and content providers, and appied cooperative game theory to derive an optima settement between these entities. 2.9 Summary We examine the interpay between traffic engineering and content distribution. Whie the probem has ong existed, the dramaticay increased amount of content-centric traffic, e.g., CDN and P2P traffic, makes it more significant. With the strong motivation for ISPs to provide content services, they are faced with the question of whether to stay with the current design or to start sharing information or contro. This work sheds ight on ways ISPs and CPs can cooperate. This work serves as a starting point to better understand the interaction between those that operate networks and those that distribute content. Traditionay, ISPs provide and operate the pipes, whie content providers distribute content over the pipes. In terms of what information can be shared between ISPs and CPs and what contro can be jointy performed, there are four genera categories as summarized in Tabe 2.5. The top eft corner is the current practice, which 48

62 may give an undesirabe Nash equiibrium. The bottom right corner is the joint design, which achieves optima operation points. The top right corner is the case where the CP receives extra information and adapts contro accordingy, and the bottom eft corner is the case of content-aware networking. This work studies three of the four corners in the tabe. Starting from the current practice, to move towards the bottom right corner of the tabe, whie the two parties remain separate business entities, requires uniateray-actionabe, backward-compatibe, and incrementaydepoyabe migration paths yet to be discovered. 49

63 Chapter 3 Federating Content Distribution across Decentraized CDNs This chapter focuses on the chaenge of joint contro over mutipe traffic management decisions that are operated by mutipe institutions at different time-scaes [47]. We consider a content deivery architecture based on geographicay-distributed groups of ast-mie CDN servers, e.g., set-top boxes ocated within users homes. In contrast to Chapter 2, these servers may beong to administrativey separate domains, e.g., mutipe ISPs or CDNs. We propose a set of mechanisms to jointy manage content repication and request routing within this architecture, achieving both scaabiity and cost optimaity. Specificay, our soution consists of two parts. First, we identify the repication and routing variabes for each group of servers, based on distributed messagepassing between these groups. Second, we describe agorithms for content pacement and request mapping at the server granuarity within each group, based on the decisions made in the first step. We formay prove the optimaity of these methods, and confirm their efficacy through evauations based on BitTorrent traces. In particuar we observe a reduction of network costs by more than 50% over state-of-the art mechanisms. 50

64 3.1 Introduction The tota Internet traffic per month in 2011 is aready in excess of Bytes [48]. Video-ondemand traffic aone is predicted to grow to three times this amount by 2015 [48]. This foreseen growth prompts a rethinking of the current content deivery architecture. Today s content deivery networks (CDNs) operate in isoation, hence missing the opportunity to poo resources of individua CDNs. Recognizing the potentia of federating CDNs, industry stakehoders have created the IETF CDNi working group [49] to standardize protocos for CDN interoperabiity. Another promising evoution consists of extending the CDN to the ast mie of content deivery by incorporating servers at the network periphery. This approach everages sma servers within users homes, such as set-top boxes or broadband gateways, as advocated by the Nano-Datacenter consortium [50], or dedicated toaster-sized appiances promoted by business initiatives [51]. By inter-connecting these diffuse couds, user requests directed to one operator may be forwarded to another for the purpose of avaiabiity and proximity. Operating a federation of decentraized CDNs requires efficient traffic management among a coection of distinct service providers. Traffic crossing the provider boundaries may experience degraded performance such as extra atency. Within each provider, traffic exchange between the ast-mie servers ocated at different ISPs aso impies increased biing costs. Our work aims at deveoping soutions that minimize cross-traffic costs and accommodate user demands in a scaabe manner. Distributing content among a coection of operators presents an optimization probem over severa degrees of freedom: (i) content repication within each operator, (ii) request mapping to different operators, and (iii) service assignment to individua servers. Traditiona design addresses these probems separatey and yet they ceary impact one another. In this work, we present a set of soutions that coectivey achieve a of the foowing: A joint optimization over both content pacement and routing with provaby-optima performance. A decentraized impementation that faciitates coordination between different administrations. A scaabe service assignment scheme that is easy to impement. 51

65 An adaptive content caching agorithm with ow operationa costs. This chapter proceeds aong the foowing steps. We first describe an optimization probem featuring both pacement and routing variabes, whose soution gives a ower bound on the best achievabe costs (Section 3.2). We then propose a scheme distributed between the decentraized CDN operators which identifies content repication and routing poicies at the operator eve (Section 3.3). We next deveop content management and request routing strategies at the servereve within each operator (Sections 3.4 and 3.5). In conjunction, the operator- and server-eve schemes are proven to achieve optima network costs (Theorems 1,2 and 3). At a methodoogica eve, these resuts rey on decomposition of optimization probems couped with prima-dua techniques, and Lyapunov stabiity appied to fuid imits. In addition to this theoretica underpinning, our soution is further vaidated experimentay in Section 3.6. Simuations driven by BitTorrent traces, with a the associated features of rea traffic (bursty, ong-tai distribution, geographica heterogeneities) show that our approach reduces by more than haf network costs, as compared to state-of-the-art soutions based on LRU cache management and nearest-neighbor routing. We present reated work in Section 3.7 and concude in Section Probem Formuation and Soution Structure In this section, we introduce our system mode and a goba optimization probem that the content provider soves. We aso propose a content distribution architecture that aows a scaabe, efficient, and decentraized soution to the goba probem. The key notations are summarized in Tabe System Mode The system consists of a set B of boxes where B = B, and a distinguished node s, the content server. The content server s is owned and operated by a content provider, such as YouTube or Netfix, who wishes to deiver content (e.g., videos or songs) to home users who subscribe its services. The boxes, such as set-top boxes or network-attached storage (NAS), are instaed at users homes, providing common Internet connectivity and imited storage. The coection of deivered content constitutes a set C where C = C, caed the content cataog. A content are 52

66 Notation Interpretation s Content server. B Set of a boxes in the system. C Content cataog. D Set of a set-of-box casses. B d Boxes in cass d D. M d Storage capacity of boxes in B d. U d Upoad capacity of boxes in B d. F b Cache content of box b. p d c Repication ratio of content c in B d. λ d c Request rate for content c C in B d. w dd Cost of transfering content from d D {s} to d D. rc dd Rate of requests for c routed from d D to d D {s}. r d c Aggregate rate of incoming requests for c to d D {s}. R d R d = B d U d, the tota upoad capacity in cass d. Tabe 3.1: Summary of key notations repicated at server s. The storage and upoad capacities of boxes are eased out to the content provider, which uses these resources to off-oad part of the traffic oad on the server s. Box Casses A content provider s service cover geographicay-diverse regions and different ISPs. The customers present a diversity of their Internet connectivity (e.g., bandwidth) and even box storage capacity. We partition the set B of boxes into D casses B d of size B d = B d, where d D = {1,..., D}. Such partitioning may correspond, e.g., to grouping together boxes managed by the same ISP. Different eves of aggregation or granuarity can aso be used to refine the geographic diversity. For exampe, each cass may comprise boxes within the same city or even the same city bock. We aow casses to be heterogeneous, i.e., the storage and bandwidth capacities of boxes may differ across casses. We denote by M d the storage capacity of boxes in B d, e.g., the number of items a box can store. We make the assumption that content are of an identica size for instance, the origina content are chopped into chunks of a fixed size and the cataog is viewed as a coection of chunks rather than the origina items. For each box b B d, et F b C, where F b = M d 53

67 be the set of content cached in box b. For each cass d, et p d c = b B 1 d c F b B d, c C (3.1) be the fraction of boxes in B d that store content c C. We ca p d c the repication ratio of item c in cass d. As the tota storage capacity of cass d is B d M d, it is easy to see that, when a caches are fu, the repication ratios satisfy p d c = M d, d D. (3.2) c C For a fixed set of repication ratio {p d c}, there are many combinations of the exact content pacement profies {F b } b B d. Therefore, the repication ratio can be viewed as a cass-wide description of content pacement decisions. We denote by U d the upoad capacity of boxes in B d. A box can upoad at most U d content items concurrenty, each at a fixed rate. Aternativey, each box has U d upoad sots : once a box receives a request for a content it stores, a free upoad sot is taken to serve the request and upoad the requested content, if it exists. For exampe, a box has U d = 5Mbps dedicated upoad capacity, and is abe to serve at most 5 concurrent requests, e.g., video streaming, each at 1Mbps rate. The service time, e.g., the duration of a streaming session, is assumed to be exponentiay distributed with one unit mean. Sots remain busy unti the upoad terminates, at which point they become free again. When a U d sots of a box are busy, it is unabe to serve any additiona requests. As such, we use a oss mode [52], a key feature of our design, rather than a queueing mode, to capture the service behavior of the system. Such a choice is based on severa reasons. First, an incoming request must be immediatey served by a box, or re-routed to the server otherwise, rather than waiting in the queue. Second, most of today s content services, such as video streaming, require a constant bit-rate and do not consume the extra bandwidth. Third, we primariy focus on a heavy traffic scenario in which a box s upoad bandwidth is rarey under-utiized, as it is to the content provider s interests to offoad traffic from the server as much as possibe. 54

68 Request Load Users (boxes) generate content requests at varying rates across different casses. In particuar, each b B d generates requests for content c according to a Poisson process with rate λ d c. The aggregate request rate for c in cass d is λ d c = λ d cb d, which scaes proportionay to the cass size. When a box b B d storing c C (i.e., c F b ) generates a request for c, it is served by the oca cache no downoading is necessary. Otherwise, the request must be served by either the content server s or some other box, in B d or in a different cass. by r ds c satisfy We denote by r dd c the aggregate request rate routed from cass d boxes to cass d boxes, and the request rate routed directy to the server s. To meet users demands, these rates must r ds c + d D r dd c = λ d c(1 p d c), d D, (3.3) i.e., requests not immediatey served by oca caches in cass d are served by server s or a box in some cass d D. Loss Probabiities Not a requests for content c that arrive at a cass d can be served by boxes in d. For exampe, it is possibe that no free upoad sots in the cass exist when the request for c arrives. In such a case, we assume that a request has to be dropped from cass d and re-routed to the server s. An important performance metric that a content provider cares about is the request oss probabiity, as it wishes to offoad traffic from the server. Let νc d be the oss probabiity of item c in cass d, i.e., the steady state probabiity that a request for a content item c is dropped upon its arriva and needs be re-routed to the server s. In genera, ν d c depends on the foowing three factors: (a) the arriva rates {r d c } c C of requests for different content, where r d c = d D rd d c is the aggregate request rate for content c received by cass d, (b) the content pacement profie {F b } b B d in cass d boxes, and (c) the service assignment agorithm that maps incoming requests to boxes that serve them. We say that the requests for item c are served with high probabiity (w.h.p.) in cass d, if im ν B d c d (B d ) = 0, (3.4) 55

69 i.e., as the tota number of boxes increases, the probabiity that a request for content c is dropped goes to zero. Two necessary constraints for (3.4) to hod for d D are: c C r d c < B d U d, d D (3.5) r d c < B d U d p d c, c C, d D. (3.6) Constraint (3.5) states that the aggregate traffic oad imposed on cass d shoud not exceed the tota upoad capacity over a boxes. In addition, (3.6) states that the traffic imposed on d by requests for c shoud not exceed the tota capacity of boxes storing c. In Sections 3.4 and 3.5, we wi show that (3.5) and (3.6) are aso sufficient for (3.4), by presenting (a) a service assignment agorithm that map incoming requests to boxes, and (b) an agorithm for pacing the content {F b } b B d, such that requests for a content c C are served w.h.p. given that (3.5) and (3.6) hod A Goba Optimization for Minimizing Costs We next introduce a goba optimization probem that aows the content provider to minimize its operationa costs by reducing the cross traffic, whie ensuring cose-to-zero service oss probabiities. Minimizing Cross-Traffic Costs Serving a user request from one cass d in another cass d requires transferring content across the cass boundaries. The cross-traffic presents a significant operationa cost to the content provider. For exampe, the content provider needs to pay the bandwidth cost at a fixed rate for its outgoing traffic [3]. In particuar, we group boxes by ISPs, and the cross-traffic costs are dictated by the transit agreements between peering ISPs that may vary from one to another. As such, we denote w dd as the unit bandwidth cost for routing traffic from ISP d to d. As it is the content provider s goa to offoad traffic from the centra server s, we aso introduce a cost w ds that represents the unit traffic cost of serving cass d requests by the server, such that w ds > w dd for any d. 56

70 The tota weighted cross-traffic costs in the system can be formuated as ( w ds rc ds + ( ) ) w dd rc dd (1 νc d ) + w ds rc dd νc d, d c C d D considering the fact that a fraction ν d c of content c requests arriving at cass d are re-routed to the server due to osses. Given that (3.4) hods, i.e., the oss probabiity is arbitrariy sma as B grows arge, the tota system costs can be approximated as: c C,d D ( w ds r ds c + d D w dd r dd c ) (3.7) Joint Request Routing and Content Pacement Optimization We next present a goba optimization probem that aows a content provider to minimize its operationa cost, by controing request routing and content pacement decisions for each cass d. The content provider depoys, manages these decentraized CDN boxes, and pays the crosstraffic costs to ISPs. It is therefore to its interest to minimize the operationa costs. In particuar, the service provider needs to determine (a) the content F b paced in each box b, and (b) where the request generated by each box shoud be directed to, if the content is not ocay cached. Soving this probem over miions of boxes poses a significant scaabiity chaenge. Further, deciding where to pace content and how to route requests is, in genera, a combinatoria probem and hence computationay intractabe. To address these issues, we propose a divide-and-conquer approach that first soves a goba optimization probem across casses, and then impement detaied decisions inside each cass. Through such an approximation, we wish to minimize the goba cost, and at the same whie ensure that a requests are served w.h.p. as the system size scaes. Let r d = {rc dd } d D {s},c C and p d = {p d c} c C be the request rates and repication ratios in cass d, respectivey. Let F d (r d ) = c C ( w ds r ds c + d D w dd r dd c ) (3.8) 57

71 be the tota cost generated by cass d traffic. A ower bound on the operator s cost is provided by the soution to the foowing inear program GLOBAL minimize subject to variabes F d (r d ) d D p d c = M d, d D c C d D c C r dd c (3.9a) (3.9b) (3.9c) + r ds c = λ d c(1 p d c), c C, d D (3.9d) r d c < R d, d D (3.9e) r d c < R d p d c, c C, d D (3.9f) r dd c 0, r ds c 0, p d c 0, c C, d, d D where R d = B d U d is the tota upoad capacity in cass d. The objective of the optimization probem is to minimize the tota cost incurred by content transfers. Constraints (3.9c) and (3.9d) correspond to equations (3.2) and (3.3); they state that the fu storage capacity of each cass is used and that a requests are eventuay served, respectivey. Constraints (3.9e) and (3.9f) correspond to (3.5) and (3.6), respectivey. This optimization probem is a inear program and can be soved efficienty using standard optimization techniques. We ater introduce a distributed soution that is scaabe and appeaing to the geo-distributed infrastructure, without the need of a centraized coordinator that is prone to singe point of faiure System Architecture We propose an efficient soution to minimize the operator s costs, given that request rates and repication ratios are chosen so as to sove the optimization probem (3.9), and within each cass service assignment and content pacement are performed such that (3.4) hods, as we wi describe ater in Sections 3.4 and 3.5. To impement these decisions in a distributed fashion, we propose a scaabe system architecture by introducing cass trackers. These trackers are depoyed by the content provider inside each cass to manage content distribution for boxes. They coectivey sove 58

72 the goba optimization probem (3.9) in a distributed manner, and at the same whie manage content pacement and service assignment within their cass. Each cass tracker has a compete view of the current state of every box inside its own cass. In particuar, it has the fu visibiity to (a) which content are stored in each box, (b) how many free upoad sots they have, and (c) which content they are upoading. These statistics can be measured and maintained ocay at the tracker, as both pacing content at boxes and assigning requests to boxes are handed through the trackers, which we discuss beow. Cass trackers have no a-priori knowedge of the states of boxes in other casses, and ony require a ightweight exchange of summary statistics among themseves. In particuar, the operations performed by the tracker in cass d are as foows: Goba Optimization. The tracker in cass d determines (a) the repication ratio, i.e., the fraction p d c of boxes in the cass that store content c C, and (b) the rate of requests r dd c, d D {s} that are forwarded to the server or to other cass trackers by soving GLOBAL in a distributed fashion. When performing this optimization, the tracker takes into account the traffic of requests entering the cass d, as we as certain congestion signas it receives from other cass trackers. Request Routing. Having determined the rates rc dd, the tracker impements a routing poicy for requests generated within its cass. In particuar, a box that has a cache miss contacts the tracker, which then determines where the request shoud be routed to (e.g., the server, a box within cass d, or the tracker in another cass) so that the tota rates to d D {s} are given by r dd c. Service Assignment. Whenever a request (either interna or externa) for a content item is to be served by a box in cass d, the tracker determines which box in B d is going to serve this request. In particuar, the tracker uses a service assignment poicy that determines how incoming requests for content c shoud be mapped to a box b that stores c i.e., c F b. If an incoming request cannot be served, the tracker re-routes this request to the server s. Content Pacement. After deciding the repication ratios p d c of each content item in cass d, the tracker aocates the content items to boxes. That is, for each box b B d, it determines F b in a manner so that (3.1) is satisfied. 59

73 Mgmt Type Macroscopic/Goba Microscopic/Loca Content Repication Ratio Content Pacement Request Request Routing Service Assignment Tabe 3.2: Operations performed by a cass tracker The above operations are summarized in Tabe 3.2. They can be grouped into content management operations (repication ratio and pacement) and request management operations (routing and service assignment). These operations are of different contro granuarities, and happen at different time-scaes. For instance, content repication and request routing are operations that invove the macroscopic behavior of a cass, characterized by the aggregate information such as the repication ratios p d c and the request rates rc dd. In contrast, content pacement and service assignment are decisions that invove the microscopic management of resources in a cass at the eve of individua boxes. In the foowing sections, we describe each of these operations in more detai: we first show how to determine repication ratio and request routing in a distributed manner, foowed by the service assignment and the content pacement poicies that together provaby guarantee that a incoming content requests to a cass are served w.h.p A Decentraized Soution to the Goba Probem We now present how the trackers sove GLOBAL to determine their content repication (p d ) and request routing (r d ) parameters in a distributed fashion. In short, trackers exchange messages and adapt these vaues over severa rounds. Our soution ensures that both the request rates and the repication ratios adapt in a smooth fashion, i.e., changes between two iterations are incrementa and the system does not osciate widy. This is important as abrupt changes to p d require reshuffing the content of many boxes, which resuts in a considerabe cost in data transfers. Our presentation proceeds as foows: we first discuss why GLOBAL is difficut to sove in a distributed fashion with standard methods, and then present our distributed impementation. 60

74 3.3.1 Standard Dua Decomposition Consider the partia Lagrangian L(r, p; α, β) = d F d (r d ) + d β d ( c,d r dd c R d ) + d c α d c ( d ) rc dd R d p d c where α d c, β d are the dua variabes (Lagrange mutipiers) associated with the constraints (3.9e) and (3.9f), respectivey. Observe that L is separabe in the prima variabes, i.e., it can be written as L(r, p, α, β) = d Ld (r d, p d ; α, β) where L d (r d, p d, α, β) = F d (r d ) + d ( β d c r dd c + c α d c r dd c ) β d R d c α d cr d p d c. This suggests a standard dua decomposition agorithm [53] for soving GLOBAL. The agorithm runs in mutipe rounds, during which the tracker in cass d maintains and updates both the prima and dua variabes r d (t), p d (t), α d (t), β d (t), corresponding to its cass. We write these as functions of the round t = 0, 1,..., as they are adapted at each round. The tracker in cass d aso maintains estimates of the foowing quantities. First, for every c C, it maintains an estimate of λ d c, i.e., the request rate of c from boxes within its own cass. Second, it maintains estimates of the quantities r d c, i.e., the request rate for content c to be served by boxes in B d. In practice, λ d c and r d c can be estimated by the use of appropriate counters, one for each rate. These can be incremented appropriatey, e.g., whenever a box within B d issues a new request or an externa request is received. The desired rate estimates can be obtained at the end of each round by dividing the counters by the round duration T. At the begining of a new round, a counters are reset to zero. Using the estimates r d c, the tracker updates the dua variabes α d c, β d at the end of each round, increasing them when a constraints (3.9e) and (3.9f) are vioated or decreasing them when the constraints are oose. That is, for a c C, α d c(t) = max { 0, α d c(t 1) + γ(t)(r d c (t 1) R d p d c) } β d (t) = max { 0, β d (t 1) + γ(t) ( c r d c (t 1) R d)} (3.10a) (3.10b) 61

75 where γ(t) a decreasing gain factor. At the end round t, each tracker shares its current dua variabes with a other trackers. Having a dua variabes α, β in the sytem, the trackers adapt their prima variabes, reducing traffic forwarded to congested casses and increasing traffic forwarded to uncongested ones. This can be performed by setting: (r d, p d )(t + 1) = argmin (r d,p d ) I d L d (r d, p d ; α(t), β(t)). (3.11) where I d is the set of pairs (r d, p d ) defined by (3.9c) and (3.9d) as we as the non-negativity constraints. Combined, adaptations (3.10) and (3.11) are known to converge to a maximizer of the prima probem when the functions L d are stricty convex (see, e.g., [53] Section 3.4.2, pp ). Unfortunatey, this is not the case in our setup, as L d are inear in both r d and p d. In practice, the ack of strict convexity makes r d,p d osciate widy with every appication of (3.11). This is disastrous in our system; wide osciations of p d impy that a arge fraction of boxes in B d need to change their content in each iteration. This is both impractica and costy; ideay, we woud ike each iteration to change the content of each cass smoothy, so that the cost of impementing these changes is negigibe A Distributed Impementation We use an interior point method that deas with the ack of strict convexity caed the method of mutipiers [53]. First, we convert (3.9e) and (3.9f) to equaity constraints by introducing appropriate sack variabes y d, z d = [z d c ] c C : c C r d c + y d = R d, d D (3.12a) r d c + z d c = R d p d c, c C, d D (3.12b) y d 0, z d c 0, c C, d D (3.12c) Under this modification, GLOBAL has the foowing properties. First, the objective (3.9b) is separabe in the oca variabes (r d, p d, y d, z d ), corresponding to each cass. Second, and the constraints (3.12a) and (3.12b) couping the oca variabes are inear equaities. Finay, the remaining constraints (3.9c) and (3.9d) as we as the positivity constraints define a bounded 62

76 Tracker d at the end of round k: Obtain estimates of λ d c, r d c, c C. // Update dua variabes s d tot 1 D ( c C r d c + y d R d) β d βc d + θs d tot for each content ( c s d c 1 r d (t) D c + zc d (t) R d p d c(t) ) αc d αc d + θs d c end for Broadcast ( α d, β d, s d, stot) d to other trackers d D Receive dua variabes from a other trackers d D // Update prima variabes (r d, p d, z d, y d ) argminlocal d (r d, p d, z d, y d, α, β, s, s tot) I d Figure 3.1: Decentraized soution to the goba probem GLOBAL. convex domain for the oca prima variabes. As a resut, the method of mutipiers admits a distributed impementation (see [53], Exampe 4.4., pp ). Appying this distributed impementation to GLOBAL we obtain the agorithm summarized in Fig The tracker in cass d maintains the foowing oca variabes, which correspond to the prima and dua variabes of (3.9): r d (t), p d (t), z d (t), y d (t), α d (t), β d (t). As in the dua decomposition method, the tracker in cass d aso maintains estimates of λ d c, i.e., the request rate of c from boxes within its own cass, and r d c, i.e., the request rate for content c to be served by boxes in B d. Using these estimates, the prima and dua variabes are updated as foows. At the end of round t, the tracker in cass d uses the estimates of r d c to see whether constraints (3.12a) and (3.12b) are vioated or not. In particuar, the tracker computes the quantities: s d tot(t) = ( c r d c (t) + y d (t) R d) / D s d c(t) = ( r d c (t) + z d c (t) R d p d c(t) ) / D, c C 63

77 and updates the dua variabes as foows: β d (t) = β d (t 1) + θ(t)s d tot(t) α d c(t) = α d c(t 1) + θ(t)s d c(t), c C where {θ(t)} t N are positive and non-decreasing. Subsequenty, the tracker broadcasts to every other tracker in D a message containing its new dua variabes as we as the congestion signas α d (t), β d (t), s d (t), s d tot(t). Note that these comprise 2( D + 1) vaues, in tota. For any d, d D, et G d tot(r d, y d ) = c rdd c + 1 d=d y d, G d c (r d, p d, z d ) = rc dd + 1 d=d (zc d R d p d c). Intuitivey, these capture the contribution of the prima variabes of cass d to the constraints (3.12a) and (3.12b) of cass d. After the tracker in cass d has received a the messages sent by other trackers, it soves the foowing quadratic program: LOCAL d (r d (t), p d (t), z d (t), y d (t), α(t), β(t), s(t), s tot (t)) minimize F d (r d ) (3.13a) + β d (t)g d tot(r d, y d ) + αc d (t)g d c (r d, p d, z d ) d d,c (3.13b) + θ(t) 2 + c d (G d c [ ( ( G d tot r d r d (t), y d y d (t) ) 2 + stot(t)) d (3.13c) ( r d r d (t), p d p d (t), z d z d (t) ) ] ) 2 + s d c (t) (3.13d) subject to (r d, p d, z d, y d ) J d, d D (3.13e) variabes r d, p d, z d, y d, d D where J d is the set of quadrupets (r d, p d, y d, z d ) defined by (3.9c) and (3.9d) as we as the nonnegativity constraints. LOCAL d thus receives as input a the dua variabes α, β, the congestion variabes s d, s d tot, as we as a the oca prima variabes at round t. The ast four are incuded in 64

78 the quadratic terms appearing in the objective function, and ensure the smoothness of the changes to the prima variabes from one round to the next. The foowing theorem foows directy from the anaysis in [53] and estabishes that this agorithm indeed converges to an optima soution. Theorem 8. Assume that the tracker in cass d correcty estimates λ d c, r d c in each iteration, and that {θ(t)} t N is a non-decreasing sequence of non-negative numbers. Then im t (r d (t), p d (t)) = (r d, p d ), where (r d, p d ) d D is an optima soution to (3.9). 3.4 Request Routing and Service Assignment In this section, we study how to route user requests, based on the information of repication ratios {p d } and request rates {r d } cacuated by the goba cost optimization probem (3.9). Our request routing incudes two resoutions: (a) inter-domain request routing, e.g., a user request is first directed to a cass d (either its home cass or a remote cass), and (b) intra-domain service assignment, e.g., a user request arriving at cass d is further assigned to a specific box within cass d that is capabe of serving this request. The key idea of our service assignment strategy is that, at any time, the foowing property hods: given conditions (3.9e) and (3.9f) are true for a cass d, a requests routed to cass d are served w.h.p.. Intuitivey, our strategy ensures that, as the tota box popuation size B grows, the probabiity that a request reaching a cass cannot be accommodated converges to zero Inter-Domain Request Routing Content requests that are not ocay served due to cache misses, must be directed to another box inside its home cass, a remote cass, or the server s. The goba optimization probem (3.9) cacuates the inter-domain request routing decision, i.e., r d, for every cass d. Eq. (3.9d) dictates that for every content c, the designated forwarding rates to a casses and the server, sum up to λ c (1 p d c) the effective request rates coming out this cass. In practice, a cass d request is directed to another cass (or server) d D {s} with a probabiity proportiona to rc dd. As a resut, request voumes forwarded from cass d to d are independent Poisson processes with rates rc dd. This provides a simpe way to impement the request routing decisions optimized by the goba probem. 65

79 3.4.2 Intra-Domain Service Assignment We next show intra-domain service assignment poicies, i.e., seecting a specific box inside a cass, given the inter-domain request routing decisions made in the first step. The service assignment poicy determines which box an incoming request shoud be assigned to upon its arriva. If no avaiabe box is found, the request is re-routed to the server. In what foows, we drop the superscript d, since we primariy focus on the anaysis of a singe cass. We therefore denote by B the set of boxes in the cass, by B = B the size of the box popuation, by U and M their upoad and storage capacities, respectivey, and by r c the arriva rates of requests for content c. We study two service assignment poicies repacking and uniform sot that map a request to a box that is capabe of serving this request. For both poicies, we estabish a necessary and sufficient condition under which the system can serve a incoming traffic asymptoticay amost surey, i.e., w.h.p., as the system size increases. In other words, we characterize the capacity region for the two service assignment poicies, given a particuar content pacement decision by the system. Uniform Sot Poicy. Under this poicy, an incoming request for content c is assigned to a box seected among a boxes currenty storing c and having an empty upoad sot. Each such box is seected with a probabiity proportiona to the number of its empty sots. Equivaenty, the request is matched to an upoad sot seected uniformy from a free upoad sots of boxes that can store c. Definition 9. Let X b be the number of free upoad sots of a box b B. Under the uniform sot poicy, an incoming request to cass d for content c is matched to a sot seected uniformy at random among the b B:c F b X b free sots over a boxes. Note that if this sum is zero i.e., no box storing content c has a free upoad sot the incoming request is re-routed to the server. Repacking Poicy. A imitation of the uniform sot poicy is that it does not guarantee that an incoming request can be aways accommodated, though the system is capabe of serving the request by migrating existing services. One way to address this is through repacking [54, 55], as iustrated in the exampe of Figure 3.2(a). 66

80 boxes! requests! boxes! requests! C1 C2 C3 C2 C1 C2 C3 C3 C2 C3 C4 C4 C2 C3 C4 C1 C4 C5 C6 C1 before repacking! before repacking! C1 C2 C3 C2 C1 C2 C3 C3 C2 C3 C4 C4 C2 C3 C4 C1 C4 C5 C6 C1 after repacking! (a) one-hop migration after repacking! (b) two-hop migration Figure 3.2: The repacking poicy improves the resource utiization by aowing existing downoads to be migrated. In this exampe, a request for content c 1 arrives but no box storing c 1 currenty has a free sot. There exists a box b that stores c 1 but is using one of its upoad sots for serving another content c 2. In addition, there exists another box b that both stores c 2 and has a free sot. Now consider that if the downoad of content c 2 is migrated from b to b, it immediatey reeases a free sot at box b and aows the incoming request for c 1 to be served. Even if a boxes storing c 2 have no empty sots, it is possibe that they can free one of their busy sots, through a chain of migration events, as iustrated in Figure 3.2(b). We term this strategy repacking. Note that the operation of repacking does not require any change of existing content pacement on any box, and thus can be impemented in practice. In genera, a repacking operation might invove mutipe migrations of existing services, in order to free resources for an incoming request. To define it more precisey, consider a bipartite graph G(V req V sot, E), where V req is the set of request vertices that each corresponds to a request, and V sot is the set of sot vertices that each corresponds to an upoad sot from some box. E is the set of edges, where an edge between a request and a sot exists if the sot beongs to a box that 67

81 can serve the request, i.e., storing the content that is requested for. The service capacity of the system is characterized by the foowing emma: Lemma 10. A requests in V req can be served if and ony if there exists a maximum matching M E on G that is incident to a nodes in V req. We define the repacking poicy formay by introducing a matching probem as foows. At any time, et the graph G(V req V sot, E) incude existing requests being served, and M be the maximum matching that represent the existing service assignment. When a new request arrives, we expand the graph G by adding it to V req and creating edges between the request and a upoad sots in V sot that can serve it. If a new maximum matching M exists, the incoming request is said to be served. In particuar, M can be adapted from M by finding an augmenting path that incudes the new request in the bipartite graph. In practice, the augmenting path advises how existing requests shoud be migrated. We now formay define the repacking poicy as foows: Definition 11. Under the repacking poicy, an incoming request is matched to a sot by migrating existing service assignment, through the search of an augmenting path that finds a maximum matching in the expanded bipartite graph. If there exists no maximum matching, the request is dropped and re-routed to the server. As the size of the bipartite graph G cannot exceed 2B d U d, the repacking poicy can be impemented within a poynomia time in the number of boxes. The repacking poicy achieves a ower request dropping rate than the uniform sot poicy, as aowing services to be migrated expectedy improves the resource utiization. However, this comes at the cost of a higher compexity, since we need to sove a maximum matching probem at the arriva of every new request. Further, migrating existing services may introduce additiona user atency. [54] shows that, given certain conditions on content pacement and request rates, the repacking poicy achieves zero request dropping rate with a high probabiity, i.e., a incoming requests directed to a cass can be successfuy served. Whie the uniform sot poicy is simpe and easy to impement in practice, it aso begs the question whether it achieves the same optima performance as the repacking poicy. 68

82 3.4.3 Optimaity of Uniform Sot Poicy The effectiveness of a service assignment poicy depends on (i) how contents are paced across boxes, and (ii) the request rates for different contents. We study the conditions under which incoming requests can be successfuy served with zero dropping rate, i.e., we characterize the capacity region of the two service assignment poicies. We first introduce a condition under which the repacking poicy achieves zero dropping rate, and extend the resut to the uniform sot poicy. Given that the repacking poicy is more sophisticated than the uniform sot poicy, one might expect that the atter exhibits a higher oss probabiity. Nonetheess, we prove that the uniform sot poicy aso achieves the same performance asymptoticay, i.e., when the system size (number of boxes) is arge. This suggests a practica design choice: the simpe, ight-weighted uniform-sot poicy without osing the performance optimaity. Consider a coection of contents F C such that F = M. Let B F = {b B : F b = F} be the set of boxes that store exacty F. These sets partition B into sub-casses, each comprising boxes that store identica contents. As the number of boxes B = B goes to infinity, request arriva rates r c and the size of each subcass B F = B F scae proportionay to B. That is, the foowing quantities ρ c = r c /BU, β F = B F /B, c C F C are constants that do not depend on B. The scaing foows from our traffic mode: as the number of boxes increases, the content demand, the aggregate storage, and the tota upoad capacity grow proportionay with the system popuation. We formay define the performance metric as foows. Let ν c be the oss probabiity of content c, i.e., the steady state probabiity that a request for content c is dropped (and re-routed to the server) upon its arriva. We say that the requests for item c are served asymptoticay amost surey (w.h.p.) if im ν c(b) = 0. (3.14) B 69

83 We characterize the capacity region of the uniform sot and the repacking poicies, by answering the foowing question: given a fixed content pacement across boxes, what are the conditions on request arriva rates such that (3.14) hods? Consider the foowing condition: r c < B F U, A C. (3.15) c A F:F A Eq. (3.15) states that for any set of contents A C, the tota request rates for these contents do not exceed the aggregate upoad capacity of boxes that store at east one content in A. This condition is simiar to the Ha s Theorem that estabishes a maximum matching in a bipartite graph, however comes with changing traffic and graph. [54] shows that, if the repacking poicy is used, condition (3.15) is sufficient for every content request to be served w.h.p., i.e., (3.14) hods. We present the foowing theorem, which estabishes that if the uniform sot poicy is used, (3.15) is sufficient for every content request to be served w.h.p.. Theorem 12. Given that (3.15) hods, and that requests are assigned to boxes according to the uniform sot poicy, (3.14) is true, i.e., requests for every content c C are served w.h.p.. We stress that it is an asymptotic resut: for any system with a finite size, the repacking poicy outperforms the uniform sot poicy in terms of the oss probabiities ν c. This, however, comes at the cost of a more compicated impementation for the repacking poicy, which requires soving a maximum matching probem upon each arriva of content requests. In practice, a variant of the repacking poicy that ony invoves migration chains within a imited ength, e.g., re-assigning requests that incur one or two migrations, can be used to further ower oss probabiities at a computationa trade-off. However, as the system size B increases, the reative benefit vanishes. We ater show in the evauation that the uniform sot poicy achieves a very cose performance to the repacking poicy in practice, under reaistic traffic traces and system settings. We present the detaied proof of Theorem 12 beow. Readers can skip the proof without affecting the understanding of other sections. Proof of Theorem 12. We partition the set of boxes B according to the contents they store in their cache. In particuar, each of the boxes in B are grouped into L sub-casses B 1, B 2,..., B L, where a boxes in the i-th cass B i, 1 i L, store the same set of contents F i C such that F i = M. 70

84 We denote by L = {1, 2,..., L} and by B i = B i the number of boxes in the i-th sub-cass. For A C, et cass(a) = {i L : F i A } be the set of sub-casses of boxes storing at east one content in A. And et β i = B iu BU, i L and ρ c = r c BU, c C. Note that, by our scaing assumption, when B, the above quantities remain constant. Note that (3.15) can be rewritten as: ρ c < β i, A C. c A j cass(a) Let X i be the number of empty sots in the i-th subcass. Then, under the uniform sot service assignment poicy, the stochastic process X : R + N L is a Markov process and can be described as foows: X i (t) = X i (0) + E + i ( t 0 ) B i U X i (τ)dτ E i ( t 0 c F i r c ) X i (τ) j:c F j X j (τ) dτ, i L, (3.16) where E + i, E i, i L, are independent unit-rate Poisson processes. Note that in the above we assume by convention that for a i L and a c F i, X i (τ) j:c F j X j (τ) = 0 whenever j:c F j X j (τ) = 0. We wi say that a mapping x : R + [0, 1] L is a fuid trajectory of the system if it satisfies the foowing set of equations: x i (t) = x i (0) + β i t t x i (τ)dτ t ρ c z c,i (x(τ))dτ, i L. (3.17) c F i

85 where z c,i, c F i, are functions satisfying z c,i (x) = z c,i (x) > 0, x i j:c F j x j, if i cass({c}) j:c F j x j > 0, and z c,i 1 otherwise (3.18a) (3.18b) Given a vector x 0 [0, 1] L, we define S(x 0 ) to be the set of fuid trajectories defined by integra equation (3.17) with initia condition x(0) = x 0. Lemma 13. Consider a sequence of positive numbers {B k } k N such that im k B k = +, and a sequence of initia conditions X k (0) = [x k i ] 1 i L s.t. the imit im k 1 B k X k (0) = x 0 exists. Let {X k (t)} t R+ denote the Markov process given by (3.16) given that B = B k, and consider the rescaed process x k (t) = 1 B k U Xk (t), t R +. Then for a T > 0 and a ɛ > 0, im P k ( inf sup x S(x 0) t [O,T ] x k (t) x(t) ɛ ) = 0 Proof. The proof is adapted from Massouié [56]. The ony key steps that we need to verify for our system is that for every i and every c F i, (a) X i / j:c F j X j is bounded (by 1) and (b) at its points of discontinuity x (that is, such that j:c F j X j is zero), then im sup x x X i / j:c F j X j = 1 (where convergence is taken, e.g., in the. 1 norm). Both are easy to verify. Lemma 13 impies that (a) the set of fuid trajectories S(x) is non-empty and (b) the rescaed process x k converges on every finite set [0, T ] to a fuid trajectory, as B, in probabiity. We therefore turn our attention to studying the asymptotic behavior of such fuid trajectories. Given an x 0 [0, 1] L, consider a fuid trajectory x S(x 0 ). Since z i,c are bounded by 1, (3.17) impies that x is Lipschitz continuous (with parameter c ρ c). Hence, by Rademacher s theorem, ẋ exists amost everywhere and is given by ẋ i = β i x i c F i ρ c z c,i (x), i L. (3.19) 72

86 Define the imit set [57] of ODE (3.19) to be J = im t y [0,1] L {x(s), s t : x(0) = y}. Then, Lemma 13 impies (see Thm. 3 in Benaïm and Le Boudec [58]) that, as k tends to infinity, the support of the steady state probabiity of X k converges to a subset of J. Thus, to show that the probabiity that queries for every content item succeed asymptoticay amost surey, it suffices to show that x i > 0 for every x J. This is indeed the case. Let I 0 (x) = {i : x i = 0} be the zero-vaued coordinates of x and C(I) = {c C : cass({c}) I} denote the set of items stored ony by casses in I L. Consider the foowing candidate Lyapunov function: G(x) = β i og(x i ) i L i L x i ρ c og c C j cass({c}) x j, if x > 0 and G(x) = otherwise. Lemma 14. Under (3.15), G is continuous in [0, 1] L. Proof. Indeed, consider a x such that I I 0 (x ). Consider a sequence x k [0, 1] L, k N, s.t. x k x in the norm (or any equivaent norm in R L ). We need to show that im k G(x k ) =. If I 0 (x k ) for some k, then G(x k ) = ; hence, w..o.g., we can assume that x k (0, 1] L. Then G(x k ) = A k + B k, where A k = x k i ρ c og B k = i I i L\I β i og(x k i ) i L β i og(x k i ) c C\C(I) ρ c og c C(I) j cass({c}) x k j j cass({c}). x k j, and The by the continuity of the og function, and the fact that x i > 0 for a i / I, it is easy to see that im k A k exists and is finite. On the other hand, if C(I) =, B k obviousy converges to. Assume thus that C(I) is non-empty. Partition I into casses I 1,..., I m s.t. x k i = yk for a 73

87 i I (i.e., a coordinates in a cass assume the same vaue). B k β i og(x k i ) ( ρ c og i I c C(I) m = og(y k ) m β i og(y k ) i I =1 =1 m og(y k ) β i i I =1 c C(I ) max j cass({c}) xk j ρ c c C(I) ) ρ c 1 y k =max j cass({c}) x k j since 1 y k =max j cass({c}) x k j 1 cass({c}) I = and og(y k ) < 0 for k arge enough. Under (3.15), the above quantity tends to as k, and the emma foows. Suppose that I 0 (x(t)) =, i.e.. x(t) > 0. Then (3.19) gives d(og x i ) dt = ẋi = β i 1 1 ρ c = G x i x i c F i j:c F j x j x i Hence, dg( x(t)) dt = i G x i ẋ i = i ( ) 2 G x i 0, x i i.e., when at x, G is increasing as time progresses under the dynamics (3.19). This, impies that if x(t) > 0 then the fuid trajectory wi stay bounded away from any x s.t. I 0 (x), as G(x ) = and by Lemma 14 to reach such an x the quantity G(x(t)) woud have to decrease, a contradiction. Suppose now that I = I 0 (x(t)). We wi show that I 0 (x(t + δ) =, for sma enough δ. Our previous anaysis for the case I 0 (x) therefore appies and the theorem, as the imit set L cannot incude points x such that I 0 (x 0 ). By (3.17) and (3.18), fuid trajectories are Lipschitz continuous; hence, for δ sma enough, x i (t + δ) > 0 for a i / I. Moreover, by (3.19) d i I x i dt = β i ρ c z c,i (x) (3.18a) = β i i I i I c F i i I i I (3.18b) β i (3.15) ρ c > 0. i I c C(I) c F i C(I) ρ c z c,i (x) 74

88 Hence, for δ sma enough, there exists at east one i I such that x i (t + δ) > 0. Given that x i, i / I, wi stay bounded away from zero within this interva, this impies that within δ time (where δ sma enough) a coordinates in I wi become positive. Hence, I 0 (x(t + δ) =, for sma enough δ. 3.5 Content Pacement Condition (3.15) stipuates that every subset of C shoud be stored by enough boxes to serve incoming requests for a content in this set. In this section, we describe a content pacement scheme under which, if the conditions (3.9e) and (3.9f) of GLOBAL hod, then so does (3.15). As a resut, provided that (3.9e) and (3.9f) hod, this content pacement scheme in combined with the uniform sot poicy ensure that a requests are served w.h.p.. As the repication ratios p d change from one round of the optimization to the next, so does content pacement poicies; our agorithm expoits the fact that changes in p d are smooth, by reconfiguring pacement with as few content exchanges as possibe Designated Sot Pacement For every box b B d, we identify a specia storage sot which we ca the designated sot. We denote the content of this sot by D b and the remaining contents of b by L b = F b \ {D b }. For a c C, et E d c = {b B d : D b = c} be the set of boxes storing c in their designated sot. As E d c are disjoint, we have BFU d d = U d = U d + Ec d U d, A C. F:F A b B d :F b A b B d :D b A Hence, the foowing emma hods: b B d :D b A U d c A Lemma 15. If E d c > r d c /U d then (3.15) hods. Lemma 15 impies that the exact fraction of boxes that store content c in their designated sot shoud exceed r d c /B d U d but not p d c. The foowing emma states that such fractions can be easiy computed if the conditions in GLOBAL hod. 75

89 Lemma 16. Given a cass d, consider r d c and p d c, c C, for which (3.9e) and (3.9f) hod. There exist q d c [0, 1], c C, such that qc d = 1, c 0 r d c /B d U d < q d c p d c 1, c C. (3.20) Moreover, such q d c can be computed in O( C og C ) time. Proof. We show a constructive proof beow to cacuate q d c that satisfy (3.20). If M d = 1, the emma triviay hods for q d c = p d c. Now suppose that M d 2. Let ɛ = 1 c C r d c /B d U d and ɛ c = p d c r d c /B d U d. From (3.9e) and (3.9f), we have that ɛ > 0 and ɛ c > 0. Sort ɛ c in an increasing fashion, so that ɛ c1 ɛ c2... ɛ c C. If ɛ c1 ɛ, then the emma hods for q d c = r d c /B d U d + ɛ/ C. Assume thus that ɛ c1 < ɛ. Let k = max{j : j i=1 ɛ c i < ɛ}. Then 1 k < C, as c C ɛ (3.9c) c = M 1 + ɛ > M 1 > ɛ for M 2. Then, ɛ = ɛ k i=1 ɛ c i > 0, by the definition of k. Let q d c i = r d c /BU + ɛ ci for i k and q d c i = r d c /BU + ɛ /( C k) for i > k. Then q d c i = p d c i > λ c /BU for i k. For i > k, qc d i > r d c /BU as ɛ > 0 whie ɛ /( C k) ɛ ɛ ck+1, as otherwise k+1 i=1 ɛ c i < 1, a contradiction. Moreover, ɛ ck+1 ɛ ci for a i k, so qc d i r d c i /BU + ɛ ci p d c. Finay, c qd c = k i=1 (r d c i /BU + ɛ ci ) + C i=k+1 r d c i /BU + ɛ = c r d c /BU + ɛ = 1, and the emma foows. In other words, if (3.9e) and (3.9f) of GLOBAL hod, ensuring that requests for a contents are served w.h.p. in cass d is acheived by pacing content c in the designated sot of at east q d c B d boxes, where q d c B d are determined as in Lemma 16. We ca such a pacement scheme a designated sot pacement. Beow, we describe an agorithm that, given ratios q d c and p d c, paces content in cass d in a way that these ratios are satisfied An Agorithm Constructing a Designated Sot Pacement For simpicity, we drop the superscript d in the remainder of this section, though we are referring to content pacement in a singe cass. We focus on the scenario where we are given an initia content pacement {F b } b B over B boxes in set B. Our agorithm, outined in Figure 3.3, receives this pacement as we as target repication ratios q c and p c, c C, satisfying (3.20). It outputs a new 76

90 Input: Initia pacement {F b } b B and target ratios q c, p c Let A + := {c C : q c > q c}, A := {c : C : q c < q c}; whie there exists b B s.t. D b A + and L b A Pick c L b A, and swap it ocay with the content of D b. Update q, π, A +, A accordingy whie there exists b B s.t. D b A + and L b A = Pick c A and pace c in the designated sot D b ; Update q, π, A +, A accordingy Let C + := {c : π c > π c}; C + := {c : π c < π c}; C 0 := C \ (C + C ). Let G := { b B s.t. C + L b and C \ (D b L b ) }; whie (G ) or (there exists c C s.t. (π c π c)b 2) if (G ) then Pick any b G Repace some c C + L b with some c C \ (D b L b ); ese Pick c C s.t. (π c π c)b 2. Find a box b that does not store c. Pick c C 0 L b and repace c with c. update G, π, C +, C, C 0 accordingy. Figure 3.3: Pacement Agorithm content pacement {F b } b B in which q cb boxes store c in their designated sot, whie approximatey p cb boxes store c overa. Moreover, it does so with as few changes of box contents as possibe. We assume that q cb and p cb are integers for arge B, this is a good approximation. Let q c, p c be the corresponding designated sot and overa fractions in the input pacement {F b } b B. Let π c = p c q c, π c = p c q c. A ower bound on the number of cache modification operations needed to go to the target repication ratios is given by Bδ/2, where δ = c p c p c. Our agorithm s performance in terms of cache modification operations wi be expressed as a function of the cache size M and of the quantities α = c q c q c, β = c π c π c : Theorem 17. The content pacement agorithm in Fig. 3.3 eads to a content repication {F b } b B in which exacty q cb boxes store c in their designated sot, and p c B boxes store c overa, where c p c p c B < 2M, and p c p c B 1, for a c C. The tota number of write operations is at most B[α + (M 1)(α + β)]/2. In other words, the agorithm produces a pacement in which at most 2M contents are either under or over-repicated, each one ony by one repica. Proof of Thm. 17. To modify the designated sots, the agorithm picks any over-repicated content c in set A + = {c : q c > q c}. For any user hoding c in its designated sot, it checks whether it 77

91 hods in its norma sots an under-repicated content c A = {c : q c < q c}. If such content exists, it renames the corresponding sot as designated and the sot hoding c as norma. This incurs no cache deete-write, and reduces the 1 norm i.e., the imbaance between the vectors Bq and Bq by 2. This is repeated unti an under-repicated content c cannot be found within the norma cache sots of boxes storing some c A +. If there sti are over-repicated items in A +, some c A is seected arbitrariy and overwrites c within the designated sot. This again reduces imbaance by 2, invoving one deete-write operation. At the end of this phase, the repication rates within the designated sots have reached their target B q, incurring a cost of at most Bα/2 operations. The resuting caches are free of dupicate copies. Aso, after these operations, the intermediate repication rates within the norma cache sots Bπ c, verify Bπ c Bπ c Bq c Bq c. We now consider the transformation of the intermediate repication rates π c into repication rates π c. To this end, we distinguish contents c that are over-repicated, under-repicated and perfecty repicated by introducing C + = {c : π c > π c}, C = {c : π c < π c}, C 0 = {c : π c = π c}. For any box b, if there exists c C + L b, and c C \(D b L b ), the agorithm repaces c by c within L b, thereby reducing the 1 distance i.e., the imbaance between vectors Bπ and Bπ by 2. We ca the corresponding operation a greedy reduction. Eventuay, the agorithm may arrive at a configuration where no such changes are possibe. Then, for any box b such that C + L b is not empty, necessariy C (L b D b ). Hence, the size of C is at most M 1. In that case, the agorithm picks some content c that is under-repicated by at east 2 repicas, and finds a user b which does not hod c, i.e. c C \ (D b L b ). It aso seects some content c within C 0 L b : such content must exist, since C M 1, and necessariy C L b C \ {c } has size stricty ess than M 1, the size of L b ; the remaining content c must beong to C 0 since otherwise we coud have performed a greedy reduction. We then repace content c by content c. This does not change the imbaance, but augments the size of set C : indeed content c is now under-repicated (one repica missing). We then try to do a greedy reduction, i.e., a repacement of an over-repicated content by c if possibe. If not, we repeat the previous step, i.e., by identifying some content under-repicated by at east 2, and creating a new repica in pace of some perfecty repicated item, thereby augmenting the size of C whie maintaining imbaance. In at most M 1 steps, the agorithm infates the size of C to 78

92 at east M, at which stage we know that some greedy imbaance reduction can be performed. This procedure terminates when the size of C is at most M 1, and each of the corresponding contents is missing ony one repica. The imbaance, which was after the termination of the designated sot pacement at most N c π c π c B(α + β), has been reduced to at most 2M. The number of write operations invoved in each imbaance reduction by 2 is at most M 1. Thus the tota number of write operations in this second phase is upper-bounded by (M 1)(B/2)(α + β). 3.6 Performance Evauation In this section, we evauate our agorithms under synthesized traces and a rea-ife BitTorrent traffic trace. We impement an event-driven simuator that captures box-eve content pacement and service assignment. In particuar, we impement the soutions we proposed in Section 3.3 (LOCAL d optimization), Section 3.4 (Uniform Sot Poicy), Section (Designated Sot Pacement). We aso impement ISP trackers that execute the decentraized soution in Figure 3.1, e.g., redirecting requests, estimating demands and exchanging messages for decentraized optimization. In the rest of this section, a evauations are performed using a cost matrix with uniform [0, 1] cass-pairwise cost. The cost to itsef is aways zero, and the cost is set to constant 3 if a content is downoaded from the server either due to dropped request or designated routing decision Uniform-sot service assignment First, we show that the uniform-sot poicy achieves cose-to-optima service assignment, given that content-wise capacity constraints are respected. We focus on a singe ISP and assign requests to individua boxes under the uniform-sot poicy. Figure 3.5 shows the dropping probabiity under various settings of B and M. We utiize a synthesized trace with Poisson arrivas and exponentiay distributed service time. Content popuarity foows a Zipf distribution and we et the tota traffic rate grow proportionay to the system popuation. As predicted by the theory, the dropping probabiity quicky vanishes as B grows. For the same B, the dropping probabiity is higher when M is arger. This is due to the undermined accuracy of content pacement when M is arge. In addition, when the storage is rich, our scheme tries to aocate popuar content in caches in order to minimize cost. This resuts in more requests 79

93 Cumuative Percentie # of Downoads/IPs Reative Difference (%) Downoads Unique IPs Country in Decreasing Order of Tota Downoads 10 5 (a) Downoads Unique IPs Country in Decreasing Order of Tota Downoads (b) Content ID in Decreasing Order of Popuarity (c) Figure 3.4: Characterization of rea-ife BitTorrent trace. (a) Cumuative counts of downoads/boxes. (b) Per-country counts of downoads/boxes. (c) Predictabiity of content demand in one hour interva over one month period. from ess popuar content, which are originay sent to the server. Serving ess popuar content understandaby incur more drops Synthesized trace based simuation We next show the efficiency of our fu soution with decentraized optimization, content pacement scheme and uniform-sot poicy. We again utiize a synthesized trace generated as in the previous one. Content popuarities are heterogeneous in different casses. Figure 3.6 iustrates the average empirica cost per request, compared to the fuid prediction under distributed optimization and the optima cost computed offine. The distributed agorithm quicky converges to the goba 80

94 Dropping Probabiity P(B) M=1 M=2 M=4 M= Number of Boxes B Figure 3.5: Dropping probabiity decreases fast with uniform sot strategy. Simuation in a singe cass with a cataog size of C = 100. Goba Cost Empirica Numerica Optima Timeine Figure 3.6: Performance of fu soution with decentraized optimization, content pacement scheme and uniform-sot poicy, under the parameter settings C = 1000, D = 10, B = 1000, Ū = 3, M = 4. optimum after a handfu of iterations. The resuts suggest that in the steady state, the empirica cost achieved by our soution is very cose to the offine optima BitTorrent trace-based simuation We finay empoy a rea-ife trace coected from the Vuze network, one of the most popuar BitTorrent cients. To coect the traces, we ran 1000 DHT nodes with randomy-chosen IDs, and ogged a DHT put messages routed to our nodes. During 30 days we traced the downoads of around 2 miion unique fies by 8 miion unique IP addresses. We determined the country of each IP using Maxmind s GeoLite City database [59]. To imit the runtime of our simuations, we trim the traces as foows. We consider ony the top 1,000 most popuar fies, which contribute 52% of the tota downoads. Figure 3.4(a) iustrates the tota number of downoad events and unique IPs, grouped by countries in a decreasing order 81

95 Average Downoad Cost Offine Optima Decentraized LRU cosest LFU cosest Hour Figure 3.7: Performance of different agorithms over a rea 30-day BitTorrent trace. of tota downoads during the 30-day period, and the cumuative counts in Figure 3.4(b). The traces show that one user issues, on average, approximatey one content request. We seect the top 20 countries as casses in our simuation. So far we have shown the efficiency of our agorithm given fixed request rates. In practice, the content popuarity may change over time and cannot be accuratey predicted from the history. The content demand shoud remain sufficienty stabe in order for our agorithm to work we and not osciate widy. In our simuation, we measure content demands based on a 24-hour interva, and use the demand rates from the previous day as the prediction for the next. Figure 3.4(c) shows the reative difference between the predicted rate and the true rate, grouped by content in decreasing order of popuarity. Each data point is averaged over the entire 30-day period. The resuts show that demand is stabe for the majority of content. The ones with high variations come from new content that reach their popuarity peaks during the first few days after their generations. In the foowing simuations, we et each user have 2 upoad sots and 2 storage sots. We compare our soution to two common caching agorithms, e.g., Least Recenty Used (LRU) and Least Frequenty Used (LFU). The request routing foows the cosest approach: when a downoad request is issued, we go over a casses in increasing cost order, e.g., starting from the requester s cass. The request is accepted whenever a cass is found for which there exists at east one box with one or more free sots that stores the requested content. If no such cass is found, the request is redirected to the server. We run the trace under four content pacement and routing agorithms: our soution, offine optima, LRU-cosest and LFU-cosest. A agorithms start with the same 82

96 configuration of content pacement. Figure 3.7 shows the average downoad cost for the four cases. The resuts show that our soution significanty reduces the cross-traffic between casses most of the time compared to the two heuristics. The spikes are mainy due to incorrecty predicting the demand of content with highy varying popuarities. The offine agorithm works best because of its perfect demand prediction and its instant memory reshuffes. 3.7 Reated Work Peer-assistance of CDNs has been studied from severa perspectives. Recent work has shown that it can reduce CDN server traffic [60] and energy consumption [8] by more than 60%. Eary research compared the efficiency of prefetching poicies for peer-assisted VoD [61], and bandwidth aocation across different P2P swarms [62]. Issues regarding peer incentivization [63, 64, 65] have aso been studied. Minimizing cross-traffic is extensivey studied in the context of ISP-friendy P2P system design, whose goa is to encourage content downoads within the same ISP. This is known to reduce both ISP cross-traffic and downoad deays, being thus mutuay benefitia to ISPs and peers aike [66, 67]. Typica impementations bias the seection of downoad sources towards nearby peers; peer proximity can be inferred either through cient-side mechanisms [44] or through a service offered by the ISP [68, 69, 19]. In the atter case, the ISP can expicity recommend which neighbors to downoad content from by soving an optimization probem that minimizes cross-traffic [19]. In the context of peer-assisted CDNs, an objective that minimizes a weighted sum of cross-traffic and the oad on the content server can aso be considered [69]. Prior work on ISP-friendiness reduces cross traffic soey by performing service assignment to suitabe peers. In our context, we have an additiona contro knob on top of service assignment, namey content pacement: our optimization seects not ony where requests are routed, but aso where content is stored. In the cooperative caching probem, cients generate a set of requests for items, that need to be mapped to caches that can serve them; each cient/cache pair assignment is aso associated with an access cost. The goa is to decide how to pace items in caches and assign requests to caches that can serve them, so that the tota access cost is minimized. The probem is NP-hard, and a 10-approximation agorithm is known [70]. Motivated by CDN topoogies, Borst et a. [71] 83

97 obtain ower approximation ratios as we as competitive onine agorithms for the case where cache costs are determined by weights in a star graph. A poynomia agorithm is known in the case where caches are organized in a hierarchica topoogy and the repica accessed is aways the nearest repica [72]. Our work significanty departs from the above studies by expicity deaing with bandwidth constraints, assuming a stochastic demand for items and proposing an adaptive, distributed agorithm for joint content pacement and service assignment among mutipe casses of identica caches. Finay, recent work [73, 74, 54] has considered cache management specificay in the context of P2P VoD systems, and is in this sense cose to our work. However the muti-group dimension is not addressed in these works; aso, at the exception of [54], the request service poicies in [73, 74] differ from ours, which is based on a oss mode. 3.8 Summary We offer a soution to reguate cross-traffic and minimize content deivery costs in decentraized CDN s. We present an optima request routing scheme that can nicey accommodate user demands, an effective service mapping agorithm that is easy to impement within each operator, and an adaptive content caching agorithm with ow operationa costs. Through a ive BitTorrent trace-based simuation, we demonstrate that our distributed agorithm is simutaneousy scaabe, accurate and responsive. 84

98 Chapter 4 DONAR: Decentraized Server Seection for Coud Services This chapter focuses on the chaenge of a distributed impementation of traffic management soutions across a arge number of geographicay distributed cients [10]. Geo-repicated services need an effective way to direct cient requests to a particuar ocation, based on performance, oad, and cost. This chapter presents DONAR, a distributed system that can offoad the burden of repica seection, whie providing these services with a sufficienty expressive interface for specifying mapping poicies. Most existing approaches for repica seection rey on either centra coordination (which has reiabiity, security, and scaabiity imitations) or distributed heuristics (which ead to suboptima request distributions, or even instabiity). In contrast, the distributed mapping nodes in DONAR run a simpe, efficient agorithm to coordinate their repica-seection decisions for cients. The protoco soves an optimization probem that jointy considers both cient performance and server oad, aowing us to show that the distributed agorithm is stabe and effective. Experiments with our DONAR prototype providing repica seection for CoraCDN and the Measurement Lab demonstrate that our agorithm performs we in the wid. Our prototype supports DNS- and HTTP-based redirection, IP anycast, and a secure update protoco, and can hande many customer services with diverse poicy objectives. 85

99 4.1 Introduction Coud services need an effective way to direct cients across the wide area to an appropriate service ocation (or repica ). For many companies offering distributed services, managing repica seection is an unnecessary burden. In this chapter, we present the design, impementation, evauation, and depoyment of DONAR, a decentraized repica-seection system that meets the needs of these services A Case for Outsourcing Repica Seection Many networked services hande repica seection themseves. However, even the simpest approach of using DNS-based repica seection requires running a DNS server that tracks changes in which repicas are running and customizes the IP address(es) returned to different cients. These IP addresses may represent singe servers in some cases. Or, they may be virtuaized addresses that each represent a custer of coocated machines, with an on-path oad baancer directing requests to individua servers. To hande wide-area repica seection we, these companies need to (i) run the DNS server at mutipe ocations, for better reiabiity and performance, (ii) have these nodes coordinate to distribute cient requests across the repicas, to strike a good trade-off between cient performance, server oad, and network cost, and (iii) perhaps switch from DNS to aternate techniques, ike HTTP-based redirection or proxying, that offer finer-grain contro over directing requests to repicas. One aternative is to outsource the entire responsibiity for running a Web-based service to a CDN with ampe server and network capacity (e.g., Akamai [6]). Increasingy, coud computing offers an attractive aternative where the coud provider offers eastic server and network resources, whie aowing customers to design and impement their own services. Today, such customers are eft argey to hande the repica-seection process on their own, with at best imited support from individua coud providers [75] and third-party DNS hosting patforms [76, 77]. Instead, companies shoud be abe to manage their own distributed services whie outsourcing repica seection to a third party or their coud provider(s). These companies shoud merey specify high-eve poicies, based on performance, server and network oad, and cost. Then, the repicaseection system shoud reaize these poicies by directing cients to the appropriate repicas and 86

100 adapting to poicy changes, server repica faiures, and shifts in the cient demands. To be effective, the repica-seection system must satisfy severa important goas for its customers. It must be: Expressive: Customers shoud have a sufficienty expressive interface to specify poicies based on (some combination of) performance, repica oad, and server and bandwidth costs. Reiabe: The system shoud offer reiabe service to cients, as we as stabe storage of customer poicy and repica configuration data. Accurate: Cient requests shoud be directed to the service repicas as accuratey as possibe, based on the customer s repica-seection poicy. Responsive: The repica-seection system shoud respond quicky to changing cient demands and customer poicies without introducing instabiity. Fexibe: The nodes shoud support a variety of repica-seection mechanisms (e.g., DNS and HTTP-redirection). Secure: Ony the customer, or another authorized party, shoud be abe to create or change its seection poicies. In this chapter, we present the design, impementation, evauation, and depoyment of DONAR, a decentraized repica-seection system that achieves these goas. DONAR s distributed agorithm soves a forma optimization probem that jointy considers both cient ocaity, server oad, and poicy preferences. By design, DONAR faciitates repica seection for many services, however its underying agorithms remain reevant in the case of a singe service performing its own repica seection, such as a commercia CDN Decentraized Repica-Seection System The need for reiabiity and performance shoud drive the design of a repica-seection system, eading to a distributed soution that consists of mutipe mapping nodes handing a diverse mix of cients, as shown in Figure 4.1. These mapping nodes coud be HTTP ingress proxies that route cient requests from a given ocae to the appropriate data centers, the mode adopted by Googe and Yahoo. Or the mapping nodes coud be authoritative DNS servers that resove oca queries for the names of Web sites, the mode adopted by Akamai and most CDNs. Furthermore, these DNS servers may use IP anycast to everage BGP-based faiover and to minimize cient request 87

101 Customer Repicas 50% 25% 13% 12% Mapping Nodes Node 1 Node 2 Node 3 Cient Requests Figure 4.1: DONAR uses distributed mapping nodes for repica seection. Its agorithms can maintain a weighted spit of requests to a customer s repicas, whie preserving cient repica ocaity to the greatest extent possibe. atency. Whatever the mechanism for interacting with cients, each mapping node has ony a partia view of the goba space of cients. As such, these mapping nodes need to make different decisions; for exampe, node 1 in Figure 4.1 directs a of its cients to the eftmost repica, whereas node 3 divides its cients among the remaining three repicas. Each mapping node needs to know how to both direct its cients and adapt to changing conditions. The simpest approach woud be to have a centra coordinator that coects information about the mix of cients per customer service, as we as the request oad from each mapping node, and then informs each mapping node how to direct future requests. However, a centra coordinator introduces a singe point of faiure, as we as an attractive target for attackers trying to bring down the service. Further, it incurs significant overhead for the mapping nodes to interact with the controer. Whie some existing services do perform centraized computation for exampe, Akamai uses a centraized hierarchica stabe-marriage agorithm for assigning cients to its CDN servers [78] the overhead of backhauing cient information can be more prohibitive when it must be done for each customer service using DONAR. Finay, a centraized soution adds additiona deay, making the system ess responsive to sudden changes in cient request rates (i.e., fash crowds). 88

102 To overcome these imitations, DONAR runs a decentraized agorithm among the mapping nodes themseves, with minima protoco overhead that does not grow with the number of cients. Designing a distributed agorithm that is simutaneousy scaabe, accurate, and responsive is an open chaenge, not addressed by previous heuristic-based [79, 80, 81, 82] or partiay centraized [83] soutions. Decentraized agorithms are notoriousy prone to osciations (where the nodes over-react based on their own oca information) and inaccuracy (where the system does not baance repica oad effectivey). DONAR must avoid faing into these traps Research Contributions and Roadmap In designing, impementing, depoying, and evauating DONAR, we make three main research contributions: Simpe and expressive interface for customer poicies (Section 4.2): DONAR has a simpe poicy interface where each repica ocation specifies a spit weight or a bandwidth cap, and expected cient performance is captured through a performance penaty. These three sets of parameters may change over time. We show that this interface can capture diverse repica-seection goas based on cient proximity, server capacity, 95th-percentie biing, and so on. The API eads us to introduce a forma optimization probem that jointy consider customers (sometimes conficting) preferences. Stabe, efficient, and accurate distributed repica-seection agorithm (Section 4.3): DONAR consists of a distributed coection of mapping nodes for better reiabiity and performance. Since these nodes each hande repica seection for a different mix of cients, they cannot simpy sove the optimization probem independenty. Rather than resorting to a centraized architecture that pushes resuts to each mapping node, we decompose the optimization probem into a distributed agorithm that requires ony minima coordination between the nodes. We show that our decentraized agorithm provaby converges to the soution of the optimization probem, ensuring that DONAR does not over or under react to shifts in cient demands. Scaabe, secure, reiabe, and fexibe prototype system (Section 4.4): Our DONAR prototype impements the distributed optimization agorithm, supports DNS- and HTTP-based repica seection, and stores customer data in a scaabe storage system. DONAR uses IP anycast for fast faiover and good cient redirection performance. The prototype incudes a secure update 89

103 protoco for poicy changes and receives a periodic feed of IP2Geo data [9] to support poicies based on cient ocation. Our prototype is used to provide distributed DNS resoution for the Measurement Lab testbed [84] and for a portion of the CoraCDN service [85]. Experiments in Section 4.5 evauate both our distributed agorithm operating at scae (through trace-driven simuations of cient requests to CoraCDN) and a sma-scae depoyment of our prototype system (providing DNS-based repica seection to jointy optimize cient proximity and server oad for part of CoraCDN). These experiments demonstrate that DONAR offers effective, customized repica seection in a scaabe and efficient fashion. Section 4.6 compares DONAR to reated work, and Section 4.7 concudes. 4.2 Configurabe Mapping Poicies DONAR reaizes its customers high-eve poicy goas, whie shieding them from the compicated internas of distributed repica seection. By customer, we mean a service provider that outsources repica seection to DONAR. Customers configure their poicies by communicating directy with any DONAR mapping node. This section first motivates and introduces DONAR s poicy API. It then describes a number of appication scenarios that everage this interface to express sophisticated mapping poicies Customer Goas Customers use DONAR to optimay pair cients with service repicas. What customers consider optima can differ: some may seek simpy to minimize the network atency between cients and repicas, others may seek to baance oad across a repicas, whie sti others may try to optimize the assignment based on the biing costs of their network operators or hosting services. In genera, however, we can decompose a customer s preferences into those associated with the wide-area network (the network performance a cient woud experience if directed to a particuar repica) and those associated with its own repicas (the oad on the servers and the network at each ocation). DONAR considers both factors in its repica-seection agorithm. Minimizing network costs. Mapping poicies commony seek to pair cients with repicas that offer good performance. Whie repica oad can affect cient-perceived performance, the network 90

104 path has a significant impact as we. For exampe, web transfers or interactive appications often seek sma network round-trip times (RTTs), whie buk transfers seek good network throughput (athough RTT certainy can aso impact TCP throughput). DONAR s agorithms use an abstract cost(c, i) function to indicate the performance penaty between a particuar cient repica pair. This function aows us a considerabe amount of expressiveness. For instance, to optimize atency, cost simpy can be RTT, directy measured or estimated via network coordinates. If throughput is the primary goa, cost can be cacuated as the penaty of network congestion. In fact, the cost function can be any shape e.g., a transated ogistic function of atency or congestion to optimize the worst case or percentie-based performance. The fexibiity of the cost function aso aows for intricate poicies, e.g., aways mapping one cient to a particuar server repica, or preferring a repica through a peering ink over a transit ink. DONAR s current impementation starts with a shared cost function for a of its customers, which represents expected path atency. As a resut, our customers do not need to cacuate and submit arge and compex cost functions independenty. In the case where a customer s notion of cost differs from that shared function, our interface aows them to override DONAR s defaut mapping decisions (discussed in more detai in the foowing section). To estimate path atency, services ike DONAR coud use a variety of techniques: direct network measurements [86, 6, 87], virtua coordinates [88, 89], structura modes of path performance [90, 11], or some hybrid of these approaches [82]. DONAR s agorithm can accept any of these techniques; research towards improving network cost estimation is very compementary to our interests. Our current prototype uses a commercia IP geoocation database [9] to estimate network proximity, athough this easiy coud be repaced with an aternative soution. Baancing cient requests across repicas. Unike pairwise network performance, which is argey shared amongst DONAR s customers, traffic distribution preferences vary widey from customer to customer. In the simpest scenarios, services may want to equay baance request rates across repicas, or they may want decisions based soey on network proximity (i.e., cost). More advanced considerations, however, are both possibe and important. For exampe, given 95thpercentie biing mechanisms, customers coud try to minimize costs by reducing the frequency of peak consumption, or to eiminate overage costs by rarey exceeding their committed rates. In 91

105 any repica-mapping scheme, request capacity, bandwidth cost, and other factors wi infuence the preferred rate of request arriva at a given repica. The reative importance of these factors may vary from one repica to the next, even for the same customer Appication Programming Interface Through considering a number of such poicy preferences which we review in the next section we found that enabing customers to express two simpe factors was a powerfu too. DONAR aows its customers to dictate a repica s (i) spit weight, w i, the desired proportion of requests that a particuar repica i shoud receive out of the customer s tota set of repicas, or (ii) bandwidth cap, B i, the upper-bound on the exact number of requests that repica i shoud receive. In practice, different requests consume different amounts of server and bandwidth resources; we expect customers to set B i based on the reationship between the avaiabe server/bandwidth resources and the average resources required to service a request. These two factors enabe DONAR s customers to baance oad between repicas or to cap oad at an individua repica. For the former, a customer specifies w i and ɛ i to indicate that it wishes repica i to receive a w i fraction of the tota request oad 1, but is wiing to deviate up to ɛ i in order to achieve better network performance. If P i denotes the true proportion of requests directed to i, then P i w i ɛ i This ɛ i is expressed in the same unit as the spit rate; for exampe, if a customer wants each of its ten repicas to receive a spit rate of 0.1, setting ɛ to 0.02 indicates that each repica shoud receive 10% ± 2% of the request oad. Aternativey, a customer can aso specify a bandwidth cap B i per repica, to require that the exact amount of received traffic at i does not exceed B i. If B is the tota constant oad across a repicas, then B P i B i 1 In reaity, the customer may seect weights that do not sum to 1, particuary since each repica may assign its weight independenty. DONAR simpy normaizes the weights to sum to 1, e.g., proportion w i / j w j for repica i. 92

106 Functionaity DONAR API Ca create a DONAR service s = create () add a repica instance i = add (s, rep, tt) set spit weight set (s, i, w i, ɛ i ) set bandwidth cap set (s, i, B i ) match a cient-repica pair match (s, cnt, i) prefer a particuar repica preference (s, cnt, i) remove a repica instance remove (s, i) Figure 4.2: DONAR s Appication Programming Interface Note that if a set of B i s become infeasibe given a service s traffic oad, DONAR reverts to w i spits for each instance. If those are not present, excess oad is spread equay amongst repicas. Figure 4.2 shows DONAR s customer API. A customer creates a DONAR service by caing create(). A customer uses add() and remove() to add or remove a repica instance from its service s repica set. The record wi persist in DONAR for the specified time-to-ive period (tt), uness it is expicity removed before this period expires. A short tt serves to make add() equivaent to a soft-state heartbeat operation, where an individua repica can execute it repeatedy to express its iveness. To communicate its preferred request distribution, the customer uses set() for a repica instance i. This function takes either (w i, ɛ i ) or B i, but never both (as the combination does not make ogica sense with respect to the same repica at the same instant in time). We do aow a customer simutaneousy to use both w i and B i for different subsets of repicas, which we show ater by exampe. Some customers may want to impose more expicit constraints on specific cient-repica pairs, for cost or performance reasons. For exampe, a customer may insta a singe repica inside an enterprise network for handing requests ony from cients within that enterprise (and, simiary, the cients in the enterprise shoud ony go to that server repica, no matter what the oad is). A customer expresses this hard constraint using the match() ca. Aternativey, a customer may have a strong preference for directing certain cients to a particuar repica e.g., due to network peering arrangements giving them priority over other cients. A customer expresses this poicy using the preference() ca. In contrast to match(), the preference() ca is a soft constraint; in the (presumaby rare) case that repica i cannot hande the tota oad of the high-priority cients, 93

107 some of these cient requests are directed to other, ess-preferred repicas. These two options mimic common primitives in today s commercia CDNs. In describing the API, we have not specified exacty who or what is initiating these API function cas e.g., a centraized service manager, individua repicas, etc. because DONAR imposes no constraint on the source of customer poicy preferences. As we demonstrate next, different use cases require updating DONAR from different sources, at different times, and based on different inputs Expressing Poicies with DONAR s API Customers can use this reativey simpe interface to impement surprisingy compex mapping poicies. It is important to note that, internay, DONAR wi minimize the network cost between cients and repicas within the constraints of any customer poicy. To demonstrate use of the API, we describe increasingy intricate scenarios. Seecting the cosest repica. To impement a cosest repica poicy, a customer simpy sets the bandwidth caps of a repicas to infinity. That is, they impose no constraint on the traffic distribution. This update can be generated from a singe source or independenty from each repica. Baancing workoad between datacenters. Consider a service provider running mutipe datacenters, where datacenter i has w i servers (dedicated to this service), and jobs are primariy constrained by server resources. The service provider s tota capacity at these datacenters may change over time, however, due to physica faiures or maintenance, the addition of new servers or VM instances, the repurposing of resources amongst services, or temporariy bringing down servers for energy conservation. In any case, they want the proportion of traffic arriving at each datacenter to refect current capacity. Therefore, each datacenter i ocay keeps track of its number of active servers as w i, and cas set(s, i, w i, ɛ i ), where ɛ i can be any toerance parameter. The datacenter need ony ca set() when the number of active servers changes. DONAR then proportionay maps cient requests according to these weights. This scenario is simpe because (i) w i is ocay measurabe within each datacenter, and (ii) w i is easiy computabe, i.e., decreasing w i when servers fai and increasing when they recover. 94

108 Muti-Homing Repicated Service Avaiabe Loca inks with Wide-area repicas with Resources different bandwidth different bandwidth to Optimize capacities and costs capacities and costs Resource Proportion of traffic Proportion of new cients Aocation to assign each ink to assign each repica Figure 4.3: Muti-homed route contro versus wide-area repica seection Enforcing heuristics-based repica seection. DONAR impements a superset of the poicies that are used in egacy systems to achieve heuristics-based repica seection. For exampe, OASIS [82] aows repicas to withdraw themseves from the server poo once they reach a given sef-assessed oad threshod, and then maps cients to the cosest repica remaining in the poo. To reaize this poicy with DONAR, each repica i independenty cas set(s, i, B i ), where B i is the oad threshod it estimates from previous history. B i can be dynamicay updated by each repica, by taking into account the prior observed performance given its traffic voume. DONAR then stricty optimizes for network cost among those repicas sti under their workoad threshod B i. Optimizing 95th-percentie biing costs. Network transit providers often charge based on the 95th-percentie bandwidth consumption rates, as cacuated over a 5-minute or 30-minute periods in a month. To minimize such burstabe biing costs, a distributed service coud everage an agorithm recenty proposed for muti-homed route contro under 95-percentie biing [91]. This agorithm, whie intended for traffic engineering in muti-homed enterprise networks, coud aso be used to divide cients amongst service repicas; Figure 4.3 summarizes the reationship between these two probems. The biing optimization coud be performed by whatever management system the service aready runs to monitor its repicas. This system woud need to track each repica s traffic oad, predict future traffic patterns, run a biing cost optimization agorithm to compute the target spit ratio w i for the next time period, and output these parameters to DONAR through set(s, i, w i, 0) cas. Of course, any mapping service capabe of proportionay spitting requests (such as weighted round-robin DNS) coud achieve such dynamic spits, but ony by ignoring cient network performance. DONAR accommodates frequenty updated spit ratios whie sti assuring that cients reach a nearby repica. 95

109 Notation N C n I R nci α cn s n w i P i B i B ɛ i Interpretation Set of mapping nodes Set of cients for node n, C n C, the set of a cients Set of service repicas Proportion of traffic oad that is mapped to repica i from cient c by node n Proportion of node n s traffic oad from cient c Proportion of tota traffic oad (from C) on node n Traffic spit weight of repica i True proportion of requests directed to repica i Bandwidth cap on repica i Tota traffic oad across a repicas Toerance of deviation from desired traffic spit Tabe 4.1: Summary of key notations Shifting traffic to under-utiized repicas. Content providers often depoy services over a arge set of geographicay diverse repicas. Providers commony want to map cients to the cosest repica uness pre-determined bandwidth capacities are vioated. Under such a poicy, certain nodes may see too few requests due to an unattractive network ocation (i.e., they are rarey the cosest node ). To remedy this probem, providers can everage both traffic options in combination. For busy repicas, they set(s, i, B i ) to impose a bandwidth cap, whie for unpopuar repicas, they can set(s, i, k, 0), so that repica i receives at east some fixed traffic proportion k. In this way, they avoid under-utiizing instances, whie offering the vast majority of cients a nearby repica. 4.3 Repica Seection Agorithms This section first formuates the goba repica-seection probem, based on the mapping poicies we considered earier. We then propose a decentraized soution running on distributed mapping nodes. We demonstrate that each mapping node ocay optimizing the assignment of its cient popuation and judiciousy communicating with other nodes can ead to a gobay optima assignment. We summarize the notation we use in Tabe Goba Repica-Seection Probem We have discussed two important poicy factors that motivate our repica-seection decisions. Satisfying one of these components (e.g., network performance), however, typicay comes at the 96

110 expense of the other (e.g., accurate oad distribution). As DONAR aows customers to freey express their requirements through their choice of w i, ɛ i, and B i customers can express their wiingness to trade-off performance for oad distribution. To formuate the repica-seection probem, we introduce a network mode. Let C be the set of cients, and I the set of server repicas. Denote R ci as the proportion of traffic from cient c routed to repica i, which we sove for in our probem. The goba cient performance (penaty) can be cacuated as perf g = c C R ci cost(c, i) (4.1) i I Foowing our earier definitions, et P i = c C R ci be the true proportion of requests directed to repica i. Our goa is to minimize this performance penaty, i.e., to match each cient with a good repica, whie satisfying the customer s requirements. That goa can be expressed as the foowing optimization probem RS g : minimize perf g (4.2a) subject to P i w i ɛ i (4.2b) B P i B i, i (4.2c) B is the tota amount of traffic, a constant parameter that can be cacuated by summing the traffic observed at a repicas. Note that for each repica i, either constraint (4.2b) or (4.2c) is active, but not both. The optimization probem can aso hande the match() and preference() constraints outined in Section A ca to match() imposes a hard constraint that cient c ony uses repica i, and vice versa. This is easiy handed by removing the cient and the repica from the optimization probem entirey, and soving the probem for the remaining cients and repicas. The preference() ca imposes a soft constraint that cient c has priority for mapping to repica i, assuming the repica can hande the oad. This is easiy handed by scaing down cost(c, i) by some arge constant factor so the soution maps cient c to repica i whenever possibe. 97

111 4.3.2 Distributed Mapping Service Soving the goba optimization probem directy requires a centra coordinator that coects a cient information, cacuates the optima mappings, and directs cients to the appropriate repicas. Instead, we sove the repica-seection probem using a distributed mapping service. Let N be the set of mapping nodes. Every mapping node n N has its own view C n of the tota cient popuation, i.e., C n C. A mapping node n receives a request from a cient c C n. The node maps the cient to a repica i I and returns the resut to that cient. In practice, each cient c can represent a group of aggregated end hosts, e.g., according to their posta codes. This is necessary to keep request rates per-cient stabe (see Section for more discussion). Therefore, we aow freedom in directing one cient to one or mutipe repicas. We then expand the decision variabe R ci to R nci, which is the fraction of traffic oad that is mapped to repica i from cient c by node n, i.e., i R nci = 1. Each DONAR node monitors requests from its own cient popuation. Let s n be the normaized traffic oad on node n (that is, n s fraction of the tota oad from C). Different cients may generate different amounts of workoad; et α cn [0, 1] denote the proportion of n s traffic that comes from cient c (where c C n α cn = 1). The information of R nci and α cn can be measured at DONAR nodes ocay, whereas coecting s n and P i requires either centra aggregation of each node s oca cient oad and decisions, or each node to exchange its oad with its peers. The distributed node depoyment aso aows DONAR to check the feasibiity of probem (4.2a) easiy: given oca oad information, cacuate the tota oad B and determine whether (4.2a) accepts a set of {w i, B i } i I parameters. The goba repica-seection probem RS g, after introducing mapping nodes and the new variabe R nci, remains the same formuation as (4.2a)-(4.2c), with perf g = s n α cn R nci cost(c, i) (4.3) n N c C n i I P i = s n α cn R nci n N c C n A simpe approach to soving this probem is to depoy a centra coordinator that coects a necessary information and cacuates the decision variabes for a mapping nodes. We seek to 98

112 avoid this centraization, however, for severa reasons: (i) coordination between a mapping nodes is required; (ii) the centra coordinator becomes a singe point-of-faiure; (iii) the coordinator requires information about every cient and node, eading to O( N C I ) communication and computationa compexity; and (iv) as traffic oad changes, the probem needs to be recomputed and the resuts re-disseminated. This motivates us to deveop a decentraized soution, where each DONAR node runs a distributed agorithm that is simutaneousy scaabe, accurate, and responsive to change Decentraized Seection Agorithm We now derive a decentraized soution for the goba probem RS g. perform a smaer-scae oca optimization based on its cient view. Each DONAR node wi We ater show how oca decisions converge to the goba optimum within a handfu of agorithmic iterations. We everage the theory of optimization decomposition to guide our design. Consider the goba performance term perf g (4.3), which consists of oca cient performance contributed by each node: perf g = n N perf n where perf n = s n α cn c C n i I R nci cost(c, i) (4.4) Each node optimizes oca performance on its cient popuation, pus a oad term imposed on the repicas. For a repica i with a spit-weight constraint (the case of bandwidth-cap constraint foows simiary and is shown in the fina agorithm), oad = ( λ i (Pi w i ) 2 ɛ 2 ) i i I (4.5) where λ i is interpreted as the unit price of vioating the constraint. We wi show ater how to set this vaue dynamicay for each repica. Notice that the oad associates with decision variabes from a nodes. To decoupe it, rewrite P i as P i = n N P ni = P ni + P n i = P ni + P ni n N \{n} 99

113 Initiaization For each repica i: Set an arbitrary price λ i 0. For each node n: Set an arbitrary decision R nci. Iteration For each node n: (1) Coects the atest {P n i} i I for other n. (2) Coects the atest λ i for every repica i. (3) Soves RS n. (4) Computes {P ni } i I and updates the info. (5) With probabiity 1/ N, for every repica i: (6) Coects the atest P i = n P ni. (7) Computes λ i max { 0, λ i + θ ( (P i w i ) 2 ɛ 2 i )}, or λ i max { 0, λ i + θ ( (B P i ) 2 B 2 i )}. (8) Updates λ i. (9) Stops if {P n i} i I from other n do not change. Tabe 4.2: Decentraized soution of server seection where P ni = s n c C n α cn R nci is the traffic oad contributed by node n on repica i that is, the requests from those cients directed to i by DONAR node n and P ni is the traffic oad contributed by nodes other than n, independent of n s decisions. Then the oca repica seection probem for node n is formuated as the foowing optimization probem RS n: minimize perf n + oad n (4.6a) variabes R nci, c C n, i I (4.6b) where oad n = oad, n. To sove this oca probem, a mapping node needs to know (i) oca information about s n and α cn, (ii) prices λ i for a repicas, and (iii) the aggregated P ni information from other nodes. Equation (4.6a) is a quadratic programming probem, which can be soved efficienty by standard optimization sovers (we evauate computation time in Section 4.5.1). We formay present the decentraized agorithm in Tabe 4.2. At the initiaization stage, each node picks an arbitrary mapping decision, e.g., one that ony optimizes oca performance. Each repica sets an arbitrary price, say λ i = 0. The core components of the agorithm are the oca updates by each mapping node, and the periodic updates of repica prices. Mapping decisions 100

114 are made at each node n by soving RS n based on the atest information. Repica prices (λ i ) are updated based on the inferred traffic oad to each i. Intuitivey, the price increases if i s spit weight or bandwidth requirement is vioated, and decreases otherwise. This can be achieved via additive updates (shown by θ). We both coect and update the P ni and λ i information through a data store service, as discussed ater. Whie the centraized soution requires O( N C I ) communication and computation at the coordinator, the distributed soution has much ess overhead. Each node needs to share its mapping decisions of size I with a others, and each repica s price λ i needs to be known by each node. This impies N messages, each of size O(( N 1) I + I ) = O( N I ). Each node s computationa compexity is of size O( C n I ). The correctness of the decentraized agorithm is given by the foowing theorem: Theorem 18. The distributed agorithm shown in Tabe 4.2, converges to the optima soution of RS g, given that (i) each node n iterativey soves RS n in a circuar fashion, i.e., n = 1, 2,... N, 1,..., and (ii) each repica price λ i is updated in a arger timescae, i.e., after a nodes decisions converge given a set of {λ i } i I. Proof. It suffices to show that the distributed agorithm is an execution of the dua agorithm that soves RS g. We ony show the case of spit weight constraint (4.2b), and the case of bandwidth cap constraint (4.2c) woud foow simiary. First, foowing the Lagrangian method for soving an optimization probem, we derive the Lagrangian of the goba probem RS g. Constraint (4.2b) is equivaent to (P i w i ) 2 ɛ 2 i The Lagrangian of RS g is written as: L(R, λ) = perf g + ( λ i (Pi w i ) 2 ɛ 2 ) i i I = n N perf n + oad (4.7) where λ i 0 is the Lagrange mutipier (repica price) associated with the spit weight constraint on repica i, and R = {R nci } n N,c Cn,i I is the prima variabe. The dua agorithm requires to 101

115 minimize the Lagrangian (4.7) for a given set of {λ i } i I. minimize L(R, λ) variabe R A distributed agorithm impied by condition (i) soves (4.7), because each node n iterativey soving RS n in a circuar fashion, simpy impements the noninear Guass-Seide agorithm (per [32, Ch. 3, Prop. 3.9], [92, Prop ]): R (t+1) n = argmin RS g (..., R (t+1) n 1, R n, R (t) n+1,...) (4.8) where R n = {R nci } c Cn,i I denotes node n s decision variabe. Given a set of {λ i } i I, the distributed soution converges because: first, the objective function (4.7) is continuousy differentiabe and convex on the entire set of variabes. Second, each step of RS n is a minimization of (4.7) with respect to its own variabe R n, assuming others are hed constant. Third, the optima soution of each RS n is uniquey attained, since its objective function is quadratic. The three conditions together ensure that the imit point of the sequence {R n } (t) n N, minimizes (4.7) for a given set of {λ i } i I. Second, we need to sove the master dua probem: maximize f(λ) subject to λ 0 where f(λ) = max R L(R, λ), which is soved in the first step. Since the soution to (4.7) is unique, the dua function f(λ) is differentiabe, which can be soved by the foowing gradient projection method: λ i max { 0, λ i + θ ( (P i w i ) 2 ɛ 2 )} i, i where θ > 0 is a sma positive step size. Condition (ii) guarantees the dua prices λ are updated in a arger timescae, as the dua agorithm requires. 102

116 The duaity gap of RS g is zero, and the soution to each RS n is aso unique. This finay guarantees that the equiibrium point of the decentraized agorithm is aso the optima soution of the goba probem RS g. The correctness of the distributed agorithm reies on an appropriate ordering of oca updates from each node, i.e., in a round-robin fashion as shown in (4.8), and a ess frequent repica price update. In practice, however, we aow nodes and repicas to update uncoordinatedy and independenty. We find that the agorithm s convergence is not sensitive to this ack of coordination, which we demonstrate in our evauation. In fact, the decentraized soution works we even at the scae of thousands of mapping nodes. For a given repica-seection probem, the decentraized soution usuay converges within a handfu of iterations, and the equiibrium point is aso the optima soution to the goba probem. 4.4 DONAR s System Design This section describes the design of DONAR, which provides distributed repica seection for arge numbers of customers, each of whom have their own set of service repicas and different high-eve preferences over repica seection criteria. DONAR impements the poicy interface and distributed optimization mechanism we defined in the ast two sections. Each DONAR node must aso reiaby hande cient requests and customer updates. DONAR nodes shoud be geographicay dispersed themseves, for greater reiabiity and better performance. Our current depoyment, for exampe, consists of gobay dispersed machines on both the PanetLab [93] and VINI [94] network patforms. Figure 4.4 depicts a singe DONAR node and its interactions with various system components. This section is organized around these components. Section discusses how DONAR nodes combine customer poicies, mapping information shared by other nodes, and ocay-avaiabe cost information, in order to optimay map cients to customers service repicas. Section describes DONAR s front-end mechanisms for performing repica seection (e.g., DNS, HTTP redirection, HTTP proxying, etc.), and Section detais its update protoco for registering and updating customer services, repicas, and poicies. Finay, Section describes DONAR s back-end 103

117 Poicy Update Customer Service Network Costs (3) Storage (1) DNS (2) Mapping Request DONAR Node DONAR Node Storage DONAR Node Storage Figure 4.4: Interactions on a DONAR node distributed data store (for reiaby disseminating data) and use of IP Anycast (for reiaby routing cient requests) Efficient Distributed Optimization Given poicy preferences for each customer, DONAR nodes must transate high-eve mapping goas into a specific set of rues for each node. DONAR s poicy engine reaizes a variant of the agorithm described in Tabe 4.2. Particuary, a nodes act asynchronousy in the system, so no round-based coordination is required. We demonstrate in Section 4.5 that this variant converges in practice. Request rate and cost estimation. Our mode assumes that each node has an estimate of the request rate per cient. As suggested in Section 4.3, each cient represents a group of simiary ocated end-hosts (we aso refer to this entity as a cient region ). When a DONAR node comes onine and begins receiving cient requests, it tracks the request voume per unit time per cient region. Whie our optimization agorithm modes a static probem and therefore constant request rates true request rates vary over time. To address this imitation, DONAR nodes use an exponentiay-weighted moving average of previous time intervas (with α = 0.8 and 10 minute intervas in our current depoyment). Sti, rapid changes in a particuar cient region might ead to suboptima mapping decisions. Using trace data from a popuar CDN, however, we 104

118 show in Section that reative request rates per region do not significanty vary between time intervas. Our mode aso assumes a known cost(c,i) function, which quantifies the cost of pairing cient c with instance i. In our current depoyment, each DONAR node has a commercia-grade IP geoocation database [9] that provides this data and is updated weeky. We use a cost(c, i) function that is the normaized Eucidean distance between the IP prefixes of the cient c and instance i (ike that described in Section 4.3). Performing oca optimization. DONAR nodes arrive at gobay optima routing behavior by periodicay re-running the oca optimization probem. In our current impementation, nodes each run the same oca procedure at reguar intervas (about every 2 minutes, using some randomized deay to desynchronize computations). As discussed in Section 4.3.3, the inputs to each oca optimization are the aggregate traffic information sent by a remote DONAR nodes {B P ni } i I ; the customer poicy parameters, {w i, B i } i I ; and the proportions of oca traffic coming from each cient region to that node, {α cn } c Cn. The first two inputs are avaiabe through DONAR s distributed data store, whie nodes ocay compute their cients proportions. Given these inputs, the node must decide how to best map its cients. This reduces to the oca optimization probem RS n, which minimizes a node s tota cient atency given the constraints of the customer s poicy specification and the mapping decisions of other nodes (treating their most recent update as a static assignment). The outcome of this optimization is a new set of oca rues {R nci } c Cni I, which dictates the node s mapping behavior. Given these new mapping rues, the node now expects to route different amounts of traffic to each repica. It then computes these new expected traffic rates per instance, using the current mapping poicy and its historica cient request rates {α cn } c Cn. It then updates its existing per-repica totas {B P i } i I in the distributed data store, so that impications of its new oca poicy propagate to other nodes. If the new soution vioates customer constraints bandwidth caps are overshot or spit s exceed the aowed toerance the node wi update the constraint mutipiers {λ i } i I with probabiity of 1/ N. Thus, in the case of overoad, the mutipiers wi be updated, on average, once per cyce of oca updates. 105

119 4.4.2 Providing Fexibe Mapping Mechanisms A variety of protoco-eve mechanisms are empoyed for wide-area repica seection today. They incude (i) dynamicay generated DNS responses with short TTLs, according to a given poicy, (ii) using HTTP Redirection from a centraized source and/or between repicas, and (iii) using persistent HTTP proxies to tunne requests. To offer customers maximum fexibiity, DONAR offers a three of these mechanisms. To use DONAR via DNS, a domain s owner wi register each of its repicas as a singe A record with DONAR, and then point the NS records for its domain (e.g., exampe.com) to ns.donardns.org. DONAR nameservers wi then respond to requests for exampe.com with an appropriate repica given the domain s seection criteria. To use DONAR for HTTP redirection or proxying, a customer adds HTTP records to DONAR a record type in DONAR update messages such as mapping exampe.com to us-east.exampe.com, us-west.exampe.com, etc. DONAR resoves these names and appropriatey identifies their IP address for use during its optimization cacuations. 2 This customer then hands off DNS authority to DONAR as before. When DONAR receives a DNS query for the domain, it returns the IP address of the cient s nearest DONAR node. Upon receiving the corresponding HTTP request, a DONAR HTTP server uses requests Host: header fieds to determine for which customer domains their requests correspond. It queries its oca DONAR poicy engine for the appropriate repica, and then redirects or proxies the request to that repica. The DONAR software architecture is designed to support the easy addition of new protocos. DONAR s DNS nameserver and HTTP server run as separate processes, communicating with the oca DONAR poicy engine via a standardized socket protoco. Section describes additiona impementation detais Secure Registration and Dynamic Updates Since DONAR aims to accommodate many simutaneous customers, it is essentia that a customerfacing operations be competey automated and not require human intervention. Additionay, since DONAR is a pubic service, it must authenticate cient requests and prevent repay attacks. To 2 In this HTTP exampe, customers need to ensure that each name resoves either to a singe IP address or a set of coocated repicas. 106

120 meet these goas we have deveoped a protoco which provides secure, automatic account creation and faciitates frequent poicy updates. Account creation. In DONAR, a customer account is uniquey identified by a private/pubic key pair. DONAR reiaby stores its customers pubic keys, which are used to cryptographicay verify signatures on account updates (described next). Creating accounts in DONAR is competey automated, i.e., no centra authority is required to approve account creation. To create a DONAR account, a customer generates a pubic/private key-pair and simpy begins adding new records to DONAR, signed with the private key. If a DONAR node sees an unregistered pubic key in update messages, it generates the SHA-1 hash of the key, hash, and aocates the domain <hash>.donardns.net to the customer. Customers have the option of vaidating a domain name that they own (e.g., exampe.com). To do so, a customer creates a temporary CNAME DNS record that maps vaidate-<hash>.exampe.com to donardns.net. Since ony someone authoritative for the exampe.com namespace wi be abe to add this record, its presence aone is a sufficient proof of ownership. The customer then sends a vaidation request for exampe.com to a DONAR node. DONAR ooks for the vaidation CNAME record in DNS and, if found, wi repace <hash>.donardns.net with exampe.com for a records tied to that account. DONAR Update Protoco (DUP). Customers interact with DONAR nodes through the DONAR Update Protoco (DUP). Operating over UDP, DUP aows customers to add and remove new service repicas, express the poicy parameters for these repicas as described in Section 4.3, as we as impicity create and expicity verify accounts, per above. DUP is simiar in spirit to the DNS UPDATE protoco [95] with some important additiona features. These incude mandatory RSA signatures, nonces for repay protection, DONAR-specific meta-data (such as spit weight or bandwidth cap), and record types outside of DNS (for HTTP mapping). DUP is record-based (ike DNS), aowing forward compatibiity as we add new features such as additiona poicy options Reiabiity through Decentraization DONAR provides high avaiabiity by gracefuy toerating the faiure of individua DONAR nodes. To accompish this, DONAR incorporates reiabe distributed data storage (for customer records) and ensures that cients wi be routed away from faied nodes. 107

121 Distributed Data Storage. DONAR provides distributed storage of customer record data. A customer update (through DUP) shoud be abe to be received and handed by any DONAR node, and the update shoud then be prompty visibe throughout the DONAR network. There shoud be no centra point of faiure in the storage system, and the system shoud scae-out with the incusion of new DONAR nodes without any specia configuration. To provide this functionaity, DONAR uses the CRAQ storage system [96] to repicate record data and account information across a participating nodes. CRAQ automaticay re-repicates data under membership changes and can provide either strong or eventua consistency of data. Its performance is optimized for read-heavy workoads, which we expect in systems ike DONAR where the number of cient requests ikey wi be orders of magnitude greater than the number of customer updates. DONAR piggybacks on CRAQ s group-membership functionaity, buit on Zookeeper [97], in order to aert DONAR nodes when a node fais. Whie such group notifications are not required for DONAR s avaiabiity, this feature aows DONAR to quicky recompute and reconverge to optima mapping behavior foowing node faiures. Route contro with IP anycast. Whie DONAR s storage system ensures that vauabe data is retained in the face of node faiures, it does not address the issue of routing cient requests away from faied nodes. When DONAR is used for DNS, it can partiay rey on its cients resovers performing automatic faiover between authoritative nameservers. This faiover significanty increases resoution atency, however. Furthermore, DONAR node faiure presents an additiona probem when used with HTTP-based seection. Most web browsers do not faiover between mutipe A records (in this case, DONAR HTTP servers), and browsers and now browser pugins ike Java and Fash as we purposey pin DNS names to specific IP addresses to prevent DNS rebinding attacks [98]. 3 These cached host-to-address bindings often persist for severa minutes. To address both cases, DONAR is designed to work over IP anycast, not ony for answering DNS queries but aso for processing updates. In our current depoyment, a subset of DONAR nodes run on VINI [94], a private instance of PanetLab which aows tighter contro over the network stack. These nodes run an instance of Quagga that peers with TransitPorta instances at each site [99], and thus each site announces 3 A browser may pin DNS IP mappings even after observing destination faiures; otherwise, an attacker may forge ICMP host unreachabe messages to cause it to unpin a specific mapping. 108

122 DONAR s /24 prefix through BGP. If a node oses connectivity, the BGP session wi drop and the Transit Porta wi withdraw the wide-area route. To hande appication-eve faiures, a watchdog process on each node monitors the DONAR processes and withdraws the BGP route if a critica service fais Impementation The software running on each DONAR node consists of severa moduar components which are detaied in Figure 4.5. They constitute a tota of approximatey 10,000 ines of code (C++ and Java), as we as another 2,000 ines of she scripts for system management. DONAR has been running continuousy on Measurement Lab (M-Lab) since October A services depoyed on M-Lab have a unique domain name: service. account.donar.measurement-ab.org that provides a cosest-node poicy by defaut, seecting from among the set of M-Lab servers. Two of the most popuar M-Lab services the Network Diagnostic Too (NDT) [100], which is used for the Federa Communication Commission s Consumer Broadband Test, and NPAD [101] are more cosey integrated with DONAR. NDT and NPAD run the DUP protoco, providing DONAR nodes with status updates every 5 minutes. Since December of 2009, DONAR has aso handed 15% CoraCDN s DNS traffic [85], around 1.25 miion requests per day. This service uses an equa-spit poicy amongst its repicas. DONAR s current impementation supports three front-end mapping mechanisms, impemented as separate processes for extensibiity, which communicate via the main DONAR poicy engine over a UNIX domain socket and a record-based ASCII protoco. The DNS front-end is buit on the open-source PowerDNS resover, which supports customizabe storage backends. DONAR provides a custom HTTP server for HTTP redirection, buit on top of the ibmicrohttpd embedded HTTP ibrary. DONAR aso supports basic HTTP tunneing i.e., acting as a persistent proxy between cients and repicas, via a custom buit HTTP proxy. However, due to the bandwidth constraints of our depoyment patform, it is currenty disabed. At the storage ayer, DONAR nodes use CRAQ to disseminate customer records (pubic key, domain, and repica information), as we as traffic request rates to each service repica (from each 109

123 DNS Requests HTTP Requests Customer Updates DONAR Update Server PowerDNS Resover Custom HTTP Server/Proxy DONAR Poicy Engine Distributed Storage Service (CRAQ) Customer Record Data Goba Traffic Data Figure 4.5: Software architecture of a DONAR node DONAR node). CRAQ s key-vaue interface offers basic set/get functionaity; DONAR s primitive data types are stored with an XDR encoding in CRAQ. The DONAR poicy engine is written in C++, buit using the Tame extensions [102] to the SFS asynchronous I/O ibraries. In order to assign cient regions to repica instances, the engine soves a quadratic program of size C n I, using a quadratic sover from the MOSEK [103] optimization ibrary. The DONAR update server, written in Java, processes DUP requests from customers, incuding account vaidation, poicy updates, and changes in the set of active repicas. It uses CRAQ to disseminate record data between nodes. Whie customers can buid their own appications that directy speak DUP, we aso provide a pubicy-avaiabe Java cient that performs these basic operations. 4.5 Evauation Our evauation of DONAR is in three parts. Section simuates a arge-scae depoyment given trace request data and simuated mapping nodes. Section uses the same dataset to verify that cient request voumes are reasonaby stabe from one time period to the next. Finay, Section evauates our prototype performing rea-word repica seection for a popuar CDN. 110

124 Repica Request Distribution A B C D Time (Minute) Figure 4.6: DONAR adapts to spit weight changes 12 Geo Distance (Km) Round Robin Decentraized Ag Centraized Ag Cosest Time (Minute) Figure 4.7: Network performance impications of repica-seection poicies Trace-Based Simuation We use trace data to demonstrate the performance and stabiity of our decentraized repicaseection agorithm. We anayzed DNS og fies from CoraCDN. Our dataset consists of 9,918,780 requests over a randomy-seected 24-hour period (Juy 28, 2009). On that day, CoraCDN s infrastructure consisted of 76 DNS servers which dispatched cients to any of 308 HTTP proxies distributed word-wide. Locaity information was obtained through Quova s commercia geoocation database [9]. In the trace-based simuation, we aggregate cients by geographic ocation. This resuts in 371 distinct cient regions in our trace. We choose 10 hypothetica mapping nodes to represent a gobay-distributed set of authoritative nameservers. Each request is assigned to the nearest nameserver, partitioning the cient space. Four repica instances are seected from ocations in the US-East, US-West, Europe, and Asia. 111

125 We feed each mapping node with cient request rates and distributions, which are inferred from the trace for every 10-minute interva. In the evauation, we aow DONAR nodes to communicate asynchronousy, i.e., they do not tak in a circuar fashion as in Tabe 4.2. Instead, we overap updates such that there is 40% probabiity that at east haf of nodes update simutaneousy. A nodes perform an update once each minute. We use the quadratic programming sover in MOSEK [103], and each oca optimization takes about 50ms on a 2.0GHz dua core machine. Load spit weight. Customers can submit different sets of oad spit weights over time, and we show how DONAR dynamicay adapts to such changes. Figure 4.6 shows the repica request distribution over a 2-hour trace. We vary the desired spit weight four times, at 0, 30, 60 and 90 minutes. Phase A shows repica oads quicky converging from a random initia point to a spit weight of 40/30/20/10. Sma perturbations occur at the beginning of every 10 minutes, since cient request rates and distributions change. Repica oads quicky converge to the origina eve as DONAR re-optimizes based on the current oad. In Phase B, we adjust the spit weight to 60/20/15/5, and repica oads shift to the new eve usuay within 1 or 2 minutes. Note that the spit weight and traffic oad can change simutaneousy, and DONAR is very responsive to these changes. In Phase C, We impement an equa-spit poicy and Phase D re-baances the oad to an uneven distribution. In this experiment we chose ɛ i = 0.01, so there is very itte deviation from the exact spit weight. This exampe demonstrates the nice responsiveness and convergence property of the decentraized agorithm, even when the oca optimizations run asynchronousy. Network performance. We next investigate the network performance under an equa-spit poicy among repicas, i.e., a four repicas expect 25% of oad and toerate ɛ i = 1% deviation. We use a 6-hour trace from the above dataset, starting at 9pm EST. We compare DONAR s decentraized agorithm to three other repica-seection agorithms. Round Robin maps incoming requests to the four repicas in a round-robin fashion, achieving equa oad distribution. Centraized Ag uses a centra coordinator to cacuate mapping decision for a nodes and a cients, and thus does not require inter-node communication. Cosest aways maps a cient to the cosest repica and achieves the best network performance. The performance of these agorithms is shown in Figure 4.7. The best (minimum) distance, reaized by Cosest, is quite stabe over time. Round Robin achieves the worst network performance, about 300% 400% more than the minimum, since 112

126 % of cients to mean variance cosest repica ɛ i = % ɛ i = 1% 69.56% ɛ i = 5% 77.13% ɛ i = 10% 86.34% cosest repica 100% Figure 4.8: Sensitivity anaysis of using toerance parameter ɛ i 75% of requests go to sub-optima repicas. DONAR s decentraized agorithm can achieve much better performance, reaizing 10% 100% above the minimum distance in exchange for better oad distribution. Note that the decentraized soution is very cose to that of a centra coordinator. It is aso interesting to note that DONAR s network cost is increasing. This can be expained by diurna patterns in different areas: the United States was approaching midnight whie Asia reached its traffic peak at noon. Requiring 25% oad at each of the two US servers understandaby hurts the network performance. Sensitivity to toerance parameter. When submitting spit weights, a customer can use ɛ i to strike a baance between strict oad distribution and improved network performance. Athough an accurate choice of ɛ i depends on the underying traffic oad, we use our trace to shed some ight on its usage. In Figure 4.8, we empoy an equa-spit poicy, and try ɛ i = 0, 1%, 5% and 10%. We show the percentage of cients that are mapped to the cosest repica, and the mean performance and variance, over the entire 6-hour trace. We aso compare them to the coset-repica poicy. Surprisingy, toerating a 1% deviation from a strict equa spit aows 15% more cients to map to the cosest repica. 5% toerance can further improve 7% of nodes, and an additiona 9% improvement is possibe for a 10% toerance. This demonstrates that ɛ i provides a very tangibe mechanism for trading off network performance and traffic distribution Predictabiity of Cient Request Rate So far we have shown rapid convergence given a temporariy fixed set of request rates per cient. In reaity, cient request voume wi vary, and DONAR s predicted request voume for a given cient may not accuratey forecast cient traffic. For DONAR to work we, cient traffic rates must be sufficienty predictabe under a granuarity that remains usefu to our customers. Our current 113

127 Reative Difference (%) Area Code in Decreasing Order of Popuarity (Percentie) Cumuative Traffic Area Code in Decreasing Order of Popuarity (Percentie) Figure 4.9: Stabiity of area code request rates depoyment uses a fixed interva size of 10 minutes. We now show via anaysis of the same trace data, that request rates are sufficienty predictabe on this timescae. Figure 4.9 (top) pots the reative difference between our estimated rate and the true rate for each cient group, i.e., a vaue of zero indicates a perfect prediction of request voume. Each data point is the average difference over a 2-hour interva for one cient. Figure 4.9 (bottom) is a CDF of a traffic from these same cients, which shows that the vast majority of incoming traffic beongs to groups whose traffic is very stabe. The high variation in request rate of the ast 50% of groups accounts for ony 6% of tota traffic. Coarser-grained cient aggregation (i.e., arger cient groups) wi ead to better request stabiity, but at the cost of ocaity precision. In practice, services prefer a fine granuarity in terms of rate intervas and cient ocation. Commercia CDN s and other repicated services, which see ordersof-magnitude more traffic than CoraCDN, woud be abe to achieve much finer granuarity whie keeping rates stabe, such as tracking requests per minute, per IP prefix. 114

128 HTTP Cients at each Proxy Ten Minute Intervas Figure 4.10: Server request oads under cosest repica poicy Prototype Evauation We now evauate our DONAR prototype when used to perform repica seection for CoraCDN. CoraCDN disseminates content by answering HTTP requests on each of its distributed repicas. Since cients access CoraCDN via the nyud.net domain suffix, they require a DNS mechanism to perform repica seection. For our experiments, we create a DONAR account for the nyud.net suffix, and add ten CoraCDN proxies as active repicas. We then point a subset of the CoraCDN NS records to DONAR s mapping nodes in order to direct DNS queries. Cosest Repica. We first impement a cosest repica poicy by imposing no constraint on the distribution of cient requests. We then track the cient arriva rate at each repica, cacuated in 10-minute intervas over the course of three days. As Figure 4.10 demonstrates, the voume of traffic arriving at each repica varies highy according to repica ocation. The busiest repica in each interva typicay sees ten times more traffic than the east busy. For severa periods a singe server handes more than 40% of requests. The diurna traffic fuctuations, which are evident in the graph, aso increase the variabiity of traffic on each repica. Each CoraCDN repica has roughy the same capacity, and each repica s performance diminishes with increased cient oad, so a preferabe outcome is one in which oads are reativey uniform. Despite a gobay distributed set of repicas serving distributed cients, a naïve repica seection strategy resuts in highy disproportionate server oads and fais to meet this goa. Furthermore, due to diurna patterns, there is no way to staticay provision our servers in order to 115

129 DNS Answers for each Proxy Ten Minute Intervas HTTP Cients at each Proxy Ten Minute Intervas Figure 4.11: Proportiona traffic distribution observed by DONAR (Top) and CoraCDN (Bottom), when an equa-spit poicy is enacted by DONAR. Horizonta gray ines represent the ɛ toerance ±2% around each spit rate. equaize oad under this type of seection poicy. Instead, we require a dynamic mapping ayer to baance the goas of cient proximity and oad distribution, as we show next. Controing request distribution. We next everage DONAR s API to dictate a specific distribution of cient traffic amongst repicas. In this evauation we partition the ten Cora servers to receive equa amounts of traffic (10% each) each with an aowed deviation of 2%. We then measure both the mapping behavior of each DONAR node and the cient arriva rate at each CoraCDN repica. Figure 4.11 demonstrates the proportion of requests mapped to each repica as recorded by DONAR nodes. Repica request voumes fuctuate within the aowed range as DONAR adjusts to changing cient request patterns. The few fuctuations which extend past the aowed toerance are due to inaccuracies in request oad prediction and intermediate soutions 116

130 HTTP Cients at Each Proxy DONAR Round Robin Ranked Order from Cosest Figure 4.12: Cient performance during equa spit which (as expained in Section 4.3) may temporariy vioate constraints. Figure 4.11 (Bottom) depicts the arriva rate of cients at each CoraCDN node. The discrepancy in these graphs is an artifact of the choice to use DNS-based mapping, a method which offers compete transparency at the cost of some contro over mapping behavior (as discussed in Section 4.6). Nonetheess, request distribution remains within 5% of desired during most time periods. Customers using HTTP-based mechanisms woud see cient arriva exacty equa to that observed by DONAR. Measuring cient performance. We next evauate the performance of CoraCDN cients in the face of specific the traffic distribution requests imposed in the prior experiment. Surprisingy, DONAR is abe to sufficienty baance cient requests, whie pairing cients with nearby servers with very high probabiity. This distinguishes DONAR from simpe weighted oad baancing schemes, which forgo a consideration of network performance in favor of achieving a specific spit. Whie DONAR nodes define fractiona rues for each cient, in practice neary every cient is pinned to a particuar repica at optimum. We can thus show the proportion of cients paired with their n th preferred repica. Figure 4.12 pots this data, contrasting DONAR with a round-robin mapping poicy. With DONAR, more than 50% of cients are mapped to the nearest node and 75% are mapped within the top three, a significant improvement over the traditiona round-robin approach. 117

131 4.6 Reated Work Network distance estimation. A significant body of research has examined techniques for network atency and distance estimation, usefu for determining cost(c,i) in DONAR. Some work focused on reducing the overhead for measuring the IP address space [104, 90, 105, 106]. Aternativey, virtua coordinate systems [88, 89] estimated atency based on synthetic coordinates. More recent work considered throughput, routing probems, and abnormaities as we [11, 87]. Another direction is geographic mapping techniques, incuding whois data [107, 108], or extracting information from ocation-based naming conventions for routers [109, 110]. Configurabe Mapping Poicies. Existing services provide imited customization of mapping behavior. OASIS [82] offers a choice of two heuristics which jointy consider server oad and cient-server proximity. CosestNode [111] supports ocaity poicies based on programmatica ondemand network probing [86]. Amazon s EC2, a commercia service, aows basic oad baancing between virtua machine instances. For those who are wiing to set up their own distributed DNS infrastructure, packages ike MyXDNS [112] faciitate extensive customization of mapping behavior, though ony by eaving a inputs and decisions up to the user. Efficacy of Request Redirection. Severa studies have evauated the effectiveness of DNSbased repica seection, incuding the accuracy of using DNS resovers to approximate cient ocation [113] or the impact of caching on DNS responsiveness [114]. Despite these drawbacks, DNS remains the preferred mechanism for sever seection by many industry eaders, such as Akamai [6]. Whie HTTP tunneing and request redirection offer greater accuracy and finer-grain contro for services operating on HTTP, they introduce additiona atency and overhead. 4.7 Summary DONAR enabes onine services to offer better performance to their cients at a ower cost, by directing cient requests to appropriate server repicas. DONAR s expressive interface supports a wide range of repica-seection poicies, and its decentraized design is scaabe, accurate, and responsive. Through a ive depoyment with CoraCDN, we demonstrate that our distributed agorithm is accurate and efficient, requiring itte coordination among the mapping nodes to 118

132 adapt to changing cient demands. In our ongoing work, we pan to expand our CoraCDN and M-Lab depoyments and start making DONAR avaiabe to other services. In doing so, we hope to identify more ways to meet the needs of networked services. 119

133 Chapter 5 Concusion Wide-area traffic management is a we-studied probem in the area of transit networks. The advent of coud computing presents a new opportunity to CSPs for coordination between trafficmanagement tasks that are previousy controed by different institutions. Today s CSPs face a number of significant chaenges in managing their traffic across wide-area networks, incuding (i) the imited visibiity among administrativey separated groups that contro different management decisions, (ii) the oss of efficiency due to independent decision makings in contrast to a coordinated soution, and (iii) the poor scaabiity as the need rises for running a joint traffic management across a arge number of geo-distributed data centers, appications, and cients. This dissertation foows a principed approach to address these probems with a set of new networking soutions, a scaabe architectura design, and theoretica resuts with provabe optimaity guarantees. In this chapter, we first summarize our main contributions in this dissertation in Section 5.1. We then discuss how to combine different soutions together to provide a new traffic management system in Section 5.2. We briefy propose future research directions in Section 5.3, and concude with Section Summary of Contribution In this section, we revisit the design requirements that are raised in Chapter 1, and summarize our contributions in this dissertation. 120

134 First, we observe that CSPs who have their own backbones, e.g., an ISP who wishes to depoy CDN for serving its customers, have the opportunity to coordinate between network routing and server seection decisions. Traditionay, CSPs do not make optima decisions for both, due to the imited visibiity that the traffic engineering group and the CDN have into each other. To understand who shoud provide what information, we deveop three abstractions with an increasing amount of cooperation between two parties, starting from (1) today s status quo that reies on endto-end measurement at the edge, to (2) a partia cooperation by exposing the network topoogy and routes at the core, and towards (3) a fu cooperation by communicating their objectives. We rigorousy show that cooperation mode (2) eads to a goba optimum when traffic engineering and CDN match in their interests, however, in genera, suffers from the performance sub-optimaity and, as a counter-intuitive observation, might be hurt by the extra visibiity. Through a re-design of the two systems by introducing a Nash bargaining soution, we aeviate the performance efficiency oss whie keeping the functiona separation of existing systems. As a design exampe motivated by information sharing, this works provides a spectrum of design choices for CSPs on the trade-off space between performance, compexity, and architectura changes. Second, we study how to maximize performance and minimize cost for CSPs that provide geographicay-repicated content services over mutipe ISP networks. We propose a joint content pacement and server seection soution for ast-mie decentraized CDNs ocated within users homes. The need for a joint design presents significant chaenges that, different traffic management decisions have varied resoutions, e.g., ranging from an individua server to an ISP, and may happen at different time-scaes, e.g., ranging from seconds to hours, which is hard to sove in practice. We take a divide-and-conquer approach, and offer a new set of network soutions to capture these heterogeneities. Our proposed agorithms are simpe, provaby optima, and can be impemented by a scaabe architecture. As a design exampe motivated by joint contro, this work sheds ight on overcoming the issues of resource heterogeneity, computationa tractabiity, and impementation compexity that are important for CSPs who wishes to perform a joint optimization over different traffic management decisions. Third, we address the need for CSPs to perform traffic management operations over a arge number of data centers, appications and cients, e.g., offering reiabe repica-seection service for geographicay distributed cients. Most existing repica-seection systems run on a set of dis- 121

135 tributed mapping nodes for directing cient requests, however, require either centra coordination (which has reiabiity, security, and scaabiity imitations) or distributed heuristics (which are notorious for optimaity, accuracy, and stabiity). We propose a distributed soution to coordinate server-seection decisions among a set of mapping nodes, aowing them to coectivey optimize cient performance and enforce management poicies such as server oad-baancing. As a design exampe motivated by distributed impementation, this work offers a new paradigm for CSPs to perform effective and distributed traffic management for geo-distributed onine services. The three proposed soutions in this dissertation offer compementary methods for CSPs to manage their traffic across wide-area networks. In practice, each of these design paradigms can be appied to specific systems that CSPs wish to depoy, for the purpose of sharing information, joint contro, and distributed impementation. 5.2 Synergizing Three Traffic Management Soutions A CSP might choose to impement one of the proposed traffic management soutions based on its needs, however, combining these soutions brings additiona benefits to a CSP. In this section, we briefy show how to integrate the three soutions to provide a synergistic traffic management system, as described beow. An emerging trend of content distribution on the Internet is to pace content cose to the edge, e.g., end users, for better proximity and caching. As such, today s CSPs, such as socia network providers, wish to depoy the CDN at mutipe ISPs, or subscribe to mutipe CDNs, for a better geographic coverage. The service provider can everage our soution in Chapter 3 to decide how to pace a arge cataog of content, e.g., user data in a socia network, across CDNs (ocated at different ISPs), and how to direct user request at the granuarity of CDNs (or ISPs). Within each CDN (or an ISP), our soution in Chapter 2 can be used to coordinate server-seection and network routing in an ISP network. Further, DONAR that we propose in Chapter 4, can be used to provide the repica-seection service for each CDN. Combining these soutions simpifies the management task of a CSP by separating functionaities between different institutions. A CSP can focus on the coordination between a CDN and traffic engineering inside a singe ISP network, without the need to worry about traffic management decisions affected by 122

136 other CDNs and ISPs. In addition, the CDN-specific server-seection task can be outsourced to DONAR to hande customized management poicies, without the need to separatey impement a repica-seection scheme for each CDN. Therefore, a CSP is freed from the more compicated and error-prone management that it has to hande otherwise. 5.3 Open Probems and Future Work As the Internet is increasingy a patform for onine services, there are a number of important questions that CSPs need to further investigate beyond our work in this dissertation Optimizing CSP Operationa Costs Offering good performance to cients at a reasonabe cost is the ife bood of any CSP. In making traffic management decisions, coud service providers shoud weigh operationa costs for resources ike bandwidth and eectricity. This dissertation provides simpe abstractions to hande primariy bandwidth costs, by minimizing cross-isp traffic, and expressing server-oad poicies in DONAR. In practice, the bandwidth pricing mode can be more compicated, such as the 95th-percentie biing used by network transit providers to charge bandwidth consumption, or a tiered pricing pan that is being adopted by coud-computing patforms ike Amazon AWS [1]. On the other hand, the power draw is one of the major capita outay in modern data centers [115]. The eectricity utiity costs may vary spatiay, e.g., from one data center ocation to another, and temporay, e.g., from the peak hour to the vaey hour. CSPs have the opportunity to minimize their costs by directing cients to cheap servers, routing traffic on inexpensive inks, and migrating jobs to under-utiized data centers. Bandwidth pricing has been proposed to reduce such costs [116]. Innovative soutions are needed to capture these ocation-dependent, time-varying costs, aowing CSPs to have a more accurate and effective contro on their operationa costs Traffic Management Within a CSP Backbone Large onine service providers, such as Googe and Yahoo, have their dedicated network backbones. In addition to serving Web-ike traffic, some services need to send a arge voume of data, e.g., search indices, app updates, or backup storage to mutipe data centers. On the surface, the traffic 123

137 management probem within a CSP backbone seems simiar to the traffic engineering performed within an ISP backbone. However, the two probems differ in a number of important ways. First, the CSP has contro over the server (e.g., the sending rates) and the network (e.g., which path carries the traffic, or rate spit in the case of mutipe paths). This presents the opportunity for a joint contro over rate aocation and network routing. Second, the CSP hosts a arge number of services and appications that have different performance goas (e.g., ow atency vs. high throughput), and different priorities (e.g., voice vs data backup). Third, in contrast to intradomain traffic engineering that handes a reativey stabe traffic matrix (e.g., cient demands), some CSP services have an infinite traffic backog ike anaytics for search that wish to consume whatever spare bandwidth the network can offer. This introduces new performance goa, e.g., maximizing throughput, rather than minimizing congestion. A scenarios above raise the need for new approaches to manage traffic in a new CSP network environment Long Term Server Pacement As the traffic demand for onine services continues to grow, with the increasing popuarity of video streaming, socia networks and mobie apps, CSPs need to make ong-term decisions for upgrading their infrastructure, incuding improving their peering connectivity with ISPs, and depoying more data centers and CDN servers. To offer more reiabe services to cients, CSPs are often faced with two types of choices: (i) pacing a sma number of servers and estabishing better peering contracts in terms of bandwidth, such as Leve-3, or (ii) depoying a arge number of servers in many ISPs that have better edge access to end users, such as Akamai. In making these decisions, CSPs need to consider the user performance (e.g., atency, throughput), operationa costs (e.g., bandwidth, power), and capita investment (e.g., servers, routers). As such, new performance and cost modes are needed to drive the ong-term panning of server pacement and ISP peering seection Traffic Management within Data Centers CSPs usuay expoit economies of scae by consoidating a arge number of servers in their data centers. Coud services aow tenants to subscribe a handfu of servers or virtua machines (VMs) to host their appications. As the demand for coud-based services continue to rise, data centers increasingy need efficient traffic management to improve resource utiization inside the network. 124

138 Compared to wide-area networks, intra data center traffic management share many common ingredients, such as network routing and content pacement. The statistica mutipexing of data center resources, such as CPU and memory, aso introduces a new contro knob, server (VM) pacement servers from a tenants must be paced with great care to ensure their aggregate resource usage on the network and hosts is fufied. [117] takes a first step to jointy manage routing and VM pacement, eaving many interesting research directions simiar to those addressed in this dissertation. 5.4 Concuding Remarks In concusion, this dissertation presented the design, evauation, and impementation of a set of new wide-area traffic management soutions for arge-scae coud services. We have proposed (i) a cooperation scheme between a network provider and a content provider in the face of a growing trend of content-centric networking, (ii) a joint content distribution soution across a federation of decentraized CDNs, and (iii) a distributed mapping system that outsources server-seection for geo-repicated onine services. The evoution of the Internet has driven these exciting research topics, aowing us to revisit previous approaches we use to manage a network. Today s coud service providers usuay appy ad hoc techniques for controing their traffic. As different parts of the network become a singe patform for onine services, we have unique opportunities to design the network from a cean sate, foowing the phiosophy of networking as a discipine [118] Protocos were buit to master compexity, but it is the abiity to extract simpicity that ays the inteectua foundations. Abstractions are keys to extracting simpicity.. We beieve that, the abstractions deveoped in this dissertation are promising enabers for designing wide-area networking soutions in a more systematic, automated and effective way, and wi shed ight on many others that are yet to be discovered. 125

139 Bibiography [1] Amazon Web Services, [2] R. Kohavi, R. M. Henne, and D. Sommerfied, Practica guide to controed experiments on the web: Listen to your customers not to the hippo, in Proc. ACM SIGKDD, [3] Z. Zhang, M. Zhang, A. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian, Optimizing cost and performance in onine service provider networks, in Proc. USENIX NSDI, [4] A. Qureshi, R. Weber, H. Baakrishnan, J. Guttag, and B. Maggs, Cutting the eectric bi for Internet-scae systems, in Proc. ACM SIGCOMM, [5] AT&T. [6] Akamai. [7] Netfix. [8] V. Vaancius, N. Laoutaris, L. Massouie, C. Diot, and P. Rodriguez, Greening the Internet with nano data centers, in Proc. CoNEXT, [9] Quova. [10] P. Wende, J. W. Jiang, M. J. Freedman, and J. Rexford, DONAR: Decentraized server seection for coud services, in Proc. ACM SIGCOMM, [11] H. V. Madhyastha, T. Isda, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani, ipane: An information pane for distributed services, in Proc. USENIX OSDI,

140 [12] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, [13] W. Jiang, R. Zhang-Shen, J. Rexford, and M. Chiang, Cooperative content distribution and traffic engineering in an ISP network, in Proc. ACM SIGMETRICS, [14] W. B. Norton, Video Internet: The next wave of massive disruption to the U.S. peering ecosystem, Sept Equinix white paper. [15] AT&T, U-verse. [16] Verizon, FiOS. [17] A.-J. Su, D. R. Choffnes, A. Kuzmanovic, and F. E. Bustamante, Drafting behind Akamai (Traveocity-based detouring), in Proc. ACM SIGCOMM, [18] V. Aggarwa, A. Fedmann, and C. Scheideer, Can ISPs and P2P users cooperate for improved performance?, ACM SIGCOMM Computer Communication Review, vo. 37, no. 3, pp , [19] H. Xie, Y. R. Yang, A. Krishnamurthy, Y. Liu, and A. Siberschatz, P4P: Provider Porta for (P2P) Appications, in Proc. ACM SIGCOMM, [20] D. DiPaantino and R. Johari, Traffic engineering versus content distribution: A game theoretic perspective, in Proc. IEEE INFOCOM, [21] J. F. Nash, The bargaining probem, Econometrica, vo. 28, pp , [22] D. P. Paomar and M. Chiang, A tutoria on decomposition methods for network utiity maximization, IEEE J. on Seected Areas in Communications, vo. 24, no. 8, pp , [23] B. Fortz and M. Thorup, Internet traffic engineering by optimizing OSPF weights, in Proc. IEEE INFOCOM, [24] D. Awduche, J. Macom, J. Agogbua, M. O De, and J.McManus, RFC 2702: Requirements for Traffic Engineering Over MPLS, September [25] D. Xu, M. Chiang, and J. Rexford, Like-state routing with hop-by-hop forwarding can achieve optima traffic engineering, in Proc. IEEE INFOCOM,

141 [26] J. Wardrop, Some theoretica aspects of road traffic research, the Institute of Civi Engineers, vo. 1, no. 2, pp , [27] T. Roughgarden and Éva Tardos, How bad is sefish routing?, J. ACM, vo. 49, no. 2, [28] L. Qiu, Y. R. Yang, Y. Zhang, and S. Shenker, On sefish routing in Internet-ike environments, in Proc. ACM SIGCOMM, [29] M. Littman and J. Boyan, A distributed reinforcement earning scheme for network routing, Tech. Rep. CMU-CS , Robotics Institute, Carnegie Meon University, [30] W. Jiang, R. Zhang-Shen, J. Rexford, and M. Chiang, Cooperative content distribution and traffic engineering in a provider network, Tech. Rep. TR , Department of Computer Science, Princeton University, [31] M. J. Osborne and A. Rubinstein, A Course in Game Theory. MIT Press, [32] D. P. Bertsekas and J. N. Tsitsikis, Parae and Distributed Computation: Numerica Methods. Prentice Ha, [33] K. Binmore, A. Rubinstein, and A. Woinsky, The Nash bargaining soution in economic modeing, RAND Journa of Economics, vo. 17, pp , [34] J. He, R. Zhang-Shen, Y. Li, C.-Y. Lee, J. Rexford, and M. Chiang, DaVinci: Dynamicay Adaptive Virtua Networks for a Customized Internet, in Proc. CoNEXT, [35] D. P. Bertsekas, Noninear Programming. Athena Scientific, [36] N. Spring, R. Mahajan, and D. Wethera, Measuring ISP topoogies with Rocketfue, in Proc. ACM SIGCOMM, [37] Abiene. [38] M. Roughan, M. Thorup, and Y. Zhang, Performance of estimated traffic matrices in traffic engineering, SIGMETRICS Perform. Eva. Rev., vo. 31, no. 1, pp , [39] W. Jiang, R. Zhang-Shen, J. Rexford, and M. Chiang, Cooperative content distribution and traffic engineering, in ACM NetEcon, August

142 [40] W. Jiang, D.-M. Chiu, and J. C. S. Lui, On the interaction of mutipe overay routing, Perform. Eva., vo. 62, no. 1-4, pp , [41] S. C. Lee, W. Jiang, D.-M. C. Chiu, and J. C. Lui, Interaction of ISPs: Distributed resource aocation and revenue maximization, IEEE Transactions on Parae and Distributed Systems, vo. 19, no. 2, pp , [42] G. Shrimai, A. Akea, and A. Mutapcic, Cooperative interdomain traffic engineering using Nash bargaining and decomposition, in Proc. IEEE INFOCOM, [43] M. J. Freedman, C. Aperjis, and R. Johari, Prices are right: Managing resources and incentives in peer-assisted content distribution, in Proc. Internationa Workshop on Peer- To-Peer Systems, February [44] D. R. Choffnes and F. E. Bustamante, Taming the torrent: a practica approach to reducing cross-isp traffic in peer-to-peer systems, in Proc. ACM SIGCOMM, [45] Y. Liu, H. Zhang, W. Gong, and D. Towsey, On the interaction between overay routing and underay routing, in Proc. IEEE INFOCOM, [46] R. T. Ma, D. Chiu, J. C. Lui, V. Misra, and D. Rubenstein, On cooperative settement between content, transit and eyeba internet service providers, in Proc. CoNEXT, [47] J. W. Jiang, S. Ioannidis, L. Massouie, and F. Picconi, Orchestration of massivey distributed cdns, tech. rep., Department of Computer Science, Princeton University, [48] Cisco visua networking index: Forecast and methodoogy, [49] [50] Nanodatacenters: [51] Peope s CDN: [52] K. W. Ross, Mutiservice Loss Networks for Broadband Teecommunications Networks. Springer-Verag, [53] D. P. Bertsekas and J. N. Tsitsikis, Parae and Distributed Computation: Numerica Methods. Athena Scientific,

143 [54] B. R. Tan and L. Massouie, Optima content pacement for peer-to-peer video-on-demand systems, in Proc. IEEE INFOCOM, [55] F. Key, Loss networks, The Annas of Appied Probabiity, no. 1, [56] L. Massouié, Structura properties of proportiona fairness, The Annas of Appied Probabiity, vo. 17, no. 3, [57] H. Kushner and G. Yin, Stochastic approximation and recursive agorithms and appications. Springer, [58] M. Benaïm and J.-Y. LeBoudec, A cass of mean fied interaction modes for computer and communication systems, Perform. Eva, pp , [59] [60] C. Huang, A. Wang, J. Li, and K. W. Ross, Understanding hybrid CDN-P2P: why Limeight needs its own Red Swoosh, in NOSSDAV, [61] C. Huang, J. Li, and K. W. Ross, Peer-assisted VoD: Making Internet video distribution cheap, in Proc. Internationa Workshop on Peer-To-Peer Systems, [62] R. S. Peterson and E. G. Sirer, Antfarm: Efficient content distribution with managed swarms, in Proc. USENIX NSDI, [63] Y. Chen, Y. Huang, R. Jana, H. Jiang, M. Rabinovich, B. Wei, and Z. Xiao, When is P2P technoogy beneficia for IPTV services, in NOSSDAV, [64] Y. F. Chen, Y. Huang, R. Jana, H. Jiang, M. Rabinovich, J. Rahe, B. Wei, and Z. Xiao, Towards capacity and profit optimization of video-on-demand services in a peer-assisted IPTV patform, Mutimedia Systems, vo. 15, no. 1, pp , [65] V. Misra, S. Ioannidis, A. Chaintreau, and L. Massouié, Incentivizing peer-assisted services: A fuid Shapey vaue approach, in Proc. ACM SIGMETRICS, [66] R. Binda, P. Cao, W. Chan, J. Medved, G. Suwaa, T. Bates, and A. Zhang, Improving traffic ocaity in bittorrent via biased neighbor seection, in Proc. Internationa Conference on Distributed Computing Systems,

144 [67] R. Cuevas, N. Laoutaris, X. Yang, G. Siganos, and P. Rodriguez, Deep diving into BitTorrent ocaity, in Proc. IEEE INFOCOM, [68] V. Aggarwa, A. Fedmann, and C. Scheideer, Can ISPs and P2P users cooperate for improved performance?, ACM SIGCOMM Computer Communication Review, vo. 37, pp , Juy [69] J. Wang, C. Huang, and J. Li, On ISP-friendy rate aocation for peer-assisted VoD, in Proc. ACM Mutimedia, [70] I. Baev, R. Rajaraman, and C. Swamy, Approximation agorithms for data pacement in arbitrary networks, SIAM Journa of Computing, vo. 38, no. 4, [71] S. C. Borst, V. Gupta, and A. Waid, Distributed caching agorithms for content distribution networks, in Proc. IEEE INFOCOM, [72] M. R. Korupou, C. G. Paxton, and R. Rajaraman, Pacement agorithms for hierarchica cooperative caching, Journa of Agorithms, vo. 38, [73] Y. Zhou, T. Z. J. Fu, and D. M. Chiu, Modeing and anaysis of P2P repication to support VoD service, in Proc. IEEE INFOCOM, [74] W. Wu and J. C. S. Lui, Exporing the optima repication strategy in P2P-VoD systems: characterization and evauation, in Proc. IEEE INFOCOM, [75] AmazonAWS, Eastic oad baancing. Easticoadbaancing/, [76] DynDNS. [77] UtraDNS. [78] B. Maggs, Persona communication, [79] M. Coajanni, P. S. Yu, and D. M. Dias, Scheduing agorithms for distributed web servers, in Proc. Internationa Conference on Distributed Computing Systems, May [80] M. Conti, C. Nazionae, E. Gregori, and F. Panzieri, Load distribution among repicated Web servers: A QoS-based approach, in Workshop Internet Server Perf., May

145 [81] V. Cardeini, M. Coajanni, and P. S. Yu, Geographic oad baancing for scaabe distributed web systems, in MASCOTS, August [82] M. J. Freedman, K. Lakshminarayanan, and D. Mazières, OASIS: Anycast for any service, in Proc. USENIX NSDI, May [83] M. Pathan, C. Vecchioa, and R. Buyya, Load and proximity aware request-redirection for dynamic oad distribution in peering CDNs, in OTM, November [84] MeasurementLab. [85] M. J. Freedman, E. Freudentha, and D. Mazières, Democratizing content pubication with Cora, in Proc. USENIX NSDI, March [86] B. Wong, A. Sivkins, and E. G. Sirer, Meridian: A ightweight network ocation service without virtua coordinates, in Proc. ACM SIGCOMM, [87] R. Krishnan, H. V. Madhyastha, S. Srinivasan, S. Jain, A. Krishnamurthy, T. Anderson, and J. Gao, Moving beyond end-to-end path information to optimize CDN performance, in Proc. ACM SIGCOMM, [88] E. Ng and H. Zhang, Predicting Internet network distance with coordinates-based approaches, in Proc. IEEE INFOCOM, [89] F. Dabek, R. Cox, F. Kaashoek, and R. Morris, Vivadi: A decentraized network coordinate system, in Proc. ACM SIGCOMM, [90] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, and L. Zhang, IDMaps: A goba Internet host distance estimation service, Trans. Networking, October [91] D. K. Godenberg, L. Qiu, H. Xie, Y. R. Yang, and Y. Zhang, Optimizing cost and performance for mutihoming, in Proc. ACM SIGCOMM, [92] D. P. Bertsekas, Noninear Programming. Athena Scientific, [93] PanetLab. [94] A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford, In VINI veritas: Reaistic and controed network experimentation, in Proc. ACM SIGCOMM,

146 [95] S. Thomson, Y. Rekhter, and J. Bound, Dynamic updates in the domain name system (DNS UPDATE), RFC [96] J. Terrace and M. J. Freedman, Object storage on CRAQ: High-throughput chain repication for read-mosty workoads, in USENIX Annua Technica Conference, June [97] Zookeeper [98] D. Dean, E. W. Feten, and D. S. Waach, Java security: From HotJava to Netscape and beyond, in Symp. Security and Privacy, May [99] V. Vaancius, N. Feamster, J. Rexford, and A. Nakao, Wide-area route contro for distributed services, in USENIX Annua Technica Conference, June [100] Internet2, Network Diagnostic Too (NDT) [101] M. Mathis, J. Heffner, and R. Reddy, Network Path and Appication Diagnosis (NPAD) [102] M. Krohn, E. Koher, and F. M. Kaashoek, Events can make sense, in USENIX Annua Technica Conference, August [103] MOSEK, [104] J. Guyton and M. Schwartz, Locating nearby copies of repicated Internet servers, in Proc. ACM SIGCOMM, [105] W. Theimann and K. Rotherme, Dynamic distance maps of the Internet, in Proc. IEEE INFOCOM, [106] Y. Chen, K. H. Lim, R. H. Katz, and C. Overton, On the stabiity of network distance estimation, SIGMETRICS Perform. Eva. Rev., vo. 30, no. 2, pp , [107] IP to Lat/Long server [108] D. Moore, R. Periakaruppan, and J. Donohoe, Where in the word is netgeo.caida.org?, in Proc. INET, June

147 [109] V. N. Padmanabhan and L. Subramanian, An investigation of geographic mapping techniques for Internet hosts, in Proc. ACM SIGCOMM, [110] M. J. Freedman, M. Vutukuru, N. Feamster, and H. Baakrishnan, Geographic ocaity of IP prefixes, in Proc. Internet Measurement Conference, October [111] B. Wong and E. G. Sirer, CosestNode.com: An open access, scaabe, shared geocast service for distributed systems, SIGOPS OSR, vo. 40, no. 1, [112] H. A. Azoubi, M. Rabinovich, and O. Spatscheck, MyXDNS: A resquest routing DNS server with decouped server seection, in WWW, May [113] Z. M. Mao, C. D. Cranor, F. Dougis, M. Rabinovich, O. Spatscheck, and J. Wang, A precise and efficient evauation of the proximity between Web cients and their oca DNS servers, in USENIX Annua Technica Conference, June [114] J. Pang, A. Akea, A. Shaikh, B. Krishnamurthy, and S. Seshan, On the responsiveness of DNS-based network contro, in Proc. Internet Measurement Conference, October [115] A. Greenberg, J. Hamiton, D. A. Matz, and P. Pate, The cost of a coud: research probems in data center networks, ACM SIGCOMM Computer Communication Review, vo. 39, no. 1, pp , [116] H. Baani, P. Costa, T. Karagiannis, and A. Rowstron, The price is right: Towards ocationindependent costs in datacenters, in Proc. SIGCOMM Workshop on Hot Topics in Networking, [117] J. W. Jiang, T. Lan, S. Ha, M. Chen, and M. Chiang, Joint VM pacement and routing for data center traffic engineering, in Proc. IEEE INFOCOM Mini Conference, [118] S. Shenker, The future of networking, and the past of protocos. Presented at distinguished cooquium series, Computer Science Dept., Princeton University,