On Distributed Computation Rate Optimization for Deploying Cloud Computing Programming Frameworks

On Distributed omputation ate Optimization for Depoying oud omputing Programming Framewors Jia Liu athy H. Xia Ness B. Shroff Xiaodong Zhang Dept. of Eectrica and omputer Engineering Dept. of Integrated Systems Engineering Dept. of omputer Science and Engineering The Ohio State University, oumbus, OH 43210, U.S.A. iu, shroff}@ece.osu.edu, xia.52@osu.edu, zhang@cse.ohio-state.edu ABSTAT With the rapidy growing chaenges of big data anaytics, the need for efficient and distributed agorithms to optimize coud computing performances is unprecedentedy high. In this paper, we consider how to optimay depoy a coud computing programming framewor e.g., Mapeduce and Dryad over a given underying networ hardware infrastructure to maximize the end-to-end computation rate and minimize the overa computation and communication costs. The main contributions in this paper are three-fod: i we deveop a new networ fow mode with a generaized fowconservation aw to enabe a systematic design of distributed agorithms for computation rate utiity maximization probems UM in coud computing; ii based on the networ fow mode, we revea ey separabe properties of the dua functions of Probem UM, which further ead to a distributed agorithm design; and iii we offer important networing insights and meaningfu economic interpretations for the proposed agorithm and point out their connections to and distinctions from distributed agorithms design in traditiona data communications networs. This paper serves as an important first step towards the deveopment of a theoretica foundation for distributed computation anaytics in coud computing. 1. INTODUTION With the rapid advances in information technoogies, recent years have witnessed the growing chaenges in storing, processing, and anayzing arge data sets in many areas, such as socia networs web-services, genomic research, networ traffic management, compex physics simuations, environmenta research, just to name a few [1, 2]. Traditiona centraized reationa database management systems, first deveoped more than four decades earier, cannot hande the unstructured and dynamic nature of the arge data sets nowadays and the performance of reationa database management systems scaes poory as the data sets sizes increase. As a resut, distributed coud computing patforms that have a arge number of networed computing nodes with massivey parae structures have emerged as an attractive soution for handing big data anaytics. Among coud computing programming framewors for big data anaytics, perhaps the most famous exampe is the so- caed Mapeduce [3], designed initiay by Googe for scanning arge amounts of textua data to create web search indexes for the entire Internet. Another notabe exampe is Dryad [4], which is Microsoft s counterpart to Mapeduce. Dryad has been seen by some researchers as an approach to improve the pitfas of the Mapeduce framewor [5] in that: i it aows for genera styes of computation by using directed-acycic graph DAG that are much more than just map and reduce phases; and ii it aows communications between stages to happen over more than just fies stored in distributed fie systems. Furthermore, recent research showed that Mapeduce and Dryad can be mathematicay unified under a we-defined matrix-based mode [6]. However, as promising as they are, the actua depoyments of coud computing framewors such as Mapeduce and Dryad remain in their infancy and there are many technica issues to be resoved. For exampe, in the most popuar Mapeduce impementationown as the Hadoop project [7], there is ony one active job tracer to schedue and monitor a map and reduce tass. This not ony poses the singe-point-of-faiure vunerabiity, but aso has a poor scaabiity that defeats the whoe purpose of distributedness in coud computing. To date, athough there exist some heuristic design rue-of-thumbs e.g., moving computation nodes coser to their data sources to avoid unnecessary communications, itte effort has been made to estabish a mathematica foundation to systematicay deveop distributed agorithms and contro schemes that optimize the depoyments and performances of coud computing programming framewors. Hence, the goa of our paper is to tae the first step to fi this gap. In this paper, we study how to optimay decompose and aocate the subcomputation tass in a coud computing programming framewor over an underying hardware infrastructure, such that the end-to-end computation rate can be maximized and the overa computation and communication costs can be minimized. Further, agorithms for soving this probem need to be impemented in a distributed fashion. Toward this end, motivated by Dryad which aso subsumes Mapeduce as pointed out earier, we mode a coud computing programming framewor as computing a set of generic functions Θ } that share a common set of distributed incoming data sources, where the data dependency can be represented by a mutipe-input mutipe-output directed acycic graph MIMO-DAG, as shown in the ius- opyright is hed by author/owners. Performance Evauation eview, Vo. 40, No. 4, March 2013 63

χ 1 χ 2 χ 3 χ 4 g 1 λ e 1 χ 1 + χ 2 g 3 g 4 g 2 e 4 e 2 λ e 5 λ e λ 6 λ e λ 3 e 9 e e 10 7 λ λ e 8 λ λ λ e 11 λ e 12 λ e 13 g 10 g 11 Θ 1 =χ 1 + χ 2 χ 2 + χ 3 Θ 2 = χ 1 +3χ 2 + χ 3 + χ 4 a An iustrative exampe of a coud computing programming framewor n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 b A 16-node 2-D torus interconnection topoogy. Figure 1: An iustrative exampe of depoying a coud computing programming framewor over an underying networ infrastructure: a omputing functions Θ 1 and Θ 2 at computation rate λ; the data dependencies are represented by a MIMO- DAG structure; b An iustration of a 2-D torus interconnection topoogy commony seen in arge data centers or supercomputers. trated exampe in Fig. 1a. 1 On the other hand, the underying networed system upon which any coud computing framewor needs to be depoyed can aso be represented by a graph. For exampe, Fig. 1b iustrates a 2-D torus interconnection topoogy commony found in arge data centers or supercomputers that support coud computing jobs. As wi be shown ater, to capture the ey features of coud computing programming framewors and incorporate computation and communication constraints of the underying networs, we deveop a generic mathematica modeing framewor, based on which we formuate the computation rate utiity maximization probem UM. Our main resuts and contributions in this paper are three-fod: By accounting for in-networ data aggregation fows, we show that one can construct a new networ fow mode with a generaized fow-conservation aw to address both 1 The addition and mutipication operations in Fig. 1a are ony for iustrative purposes. Each vertex coud be any genera data processing operations. communication imits and computation costs in Probem UM. We point out that this utiity framewor and the associated generaized fow-conservation aw are important in that they shares the same separabe structure as in the cassica networ utiity maximization probems in data communication networs [8, 9]. Thus, they enabe the design of poynomia-time distributed agorithms for coud computing by drawing experiences gained from the cassica networ utiity maximization theory. By appropriatey reformuating Probem UM in the Lagrangian dua domain, we revea ey separabe properties of the dua function of Probem UM. These ey structura properties enabe us to obtain cosed-form expressions for the prima and dua update schemes, which further eads to a fuy distributed agorithmic impementation for big data anaytics in coud computing. We offer important networing insights and economics interpretations for the proposed distributed agorithm. We aso point out the connections to and distinctions from the dua decomposition based distributed agorithms in the cassica networ utiity maximization theory for data communications networs, thus further advancing our understanding of distributed approaches in coud computing networ optimization theory. We aso provide numerica exampes to show the efficacy of our proposed distributed agorithm. To our nowedge, this paper is among the first that treat the design of distributed agorithms in MIMO-DAG based coud computing framewors e.g., Dryad via a rigorous theoretica networ-fow optimization approach. The remainder of this paper is organized as foows. In Section 2, we review some reated wor in the iterature, putting our wor in a comparative perspective. Section 3 introduces the new networ fow mode with a generaized fow-conservation aw for depoying coud computing framewors. Section 4 deveops the principa components of our proposed distributed agorithm for soving Probem UM. Section 5 provides numerica resuts and Section 6 concudes this paper. 2. ELATED WOK Our wor is cosey reated to i distributed cross-ayer utiity maximization theory for data communication networs see, e.g., [10] for an overview and ii in-networ computation techniques. In the in-networ computation iterature, our wor is most reated to [11], where Shah et a. deveoped a networ fow mode for the in-networ computation in sensor networ appications. The mode in [11] extends the conventiona fow-conservation aw in the networ fow iterature [12] to in-networ computation appications. However, the mode in [11] is restricted to simpe tree topoogies that are used for data aggregation in sensor networs. In contrast, our networ fow mode wors with generic MIMO-DAG, which are the most appropriate modes for compex coud computing programming framewors, such as Dryad and Mapeduce. Aso, in our networ mode, we incorporate genera networ utiity and communication/computation cost functions, which were not considered in [11]. Our networ mode aso shares some simiarities with the oad shedding and distributed resource contro probem of stream processing networs e.g., [13 15]. But our wor 64 Performance Evauation eview, Vo. 40, No. 4, March 2013

differs from these wors in the foowing two important aspects: First, athough the issue of fow imbaance in stream processing networs was aso pointed out in [14,15], the fow imbaance in [14,15] was caused by different fow production and consumption rates between upstream and downstream nodes. In contrast, the fow imbaance in this wor is due to subcomputation in couds, which is a fundamentay different cause. Second, in [13], the tas-to-server assignment reationship is assumed to be given and the authors ony studied the end-to-end utiity rate maximization. In contrast, the tas-to-sever assignment reationship is aso part of the overa optimization in our wor. Our networ fow mode aso has connections with the graph embedding probems in graph theory e.g., [16 18]. But these probems differ from ours in that their embedding objectives were to minimize some graph-theoretic performance metrics, such as diation i.e., the maximum distance in the networ between adjacent tree nodes. 3. NETWOK MODEL AND POBLEM FO- MULATION In this section, we first present the modeing detais of coud computing programming framewors and the underying networ infrastructure in Section 3.1. Then, the concept of mapping between a programming framewor and networ infrastructure is introduced in Section 3.2. Next, we deveop a new networ fow mode with a generaized fow-conservation aw in Section 3.3. Finay, based on the networ fow mode, we formuate the computation rate optimization probem in Section 3.4. 3.1 Modeing oud omputing Framewors and Networ Infrastructure As mentioned earier, we mode a coud computing programming framewor by a MIMO-DAG. Here, we denote a MIMO-DAG by D = V, E}, where V and E represent the sets of vertices and edges, respectivey, with V = V and E = E. We note that this MIMO-DAG structure captures the parae processing and muti-stage aggregation nature of typica coud computing programming framewors, such as Dryad or mutipe concatenations of Map/educe. We suppose that there are S vertices of in-degree-zero in D, corresponding to the S distributed input data sources. Without oss of generaity w..o.g, we abe these source vertices as g 1,...,g S. For modeing convenience, there are K vertices in D having in-degree-one and out-degree-zero, corresponding to the virtua sins that absorb the fina output of Θ, =1,...,K. Again, w..o.g, we abe such sins as g V K+1,...,g V. The remaining nodes g S+1,...,g V 1 perform the computations prescribed by Θ. The input data stream at each source g i is denoted as χ i } =0, where χ i represents the i-th input eementattime instant. The infinite input data streams coud represent, for exampe, the big data phenomenon exempified in the arge data sets processing in by the Mapeduce or Dryad framewors in coud computing. In this paper, a sources are assumed to be synchronized. We note that the synchronization in coud computing systems is aso an activey research topic see, e.g., [19] and references therein and its detais are beyond the scope of this paper. For simpicity, we assume that a input data streams and a directed edges in the MIMO- DAG structure have a homogeneous input rate λ, which can aso be thought of as the end-to-end computation rate of Θ }. We remar that the cases with heterogeneous rates due to compression or expansion processing can be exercised simiary by introducing coefficients in front of the data rates [14] in our utiity maximization formuation described ater. One can further transform such cases into a homogeneous case using modeing techniques in [13] by changing the measure and absorbing the coefficient into the resource usage parameters. Hence, this homogeneous rate assumption does not ose any generaity. We et he and te denote the head and tai vertices of each directed edge e in E, respectivey. We et S e E he g 1,...,g S}} be the set of a edges originating from the sources. We note that S S in genera. Liewise, we et K e E te g V K+1,...,g V }} denote the set of edges terminated at the sins. Note that, since each sin edge represents a unique output see Fig. 1a, we have K = K. W..o.g, we abe the edges in E in such a way that the first S edges e 1,...,e S and the ast K edges e +1,...,e E are the source and sin edges, respectivey. In Fig. 1a, for exampe, S = e 1,...,e 6 } and K = e 12,e 13 }. We point out that each directed edge in D can be viewed as a subcomputation tas. For exampe, in Fig. 1a, edge e 7 corresponds to the computation resut χ 1 + χ 2. Therefore, the terms edge and subcomputation are sometimes used interchangeaby in this paper. It shoud be noted, however, that a subcomputation does not necessariy correspond to a unique edge. For exampe, in Fig. 1a, edges e 7 and e 8 a carry the same subcomputation χ 1+χ 2. The set of successor edges of e in D is defined as Ψe e E te =he}. For exampe, in Fig. 1a, the successor edges of e 2 are e 7 and e 8. Note that from the structure of D, wehaveψe i= for a e i K. Finay, we remar that the MIMO-DAG structure degenerates into a tree topoogy if it has a singe sin and each edge ony has one successor edge. Therefore, the tree-structure mode considered in [11] can be viewed as a specia case of our mode. On the other hand, in coud computing environments e.g., data centers or supercomputers, we have a networed system as the physica computing and communication infrastructure i.e., a coud computing hardware patform. Usuay, the hardware patform aso exhibits certain carefuy constructed parae structures. For exampe, Fig. 1b iustrates a 2-D torus interconnection topoogy that is commony seen in coud hardware patforms. Advanced supercomputers nowadays coud empoy torus interconnections with even higher dimensions e.g., 5-D torus in IBM Bue Gene/Q architecture. But we point out that these specia topoogy properties are not essentia to our networ modes and our distributed agorithm design, both of which appy to genera interconnection topoogies. In this paper, we aso mode the underying coud computing hardware patform by a graph G = N, L}, where N and L are the sets of nodes and ins with N = N and L = L, respectivey. Depending on the size and scae of the hardware system, each node in N coud represent, e.g., a PU core, a node card, or even a computer rac. Each in in L represents the communication connection between the nodes. It coud represent a oca high-speed bus if the nodes are co-ocated on the same card or a fiber in if the nodes are ocated at different racs. As mentioned earier, to avoid unnecessary data commu- Performance Evauation eview, Vo. 40, No. 4, March 2013 65

nications in coud computing modes such as Map/educe and Dryad, it is highy desirabe to aocate subcomputation tass invoving input data sources and outputs at their physica ocations. Therefore, w..o.g, we abe the nodes of N as n 1,...,n N in such a way that the first S nodes n 1,...,n S correspond to the physica ocations of the S data sources of D. 3.2 Mapping a oud omputing Framewor onto a Networ Infrastructure eary, depoying a given coud computing framewor on a coud computing hardware patform amounts to mapping a edges in D onto G in an appropriate fashion. As in standard graph theory terms, we define a path P as a sequence of nodes in N such that every two adjacent nodes and n in P satisfy,n L. The first and the ast nodes in P are caed the start and end nodes and are denoted as αp and βp, respectivey. In the extreme case, P coud contain ony one node, say. In this case, αp =βp = and the in, degenerates into a sef-oop. Here, if an edge is mapped to a path P,itmeans that the data are transmitted from αp via the specified ins to βp and then the corresponding subcomputation is performed at βp. In the specia case, if an edge is mapped to a sef-oop, then the corresponding subcomputation tas is performed ocay and no communication is needed. We denote the set of a paths in G as P. Avaid mapping of D onto G is defined as foows: Definition 1. A mapping M : E P is vaid if: 1 αme i = if e i S and he i = g, = 1,...,S; 2 βme i = if e i Kand is the physica output ocation of e i ; and 3 αme j = βme i if e j Ψe i. In Definition 1, conditions 1 and 2 impy that the source and the sin nodes in D have to match to their physica ocations in the underying networ, and condition 3 represents that the ogica successor reationships of the edges in D need to be respected after the mapping. In genera, there are many different ways to map D onto G. For exampe, it is not difficut to see that Figs. 2a and 2b iustrate two vaid mappings of the MIMO-DAG structure in Fig. 1a onto the networ in Fig. 1b. Note that in a vaid mapping, the node sequence corresponding to an edge in D coud consist of mutipe connected ins in G. For exampe, in Figure 2b, edge e 8 consists of ins n 5,n 9, n 9,n 10, and n 10,n 11. We note that each mapping may use a different computation rate, as shown in Figs. 2a and 2b. For exampe, in Fig. 2a, the abe e i : 2 represents that the computation rate of each edge e i is 2 in this mapping. Further, a timesharing among vaid mappings aso yieds a vaid mapping. For instance, Fig. 3 shows a time-sharing between the two mappings in Figs. 2a and 2b, each accounting for 50% of time. It is not difficut to see that the computation rate of this time-sharing mapping is 0.5 2+0.5 3=2.5. 3.3 A Networ Fow Mode with a Generaized Fow-onservation Law It can be seen from the above discussions that the computation rate region of a coud computing frameworovera n1 n5 given networ infrastructure can be exhausted by a timesharing strategies among a possibe vaid mappings. As a resut, any performance metrics reated to the compue1:2 e 7 :2 e2:2 e8:2 n2 n6 e 12 :2 e 5 :2 e3:2 e9:2 e10:2 e13:2 e4:2 e 11 :2 e6:2 n9 n10 n11 n12 n13 n14 n15 n16 a Mapping 1 with computation rate λ =2. n 1 n 2 n 3 n 4 e1:3 e2:3 e3:3 e7:3 e 8 :3 e4:3 e9:3 n3 n7 e5:3 e6:3 n5 n6 n7 n8 e 12 :3 e 10 :3 e11:3 n9 n10 n11 n12 n13 n14 n15 e 13 :3 n4 n8 n16 b Mapping 2 with computation rate λ =3. Figure 2: Two vaid mappings of the MIMO-DAG structure in Fig. 1a onto the networ in Fig. 1b. Each mapping has a different computation rate. tation rate can be optimized by determining an optima time-sharing strategy. However, since the tota number of vaid mappings is exponentia in N, finding an optima timesharing strategy directy is intractabe. Therefore, we need to expoit additiona structure of the networed system to address this chaenge. To this end, our basic idea is to deveop a new networ fow mode with a generaized fow-conservation aw for computation anaytics. The fundamenta rationae behind this approach is that, simiar to the cassica mode in the muticommodity networ fow iterature [12], the structura properties of the new networ fow mode woud aso ead to poynomia time soutions if designed appropriatey. Unfortunatey, deveoping a fow-conserved networ fow mode for computation anaytics is not-trivia. Due to performing computations at each node in the networ, the cassica fow-conservation aw [12] fais to hod in computation anaytics networ fows. For exampe, from the zoom-in view for node n 10 in Fig. 4, it can be seen that the tota incoming fow rate is: x e 7 n 9,n 10 + xe 8 n 9,n 10 + xe 9 n 10,n 11 + xe 9 n 10,n 11 =5. However, due to the addition computations performed at 66 Performance Evauation eview, Vo. 40, No. 4, March 2013

n1 n2 n3 n4 n5 n6 n7 n8 n9 n13 e2:1 e 1 :1 e1:1.5 e7:1.5 e7:1 e 8 :1.5 e12:1.5 e8:1 e2:1.5 n10 n14 e 5 :1 e4:1 e3:1 e9:1 e10:1.5 e6:1 e3:1.5 e6:1.5 e 11 :1 e4:1.5 e 12 :1 e9:1.5 e9:1 n11 n15 e13:1 e11:1.5 e10:1.5 e13:1.5 Figure 3: A time-sharing between two mappings in Fig. 2a and Fig. 2b, each accounting for 50% of time. e9:1.5 e7:1 e8:1.5 in n10,n9 in n 9,n 10 n 10 e12:1 in n 10,n 14 n12 n16 in n11,n10 in n10,n11 e9:1.5 e9:1 e8:1.5 Figure 4: Zoom-in view of node n 10 in Fig. 3. Due to the addition computations, the tota incoming and outgoing fow rates at n 10 are not equa. node n 10, the tota outgoing fow rate is: x e 9 n 10,n 9 + xe 8 n 10,n 11 + xe 12 n 10,n 14 =4. This shows that, for computation anaytics in coud computing, the tota input and output networ fow are unbaanced. Athough the traditiona fow-conservation aw no onger hods here, upon a coser oo at each incoming edge and their corresponding outgoing or successor edges at n 7,itis not difficut to observe another form of fow-conservation aw. As shown in Fig. 4, for the incoming edge e 4 that does not invove any computation at n 7,wehave: x e 9 n 11,n 10 = xe 9 n 10,n 9, i.e., the incoming rate equas the outgoing rate. The same is aso true for the incoming edge e 8. On the other hand, for the incoming edge e 7 that invoves in the addition with incoming edge e 9,wehave: x e 7 n 9,n 10 = xe 12 n 10,n 14 and x e 9 n 11,n 10 = xe 12 n 10,n 14, i.e., the outgoing rate of the successor edge of e 7 is equa to the incoming rate of e 7. To mode this new form of fow-conservation aw, it is important to recognize that: for a subcomputation, say e, that is injecting to node n, a portion of its fow is terminated at n to compute the successor edges of e i.e., Ψe. For convenience, we abe the ins in the physica networ as 1,...,L instead of using node pairs n j,. We et x e 0 represent the fow amount of subcomputation tas e on in. In a vaid mappings M, since the start node of a source edge e i Sand the end node of a sin edge e i Kare aways mapped to their respective physica nodes in the networ, we et Srce i αme i and Dste i βme i be the physica source and end nodes of e i and e i for simpicity. Then, based on the previous observations, we have the foowing resut: Lemma 1. When computation anaytics are performed over an underying networed system, the foowing generaized fow-conservation aw hods: O O O O I I = λ, e i S, e j Ψe i, = Srce i, 1 =0, e i E\K, e j Ψe i,and Srce i if e i S, 2 =0, e i K, Dste i, I I = λ, 3 e i K, = Dste i, 4 where O and I represent the sets of outgoing and incoming ins at node, respectivey; and y e i represents the subcomputation e i generated at node. Here, the equaities in 1 and 2 represent that the tota incoming fow rate of any non-sin edge e i at node shoud be equa to the sum of the outgoing fow rates of e i and its successor edge e j, and this reationship hods for every successor edge of e i. Moreover, Eq.1 states that if e i is a source edge and happens to be its source node, then the net injection rate of e i at is λ. On the other hand, Eqs.3 and 4 say that, for a sin edge e i that does not have any successor edge, the net output fow rate shoud be equa to λ at its physica ocation and zero esewhere. 3.4 Probem Formuation We et denote the capacity of in. Since the tota networ fow traversing a in cannot exceed the in s capacity, we have E xe i, =1,...,L. We et the networ utiity be a function of λ, denoted by Uλ : +. We assume that Uλ is concave, monotonicay increasing, and twice continuousy differentiabe. The concavity of U represents the diminishing returns effect. When U is inear as a specia case of concavity, we are simpy maximizing the computation rate itsef. In the physica networ, each outgoing edge at a node represents a subcomputation see, e.g., edges e 9, e 10, and e 13 in Fig. 4, which incurs certain costs e.g., consumed energy per unit amountofcomputation. We et ρ represent the unit cost for performing computation at node. Then, the tota cost due to computation at node can be computed Performance Evauation eview, Vo. 40, No. 4, March 2013 67

as: S ρ λ + y e i + i= S +1 ρ I y e i + I O + O. 5 In 5, the term λ + y e i + I xe i O xe i is the tota fow rate of source edge e i that is terminated at node and used to compute its successor edges. Liewise, the term y e i + I xe i O xe i represents the tota fow rate of non-source and non-sin edge e i terminated at. Our objective is to maximize the net networ utiity, defined as networ utiity minus the tota costs. Putting together the discussions earier, we can formuate the probem as foows: UM: Max Uλ O S N =1 ρ + i= S +1 λ + y e i ρ y e i + + I I O s.t. Eqs. 1, 2, 3, 4, E, =1,...,L, 0, i, ; λ 0. Now, it is not difficut to recognize that with the proposed networ fow mode in Lemma 1, Probem UM is a convex program. Moreover, the nice separabe structure of the objective function enabes the design of distributed agorithm to sove Probem UM, which wi be the focus of the next section. 4. DISTIBUTED SOLUTION POEDUE In this section, we wi present the ey steps in designing a distributed agorithm based on dua decomposition to sove Probem UM. In Section 4.1, we wi first reformuate Probem UM into its Lagrangian dua probem and show how to appropriatey decompose the Lagrangian dua probem. Based on these resuts, we introduce the basic idea in designing a distributed agorithm in Section 4.2. In Section 4.3, we offer some interesting networing and economics interpretations of the proposed distributed agorithm. Then, we wi present some numerica resuts in Section 5. 4.1 Lagrangian Dua eformuation and Decomposition As mentioned earier, since Probem UM is a convex program, it can be equivaenty soved in its Lagrangian dua domain because of a zero duaity gap [20]. To sove the Probem UM in its Lagrangian dua domain, we first sighty modify the constraints in 1 4 into inequaity constraints as foows: O O O O I I λ, e i S, e j Ψe i, = Srce i, 6 0, e i E\K, e j Ψe i, and Srce iife i S, 7 0, e i K, Dste i, I I λ, 8 e i K, = Dste i, 9 It is not difficut to show that these modifications do not affect the soution at optimaity. Interestingy, these modifications can aso be interpreted from a networ stabiity perspective: the tota service rate at each node is no ess than the tota arriva rate. Next, we associate dua variabes 0 and 0 for each constraint in 6 7 and 8 9, respectivey. For notationa simpicity, we use Ψ i to represent the index set j : e j Ψe i}. Aso, we use vectors x, μ, and w to group a x-, μ- and w-variabes. By accommodating the constraints into the objective function and combining reated terms, we have that the Lagrangian can be written as foows: Lλ, x, μ, w =Uλ O + + E + N j Ψ i =1 E N i= S +1 i= 1 =1 ρ S N =1 y e i O O S + λ Srci + λ j Ψ i ρ + λ + y e i I E i=+1 I + I O I Dsti. 10 Then, the Lagrangian dua function can be written as: } E Φμ, w = sup Lλ, x, μ, w xe i,, λ,x x e.11 i 0, i, ; λ 0. Finay, the dua probem can be written as: D: Minimize Φu 12 subject to u 0. Next, it is important to recognize that the Lagrangian function in 11 possesses a decomposabe structure, which eads to a distributed computation scheme. More specificay, by appropriatey switching summation orders and rearranging terms in 11, we have the foowing resut see Appendix A for proof detais: 68 Performance Evauation eview, Vo. 40, No. 4, March 2013

Proposition 2. The Lagrangian in 11 can be decomposed in a source-wise and in-wise fashion as foows: Φμ, w =Φ Fμ, w+ L =1 Φ μ, w+ N =1 Φ μ, w, where Φ Fμ, w, Φ μ, w, andφn μ, w represent the fow contro, per-in routing, and per-node computation subprobems at each source, each in, and each node, respectivey. Here, Φ Fμ, w and Φ μ, w are defined as foows, respectivey: Φ F μ, w max Uλ λ 0 [ S λ j Ψ i Φ μ, w max s.t. N Srci + ρ E =1 E i=+1 Tx wi x } E, 0, i., w Dsti]} ; where, =1,...,N, i =1,...,E, are defined as: ρ + j Ψ i, i =1,...,E K, Φ μ, w max y e i + j Ψ i, i = E K +1,...,E; 0, i 13 E [ y e j ŵ i ye i ]}, j Ψ i where ŵ i, =1,...,N, i =1,...,E, are defined as: ρ, i =1,...,E K,, i = E K +1,...,E. 14 Based on Proposition 2, the Lagrangian dua probem D in 12 can be transformed into the foowing master dua probem: MD: Minimize Φ Fμ, w+ L =1 Φ μ, w+ N =1 Φ μ, w subject to μ 0, w 0. Then, the tas of soving the Lagrangian dua probem D bois down to distributedy soving the subprobemsφ Fμ, w, Φ μ, w, and Φ μ, w, and then the master dua probem MD. 4.2 Design of Distributed Agorithm Note that at each source node, the fow contro and computation subprobems Φ F has a concave objective function with a singe non-negative decision variabe. Thus, Φ F can be triviay and efficienty soved e.g., by simpe bisection search method. Aso, it can be observed that the routing subprobem Φ at each in is a ower dimensiona inear programming probem with a singe constraint, OEV 2 variabes, and a probem coefficients are ocay avaiabe, which means that it can be efficienty soved as Agorithm 1 A subgradient agorithm for soving Probem UM. Initiaization: 1. hoose initia starting points μ 0 and w 0. Let m =0. Main Iteration: 2. ompute the source computation rate λm by soving the fow contro subprobem Φ Fμ m, w m by using, e.g., bisection method. 3. ompute the routing decisions m at each in by soving the routing subprobem Φ μ m, w m, a inear programming probem. 4. hoose an appropriate step size π m.ompute the subgradients d ij μ, m and di w, m using 17 and 18 with λm and m. 5. Update dua variabes μ m+1 and w m+1 with d ij μ, m and d i w, m. 6. If μ m+1 μ m <ɛand w m+1 w m <ɛ, then return λm and m. Otherwise, et + 1 and go to Step 2. we. For the master dua probem MD, since its objective function is piece-wise differentiabe, one can appy subgradient method [20]. More specificay, starting with an initia μ 0 and w 0 and after evauating Φ F μ m, w m and Φ μm, w m in the m-th iteration, we update the dua variabes as foows: m +1=max m +1=maxwi m π m d ij μ, m, 0}, 15 m πm d i w, m, 0}, 16 where π m > 0 denotes a positive scaar step size, and d ij μ, m and d ij w, m represent the subgradients of the dua variabes and in the m-th iteration, respectivey. It is nown that for the subgradient iterative schemes in 15 and 16 to converge, a sufficient condition is that the step size π m satisfies π m 0asm, m=0 =, and m=0 πm 2 [20]. A popuar step size seection strategy is the divergent harmonic series: π m = β, m =1, 2,..., m where β is some given system parameter. For the master dua probem MD, the subgradient for the Lagrangian dua function in 10 can be computed as: d ij μ, m = O O I λ1 ei S, =Srce i }, 17 d i w, m = I + λ1 ei S, =Dste i }, 18 where 1 } represents the indicator function, which equas 1 if the condition in } is satisfied and 0 otherwise. To summarize, the design of distributed subgradient agorithm for soving Probem UM is iustrated in Agorithm 1. 4.3 Networing and Economics Interpretations Severa interesting networing and economics insights for the subgradient-based distributed agorithm are in order. Queue ength interpretations of the dua variabes: Performance Evauation eview, Vo. 40, No. 4, March 2013 69

It can be seen that by dividing the step size π m on both sides of 15 and 16 and etting Q ij m /π m and Q i m wi /πm,wehave: Q ij m +1=max Q ij m y e j + y e i + I O } + λ1 ei S, =Srce i }, 0, 19 Q i m +1=max Q i m λ1 ei S, =Dste i } + I O } + y e i, 0. 20 A coser oo reveas that 19 and 20 behave exacty the same as queue ength evoution as seen in traditiona data communications networs. Indeed, if we et Q ij m represent the queuing bacog for computing the successive edge j of edge i at node in time instant m, then this queuing bacog and the dua variabe m are intimatey reated differ ony by a scaing factor π m. By the same toen, Qi m can be simiary interpreted as the queuing bacog of the terminating edges e i Kat node m. onnections to the bac-pressure agorithm: eary, in cases where in-networ computation is disabed, i.e., by forcing x e j = 0 for a e j Ψe i and for a, the networ degenerates into a traditiona communication networ. As expected, it is easy to chec that the queue ength evoutions in 19 and 20 reduce to the conventiona queue ength evoution: Q i m +1=max Q i m O + I } + λ 1 ei S, =Srce i } 1 ei S, =Dste i }, 0. 21 Moreover, since a dua μ-variabes become 0, it is easy to verify that the computation subprobem Φ admits a trivia optima soution: y e i =0,, i. Aso, with dua μ-variabes being 0, it is not difficut to see that the routing subprobem Φ w admits a trivia soution as foows: for a given in, pic a subcomputation tas e i that has the argest Tx wi x -vaue, say e i, and et e i use up the in capacity. This is exacty the same strategy used in the ceebrated bac-pressure agorithm that was first discovered in [21]. Pricing interpretation of the dua subgradient updating scheme: The dua variabes m and m can aso be economicay interpreted as congestion prices at node during the m-iteration. The dua updating scheme of the subgradient agorithm can then be viewed as a pricing scheme. When node becomes increasingy congested, then d ij μ, < 0. From 15, we can see the price of node wi be increased. On the other hand, when node becomes ess congested, then d ij μ, > 0. Again, from 16, it can be sen that the price wi be decreased. Fow ate 30 25 20 15 10 5 North Out South Out East Out West Out Sef omp. 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Node ID Figure 5: Outgoing ins and sef-oop fow rates at each node for edge e 12. Fow ate 30 25 20 15 10 5 North Out South Out East Out West Out Sef omp. 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Node ID Figure 6: Outgoing ins and sef-oop fow rates at each node for edge e 13. 5. NUMEIAL ESULTS In this section, we use Fig. 1 as an exampe to iustrate our proposed networ fow mode and distributed soution procedure. That is, we want to optimize the depoyment of the MIMO-DAG computation framewor in Fig. 1a onto the 16-node 2-D torus interconneced networ in Fig. 1b. The physica ocations of both Θ 1 and Θ 2 are at node n 16. Here, we et the capacity of each in in Fig. 1b be 10 and et the per-node unit computation cost be 0.001. Our objective is to maximize the computation rate, i.e., etting Uλ =λ. After optimization, the maximum computation rate is 15.95. Due to space imitation and the arge number of optima in fow rate variabes for this exampe 5 16 13 = 1040 variabes, we ony pot in Figs. 5 and 6 the outgoing ins and sef-oops fow rates for edges e 12 and e 13 to iustrate part of the optima soution. The North Out, South Out, East Out, and West Out in Figs. 5 and 6 represent the outgoing ins at each node in Fig. 1b aong the specified directions, respectivey. eca from Fig. 1a that edges e 12 and e 13 are sin edges, which correspond to the fina resuts of functions Θ 1 and Θ 2, respectivey. Surprisingy, it can be seen that the computations of these sin edges are not depoyed cose to the physica output node n 16. The majority of the computations of e 12 and e 13 are done at node 2 see the rates of sef-oops at node 2 and node 1. This shows that the heuristic proximity depoyment rue may not be optima. From Figs. 5 and 6, it is aso not difficut to see the optima routing paths for edges e 12 and 70 Performance Evauation eview, Vo. 40, No. 4, March 2013

Normaized ompuation ate 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 Normaized Noda omputation ost Figure 7: The change of maximum end-to-end computation rate with respect to the change of per-node unit computation cost. Normaized ompuation ate 180 160 140 120 100 80 60 40 20 0 10 20 30 40 50 60 70 80 90 100 Normaized Lin apacity Figure 8: The change of maximum end-to-end computation rate with respect to the change of in capacity. e 13. For exampe, the optima routing paths for edge e 13 are: n W 1 n N 4 n 16, n E 2 n E 3 n N 4 n 16, n 2 S n 6 E n 7 E n 8 S n 12 S n 16, n 2 W n 1 W n 4 N n 16, n N 2 n E 14 n E 15 n 16, n S 12 n 16, where the etter above each arrow denotes the routing direction at that hop. Next, we iustrate in Figs. 7 and 8 the changes of maximum end-to-end computation rate with respect to the changes of per-node unit computation cost and in capacity, respectivey. In Fig. 7, we can see as expected that the end-to-end computation rate decreases as the unit computation cost at each node increases from 1 to 10 units step size is 0.001. Liewise, we can see from Fig. 8 the end-to-end computation rate increases as the in capacity increases from 10 to 100. 6. ONLUSION In this paper, we investigated the design of distributed agorithms for coud computing programming framewors depoyments. We formuated the computation rate utiity maximization probem UM by deveoping a new networ fow mode with a generaized fow-conservation aw. Based on this enabing framewor, we deveoped a dua decomposition based distributed agorithm to sove Probem UM. We provided important networing interpretations and ey impementation insights for our proposed agorithm and pointed out the connections and distinctions to distributed agorithms design in traditiona data communications networs. oectivey, these resuts serve as the first buiding boc of a new theoretica framewor for the depoyment of coud computing programming framewors. 7. EFEENES [1] A. S. Szaay, Extreme data-intensive scientific computing, IEEE omputing in Science and Engineering, vo. 13, no. 6, pp. 34 41, Nov. 2011. [2] Data, data everywhere, Economist, 2010. [Onine]. Avaiabe: http://www.economist.com [3] J. Dean and S. Ghemawat, Mapeduce: Simpified data processing on arge custers, in Proc. USENIX OSDI, San Francisco, A, Dec. 6-8, 2004, pp. 137 149. [4] M. Isard, M. Budiu, Y. Yu, A. Birre, and D. Fettery, Dryad: Distributed data-parae programs from sequentia buiding bocs, in Proc. AM SIGOPS/Eurosys, Lisboa, Portuga, Mar. 21-23, 2007, pp. 59 72. [5] K.-H. Lee, Y.-J. Lee, H. hoi, Y. D. hung, and B. Moon, Parae data processing with Mapeduce: Asurvey, AM SIGMOD ecord, vo. 40, no. 4, pp. 11 20, Dec. 2011. [6] Y. Huai,. Lee, S. Zhang,. H. Xia, and X. Zhang, DOT: A matrix mode for anayzing, optimizing and depoying software for big data anaytics in distributed systems, in Proc. AM SO, ascais, Portuga, Oct. 27-28, 2011. [7] Hadoop. http://hadoop.apache.org. [8] M. hiang, S. H. Low, A.. aderban, and J.. Doye, Layering as optimization decomposition: A mathematica theory of networ architecture, Proc. IEEE, vo. 95, no. 1, pp. 255 312, Jan. 2007. [9] F. P. Key, A. K. Mauo, and D. K. H. Tan, ate contro in communications networs: Shadow prices, proportiona fairness and stabiity, Journa of the Operationa esearch Society, vo. 49, pp. 237 252, 1998. [10] X. Lin, N. B. Shroff, and. Sriant, A tutoria on cross-ayer optimization in wireess networs, IEEE J. Se. Areas ommun., vo. 24, no. 8, pp. 1452 1463, Aug. 2006. [11] V. Shah, B. K. Dey, and D. Manjunath, Networ fows for functions, in Proc. IEEE Internationa Symposium on Information Theory ISIT, St. Petersburg, ussia, Ju.31 Aug.5, 2011, pp. 234 238. [12] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherai, Linear Programming and Networ Fows, 4th ed. New Yor: John Wiey & Sons Inc., 2010. [13] H. Feng, Z. Liu,. Xia, and L. Zhang, Load shedding and distributed resource contro of stream processing networs, Performance Evauation, vo. 64, no. 9-12, pp. 1102 1120, Oct. 2007. [14] H. Zhao,. H. Xia, Z. Liu, and D. Towsey, A unified modeing framewor for distributed resource aocation of genera for and join processing networs, in Proc. AM Sigmetrics, New Yor, NY, Jun. 14-18, 2010. [15] Z. Liu, A. Tang,. H. Xia, and L. Zhang, A decentraized contro mechanism for stream processing Performance Evauation eview, Vo. 40, No. 4, March 2013 71

networs, Annas of Operations esearch, vo. 170, no. 1, pp. 161 182, Sep. 2009. [16] F. T. Leighton, M. J. Newman, A. G. anade, and E. J. Schwabe, Dynamic tree embeddings in butterfies and hypercubes, SIAM Journa of omputing, vo. 21, no. 4, pp. 639 654, Aug. 1992. [17] O. Wohmuth and F. Mayer-Lindenberg, A method for the embedding of arbitrary trees into hypercubes, in Proc. AM Symposium on Appied omputing, 1998, pp. 569 574. [18] V. Heun and E. W. Mayr, Efficient dynamic embeddings of arbitrary binary trees into hypercubes, Journa of Agorithms, vo. 43, pp. 51 84, 2002. [19] A. W. Mai, A. Par, and. M. Fujimoto, Optimistic synchronization of parae simuations in coud computing environements, in Proc. IEEE Internationa onference on oud omputing, Bangaore, India, Sep. 21-25, 2009, pp. 49 56. [20] M. S. Bazaraa, H. D. Sherai, and. M. Shetty, Noninear Programming: Theory and Agorithms, 3rd ed. New Yor, NY: John Wiey & Sons Inc., 2006. [21] L. Tassiuas and A. Ephremides, Stabiity properties of constrained queuing systems and scheduing poicies for maximum throughput in mutihop radio networs, IEEE Trans. Autom. ontro, vo. 37, no. 12, pp. 1936 1948, Dec. 1992. APPENDIX A. POOF OF POPOSITION 2 To derive the subprobem Φ Fμ, w, we combine a terms in 10 reated to λ, which eads to: [ S Uλ λ j Ψ i N Srci + ρ =1 E i=+1 ] Dsti. Note that the above expression is exacty the objective function of Φ F in Proposition 2. Next, we derive the objective function of Φ,whichis reativey more invoved. First, it is not difficut to verify that by switching the summation order from node-based to in-based, we can rewrite the fouth term in 10 as: E N i= 1 =1 = L E =1 i=+1 O Tx wi x I N E ye i. =1 i=+1 22 By the same toen, we can immediatey rewrite the partia second term excuding the summation invoving λ in 10 as: = L =1 N =1 ρ O I ρ Tx ρ x N N =1 i= S +1 =1 i= S +1 ρ y e i ρ y e i. 23 Now, for the more compex third term in 10, we have E N j Ψ i =1 a = = b = E =1 E =1 + N N E =1 L =1 =1 O O I + y e j j Ψ i O N E j Ψ i I y e j Tx x j Ψ i j Ψ i N E + =1 j Ψ i I j Ψ i j Ψ i x e j, 24 where a hods because the summations of the -variabes do not invove index j and can be taen outside of the summation with respect to index j; and b foows from the same toen as in 22 and 23 and the fact that switching the summation orders of the x e j -variabes does not change their sum vaue. Next, by adding 22, 23, 24 together and defining new w- and ŵ-variabes as in 13 and 14, we arrive at a summation of the foowing two terms: } L E Tx wi x, 25 N =1 [ E j Ψ i y e j ŵ i ye i ], 26 which is exacty the objective function of Φ and Φ. Note that the outer summation in 25 is with respect to in indices, each summand can be decomposed and computed at each in ocay. Liewise, the outer summation in 26 is with respect to node indices, each summand can be decomposed and computed at each node ocay. Finay, noting that the constraints E xe i,, can aso be separated in a in-wise fashion, we then have that the maximization of Lλ, x, μ, w can be equivaenty computed as the sum of a series of maximization subprobems at the source and each in, which are defined as in Proposition 2. This competes the proof. 72 Performance Evauation eview, Vo. 40, No. 4, March 2013