ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING

ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING D.Vinotha,PG Scholar,Department of CSE,RVS Technical Campus,vinothacse54@gmail.com Dr.Y.Baby Kalpana, Head of the Department, Department of CSE,RVS Technical Campus, ybkalps@gmail.com ABSTRACT: The information explosion is the rapid increase in the amount of published information or data and the effects of this abundance. As the amount of available data grows, the problem of managing the information becomes more difficult, which can lead to information overload. Therefore, it is imperative to study the cost minimization problem for big data processing in geo-distributed data centres. The cost efficient in big data processing because of the following weaknesses. First, data locality may result in a waste of resources. Second, the links in networks vary on the transmission rates and costs according to their unique features the distances and physical optical fiber facilities between data centers. To conquer above weaknesses, the cost minimization problem for big data processing via joint optimization of task assignment, data placement, and routing in geodistributed data centers has been studied. Finally, the comparison is made and the changes and improvement werestudied. Keywords: Big data, Data centre resizing, routing algorithm,data centres, markov chain process 1. INTRODUCTION The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data is available. As a result, three factors, i.e., task assignment, data placement and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three factors for big data services in geo-distributed data centers. Big data analysis has shown its great potential in unearthing valuable insights of data to improve decision making, minimize risk and develop new products and services. By 2015, 71% of worldwide data center hardware spending will come from the big data processing, which will surpass $126.2 billion.the study of the cost minimization problem via a joint optimization of three factors task assignment, data placement and data movement, deeply influence the operational expenditure of data centers for big data services in geo-distributed data centers have been introduced. To describe the task completion time with the consideration of both data transmission and computation, a two-dimensional Markov chain and derive the average task completion time in closed-form has been proposed. Furthermore, model of the problem as a Mixed-Integer Non-Linear Programming (MINLP) and propose an efficient solution to linearize has done. The high efficiency of their proposal is validated by extensive simulation based studies [6]. 2 RELATED WORKS 2.1Multi-level Power Management The coordination problem has been seeked.there are two key contributions. First, a power management solution that coordinates different individual approaches has been proposed and validated. Using simulations based on 180 server traces from nine different real-world enterprises, demonstrate the correctness, stability, and efficiency advantages of solution.second, using unified architecture as the base, a detailed quantitative sensitivity analysis has performed and draw conclusions about the impact of different architectures, implementations, workloads, and system design choices.perform a detailed sensitivity analysis to evaluate several interesting variations in the architecture and implementation, and in the mechanisms and policies space is the main advantage.power delivery, electricity consumption, and heat management are becoming key challenges in data center environments.there is individual solution to solve this problem no coordination between them were the demerits[9]. 2.2 Poisson Model Predicting the next request of a user as she visits Web pages has gained importance as Web based activity increases. There are a number of different approaches to prediction. It concentrates on 26

the discovery and modeling of the user's aggregate interest in a session. This approach relies on the premise that the visiting time of a page is an indicator of the user's interest in that page. Even the same person may have different desires at different times.the model has an advantage over previous proposals in terms of speed and memory usage.the experiments show that the model can be used on Web sites with different structures.to confirm our finding, compare these models to two previously proposed recommendation models. Results show that this model improves the efficiency significantly. If the representation is not appropriate for the model, the prediction accuracy will decrease [2]. equation 2.3 Geographical Load Balancing The exploration of whether geographical load balancing can encourage use of green renewable energy and reduce use of brown fossil fuel energy has done. It makes two contributions. First, derive two distributed algorithms for achieving optimal geographical load balancing. Second, show that if electricity is dynamically priced in proportion to the instantaneous fraction of the total energy that is brown, then geographical load balancing provides significant reductions in brown energy use. Geographical load balancing provides a huge opportunity for environmental benefit as the penetration of green, renewable energy sources increases. Specifically, an enormous challenge facing the electric grid is that of incorporating intermittent, unpredictable renewable sources such as wind and solar.geographical load balancing aims to reduce energy costs, but this can come at the expense of increased total energy usage.by routing to a data center farther from the request source to use cheaper energy, the data center may need to complete the job faster, and so use more service capacity, and thus energy, than if the request was served closer to the source[6]. 2.3 Cost minimization Data centre resizing (DCR) has been proposed to reduce the computation cost by adjusting the number of activated servers via task placement. To describe the rate-constrained computation and transmission in big data processing process, a two dimensional Markov chain and derive the expected task completion time in closed form has been proposed. To deal with the high computational complexity of solving MINLP, a mixed-integer linear programming (MILP) problem is linearized, which can be solved using commercial solver.dcr and task placement are usually jointly considered to match the computing requirement.[5] Consider the below table 1 for various references in following 3 SYSTEM MODEL Based on the study of data placement, task assignment, data center resizing and routing, the overall operational cost in large-scale geo-distributed data centers for big data applications will be minimized.first characterize the data processing process using a two-dimensional Markov chain and derive the expected completion time in closed-form, based on which the joint optimization is formulated as an MINLP problem. To tackle the high computational complexity of solving MINLP, linearize it into an MILP problem. Through extensive experiments, joint-optimization solution has substantial advantage over the approach by two-step separate optimization. K shortest path algorithm is used to perform the minimum shortest path for routing. 3.1Big data and Data Flow Collecting dataset for big data is the first task. The whole system can be modelled as a directed graph G = (N;E).Receive data flows from source nodes and forward them according to the routing strategy. The weight of each link w(u;v), representing the corresponding communication cost, can be defined as Where CR and CL, and are the inter-data centre traffic and local transmission cost such that CR> CL. 27

3.2Data placement We define a binary variable yjk to denote whether chunk k is placed on server j as follows, In the distributed file system, we maintain P copies for each chunk k < K, which leads to the following constraint: Furthermore, the data stored in each server j belongs to J cannot exceed its storage capacity, i.e., The data placement and task assignment are transparent to the data users with guaranteed QOS. Let be the processing rate and loading rate for data chunk k on server j, respectively. The processing procedure then can be described by a two-dimensional markov chain process. According to the QoS requirement, Where (6) (5) (7) 3.3Routing of distributed data centers and Cost minimization The cost minimization problem for big data processing via joint optimization of task assignment, data placement, and routing in geo-distributed data centers. Specifically, consider the following issues in joint optimization. Servers are equipped with limited storage and computation resources. Each data chunk has a storage requirement and will be required by big data tasks. K Shortest Path Routing Algorithm The K shortest path routing algorithm is an extension algorithm of the shortest path routing algorithm in a given network. It is sometimes crucial to have more than one path between two nodes in a given network. In the event there are additional constraints, other paths different from the shortest path can be computed. To find the shortest path one can use shortest path algorithms such as Dijkstra s algorithm or Bellman Ford algorithm and extend them to find more than one path. The K Shortest path routing algorithm is a generalization of the shortest path problem. The algorithm not only finds the shortest path, but also K other paths in order of increasing cost. K is the number of shortest paths to find. The problem can be restricted to have the K shortest path without loops (loopless K shortest path) or with loop [4] 3.4Task assignment A task is distributed to a server where its requested data chunk does not reside, it needs to wait for the data chunk to be transferred. Each task should be responded in time D. Moreover, in practical data center management, many task predication mechanisms based on the historical statistics have been developed and applied to the decision making in data centers. To keep the data center settings up-todate, data center operators may make adjustment according to the task predication period by period.to deal with the high computational complexity of solving MINLP, linearize it as a mixed-integer linear programming (MILP) problem, which can be solved using commercial solver. Through extensive numerical studies, show the high efficiency of proposed joint-optimization based algorithm.the flow of work can be explained in the Fig 1.1.During the file transfer, files of size > 10MB are transferred to their destination. If File sending to S->D cost exceeds the Server cost means the cost minimization to be done where D is the number of copies. Algorithm The Dijkstra s algorithm can be generalized to find the K Shortest path. Algorithm *P =empty, *count u = 0, for all u in V insert path P s = {s} into B with cost 0 while B is not empty and count t < K: let P u be the shortest cost path in B with cost C B = B {P u }, count u = count u + 1 if u = t then P = P U P u if count u K then for each vertex v adjacent to u: 28

let P v be a new path with cost C + w(u, v) formed by concatenating edge (u, v) to path P u insert P v into B cost for the number of servers, communication and operation are determined. (a) SERVER COST (8) 4 JOINT OPTIMIZATION To linearize the constrains due to product of two variables joint optimization is done. We define a new variable as follows (9) Which can be equivalently replaced by linear constrains as (10) (11) The constrains can be written in linear form as SERVER COST COMMUNICATION COST 1000 JOINT NO OF REPLICAS K MAP (b) COMMUNICATION COST 40 20 0 JOINT NO OF SERVER KMAP (12) (13) In a similar way,we define a new variable as Which can be linearized by -(14) OPERATION COST 100 0 (c) OPERATION COST NO OF SERVER JOINT K MAP -(15) 5 PERFORMANCE MEASURE -(16) The performance results of routing algorithm (k map) is analyzed which is compared with a separate optimization scheme algorithm (joint), in which minimum number of servers to be activated is found, the traffic routing scheme using the network flow model is described. The result graph will be non-joint, joint, genetic algorithmperformance graph.from the below graph,the values of both joint and k-map has been compared. The values of k-map will high value than using joint linear method. Based on this individual 6 CONCLUSION: Thus the study of the data placement, task assignment, data center resizing and routing to minimize the overall operational cost in large-scale geo-distributed data centers for big data applications has done. Therefore first characterize the data processing process using a two-dimensional Markov chain and derive the expected completion time in closed-form, based on which the joint optimization is formulated as an MINLP problem. To tackle the high computational complexity of solving MINLP, linearize it into an MILP problem. Through extensive experiments, show that joint-optimization solution has substantial advantage over the approach by two-step separate optimization.through extensive 29

numerical studies, it show the high efficiency of proposed joint-optimization based algorithm. This to be enhanced using Coupling Genetic Algorithm with a Grid Search Method to Solve Mixed Integer Nonlinear Programming Problems. REFERENCES [1]J.Dean and S.Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp.107-113, 2008. [2] S. Gunduz and M. Ozsu, A poisson model for user accesses to web pages, in Computer and Information Sciences - ISCIS 2003, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003, vol. 2869, pp. 332 339. [3] B.L.HongXu,ChenFeng, Temperature Aware Workload Management in Geo-distributed Datacenters, in Proceeding of International Conferences on Measurement and Modelling of Computer Systems (SIGMETRICS).ACM, 2013, pp.33-36. [4] Http://en.Wikipedia.org/wiki/K shortest path routing [5] Lin Gu, DezeZeng Cost Minimization for Big Data Processing in Geo-Distributed Data Centers, Member, IEEE, Peng Li, Member, IEEE and Song Guo, Senior Member, IEEE /TETC.2014.2310456, [6] Z.Liu, M.Lin, A.Wierman, S.H.Low, and L.L. Andrew, Greening Geographical Load Balancing, in Proceedings of International Conference on Measurement and Modelling of Computer Systems (SIGMETRICS).ACM, 2011, pp. 233-244. [7] Z. Liu, Y. Chen, C. Bash, A. Wierman, D. Gmach, Z. Wang, M. Marwah, and C. Hyser, Renewable and Cooling Aware Workload Management for Sustainable Data Centers, in Proceedings of International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). ACM, 2012, pp. 175 186. [8] I.Marshall and C.Roadknight, ss, Computer Networks and ISDN Systems, vol.30, no.223, pp. 2123-2130, 1998. [9] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, No Power Struggles: Coordinated Multi-level Power Management for the Data Center, in Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2008, pp. 48 59. [10] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, Xoring elephants: novel erasure codes for big data, in Proceedings of the 39th international conference on Very Large Data Bases, ser. PVLDB 13. VLDB Endowment, 2013, pp. 325 336. [11]A.Qureshi,R.Weber,H.Balakrishnan,J.Guttang,an d B.Maggs, Cutting the Electric Bill for Internetscale Systems, in Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM).ACM,2009,pp 123-134. [12]R.Urgaonkar, B.Urgaonkar, M.J.Neely, and A.Sivasubramaniam, Optimal Power Cost Management Using Stored Energy in Data Centers, in Proceeding of International Conferences on Measurement and Modelling of Computer Systems (SIGMETRICS).ACM, 2011, pp. 221-232 30