A Survey of Cloud Computing Guanfeng Octides

Transcription

1 A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors, RAM, load balancers etc, of cloud-based platform and system are entirely abstracted from the consumer of the software or services. This kind of character needs proper and fined programming models, runtime systems and communication protocols which are designed for the requirements of efficiently sharing computing capability among large numbers of different nodes. As such, they provide several means to reduce complexity and allow programmers to easily employ parallel and distributed system resources without introducing their own cumbersome and incompatible mechanism. In this survey, we analyze and summarize a couple of techniques proposed in this research field. Our study focus on programming models, algorithms and protocols used in real cloud platforms in terms of their system implementation and different resources construction, their concurrency and fault tolerance mechanism, their approach to load balance or schedule management. 1 Introduction With the vigorous development of Internet industry and rapid increase of data volume need to be processed, people become more and more interested in utii

2 lizing parallel computing resources to address their needs. However, programming with or even maintaining a distributed system is complicated and expensive. The different demands concerned by separated customers, like availability and performance of the system, and various computing resources, like commodity PC clusters, multi-core and shared-memory systems, or even the hybrid, make this topic much more difficult. Thus, research in cloud computing has focused for several years on resource-abstract-layer alike modeling and systems building, that is, on devising a programming model with related protocols to let users get rid of intricate computing resources management in a relatively general condition. Some efficient solutions have been found, such as, MapReduce [1], Phoenix [2], LATE [3], Centrifuge [4] and many others. MapReduce was proposed as a programming model for datacenter algorithms. It could be employed by many real world tasks. Inverted Index, as an example of one common used program, can be easily solved in a large cluster with MapReduce computations. The computations contain two parts, Map and Reduce, which are basically defined by users. This model split input data records into pieces at first. Then the master node dynamically select slave workers among the candidate nodes and assign one or many Map tasks whose functionality specified by users to them. Each worker node passes through the received input data and processes each pair in the way determined by Map function. Subsequently, worker node stores its intermediate key/value pairs and partitions those into a certain number of regions, which also decide upon partitioning function from users. Notified by the master node, some workers would be aware of the corresponding hashed result existence. After remotely reading (under the circumstance of real cluster infrastructure) processed pairs as input in the second phase, a.k.a. Reduce, they sort those intermediate values and group them by their keys. Finally, all output files of the Reduce function constitute the result of a user-defined computation. ii

3 2 Infrastructure Platforms or runtime systems built on various computing resources have different characters and demands. MapReduce programming model proposed by Google is implemented well among the commodity PC cluster. In execution, it needs to concern about the locations of intermediate files, network traffic, communication protocols between two machines and hardware failures. While some of them may not be necessarily considered in other computing resource bases, for example, Phoenix, an implementation of MapReduce for shared-memory system, have focused on buffer management, quite different fault recovery mechanism and performance management (like using prefetch to optimize locality) instead. For both multi-core chips and conventional symmetric multi-processors computing environment, Phoenix manages Map-Reduce buffers and Reduce-Merge buffers, whose access permissions are well specified, to store intermediate and final task output key/value pairs separately. Another programming model, Centrifuge was proposed by Microsoft for in-memory server pools. This datacenter lease manager provides a usable implementation to replace internal load balancer and guarantees systems correctness at the same time. However, inside a computing cluster, all nodes may not be homogeneous and probably take progress at a different speed. Long Approximate Time to End (LATE) is a scheduling algorithm designed for this kind of heterogeneous situation. 3 Availability and Fault Tolerance In MapReduce implementation, the master node sends an ICMP echo request packet to each worker and waiting for an ICMP response periodically. After a certain time waiting without receiving any answer, master node assumes the worker is crashed. Those incomplete Map or Reduce tasks previously assigned to this work would iii

4 be re-executed by some other nodes. Even encountering a large-scale failure, this method can still promise the computation termination. For the master node, it writes periodic checkpoints of all computing information, which would be reloaded to a substitution copy if the master fails. On the assumption of deterministic function input, MapReduce implementation is able to produce the same output with a longer execution time, but meet the requirement of fault tolerance. Phoenix system uses a different approach for worker failure detection. Since the shared-memory system cannot and have no need to employ the network layer utility to test the vital signs of workers, scheduler just mark the worker as transient failure if a worker does not complete a task within a reasonable amount of time. The failure would upgrade to permanent level if the task fails several times or the worker cannot finish successfully at a high frequency. Workers with this fault label would never be trusted and no task would be assigned to them anymore. However, a potential problem here we thought is how to determine the reasonable processing time slot. There are both pros and cons for allowing users to indicate this value or totally determined by system itself. If the system sets this time range too short, even not far from the reasonable value, there would be many computing resources wasted to re-execute the same tasks. Conversely, if the time range much longer than the task needs, it increase the time cost to detect a failure and cause performance degradation as a consequence. In Centrifuge model, the Manager service who decides the logic of lease management, partitioning and adaptive load balance consists of two parts, one is Paxos group, and the other is a set of leader server and standby servers. Paxos group is used to determine the distributed consensus in general, and in this implementation, it decides to pick a leader from all candidates, which provide a high-available store. Every operation leader will execute that change its internal state need report to the Paxos group. These sequences of requests would be stored at Paxos and give a iv

5 specified standby server when that standby server is elected to be the new leader after the old one crashed. For efficiency concern, all Owner and Lookup libraries only send requests to the valid leader. They have to broadcast their messages to all servers if previous leader they used to contact is unresponsive. An updating event would occur when the library receives a response from a new Manager node, and this one should be the only destination all other requests process towards later. 4 Performance In a computer cluster, network bandwidth is an important factor need to concern about. When a large quantity of Map workers have to read the input files from remote disks, it would consume plenty of scarce network resources and definitely lower the performance. Thus, Googles implementation takes the location information of source input data into master nodes consideration, and tries to assign as most Map tasks to the physical node that share the same local disk with their input data. In addition, it spawns redundant executions for the reason of potential stragglers towards the end of both two stages. straggler here means a machines spends an unreasonable long time to execute their tasks at the last few waves of all computations. In this special period, Customers have to wait only few workers that may even need fault tolerance mechanisms help later, which obviously prolong the total execution time to the users. The method that master node let several machines do the same operations is proposed to avoid that poor circumstance and speed up the processing rate. Another way in this model to improve the performance is to decrease the task granularity. Ideally, the more task pieces could be scheduled, the more balanced state could meet. Within the memory capability of master node, this method accelerates the recovery part as well. However, we thought the overhead brought by added partitioning operations should also be concerned. v

6 Shared-memory system has a different situation yet. Instead of taking real location information into account, runtime employ prefetch scheme, which is similar to double-buffering in streaming models, to optimize its performance. Moreover, Phoenix uses hardware compression of intermediate pairs to preserve bandwidth and cache resources. In heterogeneous environments, LATE algorithm provides its modified schedule mechanism to achieve optimal performance with its specified infrastructure. We will discuss this in the following part. 5 Scheduling and Load Balancing Phoenix and original Hadoop system [5] have many things in common on scheduling, like split input data, assign Map tasks to workers, notify Reduce workers with file locations or memory addresses. However, a new algorithm above heterogeneous computing resources still be designed and outperform those existed ones. LATE argues the homogeneity and linear processing assumption at first. At Hadoop scheduling algorithm, it rank tasks as following order: the failed task, unscheduled task and straggler task. The monitor inside the system selects a speculative task by its progress score label. Progress score is a real number from 0 to 1, and it is equally divided by Map and three phases in Reduce. A task that has run longer than one minute and its score less than average minus 1/5 would be marked as a straggler. The master node also guarantees that at most one execution copy of each task is running at a time. LATE uses a new method to pick the straggler through estimating and comparing tasks finish time into the future. They provide a simply heuristic to evaluate the remaining time and let users plug their own estimating methods into model as well. The scheduler calculates the progress rate as (progress score) / (execution time), and the remaining time as (1 progress score) vi

7 / (progress time). It determines a node is straggler or not depending on whether it is blow the SlowTaskThreshold. Furthermore, system only launch speculative task on fast nodes. A SlowNodeThreshold is set to distinguish between fast and slow nodes. Another parameter, SpeculativeCap, limits the total number of speculative tasks that can be running at the same time. Although, the value of above three parameters has been assigned to 25th percentile of task progress rate, 25th percentile of node progress and 10% of available task slots respectively, and it performs well in experiments, authors should still give more illustrations to argue their reasonability and what is the best combination. For in-memory server pools, Centrifuge is proposed as a replacement of internal network load balancer. It provides Manager service, Lookup library and Owner library. When a server want to publish some event to a specified topic, they just lookup its library by using the hash value of the topic name. Library then returns the address of appropriate server they should contact that hosts this topic. The servers linking in the Owner library receive leases on these topics also based on same hash keys. Before this server forwards the event to all parties when receiving a publish message, it need to check with its Owner library whether it holds a legal lease. Manager services job is to partition a flat key namespace into variable length ranges among all the servers linking in Owner libraries. It then distributes related partitioning assignment with a lease protocol. Manager would reassign leases for both adaptive load management needs and other Owner libraries requests. Lookup Library maintains the lease generation number and its corresponding Owner node for each range. It contacts Manager once every 30 seconds to get incremental changes and update the newest table. To solve the problem of message races, it introduces a protocol using two sequence numbers like the vector clock in some sense. If one side receives a message that does not contain receivers most recent sequence number, this message is sent before processing the previous message and vii

8 should be dropped. 6 Discussion At present, there are two kinds of solutions mainly implemented in cloud computing area. The first one is IaaS, Infrastructure as a Service, such as Rackspace Cloud [6], Joyent SmartMachines [7] and Amazon EC2 [8]. The keys of this kind of model are server virtualization and highly automated management. Through large-scale deployment of virtualized host, with high-performance storage cluster, these solutions could serve customers with unlimited virtual servers in a short time, and users are able to get full control permission and run their any applications above. Nevertheless, IaaS still need users to cope with a couple of issues within the framework or architecture, like security, replication among data servers and fault tolerance in application servers. Another novel solution is PaaS, Platform as a Service, which is an outgrowth of Software as a Service, such as Google App Engine [9], SmartPlatform Beta [10] and maybe OpenStack in the future [11]. In this solution, users have no need to worry about the problems brought by the scalability or performance from the platform itself. One potential pitfall is that PaaS may involve some risk of lock-in if offering require proprietary service interfaces or development languages. Still, it saves a lot of time and reduces trouble for the clients so that they can spend much more time on the business logic about which they really care. It could be the main trend in the future. There are still many open research issues in cloud computing field. First of all, we hold an opinion that some international standards should be established to avoid the vendor lock-in phenomena. Because it takes substantial switching costs for both developer and customer to change to other products or services. Secondly, the security and privacy issues are also big and practical topics in any internet serviii

9 vice. Cloud computing is no exception. It need to concern about malicious attack and other hacking behaviors. What is more, the proposed programming model, such as MapReduce, is not perform well on diverse data, which issues a challenge to both consistency and efficiency [12]. Furthermore, the lack of systems and services development method is contributing to serious efficiency and bottleneck issue. This kind of development should be executed automatically in most cases. In addition, an abstract layer focusing on cluster management system for managing virtual machines could be isolated from present model structure. Finally yet importantly, the scale of cloud computing service need continue expanding to support massive Enterprise Applications, and it also need some well performed substitutes for in-house management and monitoring tools. ix

10 References [1] J. Dean and J. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the Proceedings of the 6th Symp. on Operating Systems Design and Implementation, Dec [2] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems, In the Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, p.13-24, Feb [3] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proc. OSDI, pages 29 42, San Diego, CA, December [4] A. Adya, J. Dunagan, and A. Wolman. Centrifuge: Integrated Lease Management and Partitioning for Cloud Services. In Proceedings of USENIX NSDI, Apr [5] Hadoop: [6] Rackspace Cloud: [7] Joyent SmartMachines: ttp:// [8] Amazon EC2: [9] Google App Engine: [10] SmartPlatform Beta: [11] OpenStack: [12] N. Rapolu, K. Kambatla, S. Jagannathan, and A. Grama. Transactional Support in MapReduce for Speculative Parallelism x