Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least semi automatically determine the best environment settings for the tasks. It will simplify the procedure to deploy computational tasks on Cloud or Hybrid Grid Cloud Infrastructure, especially for the end user who knows a little about technical details of Cloud or Grid Computing. The user can get different computing performance by adjusting some parameters,, like time consuming, cost, etc. It doesn t need to restart the whole bunch of tasks. It will suspend to get the new settings, and resume computing from the stop point. 2. Background Cloud Computing There is no concrete definition for Cloud Computing yet. Generally speaking, it is a reliable, virtualized, internet based architecture to provide no demand computing, storage and many other services. Shared resources and virtualization are the key features. Cloud Computing has the following outstanding advantages. 1. Enable users to access data everywhere. 2. Provide on demand services 3. Provide more powerful computing capacities 4. Have a more reliable and flexible architecture, guarantee QoS for users. All the organizations are impressed with the computing, storage and all such powerful abilities. Indeed, Cloud provides on demand, scalable and fault tolerance services with globally scattered hardware resources. But Cloud Computing is not a universal method, architecture. It may give a low performance in some situation. 1. It is hard to evaluate exactly how many compute instances, data transfers still needed. 2. The coordination between different virtual machines will degrade the performance. 3. It s difficult to separate the data into independent data set. 4. For individual customer or some small group, it is still very expensive to use the public cloud computing services.

Usually, an existing grid has been used for years inside an organization. It is a waste to give up all the grid resources. The ideal way is to find a proper way and time to combine Grid with Cloud, such that it can maximize the computing capability with somehow lower cost. Hadoop The Apache Hadoop project develops open source software for reliable, scalable, distributed computing. It is a large scale distributed processing infrastructure designed to efficiently distribute large amounts of work across a set of machines. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. The application is divided into small work units, and each unit can be executed in any computational node. The framework also has a distributed file system, which stores data on the various nodes. Thus, Hadoop MapReduce is suitable to test the performance of Grid, Cloud and Grid&Cloud (G&C) on different data sizes. Usually, one of the machines (physical machine in Grid, virtual machine in Cloud, physical or virtual machine in G&C) acts as the master, who is responsible for scheduling tasks among other machines (slaves). The master machine can also be responsible for executing tasks. End User s Concerns As an end user, they don t care what environment they are using. But the performance, cost and easy to use are the most important issues for a framework. Most paper about Cloud Computing is still focus on introducing what is Cloud Computing, the advantages on using Cloud Computing. Actually, the user, especially who has little knowledge about Grid/Cloud Computing needs more concrete results data to compare, to get a understanding with it. Thus, real experimental data would help a lot. 3. Scope Hadoop MapReduce, helps to get the experimental results on Grid, Cloud and Hybrid Grid Cloud. Computation intensive tasks, I will use a linear classification algorithm in Machine Learning domain, called Pocket Algorithm to run on Hadoop, and get results from different environments. Grid, RPI Grid, includes several single core and some dual core physical machines. Cloud, will run virtual machines on same physical machines as Grid, use Xen 3.4.1 hosts the VM instances. 4. Framework

There are mainly three layers in this framework. Client Input Parser Initialize input Refine Demand Schedule Process Performance Evaluation Temporary Staging Storage XML Configure File Hadoop (MapReduce) Request Public Cloud Grid Cloud Xen Public Cloud Master The bottom layer is the whole physical environment. A Grid which is comprised of several single core or dual core severs. The Cloud uses the same server machines as the Grid, but run on Xen, such that the Cloud can host numbers of Virtual Instances. In this layer, there is also a public Cloud (i.e. Amazon EC2), from which we can get extra computing capabilities when necessary. The second layer is Hadoop, which is deployed on the bottom layer. We mainly use MapReduce, sometimes maybe HBase to execute the computational task. MapReduce is responsible for splitting the data, distributing it on different DataNode and collect the distributed results. HBase is used to store the intermediate results or some other useful

status info. The top layer is what we need to implement. It is responsible for interacting with end users. It is composed by three correlated components, Client Input Parser, Temporary Staging Storage and Schedule Process. 5. Method Generally speaking, there are two parts. The first part is to collect experimental results, and analyze the data, then get some valuable conclusion. We will use Hadoop to do the experiment. The Hadoop has an inbuilt command called jar, which supports executing the jar file. The experiment will be computation intensive. In the java project, we will use a famous algorithm, Pocket Algorithm from Machine Learning to separate two different sets of points (totally 2000 uniquely random points). It takes almost 2 hours to run in the laptop, and almost one hour in the server. So it is suitable for computation intensive experiment. We will collect all the statistics from this experiment on Grid, Cloud and Hybrid Grid Cloud. Then analyze the reason for different performance on different environment, and conclude what are the best settings for different input (size change, blocks splitting change and so on) and different requirements from end user. The second part is the implementation part. Based on the conclusion, we write an application layer with three related components, Client Input Parser, Schedule Process and Temporary Staging Storage. It automatic or at least semi automatically map user s demand on proper running environment, execute the task and return the results back. The user can also change the demand parameters during execution, but only on some time points. The Client Input Parser will take client s input and parse it into a XML file. The schedule process will take the XML file as input, based on the current environment computation capacity, and then give several practical plans to execute the task. The schedule process will analyze the pros and cons of each plan, mainly on aspects of time consuming, cost, fault tolerance and etc. It will return the analysis form back to user. Users can choose either plan they want by inputting from the application layer interface. The Schedule Process will choose the plan with best performance in case that the user doesn t choose any of the plans. During the execution, the user can reset their demand settings but only on some interruptible points. Then the Schedule process will suspend on this point, save all the parameters, states and other useful status into the Temporary Staging Storage. It generates a new configure file for the environment setting, which satisfies the end user s new requirements. Last, the Schedule process will resume and run Hadoop again to get new results. Sometimes, the computing capability inside the organization (this includes the Grid, the private Cloud) is not enough. We need to request the public Cloud Computing capabilities, like Amazon EC2. In our method, if all the feasible plans cannot satisfy client s requirement, we will then turn to consider public cloud. Then the cost and privacy will become the biggest

issue here. 6. Timetable Date MileStone Experiment Preparation 11.1 11.5 Finishing the MapReduce version of Pocket Algorithm 11.8 11.12 Finishing the computation intensive experiments on all environments Implementation Part 11.15 11.19 Implement Client Input Parser and Temporary Staging Storage of the Application Layer 11.22 11.26 Implement Schedule Process of the Application Layer 11.29 12.3 Test it, Case Study using this Application Layer Conclusion 12.6 12.10 Conclusion, Future Work 7. Reference The reference paper is also what I will present. Hyunjoo Kim, Yaakoub el Khamra, Shantenu Jha, Manish Parashar, Exploring Application and Infrastructure Adaptation on Hybrid Grid Cloud Infrastructure, in HPDC 10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 402 412,Chicago, Illinois, USA, 2010.