Contents. 1. Introduction

Transcription

1 Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving services through the internet. Cloud services offers many benefits to clients, giving them the possibility to store and process data using a scalable architecture, providing an storage capacity almost infinite. With the constantly increasing of data being moved to cloud, cloud services deals with millions of files everyday, storing and processing according to client request. Facebook is real example of this type of service provided in cloud, where the range of data goes from terabytes to petabytes, and need to be stored and processed in a timely manner. This increase of data being transferred to cloud, aggravates the problem to develop mechanisms to implement solutions for data management, to improve cloud services performance, to attend on requests to store and process large data. This project aims to to give an understanding of data management in cloud, evaluating solutions and mechanisms to speed up data transfer and data processing in cloud. Several scenarios will be simulated using cloud simulator, with a number of experiments conducted to to evaluate techniques to reduce time to transfer and process data. 1

2 Acknowledgements I would like to thank my family for giving me support to study and believe in my dreams. I would like to thank my tutor Nick, for all support given in the last 3 years, and my supervisor for his interest in this project, scheduling weekly meetings, and feedbacks given during the progress of the project. 2

3 Contents 1. Introduction 1.1 Introduction Why Cloud Simulation Why Data Storage Project Problem Objectives ,6 Minimum Requirements Project Plan Milestones Methodology Summary Background Research 2.1 Introduction Cloud Computing Cloud Features Cloud Models Software as a Service Platform as a Service Infrastructure as a Service Types of Cloud Private Clouds Public Clouds Hybrid Clouds Cloud Applications Virtualization Data Management Data Management Issues Data Storage and Query Processing Scalability and Consistency Database management Systems with multi-tenants

4 2.9 Evaluation Performance using Cloud Simulation Data Transfer CloudSim CloudSim Architecture CloudSim Usability Cloudsim Capabilities CloudSim Limitations Data Processing Hadoop Distributed File System HDFS Goals a) Hardware Failure b) Streaming Data Access c) Large Data Sets Block Division HDFS Architecture Data Replication Related Work Summary Design 3,1 Introduction Methodology Example Problem Solution Data Transfer Application of CloudSim Data Processing Performance Summary Implementation 4.1 Introduction Experiments

5 4.2.1 CloudSim Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results ,2 HDFS Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Evaluation 5.1 Introduction Achieving Minimum Requirements Evaluation of Project Methodology Evaluation of Project Experiments Evaluation of CloudSim Results Evaluation of HDFS Results Future Work and Possible Extensions Conclusion 5

6 Chapter 1 Introduction Introduction The aim of the project is to evaluate mechanisms for data management in cloud using cloud simulation. Data management in cloud have been important, as cloud computing offers storage as a service, where client can move his data to cloud, getting the benefit of accessing his data anywhere and more storage capacity offered by his cloud provider. This project will focus on data storage, and show how cloud simulation can help us to get better understanding of storing data as well as processing data in cloud. The project will involve cloud simulation experiments, to check the performance of transferring and processing data in cloud Why Cloud Simulation? In cloud environment, some services or applications have to be tested before provide it to customers in order to know how the service or application will behave while is being using. Due to the fact that cloud simulation helps to develop experiments, simulating cloud environments, most users can be easily familiar with it, and it allow them to perform tests on their services, having the power to repeat these tests, and control the cloud environment, which will not cost anything to them, and they will get an idea of the service performance before introduce it to the real cloud environment. 6

7 1.3 - Why Data Storage? As cloud computing offer many storage services, but one of the biggest concern is the possibility of data from a client, can be mixed with data from other clients. So data storage has been a challenge for them, to manage and extract data in cloud. It is important to isolate data belonging to a single client from the others. Also a good and fast way of accessing data, it is necessary to provide, in order to make easier to users to get their files in cloud, and reduce the level of data loss. So data management is an important factor for service providers in cloud, to achieve success when delivering services to clients. For example, Google offers the service Google Cloud Storage, that allow people to access, store and protect data files. It let us manage our data on a reliable Google's infrastructure, which is scalable and efficient, giving a robust storage, quick and easy access. SkyDrive is a cloud storage service from Microsoft, that let us store data in cloud and offers a set of tools to manage data, such as Word, Excel and PowerPoint. Amazon offers Amazon S3, which has an interface that allows to store and recover data, any time, and anywhere. It gives access to a scalable, inexpensive and fast infrastructure that Amazon runs on its global network of web sites. Cloud Storage Provider Figure 1: Data storage in cloud [25] 7

8 1.4 Project Problem As the quantity of data that clients want to move to cloud, has been increasing everyday, and consequentially the necessity that cloud providers have to store and process data in a timely manner, it is necessary to use mechanisms to measure the application performance used to store and process data, regarding the time, that it would take to complete clients requests. Using cloud simulation it is possible to generate a set of metrics to evaluate the performance of storing data while is transferred to cloud, and process client's data once is already in cloud. With an analysis of the results obtained from simulations using the metrics needed to measure the performance, it is possible to review the pros and cons of an implementation in a specific scenario to store and process data. Therefore is necessary to know what are the resources needed to improve a system to offer better performance to store and process data in cloud. 1.5 Objectives The objectives of the project are: Explore cloud computing and its architecture Understand data management in cloud Investigate data storage in cloud Understand simulation in cloud and explore cloud simulator, to use experiments to simulate cloud scenarios Explore data management issues and its solutions 8

9 1.6 Minimum Requirements The minimum requirements are: Investigate cloud simulator architecture ( CloudSim and HDFS simulator) Identify cloud services and cloud resources that are key for the implementation of a cloud data management scenario Understand communication between entities in CloudSim and HDFS simulator Implement a simulation experiment considering a scenario for data management in cloud Implement a working cloud simulation experiment for data storage in cloud 1.7 Project Plan The project was divided into 6 stages, setting a number of tasks for each stage, in order o keep the project on schedule. A Gant Chart was produced to set time period for each stage In the first stage, a literature review was done, covering theoretical concepts of cloud computing and its architecture, which helped to achieve the goals of the others stages of the project. Second stage, includes an analyse of the issues with data management in cloud, and solution approaches to implement methods improving data management in cloud. Third stage, planning the project structure; which problem in data management will be covered, and selecting the solution approach, and the simulators to evaluate the solution. Stage four, includes the design of scenarios to evaluate the solution, and and implementation of these scenarios using the selected simulators. Stage five, tests are performed on each scenario; results are collected and discussed. Finally in stage six, there is an evaluation of the solution is performed, with individual and comparative analysis between the scenarios simulated. 9

10 1.7.1 Milestones To track the project progress, end points for each stage have been set in the project plan, to ensure that each stage has been completed. A) Literature review (25/02/13) B) Analyse of the problem (01/03/13) C) Plan the structure of project (08/03/13) D) Design and implementation (19/04/13) E) Tests(19/04/13) F) Evaluation and writing up (05/05/13) 1.8 Methodology To achieve minimum requirements of the project, a research about cloud computing must be completed. The research will include information of cloud computing architecture, cloud simulation features and architecture, as well as a brief understanding of existing applications in cloud computing. Data Management issues will be presented in this project and some solutions will be proposed to minimise these issues. Performance evaluation in cloud can be done in three ways: through Direct Experiments, which involves a design of a real cloud environments to provide services; Mathematical Modelling, where evaluation is modelled with mathematical algorithms and formulas such as equations; and simulation in cloud, which allows users to perform experiments simulating cloud environments to test services in cloud. Simulation was used as methodology for this project as it offers tools necessary for environments to simulate scenarios and calculates time, energy consuming, data cost, and data processing results for each simulation experiment. These tools allow users to design experiments, setting parameters according to the experiment objective, run and show the results of each experiment. There are several simulators in cloud, which can help users to develop experiments, in order to simulate scenarios, for a better understanding of how cloud services works, and how to implement new 10

11 solutions for problems in cloud services. CloudSim and HDFS ( Hadoop Distributed File System) simulator will be used as framework methodology for this project, as they offer the tools needed to simulate data transfer and calculate time to transfer data and also simulate data processing for large files. Simulation experiments, will be performed, for a better understanding of data transfer and data processing in cloud. 1.9 Summary Project aim and minimum requirements have been stated, and project methodology has been explained. The next chapter will focus on background research of cloud computing, its architecture and existing applications. 11

12 12 Tasks 21/01/13 25/02/13 01/03/13 08/03/13 19/04/13 05/05/13 08/05/13 Literature Review Problem Analyse Project Plan Design Implementation Tests Evaluation Project Report

13 Chapter 2 Background Research Introduction The objective of this chapter is to go through the concepts of cloud computing, CloudSim and HDFS simulator design and architecture. 2.2 Cloud Computing Cloud computing is a computation performed in distributed servers, where data is stored and processed without users having the knowledge of the location of these data. Cloud computing as a system involves some features such as scalability, reliability, transparency and redundancy. Cloud computing can be defined as the set of hardware, networks, storage, services, and interfaces that combine to deliver aspects of computing as a service[1]. However, NIST(National Institute of Standards and Technology) defines cloud computing as: A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. [2] Cloud computing can be considered as the next step of the evolution in the internet, as cloud computing provide the means necessary for everything (from applications, computing power to computing infrastructure, storage, processes for business to personal collaboration, software) to be delivered to clients as a service, wherever the service is necessary and whenever is required over the internet.with these characteristics, cloud computing become flexible, because many machines will be working, and centralized, constituting a unique system in cloud, the internet providing services to customers through the internet. 13

14 2.3 Cloud Features A cloud computing service includes several elements, which are: distributed servers, datacenter and clients. Clients: client in cloud computing is everything connected to a Local Area Networking(LAN), requesting and receiving services. Or in other words clients are the devices that end users interact with, to manage their information in cloud. Clients can be divided in 3 types: Mobile, Thin and Thick [3]. 1. Mobile Mobile devices, such as smartphones, tables, laptops. 2. Thin - Devices without internal hard drives, which all the work is done by the servers, and then devices only display the results(data). 3. Thick Devices that connects the cloud(internet) using web browsers Datacenter: is a set of servers, in which the files and applications are stored. It can be a room full of servers, that run user requests, which are accessed via internet[3]. Cloud computing also uses virtualizing servers, when a software is installed, with the object to enable multiple instances of virtual server being used. In this way, many virtual servers can run on one physical server. Distributed Servers: these type of servers are allocated in different places, but in cloud environment, they look like they are running next to each other. The service provider gets more flexibility for security and options. For example if something goes wrong in one site in the cloud system, having the solution for a problem in many servers, the system can be accessed using another site. 2.4 Cloud Models Cloud computing offers resources as a service, and they are divided in three classes: Software as a Service, Platform as a Service and Infrastructure as a Service Software as a Service (SaaS) Software as a Service is a mean of offering application as a service to the customers, who can access it via internet [3]. It is a way of performing tasks where the software is the service itself. One of the benefits of this type of service is that the service provider offers software and customers do not need to install, manage or buy hardware to run the software, as service provider handles everything. 14

15 The software can be accessed through a thin client interface(eg. Web browser). In this model service providers are able to capture data related to customers behaviour, at the same time that customers access the application. As the service provider has all control over the version, it gives a compatibility to customers, as all users get the same version of the software. Same thing happens with the infrastructure, service provider have control over infrastructure that runs the software, which reduces the cost of implementation and upgrades[4] Platform as a Service (PaaS) PaaS offers services with the required resources such as, developer tools to build applications and services on the top of the compute infrastructure, with no need to download or install any software[3]. These application and services can be web service integration, database integration, storage, scalability and application design. PaaS gives everything needed to create, implement, test and host software in cloud. PaaS is build using one or more infrastructure as a service(iaas) which stays invisible to service providers that use PaaS. Service provider get all the responsibility to maintain and control of the underlying cloud infrastructure, but the consumer gets all control of the application[5]. This model offers services that tend to represent an engagement between complexity and flexibility allowing applications to be implemented quickly and loaded in the cloud with less configuration[6] Infrastructure as a Service (IaaS) IaaS does not offers application service to customers as SaaS and PaaS do. IaaS provide hardware, where customers can put anything they want on it[3]. IaaS offers access to resources of the virtual hardware, which includes virtual machines, network and memory. It allow consumers to provide applications more efficiently by removing complexities related to the management of their own infrastructure[5]. IaaS allow consumers to built their own virtual cluster. This allow service providers to purchase hardware resources and equipment to be shared with consumers and can be used for anything they require. 15

16 Figure 2: Cloud Computing Service Model Diagram [26] Figure 2, shows how cloud services work. Users interact to cloud through internet using laptops, mobile phones and tables. All requests from users are processed in cloud. The services provided in cloud, work as a pyramid, in a way that each service is built and run on top of the other. The Infrastructure as a service, offers the main resources which are used by the others services. Software as a Service, which is the application in cloud stays on top, as it uses resources from all other services, and runs on the resources provided by the Platform as a Service. Basically each service assists the other. A service provider developing the application to be executed in cloud, is assisted by the platform provider, offering tools necessary to develop and execute the application in cloud, while the platform provider only works receiving resources from infrastructure provider. 16

17 2.5 - Types of Cloud There are different methods of implementing applications in cloud. Cloud computing is based in various types of clouds. The main types are private, hybrid and public Private Clouds Private clouds are the ones built exclusively for a unique use(e.g: company or organisation). In private cloud, the company that owns it, has total control of the infrastructure used and the applications implemented in cloud[7]. Normally a private cloud is built in a private datacenter. One benefit of having private cloud is the fact that data is not shared and can only be accessed by the organisation which owns it Public Clouds Public Clouds, are the ones where the software of different users stays in a shared system and can be accessed by anyone in the cloud[8]. Public clouds can be bigger than private clouds, and they allow more scalability of resources. With this characteristic public clouds reduce the need of buying additional equipments to solve temporary needs, moving the risk of infrastructure to the providers of this infrastructure in cloud, as is their responsibility to manage software updates, security patches, etc[9]. There is also a possibility to give some features to a public cloud for one only user, by creating a private virtual datacenter, which provides to its user a bigger visibility of the whole infrastructure. Public clouds are more efficient for temporary applications Hybrid Clouds Hybrid cloud is a combination of two or more clouds, that stays with unique entities and are bound together offering the benefits that each type of cloud has[10]. Hybrid cloud can take features of private and public clouds. It allows private cloud to have its resources enlarged, from public cloud resources. In Hybrid clouds some applications are exclusively for public clouds, and the critical ones stay on the responsibility of the private cloud[11]. The implementation of Hybrid cloud can be done even to attend a continue demand or to satisfy a temporary demand. The quality of the implementation in the hybrid cloud, determine its efficiency. 17

18 2.6 - Cloud Applications As cloud computing reduces discard the need to buy expensive hardware to host large software application, the applications in cloud are hosted in cloud in a way that costumers do not need to provide the server space, it is done by the service provider. Cloud Computing has the power to bring applications, manipulate and share data. The most common applications in cloud are based in storage and database[3]. Storage: cloud storage has some positive things to offer to its users in cloud. When storing data in cloud, we can access them from anywhere we are, and we do not need to use the same computer, to access it, as it can be accessed with any device with internet connection. Storage applications in cloud: Google Apps it is a service that offers applications to edit documents(google Docs), chat(google Talk), (gmail). Every resource is managed by Google, client only needs to set up an account. Amazon S3(Simple Storage Service) is the most known cloud storage service, it was built to make web-scale computing easier for developers. It offers a simple web service interface, that can be used to store any data, any time, from anywhere in the internet. Youtube host millions of videos files uploaded by its users. Panda Cloud Anti-virus is an anit-virus program from Panda Software, but most of the work needed to search and remove malwares, is done in cloud. Database: is a repository to store information with links within the information, that can help the data to be searchable[3]. Cloud computing allows multiple applications to connect one database running on a cluster, by using shared services. These applications are isolated from each other, and explicit portions from database processing allocated to each application[12]. And this service becomes a Database as a Service, which avoid the complexity and cost of running our own database. Database as a Service can offer some benefits, as it is easy to use, there are no server to provision and no redundant system to worry about [3]. Database applications in cloud: SQL Server Data Services(SSDS) it has a schema-free data storage, SOAP or REST APIs. SQL Azure it belongs to Windows Azure platform, and it is a set of services hosted, infrastructures, web data and services. It offers a full relational database functionality of the SQL server, but working in cloud as a computing service, hosted Microsoft datacenter across the world. 18

19 2.7 - Virtualization Virtualization is one of the main elements of cloud computing. Virtualization is important to cloud computing because it is a way to let users access services in cloud. It creates virtual environment, with virtual machines, which omits physical characteristics of the hardware[14]. Virtualization can be done in different ways. One of them, permits that a server can be used as many virtual servers and another way, let multiple servers, being used as one virtual server. Virtualization can be considerate as Full Virtualization, when the installation of one machine can run on another [3], which is the way that virtual machines run in cloud environment. Virtual Machines are also used to emulate operating systems in one platform, creating the resources of this platform, hosting in a virtual hardware for each system. Paravirtualization, is the technique that allows multiple operating systems to run on a single hardware using system resources such as processors and memory [3]. In full virtualization the entire system is emulated, but in Paravirtualization, the management module operates with and operating system which has been fitted to run in a virtual machine Data Management One of the main reason of adopting cloud computing is the benefit of transferring, and process large amount of data. Data management, and data processing plays big role in many applications in cloud, as data stay stored in cloud, and it is necessary to provide a satisfactory performance of the services, express in terms of latency, high availability, attend service level agreements regardless the quantity of data and workloads changes Data Management Issues Cloud computing has some advantages, but for data management, developers can come across with some issues when implementing applications. When a developer is implementing an application, in many cases he has to deal with large set of files, and in case that data is large, developer has to distribute data in many systems and use parallel systems to prevent data being processed in one system which might increase time required to process data and would offer a low efficiency. 19

20 Data Storage and Query Processing Having a significant increase in data, and requests, to extract values from these data, service providers have to manage and analyse a large amount of data if they want to offer a high performance in their services and isolate data in cloud. As the data are managed in many partitions, it become hard to offer transactional guarantees, such as atomicity and isolation. To deal with these problems, some solutions have been developed, combining techniques such as MapReduce or parallels Database Management Systems(DBMS)[15]. So the challenge is to define an architecture to focus on the Query Processing mechanism and parallel file systems, such as Hadoop Distributed File System(HDFS), to give an architecture with different levels of insulation. HDFS was inspired by GFS(Google File System) which offers reliable and efficient access to data using big clusters. GFS can also be used to measure the performance of replication system on clusters. HDFS stores large files in various servers to get reliability by means of data replication. GFS works with three components, a master server, multiple clients and multiple chunks servers, where the chunks are stored in datacenters managed by servers[16]. MapReduce was introduced by Google, and it is a framework, in which each task is done with 2 functions: Map and Reduce. Map function receives a set of input files and according to user specification, it emits a set of tuples in a dictionary format ( Key-value format). Reduce function receives a set of values associated to each key, called blocks, and for each Reduce function, it emits a set of tuples that are stored in output files. This project will encounter issues related to data storage regarding data processing and response time. The size of data and method of distribution, is a challenge, as people feel interested in cloud computing because of the fact that they can store large files, but at the same time they want to process this data, quickly when they need. Considering the fact the this project will be implemented using simulator, it will be possible to design cloud environments, regarding time to transfer data of large size and make distribution data across datacenter. At the moment these techniques to store and process large data, are being used by organisations such as Facebook, Google, Youtube, Amazon, in order to offer better performance in their services, storing and processing hundreds of terabytes of data everyday. 20

21 Scalability and Consistency Scalability have to be transparent to users, allowing them to store their data in cloud without knowing the location of data, or the way to access them. But as many solutions in cloud are focused in scalability, and in general they offer weak consistency of data, which means that after the system is updated, if a user access the system, it will not guarantee that the return value will be also updated. This kind of consistency do not permit the development of a wide range of applications, such as, online services that cannot work with inconsistent values. Some aspects of data storage, query processing, have been used by some approaches to guarantee scalability, but the best way to solve this issue, is to develop solutions that combine theses aspects in a way to improve system performance without compromising data consistency[15] Database Management Systems (DBMS) with multi-tenants In DBMS as a cloud service, when tenants access the system and share resources, they can affect the performance of the system. In this case resources provisioning has to be efficient, as the workloads from DBMS as a service can vary, as the tenants can access the system with more frequency in a determined moment. 2.9 Evaluation Performance Using Cloud Simulation Cloud applications may offer many services, such as social networking, data storage, content delivery, web hosting, and it is necessary that cloud provider offers a cloud environment to respond costumers needs. Data sets have to be evaluated and analysed by service provider, to offer efficient access to the data sets and replicate them on several servers. Because of that cloud scenarios have to be evaluated to test replication of data on several servers, retrieve data, effective methods to send, process and access data. To perform those things, simulation tools have been developed to reproduce tests in cloud environment. These simulation tools let service providers and cloud costumers, to test their services in a environment that allows them to repeat the tests and have control of it. These tests let service providers to determine the cloud service quality and quantity, as well as helps to optimize the evaluation of their services, as simulation test is cheaper and faster than performing these tests in a real cloud environment [17]. To perform these tests, there are some toolkits which allow service providers and users to simulate their services or applications in cloud environment. At the moment the toolkit most used in this area is the framework CloudSim. 21

22 IcanCloud is a platform that models and simulate systems in cloud computing, it predicts trade-offs among performance and cost of a set of application being executed in hardware, and provide information about cost and performance. GreenCloud simulator is a an extension of the network simulator Ns2 which simulates cloud environment. GreenCloud provide to users a detailed model of the energy that elements of a datacenter consume, such as switches and servers. CloudSim is a simulation toolkit, that shows the result of the time, power and traffic consumption. It is based in java platform, with some modules already developed such as SimJava and GridSim [18]. HDFS simulator, is a simulator design in java, as HDFS is a distributed file system that was designed to run on commodity hardware, and it is highly fault-tolerant. Hadoop is a framework based in java, which has two sub systems: The hadoop to execute MapReduce applications and HDFS, which handles data management and access. As explained in Chapter 1, CloudSim and HDFS simulator, are the ones to be used in this project, because as this project focus on data transfer and CloudSim, offers tools necessary to simulate users sending data to cloud and calculate the time taken to to transfer and store data. HDFS simulator was designed to simulate a systems in cloud dealing with large data. As the second part of the project is related to data processing, and HDFS simulator can help developers to understand and implement environments storing and processing large data of files, and distributing data across the datacenter avoiding failures, it was used as the simulator framework to simulate data processing Data Transfer CloudSim It is a simulation framework that allows the modelling of cloud experiments using simulation of infrastructures and application services [17]. As a framework, CloudSim offers support to model and simulate large scale cloud node, and a platform to model datacenter, service brokers, scheduling, and allocation policies. CloudSim also has some features which are: the virtualization engine availability, which leads with the building and management of multiple, independent, and co-hosted virtualized services, on a datacenter node; it is flexible to switch between space shared and time-shared allocation of processing cores to virtualized services [17]. These features can make faster the development of algorithm, protocols and way of implementing it in cloud computing. 22

23 CloudSim Architecture Figure 3: CloudSim Architecture[17] Figure 3, shows the CloudSim architecture and layers that it uses. In the layer on the bottom, there is a simulator engine which is responsible of the operations to create, manage and delete simulation entities[17]. The next layer, shows the main classes that used to implement the framework, and it is composed with different modules. In the network module, it maps clients request with datacenters and calculate a possible delay of the messages between datacenters and clients. The cloud resources module, manipulate and coordinate simulation events, and also manages data related to infrastructure, provided by the datacenter that has been simulated. Then in cloud services, it shows the actions of virtual machines provisioning, the allocation of resources as a memory of the system, storage of data and bandwidth communication[23]. Then we have the virtual machines services, where manages and execute cloudlets (tasks) sent by clients. 23

24 Just above, there is the user interface structures, where occur the communication between entities, as it is done using an interface, virtual machines and cloudlets can be manipulated. The layer on the top, represents, the code where the user of framework will implement to create simulation environment. The scheduling policy, points the creation of decision policies and schedulers, which will guide simulation process[23]. It also uses decision methods, called broker, but cloudsim also allow the implementation of allocation policies of virtual machine between hosts of the same datacenter, virtual machines schedulers in hosts, and cloudlets schedulers in virtual machines CloudSim Usability When using CloudSim, users need to be able to have background in java, as the toolkit is all written in java. Having knowledge of java users will be able to write some code, using elements of its library to develop experiments, simulating the scenario desired by users. Using CloudSim is not only about writing code, changing parameters, run the program and then collect the results, but it requires an understanding of how the simulator works CloudSim Capabilities CloudSim comes with its source code which can be extended in order to make the simulator suit, the problem in which, user wants to solve, so users are free to make their own changes in the source code, adding classes, to make CloudSim help them, in a specific scenario. CloudSim is flexible, and require less time and effort to perform test, and simulation in cloud. It can simulate scenario with small-scale, and large-scale scenarios with many datacenters, sometimes without any cost related to initialisation and memory consumption. It also, use virtualisation to create many virtual services, where each service is managed on a node of a datacenter CloudSim Limitations It is a very good tool, with powerful functions, to help users simulate cloud computing environments, but it is not a tool kit where user can use it with only setting parameters. Users need to write some code in java, to get access to its library. Also it does not support every scenario in cloud, which require some extensions as discussed earlier. If user is not familiarised with java language, he will not be able to use the simulator, so because of that, CloudReports, which is an extension of CloudSim, was developed to allow user perform simulation in CloudSim using a simple graphical interface. 24

25 2.11 Data Processing Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) as every distributed file system it was designed to run on machines and allow users to store and process data in many different hardwares, normally connected with Local Area Network (LAN). But HDFS differs from normal distributed file systems, because it is high in fault-tolerant and can be deployed on hardware which is not too powerful[19]. At the moment there are several implementation of distributed file system, such as: GNU Cluser File System (GluserFS), Moose File System (MooseFS), General Parallel File System (GPFS), OpenAFS, Network File System and Google File System (GFS). HDFS is an open source framework written in java and uses an architecture based in master-slave. It consists in HDFS clusters, with a primary NameNode, which is used as master server, and is responsible to manage the file system namespace, in addiction it also regulates access to data by clients[20]. NameNode has a number of DataNodes, which normally consists in one-to-one relationship among a hardware and DataNode. DataNode manages the storage linked to boxes where it runs on. File System namespace is used in HDFS to allow data to be stored in files[20]. Each file is divided into blocks, which are distributed across DataNodes, supporting parallel processing. Blocks are replicated in many DataNodes which prevent the processing to stop if a occur and event of node failure[21]. The number of nodes used in HDFS it is proportional to the probability of one of these nodes fail, which means that as many nodes it has, the more chances of a node fails. To protect the system against failure, DataNodes receive copies of blocks. HDFS replicate three blocks, and two of these blocks go into nodes sharing the same rack, and the another goes into one node on a different rack HDFS Goals a) - Hardware Failure HDFS consist in many server machines, each machine stores part of file in the system. Having a huge number of elements in HDFS (machines), and the fact that each element has non-trivial probability of failure, it shows that some elements of HDFS will be always non-functional [21].One of the HDFS goals is to have a core architecture with automatic recovery, able to detect faults, and quick. 25

26 b) Streaming Data Access Every application running on HDFS requires streaming access to its data sets. Application running on the system, have a specific purpose, and run for a specific file system. It uses a typical design that do not focus on interactive use, but for batch processing. HDFS shows less interest in low latency, and focus more in data access. c) Large Data Sets HDFS was design to support large data sets. A file in HDFS is typically big, which goes from gigabytes to terabytes in size. To support data of this size, HDFS normally offers a high data bandwidth, and is able to scale hundreds of nodes in only one cluster. HDFS has the ability to support millions of files in one instance[21]. HDFS is also a good solution for people working with blogs, where they have to deal with large data and do not have idea of how data will be used, and using unstructured files, HDFS can help them as it allows them to store data even if they are not well structured, and process data at the local where its stored. Google is a real example of a system dealing with large data sets, as it stores hundreds of terabytes a day Block Division There are files which are too big, and cannot be stored in only one hard-drive. A way out of this problem is to divide files and distribute in many machines. The file distribution is done in a implicit way, where the developer using HDFS only has to point the correct configuration parameters. HDFS before storing the files, it adopts a strategy where these files are submitted using a method where files are divided in a sequence of block of equal size. The default size is 64 megabytes, which can be changed if necessary. The block size is bigger than block size used by the other distributed file systems, which use 512 bytes per block. After splitting files, it starts the distribution, addressing blocks in different nodes. If data addressed to a block is not big enough to fill the space reserved for it, the rest of space is not wasted, it can be used with another data. 26

27 Architecture A HDFS cluster, consists in one NameNode, which is the master server, and usually implemented in one exclusive node, which is the node with best performance. It is responsible to manage the file system namespace, and regulate how files are accessed by clients. As there are many DataNodes, and normally one DataNode is allocated in one node in the cluster, to manage storage linked to nodes where they run on. HDFS allow files to store user data. Then it execute block division, storing blocks in set of DataNodes. The operations of file system namespace are executed by the NameNode, such as renaming files, opening and closing files and directories[21]. The process of read and write requests sent from client's file system, is a responsibility of DataNodes, wich also execute block creation, replicate blocks, and delete them, as NameNode gives instructions. The first communication between master node and slave node, occur when DataNode is registered in NameNode. Then DataNode communicate to NameNode, sending information of blocks that are stored, as well as information about its local changes. This communication between NameNode and DataNode are crucial to help NameNode to define which nodes will store a respective block. If NameNode is enable to receive information from DataNode, the DataNode is asked to register again. Figure 4: HDFS Architecture [21] 27

28 Figure 4, shows the architecture of HDFS, which basically NameNode and DataNode are part of software that run on machines. Where these machines usually run on GNU/Linux operating System. As HDFS is written in java, it can be deployed in any machine that supports java, therefore, these machine are able to run NameNode or DataNode. An implementation of HDFS, usually gets one machine dedicated to run the NameNode, where the other machines are used to run DataNodes, normally one DataNode per machine. NameNode works also as a mediator for all metadata. HDFS is designed following an architecture where user data has no chance to flows through the NameNode[21] Data Replication As HDFS split files into blocks, it also replicate these blocks to increase security and durability of file. By default HDFS has three replicas allocated in different nodes. As the communication speed between machines of the same rack is bigger then between machines of different racks, in the selection of replacing replicas in one process, HDFS gives priority to replicas belonging to same rack[22]. One of the biggest benefit of replication is that system get high fault-tolerant, and reliability, as if a node fails, the process will be executed by another machine, which contains a replica of the block, without having any need to transfer data or interrupt the application execution. All this if done with transparency, as hadoop offers mechanisms to restart the process without anyone notice that a node fail during the executing[22]. When there is a fault, occur a decrease in replicas of a block, so to get reliability, NameNode consult metadata about DataNode faults, and restart the replication process in other DataNodes Related Work Through the literature review I could check some work using experiments in CloudSim, evaluating simulation experiments using space-shared and time-shared. They sent sets of task to virtual machines, and these tasks were sent in groups of 50[23]. They collect the results and checked that experiments using space-shared, every task has completed after 20 minutes, as the number of tasks had no effect in the execution time of a single task. But the execution time of a single task in experiments using virtual machines with time-shared, has been affected as the number of tasks submitted to virtual machines increased. So the first set of 50 tasks have been completed before the other ones, at the start of execution the hosts were not over-loaded, which increase its performance to execute tasks quickly, and as more tasks had been completed hosts becomes available for more tasks to be executed[23]. 28

29 Another work done, with HDFS simulator, where they got an experiment to check if the number of replicas, affect the system performance. They had used a method to calculate the mean time taken to repair the system, when there is failures in the system. They use one experiment, with replicate a block three times, and another with 8 replicas per block. So they notice that as the replication level increases, the probability of time to repair the system, increases too, as replication level increases, number of blocks in a node increases too[24]. If a node fail, the system will have to replicate more blocks than if number of replicas per block Summary In this section I went through on some topics related to cloud computing and its architecture, as well as some applications developed and being used around the world. The next chapter will cover the use of this literature review and how to design the experiment will be using this project. It includes the use of HDFS and CloudSim in data management. 29

30 Chapter 3 Design 3.1 Introduction With the use of cloud computing, people always need to find ways to deal with large data files. Many organisations every day collect billions of data bytes, and need to manage it, to satisfy its users demands. Recently with the increase of people using cloud computing, some companies wants to get the benefit of adoption cloud technologies, by transferring their to cloud. But then companies wants process this data in cloud. As one of the biggest interest of someone sending data to cloud, is related to the need of have large data files stored, and be able to access it anywhere. So then people decide not just to transfer data in cloud, but also have their data processed. One of the big issues when transferring data to cloud is the time to transfer data in cloud and time when processing it, as at these days, some requests are made for tasks to be done is short time as possible Methodology Methods and data used to design this project was mostly acquired through web sites and a few books. Every data was analysed as it was all readable, and then a depth understanding of the data was done, in order to progress with project choosing the best material to suit the problem, and help to get solution. The experiments in the project, we made relating the research done as background and the scenario needed to solve the problem presented in the project. To test the performance of the experiment, two simulator were used, one to check transfer time, and another for data processing time. 30

31 3.3 - Example Problem Recently pcs makers, have evaluate their pc storage power from bytes to terabytes. This development has made user's life easier, as they can now benefit of the having more storage capacity to save their files. Therefore the biggest problem is not on the data size to store, but on quantity of data that can be processed by a system. Lets suppose we have a company that offers a web service, in which its systems is accessed daily by thousands of users, sending, reading and writing data in short slots of time. To deal with it, the organisation need to implement a system architecture with methods capable to classify data, in order to enable the system to process data in a fast and efficient way. Many company use techniques based in log activities, where they write users requests in log files, to specify different tasks performed by a single user. This techniques can bring serious problems if used in wrong way. As an organisation can get many data in short time, and will not be able to process data neither organize it. In this way, data will be wasted, loosing the opportunities that, these data would help in the performance of the system. Also as it would be done all in serial, which implies that, data is being processed in an individual system, having only one point to deal with possible failures in the system. 3.4 Solution To solve the problem presented above, the best thing to do, is to develop distributed parallel environments, which will provide a high performance comparing with to other systems, using serial environments. With parallel computation, an application can be executed concurrently among different elements of a machine, regarding the fact if this machine is distributed or multi-processed. With these technique it is possible to get a system with many points dealing with failures, which increase system's fault-tolerance. Having such problem, many concepts, and strategies have been developed to offer simple and efficient ways to solve the problem using parallel computation techniques. With this techniques. It is possible to create a robust system able to attend users demands. The first idea is to implement system able to process data in short time, with many points dealing with possible failures in the systems, and also, a system capable to run in low cost machines. This is where HDFS comes in, as it is design to process large data files in short time, a offering a system high in fault-tolerance. 31

32 3.5 Data Transfer Data transfer is the first task of the project as it is necessary to have good methods to transfer data to cloud in order to store, and them process is as user requires. In cloud data can be sent by many users, and it is necessary to organise data while transferring, addressing data sent by users to machines in datacenter, for a better performance in the system Application of CloudSim CloudSim help this project to get prediction of time taken to receive set of data, having an idea of how long a system would take to receive a certain amount of data and store in in datacenters. CloudSim was used in the first part of this project. As the first tasks of the project were to measure time for data transferring, CloudSim helped to get time taken to transfer sets of data. Each set had at least 50 files, and some tests were performed, using different data size, and varying number of files in each set. CloudSim was a good choice for the first part of the project, as it can simulate cloud environment, using virtualization, where it simulate a datacenter, virtual machines, hosts in datacenters. The experiment, create datacenter, and each datacenter receives a number of virtual machines that are used to run the users requests, which are the cloudlets. Cloudlets, are created, each cloudlet has its own id representing the user, and it holds files, and a set of cloudlets are sent to virtual machines, were each virtual machine runs 3 or more cloudlets, according to number of cloudlets sent, and number of virtual machines in the datacenter. Datacenter are responsible to host virtual machines and store data that they receive from cloudlets. A cost of using each datacenter is provided, as well as the time to transfer files to datacenter. But as the project is not only transferring data to cloud, then comes the second part of the project which is data processing. Once data has been transferred to cloud, a user would require some processing on data stored in cloud. So the use of HDFS simulator becomes essential, as CloudSim was not design to process large data file, then comes the need of implementing a system that support large data files, and be able to process data quickly, providing time taken to process large data files. Basically, CloudSim deals with moving data to cloud, and HDFS simulator process the data in cloud. 32

33 Cloudlets VM VM Data Data VM VM host Datacenter host host host host users Figure 3.1: Data Transfer Process The entity datacenter in CloudSim, manage a number of hardware, that in the simulator are called hosts. Each host (physical machines), allocates one or more virtual machines, according to the virtual machines allocation policy that was set by the cloud provider. In the simulator this policy is used for operations of control related to life cycle of the virtual machines, just as set a virtual machine to a specific host, create virtual machines and destroy them. A datacenter can manage several hosts, that in turn manage life cycles of virtual machines. A host is an element inside the simulator which represents a server in cloud, it is given a capacity of processing millions of instructions per second, memory, storage space and a policy of cpu allocation to virtual machines. Interfaces are implemented in the hosts, to support modelling and simulate virtual machines with one or more cpu. The allocation of virtual machines consists in a process of creating instances in hosts which combines with critical characteristics (such as storage, memory capacity), configuration and requirements fro cloud provider. 33

34 In each level, CloudSim implements two provisioning policies using time-shared and space-shared. Thus, there are four possible combinations to used with the policies implemented as default by CloudSim. The available are: 1. Space-shared in virtual machines and Space-shared in hosts : cloudlets are executed one by one inside the virtual machines, and virtual machines are also executed one by one, so it works like there is a row of virtual machines to execute cloudlets, and cloudlets run in a queue inside the virtual machine. 2. Space-shared in virtual machines and Time-shared in hosts : cloudlets are executed one by one in virtual machines, but virtual machines run in parallel inside the host. 3. Time-shared in virtual machines and Space-shared in hosts : cloudlets run in parallel inside the virtual machine, but virtual machines are executed one by one inside the host. 4. Time-shared in virtual machines and hosts : cloudlets and virtual machines are executed in parallel. 3.6 Data processing A very good solution and efficient system to this problem is HDFS, as it is capable to process a large quantity of data in short time slots. Using its design and implementing it, user only need to develop the application required, without any worries of dealing with data to process, and distribution of tasks across the system. To improve its performance, the system splits the tasks in individual parts (blocks). The division of these tasks, contributes to solve possible problems in the system, helping in some activities such as parallel execution of the process and controlling the level of complexity of data. Giving to companies that adopt these system, an easy way to process data, having a powerful system, high in fault-tolerance, able to perform distribution of data automatically, and management of data being implemented. Second part of the project, goes on processing data transferred to cloud. HDFS simulator plays a good role here, as it was designed to simulate scenarios, using large data file. HDFS simulator, tends to have a similar performance to HDFS in real world. So it comes with all tools needed to simulate a data processing scenario, using its master-slave architecture, and also as HDFS is based in Hadoop, and Hadoop is based in MapReduce, it also maps data across the system. With HDFS simulator, experiments were implemented to simulate large data files being processed in cloud. Basically data is split into blocks, where each of block gets 64 Mb of the file size. HDFS simulator also provide the average time to repair the system when there are failures in the system. 34

35 Once the file is split and every block are ready, the system (HDFS simulator), distribute these blocks across nodes (DataNodes). At the same time it makes the distribution of the blocks, it replicate them, having three replicas per block. DataNode are instances that run in NameNode. For the experiment, some scenarios were simulated and tested, with different data size, varying the number of DataNodes in the system, also varying the size of blocks, from 32 megabytes to 128 megabytes, in order to get the scenario with best performance in terms of time respond to process data. Wants to write file A Let user write File A in datanodes 1,2,3 File A Blk 1 Blk 2 Blk 3 User NameNode DataNode 1 DataNode 2 DataNode 3 DataNode N Blk 1 Blk 2 Blk 3 Figure 3.2: Data writing process [27] In the figure 3.2 the user request to write a file in cloud, and this file is break into smaller blocks, and NameNode allow these blocks to be written in machines (datanodes) across the cluster. As the size of the file that users wants written, increase, more blocks will be created and used, and more datanodes will be working on this file in parallel. At the same time datanodes, replicate the blocks, having the same data in multiples machines across the cluster. Wants to read file A Sends to user, blocks location of file A User NameNode DataNode 1 DataNode 2 DataNode 3 Blk 1 Blk 3 Blk 2 Blk 1 Blk 3 Blk 2 File A Blk 1: DN1, DN2, DN6 Blk 2: DN2, DN3, DN4 Blk 3: DN1, DN3, DN5 DataNode 4 DataNode 5 DataNode 6 Blk 2 Blk 3 Blk 1 Figure 3.3: Data reading process [27] 35

36 Figure 3.3, shows user asking to read file A, and NameNode, search for datanodes holding chunks of file A, once it gets the list of datanodes with file A blocks, it sends the location of each block in the cluster, in this case user gets the address of datanodes holding the blocks, and read one block at time Performance The major purpose of the project as to measure the performance of cloud simulator for data management, so an important point was to get the best method for simulation with data management. Many things could be used in order to measure the performance of the simulator, but only the ones related to the problem presented were used. The best way to measure the performance of the simulating regarding the problem presented, was measuring the execution time, in both simulators. Although it is the simple way to measure the performance, during the project many execution times were taken to measure the performance in each simulator. One simulator was used to measure execution time to move data, and the second simulator were used to measure execution time for data processing. One more measurement of performance, is the average time to repair the system when there are failures. It would count the time as it simulates the system repairing the nodes with failures, return the mean time to repair all nodes with failures in the system. It would be a great measurement to simulate and predict possibles breakdowns on nodes in the system in real world Summary This project has a simple design, straight forward in terms of performance metrics, as it is based in evaluating the simulators performance and test them. The principal aim was to investigate performance of the simulators, so not too much time was spent to design, than performing tests. 36

37 Chapter 4 Implementation 4.1 Introduction This chapter covers, simulation scenarios, using CoudSim and HDFS simulator, aiming to evaluate methods for data transfer and data processing that each simulator offers, regarding its time respond and storage capacity. Several scenarios were deployed, simulating tasks related to the problem presented in chapter 3. Each simulator played a role to specific scenario, providing necessary tools to simulate and evaluate such tasks in cloud. Some scenarios were simulated using virtualization and others simulating a service with distributed system. Simulations in CloudSim, were performed with some experiments, where each experiment had different parameters value. The set of parameters in CloudSim, include the number of Virtual Machines (VM), number of Cloudlets, file size, number of cpus for virtual machines, and number of hosts in datacenter. Cloudlets, were responsible to send the files to virtual machines. Values for parameters, were set as default with 8 hosts, 10 virtual machines, 50 cloudlets, 1 Terabyte for each file, and 1 cpu per VM. These values were selected to be the minimum values, and were incremented for each experiment, in order to datacenter get high scales as the experiments increase the values for each parameter. HDFS simulator, covered simulation experiments for data distribution and processing. It used number of blocks which represents the file size, and for it a simulation of input and output stream were performed. The main parameters of HDFS simulator used in the experiments were: number of datanodes, number of blocks ( file size) and block size. A minimum value for each parameter was also set, having 1000 datanodes, 1 Terabyte per file, and 32 Megabyte for each block, for the first experiment,values that increased to measure the performance of the distributed file system, having more nodes and varying the block size. 37

38 4.2 - Experiments CloudSim Simulations Scenario 1: Increasing file size and number of cloudlets Objectives and Design The aim of this experiment, is to evaluate the effect how increasing number service request, influence the performance of the system, related to time taken to store the files. The experiment will also contribute to measure the maximum storage, CloudSim handles to store file. The parameters used for this scenario, started using default values, which was state in section 1. These values increased to enable a better analysis of time variation to transfer data. The experiment used time-shared policy for cloudlets allocation in virtual machines. Cloudlets were created and each cloudlet holds a file, and requires 1000 instructions, for its execution on hosts. The number of hosts in datacenter and virtual machines stays the same, using 10 Vms with 1 cpu, but cloudlets and the size of file increased several times, in order to observe when CloudSim, stops transferring files, and get the average time to transfer a file, with a specific size Results Transferring 50 Files One file per cloudlet Time (secs) File size ( Terabytes) 50 cloudlets 100 cloudlets 150 cloudlets Figure 4.1: Comparison of time taken to transfer files varying size from 1 to 10 terabytes 38

39 Transferring files to cloud One file per cloudlet, 8 hosts, 10 VMs, 1cpu Time (secs) Cloudlets 100 Cloudlets File Size (Terabytes) Figure 4.2: Comparing time taken to transfer files with different size, using 50 and 100 cloudlets Transferring sets of files One file per cloudlet, 8 hosts, 10 Vms, 1 cpu Time (secs) File Size (Terabytes) 150 Cloulets 200 Cloudlets 250 Cloudlets Figure 4.3: Comparison of time taken to transfer files with different size, using150, 200, 250 cloudlets 39

40 The experiments, varying the number of files to be stored in datacenter, were performed and a collection of results was made. The parameters used, include 10 Virtual machines, 50 cloudlets, each cloudlet sending a file to virtual machines, file size is increases as it goes from 1 terabyte to 20 terabytes. The last experiment sent 250 cloudlets and 14 terabytes per file. Using the results collected from each experiment, a graph was produced to show how time changes, according to parameters values for each experiment. The results in the figure 4.1 above, shows that time increases as the number of files sent to datacenter increases. Therefore, it is possible to notice that time difference between each group of cloudlets is very small as they almost take same time to transfer data. So the number of cloudlets do not influence the system performances in terms of time. As the graphs show that time taken from the group of 250 cloudlets transferring files of 14 terabytes, is relatively the same as time taken to send of 150 cloudlets sending files of same size of terabytes sent by 250 cloudlets. As the number of virtual machines, was the same, for every set of cloudlets, and each virtual machines runs a cloudlet at the time, which means the performance for virtual machines per each set of cloudlet do not change. It would be reasonable to stop sending files, when using file size of 14 terabytes, as it could not send every files to cloud storage, but the one of the purpose of this scenario was to find out the the file size that CloudSim would not be able to handle. Looking once again at the figure 4.3, and observing the plot for 150 cloudlets, it only goes till 14 terabytes ( gigabytes), so it was able to send set of 150 files of 1 gigabytes till 14 gigabytes, but when trying to send 150 files of 15 terabytes, it only sent 59 files, as it reaches the storage capacity. The same thing happens with set of 200 and 250 cloudlets. 40

41 Scenario 2:Increasing number of virtual machines Objectives and Design The aim of the experiment is to evaluate how the increase virtual machines running cloudlets sent by users, can influence, the average time taken to transfer and store it in datacenter. The parameters used for this scenario, used less number of cloudlets, compared with previous scenario. Values for number of virtual machines, increased in order to check the variation of time to transfer data. The experiment used time-shared policy for cloudlets allocation in virtual machines. Number of hosts and cpus, stays as default. Each cloudlet sends a file. Cloudlets were modelled requiring 1000 instructions for its execution on hosts.the number of virtual machines hosted in datacenter increases for the same number of cloudlets and file size, in order to check how it would reduce time taken to transfer files Results The experiments, varying the number of virtual machines hosted in datacenter, were performed to collect results and have a better analysis of the performance to transfer files in cloud. The parameters used for the first experiment, include 20 hosts, 10 Virtual machines,1 cpu per virtual machine, 500 cloudlets, each cloudlet sending one file of 20 terabytes to virtual machines, but it then increased number of virtual machines from 10 to 500. Using the results collected from each experiment, a graph was produced to show how time changes, according to parameters values for each experiment. 41

42 Transferring files to Cloud cloudlets Time (secs) Files of 8 Terabytes Files of 10 Terabytes Files of 12 Terabytes Number of Virtual Machines Figure 4.3: Comparing time taken to transfer files, increasing number of virtual machines Increasing the number of virtual machine hosted in datacenter, influenced time transfer, as more resources are available, and cloudlets are better distributed across the virtual machines, allow virtual machines run less cloudlets, which would result in less time taken for each virtual machine store in datacenter the files in cloudlets. So creating more virtual machines certainly increases system performance, as it reduces the transfer time. The results in figure 4.3, shows that the average time to transfer a file of 12 terabytes starts with seconds, using a cluster with 10 virtual machines, and it starts going down,as the number of virtual machines increases, reducing time taken, but a looking at average time to transfer time using 300 virtual machines, the time taken remains the same in every plot. Even increasing the number of virtual machines, the performance do not change. So at this point virtual machines are no longer influencing the transfer time, as more hardware resources are needed to increase virtual machines performance. As there is many virtual machines running and a few hardware working to host these virtual machines, the best thing to is to reduce the number of virtual machines allocated to a hardware, in order to distribute them across the machines in the cluster, enabling virtual machine migrating, which gives a better performance of the hardware and virtual machines. 42

43 Scenario 3:Increasing number of virtual machines number of hosts in datacenter increases Objectives and Design The aim of the experiment is to evaluate how the number of physical machines in datacenter,reflects on the system regarding the performance of virtual machines reducing time to transfer and store files in cloud. The experiment used time-shared policy for cloudlets allocation in virtual machines cloudlets were created, each cloudlet holding a file. Cloudlets were modelled requiring 1000 instructions for its execution on hosts. File size remains the same with 20 terabytes per file, number of hosts and virtual machines increases, in order to observe how virtual machines would contribute for a better performance, improving time response to transfer files Results Tests, varying the number of virtual machines, and hosts in datacenter, were performed and a collection of results was made. The parameters used for the first test, include 10 Virtual machines, 1000 cloudlets,. The last test used 1000 cloudlets, 500 virtual machines and 400 hosts. Transferring files of 12 terabytes Time (secs) Cloudlets / 10 hosts 500 Cloudlets/400 hosts Number of virtual machines Figure 4.4: Comparing time taken to transfer files, using 10 and 400 hosts in datacenter 43

44 The figure 5, shows the performance of the experiment, when increasing the number of hosts to 400. The graph shows that having more machines in datacenter do not affect directly data trasnfer time, as the results above, shows the performance of the experiment presented in previous scenario, running 500 cloudlets using a datacenter with 10 hosts, and it also shows the case of 500 cloudlets sending file to virtual machines running in 400 hosts in datacenter. The performance of both cases remains the same, as they take same time to send files, varying the number of virtual machines. The difference comes when using 300 virtual machines, there is an overloading in the case using a cluster with 10 hosts. The datacenter achieves its limit and the hosts to not offer enough memory to virtual machine in order to get better performance to run the cloudlets and store the files. And as there no more space in every host, when a virtual machine goes to an overloaded host, is not possible to perform as virtual machine migration to an under-loaded host. The case with 500 cloudlets and 400 hosts, has different behaviour, as from the point where it starts using 300 virtual machines, the time keep reducing. It happens because in this case there are enough hardware resources to feed virtual machines, so when there is an overloaded host, virtual machines are able to migrate to a host with enough space, to allocate these virtual machines coming from an overloaded host. Therefore, virtual machines get sufficient memory, and they are reallocated according to their computing resources Scenario 4:Increasing file size using Space-shared policy and Time-shared policy Objectives and Design The aim of the experiment is to evaluate how the allocation policies reflects on the time to transfer files. The experiment measure the performance of a system using time-shared policy and a system with space-shared policy when allocating cloudlets to virtual machines. The main goal of the experiment is to check if there is a similarity in both systems using the policies. Parameters used for both systems include 250 cloudlets, each cloudllet requiring instructions ( corresponding 20 minutes execution on hosts), hosts, and 100 virtual machines. Size of file changed from 1 to 14 terabytes for each system using on of the policies, in order to check which policy offers better performance, taking less time to transfer files. 44

45 Results Time (secs) Transferring 250 files Time-Shared vs Space- Shared File size (Terabytes) Time-Shared Space-Shared Figure 4.5: Comparing time taken to transfer files, using Time-Share and Space-Shared Figure 4.5, shows that there is significant different between both policies. The experiment using space-shared policy, shows better results in terms of time taken to transfer files comparing with the experiment using time-shared policy. These behaviour is related to the fact that a system allocation cloudlets ( user's tasks) to virtual machines using time-share policy, every cloudlets run in parallel inside the virtual machines, which mean that they all starts at the same time and will all share the cpu of the virtual machines, and this results in less power process for virtual machines to complete each task (cloudlet). Using space-shared, gives a better performance because cloudlets, run one by on inside the virtual machines, and in this way, virtual machines uses its full power for each cloudlet. Each cloudlet starts at its own time, without any need to share the cpu of virtual machines with other cloudlets, and if cloudlets are trasnferring files of equal size, each cloud let takes the same time to be completed. Basically the space-shared policy offers a better management of the virtual machines cpus. 45

46 4.2.2 HDFS Simulations Scenario 1: Increasing Size of file to process Objectives and Design This set of experiment aims to give a better understanding of hdfs performance. The main objective for this scenario is to check how long hdfs takes to process a file in cloud, and get a relation between time taken using different size of files in cloud. The scenario simulates a cluster with 1000 datanodes, a data file input stream is simulated, in order to get the length of the file. The file is split in blocks, each blocks takes 64 megabytes of the file, so block size for is set as default to 64 megabytes. It is assumed that file is already in cloud, and a user only wants to process large data. Tests performed used several files each with different size Results Several tests were performed, increasing the number of files, varying in size from 1 to 10 terabytes. Using the time taken from each test, a graph was produced to analyse the HDFS performance regarding the time it takes to process files in cloud. One of the purpose in scenario was to get the average time that HDFS takes, to get the location of file blocks, and give address each block of the file, sending the datanode id, where each block is stored, allowing the user to process the file. 46

47 Reading files and wirte its contents to new files in cloud Time (secs) File Size (Terabytes) 1000 Nodes Figure 4.5: Processing different file size in cloud The graph using the results of the experiment shows that the increase of file size, results in more time taken to process the file. As file size increases the number of blocks distributed in the system gets bigger, as blocks in HDFs are replicated, and in this experiment the replication number was set to 3. Also the fact that file gets bigger, and the cluster has only 1000 nodes, each datanode holds more nodes, which can increase the overloading rate in datanodes, and also there will be less number of racks holding datanodes, which implies more failures in the system. Also the system uses a policy to allocate one replica of each block in a different rack, and this affect writing process, as it is necessary to transfer blocks to multiples racks, which increases the time, aas number of blocks increases. The number of failures also may affect the system performance to process data, as failures happen during the data processing, when Namenode sends the address of one datanode containing the block the file that user requested, if there is failure in the rack that holds this datanode fail during the reading process, the system will need to get the location of another datanode in another rack, holding a replica of the block, which system needs in order to repair the racks and send block to user, for processing the file. Another fact to influence the increase of time while the size of file increases, is the communication between nodes. The communication between nodes from different racks, usually is done using a network switch. And in most cases the bandwidth between datanodes in the same rack is bigger, than the bandwidth between datanodes in different nodes, which means, that if there is more failures in the systems, more datanodes from different racks will be used to send replicas of the blocks, and the communication between theses datanodes will be slower, as they are allocated in different racks. Replica selection is also another fact to influence the time to process data when file gets bigger, is 47

48 the selection of replicas to be used. As the size in file increase, is normal to have more failures, which results in less datanodes working. To reduce the writing latency, the system try to allocate the blocks to reading process, the ones that are closer to the datanode reader. If there is a replica in the same rack of the datanode reader, this replica is selected to be used in the reading process. So when there is no replica in the same rack of the datanode reader, it takes more time to select the next replica of block. Another reason linked to failures in the system, is that when there is a failure, some datanodes loose the connection with Namenode. The Namenode detects the lost of connection with datanode, when it stops receiving signals from the datanode. When the Namenode detects that there no more signal coming from a datanode, it state the datanoded that did not sent signal, as dead, and consequentially Namenode stops sending tasks to theses datanodes stated as dead. This implies in less datanodes working to process data. And at the same time datanodes are processing data, Namenode analyse which blocks needs to be replicated again, as it detected dead datanode nodes in the system, it starts replicating the blocks in working datanodes across the cluster. The content of the blocks is also crucial to to system performance in terms of time to process data. As it is possible that the data of a block from a datanode gets corrupted. Data may gets corrupted beacause of fails during the storage process, or network failures. So HDFS checks the content of each block while user is processing data. When a user wants to write a file, and Namenode, tells which datanodes will process each block of the file, Namenode also analyses the checksum of blocks, compute the checksum of each block and store in a hidden file in the namenode. While in getting a file, HDFS checks if block of the file correspond to the checksum of blocks stored for this files, if not, HDFS request a block from another datanode, which with less datanodes, less possibility of have block replicas and less datanodes might have the block when there is a failure. 48

49 Scenario 2: Increasing number of nodes in cluster Objectives and Design The goal of this experiment is to evaluate how number of nodes affect the system, and it s time to process data. It will be used to analyse the data organisation in cluster and how HDFS correspond to a certain organisation in the cluster in terms of nodes, how the performance change, increasing the number of nodes, and how a number of nodes process large data files. The performance will be measured regarding the time taken to process data increasing the nodes. The experiments starts simulating a cluster with 1000 nodes, each cluster holding a specific number of nodes, the system order process a file, a data file input stream and data file output are simulated, to give the number of blocks of a file. The number of nodes increased from 1000 to per cluster. The size for each block remained the same with 64 megabytes Results The scenario had used several experiments to test the performance of the system in each cluster. The size of file is each test per cluster, used 1 to 10 terabytes. With the results of each cluster performance, a graph was produced contrasting the file size against time taken to process this file, which will be used to analyse the HDFS performance. The objective of this scenario was to observe the influence of having more datanodes, receiving blocks, and gets average time process these blocks in the system, varying the file size, each cluster using a number of datanodes, perform the same task. which also include get the location of file blocks, and give address each block of the file, sending the datanode id, where each block is stored, allowing the user to process the file. 49

50 Reading files and wirte its contents to new files in cloud Time (secs) File Size (terabytes) 1000 Nodes 10000Nodes Nodes Nodes Figure 4.6: Using clusters with different number of nodes to process files in cloud Reading files and wirte its contents to new files in cloud Time (secs) Nodes Nodes Nodes File Size(Terabytes) Figure 4.6: Using clusters with different number of nodes to process files in cloud 50

51 The results of experiments, represented in the figures above, shows that the number of nodes in cluster has a direct effect on the system performance. The figures compare time taken between clusters with 1000,10000, and nodes to process files. Time difference between the clusters gets smaller as the number of nodes in cluster increases. So as many nodes in the cluster the better performance the system has, and less time it takes to process a file. As HDFS was designed to support large data, the number of nodes helps the system to be faster while performing tasks. It gets faster with more nodes, because having more nodes, it possible to distribute the blocks in more datanodes, and create methods to reduce the failures in the system. The datanodes can be organised in such form that each block goes to one datanode in a different rack. With this method system stays high in fault-tolerance and, even if a rack goes down, there are more datanodes holding the block requested block, which user can get this block to read data. Having more nodes in the cluster, also helps the system increase the replication level, using 8 replicas per block for example, which certainly increases the availability of blocks in the system, and more blocks will be closer to the datanode reader, accelerating the reading process. One more test was performed just to certify the influence of increasing the number of nodes in cluster. The graph bellow shows how on file of 10 terabytes, would be processed increasing the number of nodes. Increasing number of nodes Time (secs) Number of Nodes 10 Terabytes Figure 4.7: Increasing number of nodes in cluster 51

52 The number of nodes helps to balance the cluster, as there will be more chances for HDFS move blocks from one datanode to another if a datanote start getting overloaded. Also when there is an increase of requests for a particular block, the system will be able to create extra replicas of the block, and make a balance of the other blocks in the cluster. Another fact related to the performance of the system regarding the increase the number of nodes, is that with more nodes more cpus will be running, which implies more workload that system can handle and processing the data. And as the number of nodes in cluster increases, the less probability the system has for data loss. Number of nodes in cluster also reflects to to system scalability Scenario 3: Increasing size of blocks Objectives and Design This of experiments aims to evaluate how the size of blocks influence the system, and how they affect time system takes to process data. The experiments will help to observe how fast the system can process data, if we vary the size for each block. The performance will be measured regarding the time taken to process data increasing the nodes. The scenario simulate a cluster with datanodes, the system order process a file, a data file input/output stream is simulated, to give the number of blocks of a file. The number of nodes remains the same with nodes per cluster. The size for each block changed, using 32, 64 and 120 megabytes Results Each experiment to test the performance of the system using different size for blocks. Tests in the scenario used different file size also, as it used one size per block, where it was used in files of 1 to 10 terabytes. Using the results of each test, a graph was produced contrasting the file size against time taken to process file, which will be used to analyse the HDFS performance. This scenario aims to observe the impact of varying the size of blocks has on the system performance. It helps to get better understanding of in what way size of block speeds and the time to process data. The each test provided the average time the system takes to process data, varying the file size and size per block. 52

53 Reading files and wirte its contents to new files in Cloud using different block size Time (secs) File Size (Terabytes) 32 mb 64 mb 128 mb Figure 4.8: Increasing size of blocks Results in figure 4.8 shows that the size of block is one of the things that has a big impact on the system performance. The experiment using blocks with 32 megabytes, takes more time to read data than the others, data processing time reduces as the block size increases. This behaviour, happen because as the size of block gets bigger, the number of block in the system will be small, which means less blocks allocated to each datanode. Therefore if system uses blocks with 128 megabytes, the user can possibly read and write more data without having to consult the Namenode for every task. Another fact of increasing the size of block is that, it reduces seek time, and the size of metadata in the Namenode, which certainly reduces Namenode load, which is an important thing to considerate if the system is dealing with files extremely large. So if number of files is less than the tasks the system is running, it means that data will be processed using maximum parallelism, and the system will still have resources available. If there is less blocks in the systems, the file will probably be allocated in fewer nodes, and this helps to reduce parallel access throughput. The size of block impact mostly to maximise the throughput of file. But if cluster has to deal with small files, the size of block might affect the system performance and increase the percentage of failures in the system. If a cluster is set, the system uses 120 megabytes for size of blocks, and a user sends a file of 128 megabytes, the system will not have the best performance for this file, as the file size is equal to block size, the entire file will be allocated to one block, and each replica will represent the entire file, which that file will not be well distributed, because the cluster will have only 3 machines holding the files, is these 3 machines fails at the same time, the file is lost, and the system will not be able to retrieve the data, as there no more machines holding replicas of the file. 53

54 Chapter 5 Evaluation 5.1 Introduction This chapter aims to evaluate the project, using the results of the simulations presented in chapter 4. It will evaluate the performance of the experiments used in both simulators, comparing the results of the two methods, to transfer data and process data in cloud. As presented in previous chapters, simulation of data management was divided in 2 sections: Data Transfer and Data Processing. As each section was measured using one simulator, each simulator used, presented different performance for each scenario simulated. In addiction to provide valid tools to perform simulations of tasks in cloud, each of both simulators could not set the right resources needed, according to the scenario looking to solve the problem. Thus, several tests were performed to find the appropriate cloud resources to solve problems that might happen when transferring and processing data in cloud. Therefore, this chapter includes an review of the results from both simulator, and time performance of each task to be completed in both simulators. 54

55 5.2 - Achieving Minimum Requirements The objective of this project is to provide a better understanding data management in cloud, using cloud simulation, to simulate scenarios evaluating mechanisms to reduce time to transfer and process data in cloud. The minimum requirements set for this project are: Investigate cloud simulator architecture ( CloudSim and HDFS simulator) Identify cloud services and cloud resources that are key for the implementation of a cloud data management scenario Understand communication between entities in CloudSim and HDFS simulator Implement a simulation experiment considering a scenario for data management in cloud Implement a working cloud simulation experiment for data storage in cloud To meet the minimum requirements stated for this project, a literature review is presented in chapter 2, covering cloud computing concepts, features and its architecture. The types of services offered in cloud computing, was also investigated and which resources is used for each cloud service. Some issues of data management were reviewed and an appropriate approach for a solution was presented. Several cloud simulators were presented in the literature review, explaining which ones were used to implement the experiments. Chapter 2 also investigate the architecture of CloudSim and HDFS. The investigation of CloudSim and HDFS, involved informations to give an understanding of how each simulator works, regarding the tools used, communication between entities, how the entities are used to simulate scenarios for data transfer and data processing. The design of this project is presented in chapter 3, explaining the problem of the project and the solution for the problem. Chapter 3 also covers the application of each simulator to evaluate the solution, which entities are used, the role of each entity in cloud simulation scenario, how each task of the solution is performed in a cloud environment. In the chapter 4, were presented the types of scenarios implemented, its characteristics that differs one from the others, the mechanisms used to reduce transfer time and data processing time in each scenarios. Scenarios simulated in CloudSim, used experiments varying the entities, to to find the resources that influence to reduce data transfer, in the experiments used, several mechanisms combining cloud resources speeding up data transfer. The implementation of experiments in HDFS, involved also a combination of cloud resources, using HDFS architecture, which helped to find the best combination to reduce time to read and write data. 55

56 5.3 Evaluation of project methodology The methodology used in this project involved cloud simulation, implementing scenarios, which simulates experiments evaluating techniques to improve data management in cloud. The simulators used in this project have done a good job, as each of them, offered scalable simulations, offering a reliable environment, able to perform experiments, control and test them, without any cost. CloudSim was helpful, being effective and flexible for modelling a cloud storage, using experiments simulating small and large data being transferred to a datacenter. It also allow to use policies for task allocation and virtual machines allocation. In addition of using virtualization, CloudSim also allow the implementation of experiments performing virtual machines and cloudlets in parallel. But it was not able to process data and showed some limitation during the first experiment, increasing the size of files, as it reaches its limit to store files. HDFS simulator has provided its efficiency for modelling cloud environments to process data, where it is possible to simulate scenarios with large probability of a file survive on a large-scale system failure. HDFS enable the implementation of scenarios simulating a distributed file system, using techniques to increase file durability and replication. It also allows a rapid adjustment of the configuration, for each experiment, increasing replica level, and number of nodes per cluster, which helps to improve mechanisms to process data. But on the other hand, it showed a bit slower and not very accurate in the simulation results,when simulating experiments with high scales took more time to run, and it was necessary to run each experiments 10 times to obtain a consistent and reliable results. Using both simulators, several scenarios have been simulated, which enable to find the combination of cloud resources needed to improve and data transfer and data processing. 5. Evaluation of Experiments Results As the experiments were performed in two different simulator, the results will be evaluated, regarding two aspects: data transfer and data processing. In this way, results will be analysed separately. Using the results obtained from the experiments performed, contrasting the resources used in each experiment to improve data transfer and data processing. A these results will be used to contrast with results from works included in the literature review. 5.5 Evaluation of CloudSim Results During the experiments, it was possible to observe that cloudlet allocation policy, have a big impact on the time taken to transfer files. The work presented by Buyya and Calheiros [23], showed that the experiment using space-shared allocation, had better performance comparing with the experiment using 56

57 time-shared. In the experiment with space-shared, every file took same time to be transferred to a datacenter, which gives a stable performance for a system using this policy, as it reduces time to transfer files. The experiment using time-shared policy that Buyya and Calheiros [23] presented, provided a different behaviour as the time to transfer each file varied, as the number of files are submitted to virtual machines increased. Files in this case did not take all the same time to be transferred to a datacenter, as the first files submited, take less time than the other ones. Results on the experiments simulated in chapter 4, scenario 2.1.4, there is a significant reduce of time to transfer files, using space-shared policy, as the taken with the experiment using time-shared allocation policy, is 60% greater than time taken using space-shared policy. This results resides on the fact virtual machines using space-shared policy, transfer each file at the time, which means that virtual machines can use its full memory resource and cpu to transfer one file at the time, while virtual machines using time-shared, transfer every files in parallel, which make each virtual machines uses its memory resource and cpu, to transfer every files. Therefore, more mechanisms have been found to improve data transfer, the results from scenario 2.1.2, shows that using more virtual machines speeds up data transfer. So combining these two mechanisms, which results in increasing the number of virtual machines and set a space-shared policy in virtual machines to allocate cloudlets, reflects on a possible to solution to reduce time to transfer files. Although, there is already a effective solution for data transfer, these mechanisms only works if physical machines in datacenter have mechanisms for a better performance, so based on the results from scenario 2.1.3, shows that the number of physical machines in a datacenter, influence the performance of virtual machines which consequentially reflects of the data transfer. Results shows that having more physical machines, make virtual machines be more distributed accross the cluster, and enables a migration from one physical machine to another, when there is a over-loading. 5.5 Evaluation of HDFS Results The experiments results from Debains and Togores [24], shows that the file size influence time that system takes to process data. In this experiment the mean time to repair the system increases, as the number of block increases. Another fact that can be observed in the experiment is that the taken to process files using eight replicas per block, is greater than time taken to process files using 4 replicas per block. As the number of replicas increases, the number of block in a node also increases. This behaviour was observed on scenario 2.2.1, as the data processing time increases as the size of each file increases. Increasing the size of a file, also increases the number of blocks of blocks being replicated, which results in blocks being allocated to each node in the system. Moreover, the other experiments shows mechanisms to reduce time to process data. Scenario 2.2.2, shows that one technique to speed up data process is achieved by increasing the number of nodes in cluster. The results obtained from this scenario, shows clearly a reduce of time with experiments using more nodes, as the number of nodes certainly increase the system performance, resulting in a better distribution of blocks 57

58 across the nodes. The project also included a study on the effect of block size, when processing data. A simulation on scenario 2.2.3, was performed implementing experiments using small and large scales for block size. The experiments give the project to a new mechanism to improve data processing, as it was possible to observe a reduce of time when increasing the block size. The time taken to process files using block of 32 megabytes if three times greater than the experiment modelled with blocks of 128 megabytes, took to process data. 5.7 Future work and Possible Extensions This project shows an effortlessness way to perform experiments and tests in cloud environment using simulations, evaluating scenarios with CloudSim and HDFS simulator. Observing the results, there is a need of having a bridge to link the results obtained in this project with results obtained in a real cloud environment. Therefore one of the next steps, is to consider situations in which the users wants to move from one cloud provider to another, and perform simulations to transfer data between two different datacenters. One more fact to consider, is the validation issue for the scenarios simulated, to check how the reliable they are in respect to real environments. Another interesting point for future work, is an extension to CloudSim to implement scenarios to process large data, and compare its results to results obtained from HDFS. 58

59 Chapter 6 Conclusion Minimum requirements for this project were achieved. As an overall, experiments have done a good job, and helped to evaluating appropriate mechanisms to speed up the time to transfer data and process it. CloudSim has offered usable tools which helped to understand the process of storing data in cloud. Virtual machines showed as a good resource to transfer data, and improve the performance of the systems. Project also provide techniques of reducing workloads on physical machines across the datacenter. The allocation policies are one of key things to improve system performance. HDFS had shown as a fast system to process data, using its techniques of splitting files in and distributed this blocks in different nodes, increasing durability of files in the systems and reducing possibles failures. Storing data in many nodes, provide a big advantage to system, being more flexible and allowing more availability of files. The block size can increase the system performance, in this way service providers can set the size of blocks according to the size of files being processed in cloud. But this mechanism is more accurate for large data, reducing the number of block in each node. 59

60 REFERENCES [1] - Judith Hurwitz, Robin Bloor, Marcia Kaufman and Fern Halper from Cloud Computing For Dummies - What Is Cloud Computing? [2]- Peter Mell and Timothy Grance - The NIST Definition of Cloud Computing Recommendations of the National Institute of Standards and Technology, Computer Security Division, Information Technology Laboratory, National Institute of Standards and Technology. Gaithersburg, MD , September 2011 [3] - Anthony T. Velte, Toby J. Velte, Robert Elsenpeter, Cloud Computing: A Practical Approach, 2010 [4] - Dan Orlando - Cloud computing service models, February 2011 [5] - Marijana - Understanding Cloud Service Models [6] - [7] - Jason Carolan and Steve Gaede - Introduction to Cloud Computing Architecture 1st Edition, June 2009 [8] Karthik - Cloud Computing and Types of Cloud, February 2012 [9] Harsh Mahajan Cloud Computing and Types of Clouds, January 2011 [10] Josh Ames - Types of Cloud Computing: Private, Public and Hybrid Clouds, December 2012 [11] Types of Cloud Computing: Public, Private and Hybrid Clouds Explained, September

61 [12] Gagan Deep Saini - Cloud Computing: Database as a Service Cloud Computing Journal, [13] Karen Scarfone, Murugiah Souppaya and Paul Hoffman - Guide to Security for Full Virtualization Technologies, Recommendation of the National Institute of Standards and Technology, January 2011 [14] - Abhishek - Types of Virtualization and Cloud Computing and Their Differences, October / [15] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D. J., Rasin, A., and Silberschatz, A.(2009). Hadoopb: An Architecture Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. [16] Hadoop (2010). Apache Hadoop. [17] - Rodrigo N. Calheiros, Rajiv Ranjan, CÃ sar A. F. De Rose, and Rajkumar Buyya. CloudSim: A Novel Framework for Modeling and Simulation of Cloud Computing Infrastructure and Services. [18] Jun-Kwon Jung, Nam-Uk Kim, Sung-Min Jung and Tai-Myoung Chung. Improved CloudSim for Simulating QoS-Based Cloud Services. [19] Dhruba Borthakur - The Hadoop Distributed File System: Architecture and Design [20] DataStax Corporation - Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS), September 2012 [21] - David Jacobs - Hadoop enables distributed 'big data' processing across clouds [22] Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas - The Hadoop Distributed File System [23] RajKumar Buyya, Rajiv Ranjan, and Rodrigo N. Calheiros - Modeling and Simulation of Scalable Cloud Computing Environments and the CloudSim Toolkit: Challenges and Opportunities [24] Corentin Debains, Pedro Alvarez-Tablo Togores, Firat Karakusoglu - Reliability of Data-Intensive 61

62 Distributed File System: A Simulation Approach, 2010 [25] [26] Ann Westerheim - What is Cloud Computing?, November 2012 [27] Brad Hedlund - Understanding Hadoop Clusters and the Ntework, September 2011 [28] - Gwen Shapira - What Data Should We Store on Hadoop?, October 2012 [29] -Jeff Beckham - Cloud Computing vs. Virtualization: The Difeferences and Benfits, October 2011 [30] - Vmware - Understanding Full Virtualization, Paravirtualization, and Hardware Assist, November [31] - [32] - Margaret Rouse - Software as a Service, August [33] - Abe Sultan - SaaS 101: The Benefits, May [34] - SaaS Software as a Service, Storage as a Service [35] - Gerald Kaefer - Cloud Computing Architecture, Corporate research and Technologies, Munich, Germany, May 2010 [36] - Sam Johnston Cloud Computing Types: Public Cloud, Hybrid Cloud, Private Cloud, March