Contents. 1. Introduction

Transcription

1 Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving services through the internet. Cloud services offers many benefits to clients, giving them the possibility to store and process data using a scalable architecture, providing an storage capacity almost infinite. With the constantly increasing of data being moved to cloud, cloud services deals with millions of files everyday, storing and processing according to client request. Facebook is real example of this type of service provided in cloud, where the range of data goes from terabytes to petabytes, and need to be stored and processed in a timely manner. This increase of data being transferred to cloud, aggravates the problem to develop mechanisms to implement solutions for data management, to improve cloud services performance, to attend on requests to store and process large data. This project aims to to give an understanding of data management in cloud, evaluating solutions and mechanisms to speed up data transfer and data processing in cloud. Several scenarios will be simulated using cloud simulator, with a number of experiments conducted to to evaluate techniques to reduce time to transfer and process data. 1

2 Acknowledgements I would like to thank my family for giving me support to study and believe in my dreams. I would like to thank my tutor Nick, for all support given in the last 3 years, and my supervisor for his interest in this project, scheduling weekly meetings, and feedbacks given during the progress of the project. 2

3 Contents 1. Introduction 1.1 Introduction Why Cloud Simulation Why Data Storage Project Problem Objectives ,6 Minimum Requirements Project Plan Milestones Methodology Summary Background Research 2.1 Introduction Cloud Computing Cloud Features Cloud Models Software as a Service Platform as a Service Infrastructure as a Service Types of Cloud Private Clouds Public Clouds Hybrid Clouds Cloud Applications Virtualization Data Management Data Management Issues Data Storage and Query Processing Scalability and Consistency Database management Systems with multi-tenants

4 2.9 Evaluation Performance using Cloud Simulation Data Transfer CloudSim CloudSim Architecture CloudSim Usability Cloudsim Capabilities CloudSim Limitations Data Processing Hadoop Distributed File System HDFS Goals a) Hardware Failure b) Streaming Data Access c) Large Data Sets Block Division HDFS Architecture Data Replication Related Work Summary Design 3,1 Introduction Methodology Example Problem Solution Data Transfer Application of CloudSim Data Processing Performance Summary Implementation 4.1 Introduction Experiments

5 4.2.1 CloudSim Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results ,2 HDFS Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Evaluation 5.1 Introduction Achieving Minimum Requirements Evaluation of Project Methodology Evaluation of Project Experiments Evaluation of CloudSim Results Evaluation of HDFS Results Future Work and Possible Extensions Conclusion 5

6 Chapter 1 Introduction Introduction The aim of the project is to evaluate mechanisms for data management in cloud using cloud simulation. Data management in cloud have been important, as cloud computing offers storage as a service, where client can move his data to cloud, getting the benefit of accessing his data anywhere and more storage capacity offered by his cloud provider. This project will focus on data storage, and show how cloud simulation can help us to get better understanding of storing data as well as processing data in cloud. The project will involve cloud simulation experiments, to check the performance of transferring and processing data in cloud Why Cloud Simulation? In cloud environment, some services or applications have to be tested before provide it to customers in order to know how the service or application will behave while is being using. Due to the fact that cloud simulation helps to develop experiments, simulating cloud environments, most users can be easily familiar with it, and it allow them to perform tests on their services, having the power to repeat these tests, and control the cloud environment, which will not cost anything to them, and they will get an idea of the service performance before introduce it to the real cloud environment. 6

7 1.3 - Why Data Storage? As cloud computing offer many storage services, but one of the biggest concern is the possibility of data from a client, can be mixed with data from other clients. So data storage has been a challenge for them, to manage and extract data in cloud. It is important to isolate data belonging to a single client from the others. Also a good and fast way of accessing data, it is necessary to provide, in order to make easier to users to get their files in cloud, and reduce the level of data loss. So data management is an important factor for service providers in cloud, to achieve success when delivering services to clients. For example, Google offers the service Google Cloud Storage, that allow people to access, store and protect data files. It let us manage our data on a reliable Google's infrastructure, which is scalable and efficient, giving a robust storage, quick and easy access. SkyDrive is a cloud storage service from Microsoft, that let us store data in cloud and offers a set of tools to manage data, such as Word, Excel and PowerPoint. Amazon offers Amazon S3, which has an interface that allows to store and recover data, any time, and anywhere. It gives access to a scalable, inexpensive and fast infrastructure that Amazon runs on its global network of web sites. Cloud Storage Provider Figure 1: Data storage in cloud [25] 7

8 1.4 Project Problem As the quantity of data that clients want to move to cloud, has been increasing everyday, and consequentially the necessity that cloud providers have to store and process data in a timely manner, it is necessary to use mechanisms to measure the application performance used to store and process data, regarding the time, that it would take to complete clients requests. Using cloud simulation it is possible to generate a set of metrics to evaluate the performance of storing data while is transferred to cloud, and process client's data once is already in cloud. With an analysis of the results obtained from simulations using the metrics needed to measure the performance, it is possible to review the pros and cons of an implementation in a specific scenario to store and process data. Therefore is necessary to know what are the resources needed to improve a system to offer better performance to store and process data in cloud. 1.5 Objectives The objectives of the project are: Explore cloud computing and its architecture Understand data management in cloud Investigate data storage in cloud Understand simulation in cloud and explore cloud simulator, to use experiments to simulate cloud scenarios Explore data management issues and its solutions 8

9 1.6 Minimum Requirements The minimum requirements are: Investigate cloud simulator architecture ( CloudSim and HDFS simulator) Identify cloud services and cloud resources that are key for the implementation of a cloud data management scenario Understand communication between entities in CloudSim and HDFS simulator Implement a simulation experiment considering a scenario for data management in cloud Implement a working cloud simulation experiment for data storage in cloud 1.7 Project Plan The project was divided into 6 stages, setting a number of tasks for each stage, in order o keep the project on schedule. A Gant Chart was produced to set time period for each stage In the first stage, a literature review was done, covering theoretical concepts of cloud computing and its architecture, which helped to achieve the goals of the others stages of the project. Second stage, includes an analyse of the issues with data management in cloud, and solution approaches to implement methods improving data management in cloud. Third stage, planning the project structure; which problem in data management will be covered, and selecting the solution approach, and the simulators to evaluate the solution. Stage four, includes the design of scenarios to evaluate the solution, and and implementation of these scenarios using the selected simulators. Stage five, tests are performed on each scenario; results are collected and discussed. Finally in stage six, there is an evaluation of the solution is performed, with individual and comparative analysis between the scenarios simulated. 9

10 1.7.1 Milestones To track the project progress, end points for each stage have been set in the project plan, to ensure that each stage has been completed. A) Literature review (25/02/13) B) Analyse of the problem (01/03/13) C) Plan the structure of project (08/03/13) D) Design and implementation (19/04/13) E) Tests(19/04/13) F) Evaluation and writing up (05/05/13) 1.8 Methodology To achieve minimum requirements of the project, a research about cloud computing must be completed. The research will include information of cloud computing architecture, cloud simulation features and architecture, as well as a brief understanding of existing applications in cloud computing. Data Management issues will be presented in this project and some solutions will be proposed to minimise these issues. Performance evaluation in cloud can be done in three ways: through Direct Experiments, which involves a design of a real cloud environments to provide services; Mathematical Modelling, where evaluation is modelled with mathematical algorithms and formulas such as equations; and simulation in cloud, which allows users to perform experiments simulating cloud environments to test services in cloud. Simulation was used as methodology for this project as it offers tools necessary for environments to simulate scenarios and calculates time, energy consuming, data cost, and data processing results for each simulation experiment. These tools allow users to design experiments, setting parameters according to the experiment objective, run and show the results of each experiment. There are several simulators in cloud, which can help users to develop experiments, in order to simulate scenarios, for a better understanding of how cloud services works, and how to implement new 10

11 solutions for problems in cloud services. CloudSim and HDFS ( Hadoop Distributed File System) simulator will be used as framework methodology for this project, as they offer the tools needed to simulate data transfer and calculate time to transfer data and also simulate data processing for large files. Simulation experiments, will be performed, for a better understanding of data transfer and data processing in cloud. 1.9 Summary Project aim and minimum requirements have been stated, and project methodology has been explained. The next chapter will focus on background research of cloud computing, its architecture and existing applications. 11

12 12 Tasks 21/01/13 25/02/13 01/03/13 08/03/13 19/04/13 05/05/13 08/05/13 Literature Review Problem Analyse Project Plan Design Implementation Tests Evaluation Project Report

13 Chapter 2 Background Research Introduction The objective of this chapter is to go through the concepts of cloud computing, CloudSim and HDFS simulator design and architecture. 2.2 Cloud Computing Cloud computing is a computation performed in distributed servers, where data is stored and processed without users having the knowledge of the location of these data. Cloud computing as a system involves some features such as scalability, reliability, transparency and redundancy. Cloud computing can be defined as the set of hardware, networks, storage, services, and interfaces that combine to deliver aspects of computing as a service[1]. However, NIST(National Institute of Standards and Technology) defines cloud computing as: A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. [2] Cloud computing can be considered as the next step of the evolution in the internet, as cloud computing provide the means necessary for everything (from applications, computing power to computing infrastructure, storage, processes for business to personal collaboration, software) to be delivered to clients as a service, wherever the service is necessary and whenever is required over the internet.with these characteristics, cloud computing become flexible, because many machines will be working, and centralized, constituting a unique system in cloud, the internet providing services to customers through the internet. 13

14 2.3 Cloud Features A cloud computing service includes several elements, which are: distributed servers, datacenter and clients. Clients: client in cloud computing is everything connected to a Local Area Networking(LAN), requesting and receiving services. Or in other words clients are the devices that end users interact with, to manage their information in cloud. Clients can be divided in 3 types: Mobile, Thin and Thick [3]. 1. Mobile Mobile devices, such as smartphones, tables, laptops. 2. Thin - Devices without internal hard drives, which all the work is done by the servers, and then devices only display the results(data). 3. Thick Devices that connects the cloud(internet) using web browsers Datacenter: is a set of servers, in which the files and applications are stored. It can be a room full of servers, that run user requests, which are accessed via internet[3]. Cloud computing also uses virtualizing servers, when a software is installed, with the object to enable multiple instances of virtual server being used. In this way, many virtual servers can run on one physical server. Distributed Servers: these type of servers are allocated in different places, but in cloud environment, they look like they are running next to each other. The service provider gets more flexibility for security and options. For example if something goes wrong in one site in the cloud system, having the solution for a problem in many servers, the system can be accessed using another site. 2.4 Cloud Models Cloud computing offers resources as a service, and they are divided in three classes: Software as a Service, Platform as a Service and Infrastructure as a Service Software as a Service (SaaS) Software as a Service is a mean of offering application as a service to the customers, who can access it via internet [3]. It is a way of performing tasks where the software is the service itself. One of the benefits of this type of service is that the service provider offers software and customers do not need to install, manage or buy hardware to run the software, as service provider handles everything. 14

15 The software can be accessed through a thin client interface(eg. Web browser). In this model service providers are able to capture data related to customers behaviour, at the same time that customers access the application. As the service provider has all control over the version, it gives a compatibility to customers, as all users get the same version of the software. Same thing happens with the infrastructure, service provider have control over infrastructure that runs the software, which reduces the cost of implementation and upgrades[4] Platform as a Service (PaaS) PaaS offers services with the required resources such as, developer tools to build applications and services on the top of the compute infrastructure, with no need to download or install any software[3]. These application and services can be web service integration, database integration, storage, scalability and application design. PaaS gives everything needed to create, implement, test and host software in cloud. PaaS is build using one or more infrastructure as a service(iaas) which stays invisible to service providers that use PaaS. Service provider get all the responsibility to maintain and control of the underlying cloud infrastructure, but the consumer gets all control of the application[5]. This model offers services that tend to represent an engagement between complexity and flexibility allowing applications to be implemented quickly and loaded in the cloud with less configuration[6] Infrastructure as a Service (IaaS) IaaS does not offers application service to customers as SaaS and PaaS do. IaaS provide hardware, where customers can put anything they want on it[3]. IaaS offers access to resources of the virtual hardware, which includes virtual machines, network and memory. It allow consumers to provide applications more efficiently by removing complexities related to the management of their own infrastructure[5]. IaaS allow consumers to built their own virtual cluster. This allow service providers to purchase hardware resources and equipment to be shared with consumers and can be used for anything they require. 15

16 Figure 2: Cloud Computing Service Model Diagram [26] Figure 2, shows how cloud services work. Users interact to cloud through internet using laptops, mobile phones and tables. All requests from users are processed in cloud. The services provided in cloud, work as a pyramid, in a way that each service is built and run on top of the other. The Infrastructure as a service, offers the main resources which are used by the others services. Software as a Service, which is the application in cloud stays on top, as it uses resources from all other services, and runs on the resources provided by the Platform as a Service. Basically each service assists the other. A service provider developing the application to be executed in cloud, is assisted by the platform provider, offering tools necessary to develop and execute the application in cloud, while the platform provider only works receiving resources from infrastructure provider. 16

17 2.5 - Types of Cloud There are different methods of implementing applications in cloud. Cloud computing is based in various types of clouds. The main types are private, hybrid and public Private Clouds Private clouds are the ones built exclusively for a unique use(e.g: company or organisation). In private cloud, the company that owns it, has total control of the infrastructure used and the applications implemented in cloud[7]. Normally a private cloud is built in a private datacenter. One benefit of having private cloud is the fact that data is not shared and can only be accessed by the organisation which owns it Public Clouds Public Clouds, are the ones where the software of different users stays in a shared system and can be accessed by anyone in the cloud[8]. Public clouds can be bigger than private clouds, and they allow more scalability of resources. With this characteristic public clouds reduce the need of buying additional equipments to solve temporary needs, moving the risk of infrastructure to the providers of this infrastructure in cloud, as is their responsibility to manage software updates, security patches, etc[9]. There is also a possibility to give some features to a public cloud for one only user, by creating a private virtual datacenter, which provides to its user a bigger visibility of the whole infrastructure. Public clouds are more efficient for temporary applications Hybrid Clouds Hybrid cloud is a combination of two or more clouds, that stays with unique entities and are bound together offering the benefits that each type of cloud has[10]. Hybrid cloud can take features of private and public clouds. It allows private cloud to have its resources enlarged, from public cloud resources. In Hybrid clouds some applications are exclusively for public clouds, and the critical ones stay on the responsibility of the private cloud[11]. The implementation of Hybrid cloud can be done even to attend a continue demand or to satisfy a temporary demand. The quality of the implementation in the hybrid cloud, determine its efficiency. 17

18 2.6 - Cloud Applications As cloud computing reduces discard the need to buy expensive hardware to host large software application, the applications in cloud are hosted in cloud in a way that costumers do not need to provide the server space, it is done by the service provider. Cloud Computing has the power to bring applications, manipulate and share data. The most common applications in cloud are based in storage and database[3]. Storage: cloud storage has some positive things to offer to its users in cloud. When storing data in cloud, we can access them from anywhere we are, and we do not need to use the same computer, to access it, as it can be accessed with any device with internet connection. Storage applications in cloud: Google Apps it is a service that offers applications to edit documents(google Docs), chat(google Talk), (gmail). Every resource is managed by Google, client only needs to set up an account. Amazon S3(Simple Storage Service) is the most known cloud storage service, it was built to make web-scale computing easier for developers. It offers a simple web service interface, that can be used to store any data, any time, from anywhere in the internet. Youtube host millions of videos files uploaded by its users. Panda Cloud Anti-virus is an anit-virus program from Panda Software, but most of the work needed to search and remove malwares, is done in cloud. Database: is a repository to store information with links within the information, that can help the data to be searchable[3]. Cloud computing allows multiple applications to connect one database running on a cluster, by using shared services. These applications are isolated from each other, and explicit portions from database processing allocated to each application[12]. And this service becomes a Database as a Service, which avoid the complexity and cost of running our own database. Database as a Service can offer some benefits, as it is easy to use, there are no server to provision and no redundant system to worry about [3]. Database applications in cloud: SQL Server Data Services(SSDS) it has a schema-free data storage, SOAP or REST APIs. SQL Azure it belongs to Windows Azure platform, and it is a set of services hosted, infrastructures, web data and services. It offers a full relational database functionality of the SQL server, but working in cloud as a computing service, hosted Microsoft datacenter across the world. 18

19 2.7 - Virtualization Virtualization is one of the main elements of cloud computing. Virtualization is important to cloud computing because it is a way to let users access services in cloud. It creates virtual environment, with virtual machines, which omits physical characteristics of the hardware[14]. Virtualization can be done in different ways. One of them, permits that a server can be used as many virtual servers and another way, let multiple servers, being used as one virtual server. Virtualization can be considerate as Full Virtualization, when the installation of one machine can run on another [3], which is the way that virtual machines run in cloud environment. Virtual Machines are also used to emulate operating systems in one platform, creating the resources of this platform, hosting in a virtual hardware for each system. Paravirtualization, is the technique that allows multiple operating systems to run on a single hardware using system resources such as processors and memory [3]. In full virtualization the entire system is emulated, but in Paravirtualization, the management module operates with and operating system which has been fitted to run in a virtual machine Data Management One of the main reason of adopting cloud computing is the benefit of transferring, and process large amount of data. Data management, and data processing plays big role in many applications in cloud, as data stay stored in cloud, and it is necessary to provide a satisfactory performance of the services, express in terms of latency, high availability, attend service level agreements regardless the quantity of data and workloads changes Data Management Issues Cloud computing has some advantages, but for data management, developers can come across with some issues when implementing applications. When a developer is implementing an application, in many cases he has to deal with large set of files, and in case that data is large, developer has to distribute data in many systems and use parallel systems to prevent data being processed in one system which might increase time required to process data and would offer a low efficiency. 19

20 Data Storage and Query Processing Having a significant increase in data, and requests, to extract values from these data, service providers have to manage and analyse a large amount of data if they want to offer a high performance in their services and isolate data in cloud. As the data are managed in many partitions, it become hard to offer transactional guarantees, such as atomicity and isolation. To deal with these problems, some solutions have been developed, combining techniques such as MapReduce or parallels Database Management Systems(DBMS)[15]. So the challenge is to define an architecture to focus on the Query Processing mechanism and parallel file systems, such as Hadoop Distributed File System(HDFS), to give an architecture with different levels of insulation. HDFS was inspired by GFS(Google File System) which offers reliable and efficient access to data using big clusters. GFS can also be used to measure the performance of replication system on clusters. HDFS stores large files in various servers to get reliability by means of data replication. GFS works with three components, a master server, multiple clients and multiple chunks servers, where the chunks are stored in datacenters managed by servers[16]. MapReduce was introduced by Google, and it is a framework, in which each task is done with 2 functions: Map and Reduce. Map function receives a set of input files and according to user specification, it emits a set of tuples in a dictionary format ( Key-value format). Reduce function receives a set of values associated to each key, called blocks, and for each Reduce function, it emits a set of tuples that are stored in output files. This project will encounter issues related to data storage regarding data processing and response time. The size of data and method of distribution, is a challenge, as people feel interested in cloud computing because of the fact that they can store large files, but at the same time they want to process this data, quickly when they need. Considering the fact the this project will be implemented using simulator, it will be possible to design cloud environments, regarding time to transfer data of large size and make distribution data across datacenter. At the moment these techniques to store and process large data, are being used by organisations such as Facebook, Google, Youtube, Amazon, in order to offer better performance in their services, storing and processing hundreds of terabytes of data everyday. 20

21 Scalability and Consistency Scalability have to be transparent to users, allowing them to store their data in cloud without knowing the location of data, or the way to access them. But as many solutions in cloud are focused in scalability, and in general they offer weak consistency of data, which means that after the system is updated, if a user access the system, it will not guarantee that the return value will be also updated. This kind of consistency do not permit the development of a wide range of applications, such as, online services that cannot work with inconsistent values. Some aspects of data storage, query processing, have been used by some approaches to guarantee scalability, but the best way to solve this issue, is to develop solutions that combine theses aspects in a way to improve system performance without compromising data consistency[15] Database Management Systems (DBMS) with multi-tenants In DBMS as a cloud service, when tenants access the system and share resources, they can affect the performance of the system. In this case resources provisioning has to be efficient, as the workloads from DBMS as a service can vary, as the tenants can access the system with more frequency in a determined moment. 2.9 Evaluation Performance Using Cloud Simulation Cloud applications may offer many services, such as social networking, data storage, content delivery, web hosting, and it is necessary that cloud provider offers a cloud environment to respond costumers needs. Data sets have to be evaluated and analysed by service provider, to offer efficient access to the data sets and replicate them on several servers. Because of that cloud scenarios have to be evaluated to test replication of data on several servers, retrieve data, effective methods to send, process and access data. To perform those things, simulation tools have been developed to reproduce tests in cloud environment. These simulation tools let service providers and cloud costumers, to test their services in a environment that allows them to repeat the tests and have control of it. These tests let service providers to determine the cloud service quality and quantity, as well as helps to optimize the evaluation of their services, as simulation test is cheaper and faster than performing these tests in a real cloud environment [17]. To perform these tests, there are some toolkits which allow service providers and users to simulate their services or applications in cloud environment. At the moment the toolkit most used in this area is the framework CloudSim. 21

22 IcanCloud is a platform that models and simulate systems in cloud computing, it predicts trade-offs among performance and cost of a set of application being executed in hardware, and provide information about cost and performance. GreenCloud simulator is a an extension of the network simulator Ns2 which simulates cloud environment. GreenCloud provide to users a detailed model of the energy that elements of a datacenter consume, such as switches and servers. CloudSim is a simulation toolkit, that shows the result of the time, power and traffic consumption. It is based in java platform, with some modules already developed such as SimJava and GridSim [18]. HDFS simulator, is a simulator design in java, as HDFS is a distributed file system that was designed to run on commodity hardware, and it is highly fault-tolerant. Hadoop is a framework based in java, which has two sub systems: The hadoop to execute MapReduce applications and HDFS, which handles data management and access. As explained in Chapter 1, CloudSim and HDFS simulator, are the ones to be used in this project, because as this project focus on data transfer and CloudSim, offers tools necessary to simulate users sending data to cloud and calculate the time taken to to transfer and store data. HDFS simulator was designed to simulate a systems in cloud dealing with large data. As the second part of the project is related to data processing, and HDFS simulator can help developers to understand and implement environments storing and processing large data of files, and distributing data across the datacenter avoiding failures, it was used as the simulator framework to simulate data processing Data Transfer CloudSim It is a simulation framework that allows the modelling of cloud experiments using simulation of infrastructures and application services [17]. As a framework, CloudSim offers support to model and simulate large scale cloud node, and a platform to model datacenter, service brokers, scheduling, and allocation policies. CloudSim also has some features which are: the virtualization engine availability, which leads with the building and management of multiple, independent, and co-hosted virtualized services, on a datacenter node; it is flexible to switch between space shared and time-shared allocation of processing cores to virtualized services [17]. These features can make faster the development of algorithm, protocols and way of implementing it in cloud computing. 22

23 CloudSim Architecture Figure 3: CloudSim Architecture[17] Figure 3, shows the CloudSim architecture and layers that it uses. In the layer on the bottom, there is a simulator engine which is responsible of the operations to create, manage and delete simulation entities[17]. The next layer, shows the main classes that used to implement the framework, and it is composed with different modules. In the network module, it maps clients request with datacenters and calculate a possible delay of the messages between datacenters and clients. The cloud resources module, manipulate and coordinate simulation events, and also manages data related to infrastructure, provided by the datacenter that has been simulated. Then in cloud services, it shows the actions of virtual machines provisioning, the allocation of resources as a memory of the system, storage of data and bandwidth communication[23]. Then we have the virtual machines services, where manages and execute cloudlets (tasks) sent by clients. 23

24 Just above, there is the user interface structures, where occur the communication between entities, as it is done using an interface, virtual machines and cloudlets can be manipulated. The layer on the top, represents, the code where the user of framework will implement to create simulation environment. The scheduling policy, points the creation of decision policies and schedulers, which will guide simulation process[23]. It also uses decision methods, called broker, but cloudsim also allow the implementation of allocation policies of virtual machine between hosts of the same datacenter, virtual machines schedulers in hosts, and cloudlets schedulers in virtual machines CloudSim Usability When using CloudSim, users need to be able to have background in java, as the toolkit is all written in java. Having knowledge of java users will be able to write some code, using elements of its library to develop experiments, simulating the scenario desired by users. Using CloudSim is not only about writing code, changing parameters, run the program and then collect the results, but it requires an understanding of how the simulator works CloudSim Capabilities CloudSim comes with its source code which can be extended in order to make the simulator suit, the problem in which, user wants to solve, so users are free to make their own changes in the source code, adding classes, to make CloudSim help them, in a specific scenario. CloudSim is flexible, and require less time and effort to perform test, and simulation in cloud. It can simulate scenario with small-scale, and large-scale scenarios with many datacenters, sometimes without any cost related to initialisation and memory consumption. It also, use virtualisation to create many virtual services, where each service is managed on a node of a datacenter CloudSim Limitations It is a very good tool, with powerful functions, to help users simulate cloud computing environments, but it is not a tool kit where user can use it with only setting parameters. Users need to write some code in java, to get access to its library. Also it does not support every scenario in cloud, which require some extensions as discussed earlier. If user is not familiarised with java language, he will not be able to use the simulator, so because of that, CloudReports, which is an extension of CloudSim, was developed to allow user perform simulation in CloudSim using a simple graphical interface. 24

25 2.11 Data Processing Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) as every distributed file system it was designed to run on machines and allow users to store and process data in many different hardwares, normally connected with Local Area Network (LAN). But HDFS differs from normal distributed file systems, because it is high in fault-tolerant and can be deployed on hardware which is not too powerful[19]. At the moment there are several implementation of distributed file system, such as: GNU Cluser File System (GluserFS), Moose File System (MooseFS), General Parallel File System (GPFS), OpenAFS, Network File System and Google File System (GFS). HDFS is an open source framework written in java and uses an architecture based in master-slave. It consists in HDFS clusters, with a primary NameNode, which is used as master server, and is responsible to manage the file system namespace, in addiction it also regulates access to data by clients[20]. NameNode has a number of DataNodes, which normally consists in one-to-one relationship among a hardware and DataNode. DataNode manages the storage linked to boxes where it runs on. File System namespace is used in HDFS to allow data to be stored in files[20]. Each file is divided into blocks, which are distributed across DataNodes, supporting parallel processing. Blocks are replicated in many DataNodes which prevent the processing to stop if a occur and event of node failure[21]. The number of nodes used in HDFS it is proportional to the probability of one of these nodes fail, which means that as many nodes it has, the more chances of a node fails. To protect the system against failure, DataNodes receive copies of blocks. HDFS replicate three blocks, and two of these blocks go into nodes sharing the same rack, and the another goes into one node on a different rack HDFS Goals a) - Hardware Failure HDFS consist in many server machines, each machine stores part of file in the system. Having a huge number of elements in HDFS (machines), and the fact that each element has non-trivial probability of failure, it shows that some elements of HDFS will be always non-functional [21].One of the HDFS goals is to have a core architecture with automatic recovery, able to detect faults, and quick. 25

26 b) Streaming Data Access Every application running on HDFS requires streaming access to its data sets. Application running on the system, have a specific purpose, and run for a specific file system. It uses a typical design that do not focus on interactive use, but for batch processing. HDFS shows less interest in low latency, and focus more in data access. c) Large Data Sets HDFS was design to support large data sets. A file in HDFS is typically big, which goes from gigabytes to terabytes in size. To support data of this size, HDFS normally offers a high data bandwidth, and is able to scale hundreds of nodes in only one cluster. HDFS has the ability to support millions of files in one instance[21]. HDFS is also a good solution for people working with blogs, where they have to deal with large data and do not have idea of how data will be used, and using unstructured files, HDFS can help them as it allows them to store data even if they are not well structured, and process data at the local where its stored. Google is a real example of a system dealing with large data sets, as it stores hundreds of terabytes a day Block Division There are files which are too big, and cannot be stored in only one hard-drive. A way out of this problem is to divide files and distribute in many machines. The file distribution is done in a implicit way, where the developer using HDFS only has to point the correct configuration parameters. HDFS before storing the files, it adopts a strategy where these files are submitted using a method where files are divided in a sequence of block of equal size. The default size is 64 megabytes, which can be changed if necessary. The block size is bigger than block size used by the other distributed file systems, which use 512 bytes per block. After splitting files, it starts the distribution, addressing blocks in different nodes. If data addressed to a block is not big enough to fill the space reserved for it, the rest of space is not wasted, it can be used with another data. 26

27 Architecture A HDFS cluster, consists in one NameNode, which is the master server, and usually implemented in one exclusive node, which is the node with best performance. It is responsible to manage the file system namespace, and regulate how files are accessed by clients. As there are many DataNodes, and normally one DataNode is allocated in one node in the cluster, to manage storage linked to nodes where they run on. HDFS allow files to store user data. Then it execute block division, storing blocks in set of DataNodes. The operations of file system namespace are executed by the NameNode, such as renaming files, opening and closing files and directories[21]. The process of read and write requests sent from client's file system, is a responsibility of DataNodes, wich also execute block creation, replicate blocks, and delete them, as NameNode gives instructions. The first communication between master node and slave node, occur when DataNode is registered in NameNode. Then DataNode communicate to NameNode, sending information of blocks that are stored, as well as information about its local changes. This communication between NameNode and DataNode are crucial to help NameNode to define which nodes will store a respective block. If NameNode is enable to receive information from DataNode, the DataNode is asked to register again. Figure 4: HDFS Architecture [21] 27

28 Figure 4, shows the architecture of HDFS, which basically NameNode and DataNode are part of software that run on machines. Where these machines usually run on GNU/Linux operating System. As HDFS is written in java, it can be deployed in any machine that supports java, therefore, these machine are able to run NameNode or DataNode. An implementation of HDFS, usually gets one machine dedicated to run the NameNode, where the other machines are used to run DataNodes, normally one DataNode per machine. NameNode works also as a mediator for all metadata. HDFS is designed following an architecture where user data has no chance to flows through the NameNode[21] Data Replication As HDFS split files into blocks, it also replicate these blocks to increase security and durability of file. By default HDFS has three replicas allocated in different nodes. As the communication speed between machines of the same rack is bigger then between machines of different racks, in the selection of replacing replicas in one process, HDFS gives priority to replicas belonging to same rack[22]. One of the biggest benefit of replication is that system get high fault-tolerant, and reliability, as if a node fails, the process will be executed by another machine, which contains a replica of the block, without having any need to transfer data or interrupt the application execution. All this if done with transparency, as hadoop offers mechanisms to restart the process without anyone notice that a node fail during the executing[22]. When there is a fault, occur a decrease in replicas of a block, so to get reliability, NameNode consult metadata about DataNode faults, and restart the replication process in other DataNodes Related Work Through the literature review I could check some work using experiments in CloudSim, evaluating simulation experiments using space-shared and time-shared. They sent sets of task to virtual machines, and these tasks were sent in groups of 50[23]. They collect the results and checked that experiments using space-shared, every task has completed after 20 minutes, as the number of tasks had no effect in the execution time of a single task. But the execution time of a single task in experiments using virtual machines with time-shared, has been affected as the number of tasks submitted to virtual machines increased. So the first set of 50 tasks have been completed before the other ones, at the start of execution the hosts were not over-loaded, which increase its performance to execute tasks quickly, and as more tasks had been completed hosts becomes available for more tasks to be executed[23]. 28

29 Another work done, with HDFS simulator, where they got an experiment to check if the number of replicas, affect the system performance. They had used a method to calculate the mean time taken to repair the system, when there is failures in the system. They use one experiment, with replicate a block three times, and another with 8 replicas per block. So they notice that as the replication level increases, the probability of time to repair the system, increases too, as replication level increases, number of blocks in a node increases too[24]. If a node fail, the system will have to replicate more blocks than if number of replicas per block Summary In this section I went through on some topics related to cloud computing and its architecture, as well as some applications developed and being used around the world. The next chapter will cover the use of this literature review and how to design the experiment will be using this project. It includes the use of HDFS and CloudSim in data management. 29

30 Chapter 3 Design 3.1 Introduction With the use of cloud computing, people always need to find ways to deal with large data files. Many organisations every day collect billions of data bytes, and need to manage it, to satisfy its users demands. Recently with the increase of people using cloud computing, some companies wants to get the benefit of adoption cloud technologies, by transferring their to cloud. But then companies wants process this data in cloud. As one of the biggest interest of someone sending data to cloud, is related to the need of have large data files stored, and be able to access it anywhere. So then people decide not just to transfer data in cloud, but also have their data processed. One of the big issues when transferring data to cloud is the time to transfer data in cloud and time when processing it, as at these days, some requests are made for tasks to be done is short time as possible Methodology Methods and data used to design this project was mostly acquired through web sites and a few books. Every data was analysed as it was all readable, and then a depth understanding of the data was done, in order to progress with project choosing the best material to suit the problem, and help to get solution. The experiments in the project, we made relating the research done as background and the scenario needed to solve the problem presented in the project. To test the performance of the experiment, two simulator were used, one to check transfer time, and another for data processing time. 30

31 3.3 - Example Problem Recently pcs makers, have evaluate their pc storage power from bytes to terabytes. This development has made user's life easier, as they can now benefit of the having more storage capacity to save their files. Therefore the biggest problem is not on the data size to store, but on quantity of data that can be processed by a system. Lets suppose we have a company that offers a web service, in which its systems is accessed daily by thousands of users, sending, reading and writing data in short slots of time. To deal with it, the organisation need to implement a system architecture with methods capable to classify data, in order to enable the system to process data in a fast and efficient way. Many company use techniques based in log activities, where they write users requests in log files, to specify different tasks performed by a single user. This techniques can bring serious problems if used in wrong way. As an organisation can get many data in short time, and will not be able to process data neither organize it. In this way, data will be wasted, loosing the opportunities that, these data would help in the performance of the system. Also as it would be done all in serial, which implies that, data is being processed in an individual system, having only one point to deal with possible failures in the system. 3.4 Solution To solve the problem presented above, the best thing to do, is to develop distributed parallel environments, which will provide a high performance comparing with to other systems, using serial environments. With parallel computation, an application can be executed concurrently among different elements of a machine, regarding the fact if this machine is distributed or multi-processed. With these technique it is possible to get a system with many points dealing with failures, which increase system's fault-tolerance. Having such problem, many concepts, and strategies have been developed to offer simple and efficient ways to solve the problem using parallel computation techniques. With this techniques. It is possible to create a robust system able to attend users demands. The first idea is to implement system able to process data in short time, with many points dealing with possible failures in the systems, and also, a system capable to run in low cost machines. This is where HDFS comes in, as it is design to process large data files in short time, a offering a system high in fault-tolerance. 31