Contents. 1. Introduction

Size: px
Start display at page:

Download "Contents. 1. Introduction"

Transcription

1 Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving services through the internet. Cloud services offers many benefits to clients, giving them the possibility to store and process data using a scalable architecture, providing an storage capacity almost infinite. With the constantly increasing of data being moved to cloud, cloud services deals with millions of files everyday, storing and processing according to client request. Facebook is real example of this type of service provided in cloud, where the range of data goes from terabytes to petabytes, and need to be stored and processed in a timely manner. This increase of data being transferred to cloud, aggravates the problem to develop mechanisms to implement solutions for data management, to improve cloud services performance, to attend on requests to store and process large data. This project aims to to give an understanding of data management in cloud, evaluating solutions and mechanisms to speed up data transfer and data processing in cloud. Several scenarios will be simulated using cloud simulator, with a number of experiments conducted to to evaluate techniques to reduce time to transfer and process data. 1

2 Acknowledgements I would like to thank my family for giving me support to study and believe in my dreams. I would like to thank my tutor Nick, for all support given in the last 3 years, and my supervisor for his interest in this project, scheduling weekly meetings, and feedbacks given during the progress of the project. 2

3 Contents 1. Introduction 1.1 Introduction Why Cloud Simulation Why Data Storage Project Problem Objectives ,6 Minimum Requirements Project Plan Milestones Methodology Summary Background Research 2.1 Introduction Cloud Computing Cloud Features Cloud Models Software as a Service Platform as a Service Infrastructure as a Service Types of Cloud Private Clouds Public Clouds Hybrid Clouds Cloud Applications Virtualization Data Management Data Management Issues Data Storage and Query Processing Scalability and Consistency Database management Systems with multi-tenants

4 2.9 Evaluation Performance using Cloud Simulation Data Transfer CloudSim CloudSim Architecture CloudSim Usability Cloudsim Capabilities CloudSim Limitations Data Processing Hadoop Distributed File System HDFS Goals a) Hardware Failure b) Streaming Data Access c) Large Data Sets Block Division HDFS Architecture Data Replication Related Work Summary Design 3,1 Introduction Methodology Example Problem Solution Data Transfer Application of CloudSim Data Processing Performance Summary Implementation 4.1 Introduction Experiments

5 4.2.1 CloudSim Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results ,2 HDFS Experiments Scenario Objectives and Design Results Scenario Objectives and Design Results Scenario Objectives and Design Results Evaluation 5.1 Introduction Achieving Minimum Requirements Evaluation of Project Methodology Evaluation of Project Experiments Evaluation of CloudSim Results Evaluation of HDFS Results Future Work and Possible Extensions Conclusion 5

6 Chapter 1 Introduction Introduction The aim of the project is to evaluate mechanisms for data management in cloud using cloud simulation. Data management in cloud have been important, as cloud computing offers storage as a service, where client can move his data to cloud, getting the benefit of accessing his data anywhere and more storage capacity offered by his cloud provider. This project will focus on data storage, and show how cloud simulation can help us to get better understanding of storing data as well as processing data in cloud. The project will involve cloud simulation experiments, to check the performance of transferring and processing data in cloud Why Cloud Simulation? In cloud environment, some services or applications have to be tested before provide it to customers in order to know how the service or application will behave while is being using. Due to the fact that cloud simulation helps to develop experiments, simulating cloud environments, most users can be easily familiar with it, and it allow them to perform tests on their services, having the power to repeat these tests, and control the cloud environment, which will not cost anything to them, and they will get an idea of the service performance before introduce it to the real cloud environment. 6

7 1.3 - Why Data Storage? As cloud computing offer many storage services, but one of the biggest concern is the possibility of data from a client, can be mixed with data from other clients. So data storage has been a challenge for them, to manage and extract data in cloud. It is important to isolate data belonging to a single client from the others. Also a good and fast way of accessing data, it is necessary to provide, in order to make easier to users to get their files in cloud, and reduce the level of data loss. So data management is an important factor for service providers in cloud, to achieve success when delivering services to clients. For example, Google offers the service Google Cloud Storage, that allow people to access, store and protect data files. It let us manage our data on a reliable Google's infrastructure, which is scalable and efficient, giving a robust storage, quick and easy access. SkyDrive is a cloud storage service from Microsoft, that let us store data in cloud and offers a set of tools to manage data, such as Word, Excel and PowerPoint. Amazon offers Amazon S3, which has an interface that allows to store and recover data, any time, and anywhere. It gives access to a scalable, inexpensive and fast infrastructure that Amazon runs on its global network of web sites. Cloud Storage Provider Figure 1: Data storage in cloud [25] 7

8 1.4 Project Problem As the quantity of data that clients want to move to cloud, has been increasing everyday, and consequentially the necessity that cloud providers have to store and process data in a timely manner, it is necessary to use mechanisms to measure the application performance used to store and process data, regarding the time, that it would take to complete clients requests. Using cloud simulation it is possible to generate a set of metrics to evaluate the performance of storing data while is transferred to cloud, and process client's data once is already in cloud. With an analysis of the results obtained from simulations using the metrics needed to measure the performance, it is possible to review the pros and cons of an implementation in a specific scenario to store and process data. Therefore is necessary to know what are the resources needed to improve a system to offer better performance to store and process data in cloud. 1.5 Objectives The objectives of the project are: Explore cloud computing and its architecture Understand data management in cloud Investigate data storage in cloud Understand simulation in cloud and explore cloud simulator, to use experiments to simulate cloud scenarios Explore data management issues and its solutions 8

9 1.6 Minimum Requirements The minimum requirements are: Investigate cloud simulator architecture ( CloudSim and HDFS simulator) Identify cloud services and cloud resources that are key for the implementation of a cloud data management scenario Understand communication between entities in CloudSim and HDFS simulator Implement a simulation experiment considering a scenario for data management in cloud Implement a working cloud simulation experiment for data storage in cloud 1.7 Project Plan The project was divided into 6 stages, setting a number of tasks for each stage, in order o keep the project on schedule. A Gant Chart was produced to set time period for each stage In the first stage, a literature review was done, covering theoretical concepts of cloud computing and its architecture, which helped to achieve the goals of the others stages of the project. Second stage, includes an analyse of the issues with data management in cloud, and solution approaches to implement methods improving data management in cloud. Third stage, planning the project structure; which problem in data management will be covered, and selecting the solution approach, and the simulators to evaluate the solution. Stage four, includes the design of scenarios to evaluate the solution, and and implementation of these scenarios using the selected simulators. Stage five, tests are performed on each scenario; results are collected and discussed. Finally in stage six, there is an evaluation of the solution is performed, with individual and comparative analysis between the scenarios simulated. 9

10 1.7.1 Milestones To track the project progress, end points for each stage have been set in the project plan, to ensure that each stage has been completed. A) Literature review (25/02/13) B) Analyse of the problem (01/03/13) C) Plan the structure of project (08/03/13) D) Design and implementation (19/04/13) E) Tests(19/04/13) F) Evaluation and writing up (05/05/13) 1.8 Methodology To achieve minimum requirements of the project, a research about cloud computing must be completed. The research will include information of cloud computing architecture, cloud simulation features and architecture, as well as a brief understanding of existing applications in cloud computing. Data Management issues will be presented in this project and some solutions will be proposed to minimise these issues. Performance evaluation in cloud can be done in three ways: through Direct Experiments, which involves a design of a real cloud environments to provide services; Mathematical Modelling, where evaluation is modelled with mathematical algorithms and formulas such as equations; and simulation in cloud, which allows users to perform experiments simulating cloud environments to test services in cloud. Simulation was used as methodology for this project as it offers tools necessary for environments to simulate scenarios and calculates time, energy consuming, data cost, and data processing results for each simulation experiment. These tools allow users to design experiments, setting parameters according to the experiment objective, run and show the results of each experiment. There are several simulators in cloud, which can help users to develop experiments, in order to simulate scenarios, for a better understanding of how cloud services works, and how to implement new 10

11 solutions for problems in cloud services. CloudSim and HDFS ( Hadoop Distributed File System) simulator will be used as framework methodology for this project, as they offer the tools needed to simulate data transfer and calculate time to transfer data and also simulate data processing for large files. Simulation experiments, will be performed, for a better understanding of data transfer and data processing in cloud. 1.9 Summary Project aim and minimum requirements have been stated, and project methodology has been explained. The next chapter will focus on background research of cloud computing, its architecture and existing applications. 11

12 12 Tasks 21/01/13 25/02/13 01/03/13 08/03/13 19/04/13 05/05/13 08/05/13 Literature Review Problem Analyse Project Plan Design Implementation Tests Evaluation Project Report

13 Chapter 2 Background Research Introduction The objective of this chapter is to go through the concepts of cloud computing, CloudSim and HDFS simulator design and architecture. 2.2 Cloud Computing Cloud computing is a computation performed in distributed servers, where data is stored and processed without users having the knowledge of the location of these data. Cloud computing as a system involves some features such as scalability, reliability, transparency and redundancy. Cloud computing can be defined as the set of hardware, networks, storage, services, and interfaces that combine to deliver aspects of computing as a service[1]. However, NIST(National Institute of Standards and Technology) defines cloud computing as: A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. [2] Cloud computing can be considered as the next step of the evolution in the internet, as cloud computing provide the means necessary for everything (from applications, computing power to computing infrastructure, storage, processes for business to personal collaboration, software) to be delivered to clients as a service, wherever the service is necessary and whenever is required over the internet.with these characteristics, cloud computing become flexible, because many machines will be working, and centralized, constituting a unique system in cloud, the internet providing services to customers through the internet. 13

14 2.3 Cloud Features A cloud computing service includes several elements, which are: distributed servers, datacenter and clients. Clients: client in cloud computing is everything connected to a Local Area Networking(LAN), requesting and receiving services. Or in other words clients are the devices that end users interact with, to manage their information in cloud. Clients can be divided in 3 types: Mobile, Thin and Thick [3]. 1. Mobile Mobile devices, such as smartphones, tables, laptops. 2. Thin - Devices without internal hard drives, which all the work is done by the servers, and then devices only display the results(data). 3. Thick Devices that connects the cloud(internet) using web browsers Datacenter: is a set of servers, in which the files and applications are stored. It can be a room full of servers, that run user requests, which are accessed via internet[3]. Cloud computing also uses virtualizing servers, when a software is installed, with the object to enable multiple instances of virtual server being used. In this way, many virtual servers can run on one physical server. Distributed Servers: these type of servers are allocated in different places, but in cloud environment, they look like they are running next to each other. The service provider gets more flexibility for security and options. For example if something goes wrong in one site in the cloud system, having the solution for a problem in many servers, the system can be accessed using another site. 2.4 Cloud Models Cloud computing offers resources as a service, and they are divided in three classes: Software as a Service, Platform as a Service and Infrastructure as a Service Software as a Service (SaaS) Software as a Service is a mean of offering application as a service to the customers, who can access it via internet [3]. It is a way of performing tasks where the software is the service itself. One of the benefits of this type of service is that the service provider offers software and customers do not need to install, manage or buy hardware to run the software, as service provider handles everything. 14

15 The software can be accessed through a thin client interface(eg. Web browser). In this model service providers are able to capture data related to customers behaviour, at the same time that customers access the application. As the service provider has all control over the version, it gives a compatibility to customers, as all users get the same version of the software. Same thing happens with the infrastructure, service provider have control over infrastructure that runs the software, which reduces the cost of implementation and upgrades[4] Platform as a Service (PaaS) PaaS offers services with the required resources such as, developer tools to build applications and services on the top of the compute infrastructure, with no need to download or install any software[3]. These application and services can be web service integration, database integration, storage, scalability and application design. PaaS gives everything needed to create, implement, test and host software in cloud. PaaS is build using one or more infrastructure as a service(iaas) which stays invisible to service providers that use PaaS. Service provider get all the responsibility to maintain and control of the underlying cloud infrastructure, but the consumer gets all control of the application[5]. This model offers services that tend to represent an engagement between complexity and flexibility allowing applications to be implemented quickly and loaded in the cloud with less configuration[6] Infrastructure as a Service (IaaS) IaaS does not offers application service to customers as SaaS and PaaS do. IaaS provide hardware, where customers can put anything they want on it[3]. IaaS offers access to resources of the virtual hardware, which includes virtual machines, network and memory. It allow consumers to provide applications more efficiently by removing complexities related to the management of their own infrastructure[5]. IaaS allow consumers to built their own virtual cluster. This allow service providers to purchase hardware resources and equipment to be shared with consumers and can be used for anything they require. 15

16 Figure 2: Cloud Computing Service Model Diagram [26] Figure 2, shows how cloud services work. Users interact to cloud through internet using laptops, mobile phones and tables. All requests from users are processed in cloud. The services provided in cloud, work as a pyramid, in a way that each service is built and run on top of the other. The Infrastructure as a service, offers the main resources which are used by the others services. Software as a Service, which is the application in cloud stays on top, as it uses resources from all other services, and runs on the resources provided by the Platform as a Service. Basically each service assists the other. A service provider developing the application to be executed in cloud, is assisted by the platform provider, offering tools necessary to develop and execute the application in cloud, while the platform provider only works receiving resources from infrastructure provider. 16

17 2.5 - Types of Cloud There are different methods of implementing applications in cloud. Cloud computing is based in various types of clouds. The main types are private, hybrid and public Private Clouds Private clouds are the ones built exclusively for a unique use(e.g: company or organisation). In private cloud, the company that owns it, has total control of the infrastructure used and the applications implemented in cloud[7]. Normally a private cloud is built in a private datacenter. One benefit of having private cloud is the fact that data is not shared and can only be accessed by the organisation which owns it Public Clouds Public Clouds, are the ones where the software of different users stays in a shared system and can be accessed by anyone in the cloud[8]. Public clouds can be bigger than private clouds, and they allow more scalability of resources. With this characteristic public clouds reduce the need of buying additional equipments to solve temporary needs, moving the risk of infrastructure to the providers of this infrastructure in cloud, as is their responsibility to manage software updates, security patches, etc[9]. There is also a possibility to give some features to a public cloud for one only user, by creating a private virtual datacenter, which provides to its user a bigger visibility of the whole infrastructure. Public clouds are more efficient for temporary applications Hybrid Clouds Hybrid cloud is a combination of two or more clouds, that stays with unique entities and are bound together offering the benefits that each type of cloud has[10]. Hybrid cloud can take features of private and public clouds. It allows private cloud to have its resources enlarged, from public cloud resources. In Hybrid clouds some applications are exclusively for public clouds, and the critical ones stay on the responsibility of the private cloud[11]. The implementation of Hybrid cloud can be done even to attend a continue demand or to satisfy a temporary demand. The quality of the implementation in the hybrid cloud, determine its efficiency. 17

18 2.6 - Cloud Applications As cloud computing reduces discard the need to buy expensive hardware to host large software application, the applications in cloud are hosted in cloud in a way that costumers do not need to provide the server space, it is done by the service provider. Cloud Computing has the power to bring applications, manipulate and share data. The most common applications in cloud are based in storage and database[3]. Storage: cloud storage has some positive things to offer to its users in cloud. When storing data in cloud, we can access them from anywhere we are, and we do not need to use the same computer, to access it, as it can be accessed with any device with internet connection. Storage applications in cloud: Google Apps it is a service that offers applications to edit documents(google Docs), chat(google Talk), (gmail). Every resource is managed by Google, client only needs to set up an account. Amazon S3(Simple Storage Service) is the most known cloud storage service, it was built to make web-scale computing easier for developers. It offers a simple web service interface, that can be used to store any data, any time, from anywhere in the internet. Youtube host millions of videos files uploaded by its users. Panda Cloud Anti-virus is an anit-virus program from Panda Software, but most of the work needed to search and remove malwares, is done in cloud. Database: is a repository to store information with links within the information, that can help the data to be searchable[3]. Cloud computing allows multiple applications to connect one database running on a cluster, by using shared services. These applications are isolated from each other, and explicit portions from database processing allocated to each application[12]. And this service becomes a Database as a Service, which avoid the complexity and cost of running our own database. Database as a Service can offer some benefits, as it is easy to use, there are no server to provision and no redundant system to worry about [3]. Database applications in cloud: SQL Server Data Services(SSDS) it has a schema-free data storage, SOAP or REST APIs. SQL Azure it belongs to Windows Azure platform, and it is a set of services hosted, infrastructures, web data and services. It offers a full relational database functionality of the SQL server, but working in cloud as a computing service, hosted Microsoft datacenter across the world. 18

19 2.7 - Virtualization Virtualization is one of the main elements of cloud computing. Virtualization is important to cloud computing because it is a way to let users access services in cloud. It creates virtual environment, with virtual machines, which omits physical characteristics of the hardware[14]. Virtualization can be done in different ways. One of them, permits that a server can be used as many virtual servers and another way, let multiple servers, being used as one virtual server. Virtualization can be considerate as Full Virtualization, when the installation of one machine can run on another [3], which is the way that virtual machines run in cloud environment. Virtual Machines are also used to emulate operating systems in one platform, creating the resources of this platform, hosting in a virtual hardware for each system. Paravirtualization, is the technique that allows multiple operating systems to run on a single hardware using system resources such as processors and memory [3]. In full virtualization the entire system is emulated, but in Paravirtualization, the management module operates with and operating system which has been fitted to run in a virtual machine Data Management One of the main reason of adopting cloud computing is the benefit of transferring, and process large amount of data. Data management, and data processing plays big role in many applications in cloud, as data stay stored in cloud, and it is necessary to provide a satisfactory performance of the services, express in terms of latency, high availability, attend service level agreements regardless the quantity of data and workloads changes Data Management Issues Cloud computing has some advantages, but for data management, developers can come across with some issues when implementing applications. When a developer is implementing an application, in many cases he has to deal with large set of files, and in case that data is large, developer has to distribute data in many systems and use parallel systems to prevent data being processed in one system which might increase time required to process data and would offer a low efficiency. 19

20 Data Storage and Query Processing Having a significant increase in data, and requests, to extract values from these data, service providers have to manage and analyse a large amount of data if they want to offer a high performance in their services and isolate data in cloud. As the data are managed in many partitions, it become hard to offer transactional guarantees, such as atomicity and isolation. To deal with these problems, some solutions have been developed, combining techniques such as MapReduce or parallels Database Management Systems(DBMS)[15]. So the challenge is to define an architecture to focus on the Query Processing mechanism and parallel file systems, such as Hadoop Distributed File System(HDFS), to give an architecture with different levels of insulation. HDFS was inspired by GFS(Google File System) which offers reliable and efficient access to data using big clusters. GFS can also be used to measure the performance of replication system on clusters. HDFS stores large files in various servers to get reliability by means of data replication. GFS works with three components, a master server, multiple clients and multiple chunks servers, where the chunks are stored in datacenters managed by servers[16]. MapReduce was introduced by Google, and it is a framework, in which each task is done with 2 functions: Map and Reduce. Map function receives a set of input files and according to user specification, it emits a set of tuples in a dictionary format ( Key-value format). Reduce function receives a set of values associated to each key, called blocks, and for each Reduce function, it emits a set of tuples that are stored in output files. This project will encounter issues related to data storage regarding data processing and response time. The size of data and method of distribution, is a challenge, as people feel interested in cloud computing because of the fact that they can store large files, but at the same time they want to process this data, quickly when they need. Considering the fact the this project will be implemented using simulator, it will be possible to design cloud environments, regarding time to transfer data of large size and make distribution data across datacenter. At the moment these techniques to store and process large data, are being used by organisations such as Facebook, Google, Youtube, Amazon, in order to offer better performance in their services, storing and processing hundreds of terabytes of data everyday. 20

21 Scalability and Consistency Scalability have to be transparent to users, allowing them to store their data in cloud without knowing the location of data, or the way to access them. But as many solutions in cloud are focused in scalability, and in general they offer weak consistency of data, which means that after the system is updated, if a user access the system, it will not guarantee that the return value will be also updated. This kind of consistency do not permit the development of a wide range of applications, such as, online services that cannot work with inconsistent values. Some aspects of data storage, query processing, have been used by some approaches to guarantee scalability, but the best way to solve this issue, is to develop solutions that combine theses aspects in a way to improve system performance without compromising data consistency[15] Database Management Systems (DBMS) with multi-tenants In DBMS as a cloud service, when tenants access the system and share resources, they can affect the performance of the system. In this case resources provisioning has to be efficient, as the workloads from DBMS as a service can vary, as the tenants can access the system with more frequency in a determined moment. 2.9 Evaluation Performance Using Cloud Simulation Cloud applications may offer many services, such as social networking, data storage, content delivery, web hosting, and it is necessary that cloud provider offers a cloud environment to respond costumers needs. Data sets have to be evaluated and analysed by service provider, to offer efficient access to the data sets and replicate them on several servers. Because of that cloud scenarios have to be evaluated to test replication of data on several servers, retrieve data, effective methods to send, process and access data. To perform those things, simulation tools have been developed to reproduce tests in cloud environment. These simulation tools let service providers and cloud costumers, to test their services in a environment that allows them to repeat the tests and have control of it. These tests let service providers to determine the cloud service quality and quantity, as well as helps to optimize the evaluation of their services, as simulation test is cheaper and faster than performing these tests in a real cloud environment [17]. To perform these tests, there are some toolkits which allow service providers and users to simulate their services or applications in cloud environment. At the moment the toolkit most used in this area is the framework CloudSim. 21

22 IcanCloud is a platform that models and simulate systems in cloud computing, it predicts trade-offs among performance and cost of a set of application being executed in hardware, and provide information about cost and performance. GreenCloud simulator is a an extension of the network simulator Ns2 which simulates cloud environment. GreenCloud provide to users a detailed model of the energy that elements of a datacenter consume, such as switches and servers. CloudSim is a simulation toolkit, that shows the result of the time, power and traffic consumption. It is based in java platform, with some modules already developed such as SimJava and GridSim [18]. HDFS simulator, is a simulator design in java, as HDFS is a distributed file system that was designed to run on commodity hardware, and it is highly fault-tolerant. Hadoop is a framework based in java, which has two sub systems: The hadoop to execute MapReduce applications and HDFS, which handles data management and access. As explained in Chapter 1, CloudSim and HDFS simulator, are the ones to be used in this project, because as this project focus on data transfer and CloudSim, offers tools necessary to simulate users sending data to cloud and calculate the time taken to to transfer and store data. HDFS simulator was designed to simulate a systems in cloud dealing with large data. As the second part of the project is related to data processing, and HDFS simulator can help developers to understand and implement environments storing and processing large data of files, and distributing data across the datacenter avoiding failures, it was used as the simulator framework to simulate data processing Data Transfer CloudSim It is a simulation framework that allows the modelling of cloud experiments using simulation of infrastructures and application services [17]. As a framework, CloudSim offers support to model and simulate large scale cloud node, and a platform to model datacenter, service brokers, scheduling, and allocation policies. CloudSim also has some features which are: the virtualization engine availability, which leads with the building and management of multiple, independent, and co-hosted virtualized services, on a datacenter node; it is flexible to switch between space shared and time-shared allocation of processing cores to virtualized services [17]. These features can make faster the development of algorithm, protocols and way of implementing it in cloud computing. 22

23 CloudSim Architecture Figure 3: CloudSim Architecture[17] Figure 3, shows the CloudSim architecture and layers that it uses. In the layer on the bottom, there is a simulator engine which is responsible of the operations to create, manage and delete simulation entities[17]. The next layer, shows the main classes that used to implement the framework, and it is composed with different modules. In the network module, it maps clients request with datacenters and calculate a possible delay of the messages between datacenters and clients. The cloud resources module, manipulate and coordinate simulation events, and also manages data related to infrastructure, provided by the datacenter that has been simulated. Then in cloud services, it shows the actions of virtual machines provisioning, the allocation of resources as a memory of the system, storage of data and bandwidth communication[23]. Then we have the virtual machines services, where manages and execute cloudlets (tasks) sent by clients. 23

24 Just above, there is the user interface structures, where occur the communication between entities, as it is done using an interface, virtual machines and cloudlets can be manipulated. The layer on the top, represents, the code where the user of framework will implement to create simulation environment. The scheduling policy, points the creation of decision policies and schedulers, which will guide simulation process[23]. It also uses decision methods, called broker, but cloudsim also allow the implementation of allocation policies of virtual machine between hosts of the same datacenter, virtual machines schedulers in hosts, and cloudlets schedulers in virtual machines CloudSim Usability When using CloudSim, users need to be able to have background in java, as the toolkit is all written in java. Having knowledge of java users will be able to write some code, using elements of its library to develop experiments, simulating the scenario desired by users. Using CloudSim is not only about writing code, changing parameters, run the program and then collect the results, but it requires an understanding of how the simulator works CloudSim Capabilities CloudSim comes with its source code which can be extended in order to make the simulator suit, the problem in which, user wants to solve, so users are free to make their own changes in the source code, adding classes, to make CloudSim help them, in a specific scenario. CloudSim is flexible, and require less time and effort to perform test, and simulation in cloud. It can simulate scenario with small-scale, and large-scale scenarios with many datacenters, sometimes without any cost related to initialisation and memory consumption. It also, use virtualisation to create many virtual services, where each service is managed on a node of a datacenter CloudSim Limitations It is a very good tool, with powerful functions, to help users simulate cloud computing environments, but it is not a tool kit where user can use it with only setting parameters. Users need to write some code in java, to get access to its library. Also it does not support every scenario in cloud, which require some extensions as discussed earlier. If user is not familiarised with java language, he will not be able to use the simulator, so because of that, CloudReports, which is an extension of CloudSim, was developed to allow user perform simulation in CloudSim using a simple graphical interface. 24

25 2.11 Data Processing Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) as every distributed file system it was designed to run on machines and allow users to store and process data in many different hardwares, normally connected with Local Area Network (LAN). But HDFS differs from normal distributed file systems, because it is high in fault-tolerant and can be deployed on hardware which is not too powerful[19]. At the moment there are several implementation of distributed file system, such as: GNU Cluser File System (GluserFS), Moose File System (MooseFS), General Parallel File System (GPFS), OpenAFS, Network File System and Google File System (GFS). HDFS is an open source framework written in java and uses an architecture based in master-slave. It consists in HDFS clusters, with a primary NameNode, which is used as master server, and is responsible to manage the file system namespace, in addiction it also regulates access to data by clients[20]. NameNode has a number of DataNodes, which normally consists in one-to-one relationship among a hardware and DataNode. DataNode manages the storage linked to boxes where it runs on. File System namespace is used in HDFS to allow data to be stored in files[20]. Each file is divided into blocks, which are distributed across DataNodes, supporting parallel processing. Blocks are replicated in many DataNodes which prevent the processing to stop if a occur and event of node failure[21]. The number of nodes used in HDFS it is proportional to the probability of one of these nodes fail, which means that as many nodes it has, the more chances of a node fails. To protect the system against failure, DataNodes receive copies of blocks. HDFS replicate three blocks, and two of these blocks go into nodes sharing the same rack, and the another goes into one node on a different rack HDFS Goals a) - Hardware Failure HDFS consist in many server machines, each machine stores part of file in the system. Having a huge number of elements in HDFS (machines), and the fact that each element has non-trivial probability of failure, it shows that some elements of HDFS will be always non-functional [21].One of the HDFS goals is to have a core architecture with automatic recovery, able to detect faults, and quick. 25

26 b) Streaming Data Access Every application running on HDFS requires streaming access to its data sets. Application running on the system, have a specific purpose, and run for a specific file system. It uses a typical design that do not focus on interactive use, but for batch processing. HDFS shows less interest in low latency, and focus more in data access. c) Large Data Sets HDFS was design to support large data sets. A file in HDFS is typically big, which goes from gigabytes to terabytes in size. To support data of this size, HDFS normally offers a high data bandwidth, and is able to scale hundreds of nodes in only one cluster. HDFS has the ability to support millions of files in one instance[21]. HDFS is also a good solution for people working with blogs, where they have to deal with large data and do not have idea of how data will be used, and using unstructured files, HDFS can help them as it allows them to store data even if they are not well structured, and process data at the local where its stored. Google is a real example of a system dealing with large data sets, as it stores hundreds of terabytes a day Block Division There are files which are too big, and cannot be stored in only one hard-drive. A way out of this problem is to divide files and distribute in many machines. The file distribution is done in a implicit way, where the developer using HDFS only has to point the correct configuration parameters. HDFS before storing the files, it adopts a strategy where these files are submitted using a method where files are divided in a sequence of block of equal size. The default size is 64 megabytes, which can be changed if necessary. The block size is bigger than block size used by the other distributed file systems, which use 512 bytes per block. After splitting files, it starts the distribution, addressing blocks in different nodes. If data addressed to a block is not big enough to fill the space reserved for it, the rest of space is not wasted, it can be used with another data. 26

27 Architecture A HDFS cluster, consists in one NameNode, which is the master server, and usually implemented in one exclusive node, which is the node with best performance. It is responsible to manage the file system namespace, and regulate how files are accessed by clients. As there are many DataNodes, and normally one DataNode is allocated in one node in the cluster, to manage storage linked to nodes where they run on. HDFS allow files to store user data. Then it execute block division, storing blocks in set of DataNodes. The operations of file system namespace are executed by the NameNode, such as renaming files, opening and closing files and directories[21]. The process of read and write requests sent from client's file system, is a responsibility of DataNodes, wich also execute block creation, replicate blocks, and delete them, as NameNode gives instructions. The first communication between master node and slave node, occur when DataNode is registered in NameNode. Then DataNode communicate to NameNode, sending information of blocks that are stored, as well as information about its local changes. This communication between NameNode and DataNode are crucial to help NameNode to define which nodes will store a respective block. If NameNode is enable to receive information from DataNode, the DataNode is asked to register again. Figure 4: HDFS Architecture [21] 27

28 Figure 4, shows the architecture of HDFS, which basically NameNode and DataNode are part of software that run on machines. Where these machines usually run on GNU/Linux operating System. As HDFS is written in java, it can be deployed in any machine that supports java, therefore, these machine are able to run NameNode or DataNode. An implementation of HDFS, usually gets one machine dedicated to run the NameNode, where the other machines are used to run DataNodes, normally one DataNode per machine. NameNode works also as a mediator for all metadata. HDFS is designed following an architecture where user data has no chance to flows through the NameNode[21] Data Replication As HDFS split files into blocks, it also replicate these blocks to increase security and durability of file. By default HDFS has three replicas allocated in different nodes. As the communication speed between machines of the same rack is bigger then between machines of different racks, in the selection of replacing replicas in one process, HDFS gives priority to replicas belonging to same rack[22]. One of the biggest benefit of replication is that system get high fault-tolerant, and reliability, as if a node fails, the process will be executed by another machine, which contains a replica of the block, without having any need to transfer data or interrupt the application execution. All this if done with transparency, as hadoop offers mechanisms to restart the process without anyone notice that a node fail during the executing[22]. When there is a fault, occur a decrease in replicas of a block, so to get reliability, NameNode consult metadata about DataNode faults, and restart the replication process in other DataNodes Related Work Through the literature review I could check some work using experiments in CloudSim, evaluating simulation experiments using space-shared and time-shared. They sent sets of task to virtual machines, and these tasks were sent in groups of 50[23]. They collect the results and checked that experiments using space-shared, every task has completed after 20 minutes, as the number of tasks had no effect in the execution time of a single task. But the execution time of a single task in experiments using virtual machines with time-shared, has been affected as the number of tasks submitted to virtual machines increased. So the first set of 50 tasks have been completed before the other ones, at the start of execution the hosts were not over-loaded, which increase its performance to execute tasks quickly, and as more tasks had been completed hosts becomes available for more tasks to be executed[23]. 28

29 Another work done, with HDFS simulator, where they got an experiment to check if the number of replicas, affect the system performance. They had used a method to calculate the mean time taken to repair the system, when there is failures in the system. They use one experiment, with replicate a block three times, and another with 8 replicas per block. So they notice that as the replication level increases, the probability of time to repair the system, increases too, as replication level increases, number of blocks in a node increases too[24]. If a node fail, the system will have to replicate more blocks than if number of replicas per block Summary In this section I went through on some topics related to cloud computing and its architecture, as well as some applications developed and being used around the world. The next chapter will cover the use of this literature review and how to design the experiment will be using this project. It includes the use of HDFS and CloudSim in data management. 29

30 Chapter 3 Design 3.1 Introduction With the use of cloud computing, people always need to find ways to deal with large data files. Many organisations every day collect billions of data bytes, and need to manage it, to satisfy its users demands. Recently with the increase of people using cloud computing, some companies wants to get the benefit of adoption cloud technologies, by transferring their to cloud. But then companies wants process this data in cloud. As one of the biggest interest of someone sending data to cloud, is related to the need of have large data files stored, and be able to access it anywhere. So then people decide not just to transfer data in cloud, but also have their data processed. One of the big issues when transferring data to cloud is the time to transfer data in cloud and time when processing it, as at these days, some requests are made for tasks to be done is short time as possible Methodology Methods and data used to design this project was mostly acquired through web sites and a few books. Every data was analysed as it was all readable, and then a depth understanding of the data was done, in order to progress with project choosing the best material to suit the problem, and help to get solution. The experiments in the project, we made relating the research done as background and the scenario needed to solve the problem presented in the project. To test the performance of the experiment, two simulator were used, one to check transfer time, and another for data processing time. 30

31 3.3 - Example Problem Recently pcs makers, have evaluate their pc storage power from bytes to terabytes. This development has made user's life easier, as they can now benefit of the having more storage capacity to save their files. Therefore the biggest problem is not on the data size to store, but on quantity of data that can be processed by a system. Lets suppose we have a company that offers a web service, in which its systems is accessed daily by thousands of users, sending, reading and writing data in short slots of time. To deal with it, the organisation need to implement a system architecture with methods capable to classify data, in order to enable the system to process data in a fast and efficient way. Many company use techniques based in log activities, where they write users requests in log files, to specify different tasks performed by a single user. This techniques can bring serious problems if used in wrong way. As an organisation can get many data in short time, and will not be able to process data neither organize it. In this way, data will be wasted, loosing the opportunities that, these data would help in the performance of the system. Also as it would be done all in serial, which implies that, data is being processed in an individual system, having only one point to deal with possible failures in the system. 3.4 Solution To solve the problem presented above, the best thing to do, is to develop distributed parallel environments, which will provide a high performance comparing with to other systems, using serial environments. With parallel computation, an application can be executed concurrently among different elements of a machine, regarding the fact if this machine is distributed or multi-processed. With these technique it is possible to get a system with many points dealing with failures, which increase system's fault-tolerance. Having such problem, many concepts, and strategies have been developed to offer simple and efficient ways to solve the problem using parallel computation techniques. With this techniques. It is possible to create a robust system able to attend users demands. The first idea is to implement system able to process data in short time, with many points dealing with possible failures in the systems, and also, a system capable to run in low cost machines. This is where HDFS comes in, as it is design to process large data files in short time, a offering a system high in fault-tolerance. 31

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

CLOUD STORAGE USING HADOOP AND PLAY

CLOUD STORAGE USING HADOOP AND PLAY 27 CLOUD STORAGE USING HADOOP AND PLAY Devateja G 1, Kashyap P V B 2, Suraj C 3, Harshavardhan C 4, Impana Appaji 5 1234 Computer Science & Engineering, Academy for Technical and Management Excellence

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Cloud Computing 159.735. Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Cloud Computing 159.735. Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009 Cloud Computing 159.735 Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009 Table of Contents Introduction... 3 What is Cloud Computing?... 3 Key Characteristics...

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India talk2tamanna@gmail.com

Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India talk2tamanna@gmail.com IJCSIT, Volume 1, Issue 5 (October, 2014) e-issn: 1694-2329 p-issn: 1694-2345 A STUDY OF CLOUD COMPUTING MODELS AND ITS FUTURE Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Viswanath Nandigam Sriram Krishnan Chaitan Baru Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies

More information

www.basho.com Technical Overview Simple, Scalable, Object Storage Software

www.basho.com Technical Overview Simple, Scalable, Object Storage Software www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...

More information

Cloud computing an insight

Cloud computing an insight Cloud computing an insight Overview IT infrastructure is changing according the fast-paced world s needs. People in the world want to stay connected with Work / Family-Friends. The data needs to be available

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

CLOUD COMPUTING USING HADOOP TECHNOLOGY

CLOUD COMPUTING USING HADOOP TECHNOLOGY CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:narendren.jbk@gmail.com

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE ANALYSIS USING RETAIL APPLICATION TEST DATA)

CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE ANALYSIS USING RETAIL APPLICATION TEST DATA) CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE ANALYSIS USING RETAIL APPLICATION TEST DATA) Abhijeet Padwal Performance engineering group Persistent Systems, Pune email: abhijeet_padwal@persistent.co.in

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System (HDFS) Overview 2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

WINDOWS AZURE DATA MANAGEMENT

WINDOWS AZURE DATA MANAGEMENT David Chappell October 2012 WINDOWS AZURE DATA MANAGEMENT CHOOSING THE RIGHT TECHNOLOGY Sponsored by Microsoft Corporation Copyright 2012 Chappell & Associates Contents Windows Azure Data Management: A

More information

Big Data on Cloud Computing- Security Issues

Big Data on Cloud Computing- Security Issues Big Data on Cloud Computing- Security Issues K Subashini, K Srivaishnavi UG Student, Department of CSE, University College of Engineering, Kanchipuram, Tamilnadu, India ABSTRACT: Cloud computing is now

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

yvette@yvetteagostini.it yvette@yvetteagostini.it

yvette@yvetteagostini.it yvette@yvetteagostini.it 1 The following is merely a collection of notes taken during works, study and just-for-fun activities No copyright infringements intended: all sources are duly listed at the end of the document This work

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Storage Architectures for Big Data in the Cloud

Storage Architectures for Big Data in the Cloud Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas

More information

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research Introduction to Cloud : Cloud and Cloud Storage Lecture 2 Dr. Dalit Naor IBM Haifa Research Storage Systems 1 Advanced Topics in Storage Systems for Big Data - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010 IT CAPACITY Provisioning IT Capacity Under-supply

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Microsoft Azure Data Technologies: An Overview

Microsoft Azure Data Technologies: An Overview David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Introduction to Cloud Services

Introduction to Cloud Services Introduction to Cloud Services (brought to you by www.rmroberts.com) Cloud computing concept is not as new as you might think, and it has actually been around for many years, even before the term cloud

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Research Paper Available online at: www.ijarcsse.com A COMPARATIVE STUDY OF CLOUD COMPUTING SERVICE PROVIDERS

Research Paper Available online at: www.ijarcsse.com A COMPARATIVE STUDY OF CLOUD COMPUTING SERVICE PROVIDERS Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A COMPARATIVE STUDY OF CLOUD

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Multilevel Communication Aware Approach for Load Balancing

Multilevel Communication Aware Approach for Load Balancing Multilevel Communication Aware Approach for Load Balancing 1 Dipti Patel, 2 Ashil Patel Department of Information Technology, L.D. College of Engineering, Gujarat Technological University, Ahmedabad 1

More information

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study DISTRIBUTED SYSTEMS AND CLOUD COMPUTING A Comparative Study Geographically distributed resources, such as storage devices, data sources, and computing power, are interconnected as a single, unified resource

More information

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,

More information

Relocating Windows Server 2003 Workloads

Relocating Windows Server 2003 Workloads Relocating Windows Server 2003 Workloads An Opportunity to Optimize From Complex Change to an Opportunity to Optimize There is much you need to know before you upgrade to a new server platform, and time

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Distributed Systems. Cloud & the Internet of Things. Björn Franke University of Edinburgh, 2015

Distributed Systems. Cloud & the Internet of Things. Björn Franke University of Edinburgh, 2015 Distributed Systems Cloud & the Internet of Things Björn Franke University of Edinburgh, 2015 OVERVIEW Cloud Computing vs. Distributed Computing Examples SAAS, PAAS, IAAS Goals, Types, Characteristics

More information

Security Benefits of Cloud Computing

Security Benefits of Cloud Computing Security Benefits of Cloud Computing FELICIAN ALECU Economy Informatics Department Academy of Economic Studies Bucharest ROMANIA e-mail: alecu.felician@ie.ase.ro Abstract: The nature of the Internet is

More information

Everything You Need To Know About Cloud Computing

Everything You Need To Know About Cloud Computing Everything You Need To Know About Cloud Computing What Every Business Owner Should Consider When Choosing Cloud Hosted Versus Internally Hosted Software 1 INTRODUCTION Cloud computing is the current information

More information

Contents. Introduction. What is the Cloud? How does it work? Types of Cloud Service. Cloud Service Providers. Summary

Contents. Introduction. What is the Cloud? How does it work? Types of Cloud Service. Cloud Service Providers. Summary Contents Introduction What is the Cloud? How does it work? Types of Cloud Service Cloud Service Providers Summary Introduction The CLOUD! It seems to be everywhere these days; you can t get away from it!

More information

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline References Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of

More information

Introduction to Database Systems CSE 444

Introduction to Database Systems CSE 444 Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon References Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

CHAPTER 2 THEORETICAL FOUNDATION

CHAPTER 2 THEORETICAL FOUNDATION CHAPTER 2 THEORETICAL FOUNDATION 2.1 Theoretical Foundation Cloud computing has become the recent trends in nowadays computing technology world. In order to understand the concept of cloud, people should

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information