Efficient Support of Big Data Storage Systems on the Cloud

Size: px
Start display at page:

Download "Efficient Support of Big Data Storage Systems on the Cloud"

Transcription

1 Efficient Support of Big Data Storage Systems on the Cloud Akshay MS, Suhas Mohan, Vincent Kuri, Dinkar Sitaram, H. L. Phalachandra PES Institute of Technology, CSE Dept., Center for Cloud Computing and Big Data Abstract Due to its advantages over traditional data centers, there has been a rapid growth in the usage of cloud infrastructures. These include public clouds (e.g., Amazon EC2), or private clouds, such as clouds deployed using Open-stack. A common factor in many of the well-known infrastructures, for example Openstack and Cloudstack, is that networked storage is used for storage of persistent data. However, traditional Big Data systems, including Hadoop, store data in commodity local storage for reasons of high performance and low cost. We present an architecture for supporting Hadoop on Openstack using local storage. Subsequently, we use benchmarks on Openstack and Amazon to show that for supporting Hadoop, local storage has better performance and lower cost. We conclude that cloud systems should support local storage for persistent data (in addition to networked storage) so as to provide efficient support for Hadoop and other Big Data systems Categories and Subject Descriptors D.4.2 [Operating Systems] Storage management secondary storage, and D.4.7 [Operating Systems] Organization and design distributed systems. General Terms Management, Measurement, Performance, Design, Economics, Experimentation. Keywords Hadoop, Big Data, Cloud, IaaS, Openstack 1. Introduction In the recent past, there has been a widespread growth in the use of cloud infrastructures. The major reason for this growth is that in general, it is more efficient and less expensive to host applications on the cloud. Since these considerations apply also to Big Data systems such as Hadoop, it is important to support them efficiently on the cloud. For example, a major factor driving the adoption of cloud technology is resource sharing [Cr2009]. Since the demands of applications are typically bursty, it is possible to share the same server resources between multiple applications, leading to lower costs. Other advantages include the ability to scale server resources rapidly, and to have large spare capacity [Ar2010]. These factors indicate that it is important to support Big Data systems efficiently on cloud infrastructures. Big Data systems by definition operate on large datasets. Therefore, when designing support for such systems on a cloud system, it is important to consider the storage architecture of the cloud system. Hadoop, Google File System, and other Big Data systems use local storage for reasons of high performance and low cost [Bo2008, GH2003]. Cloud systems, however, typically use networked storage for persistent data. For example, Openstack supports only networked storage (e.g., ISCSI) for persistent data [Ci2013]. Cloudstack, which is another widely used cloud system, supports networked storage together with local storage [Ci2012]. However, local storage is recommended only for non-persistent storage [Lo2012]. One of the motivations for using networked storage in cloud systems is to provide for availability of the data in the face of failure. This consideration does not apply to many Big Data systems including Hadoop since they replicate the data. Thus if one copy of the data is lost (either due to server or disk failure), other copies are still available. The rest of the paper is organized as follows. Section 2 reviews related work in this area. Our local storage-based solution is presented in Section 3, followed by benchmark results in Section 4. Section 5 contains our conclusions. 2. Related Work There are a number of ways in which Big Data systems can be supported in a cloud. Networked storage is used to support persistent data in Openstack and other cloud systems [Ci2013], and can be leveraged to store Hadoop data. We use benchmarks to show our solution has higher performance. Another alternative is to use a high performance storage technology, for example Amazon EC2 Elastic Block Storage [EB2013]. While this solution can achieve high performance, we provide data to show that it is not as low cost as our proposal. There are many studies of Hadoop performance [Sh2010, Su2010, Xi2010]; however, these concentrate on the performance of Hadoop running on bare hardware. The work that comes closest to ours is [Sh2010a] which uses Hadoop running on Eucalyptus and Amazon to identify virtualization bottlenecks in the Eucalyptus and Amazon cloud systems. However, the focus of the paper is on improvement of virtualization technology and not upon an efficient design to support Hadoop on cloud systems. 3. Our Solution

2 Figure 1: Openstack System with Hadoop VMs Our objective is to provide a high-performance Hadoop system running on a cloud infrastructure. We first describe the design considerations that arise, followed by the description of the actual solution. Figure 2: Hadoop VMs with Local Storage 3.1 Design Considerations The design approach of Big Data systems is moving the program to the data and not the data to the program. This ensures that data processing takes most of the time and not network transfer of data to the machines where the jobs are being deployed [Bo2008, GH2003]. A network attached storage system will not be able to exploit the full capabilities of Hadoop unless the storage system is connected via a very high speed network 10G or Fiber Channel which makes the entire setup expensive and requires considerable expertise to set up and maintain the cluster. We note that even with high-performance networking, disk performance cannot be higher than local storage since even iscsi, data eventually comes off disk that is directly attached to another machine Therefore, in our solution, we propose that running Hadoop on local storage is ideal from a cost and performance perspective. The major constraint with current cloud systems such as Openstack is that the data stored in local storage is not persistent. There are two methods of getting around this our current method relies on having longrunning VMs so that the local storage is available for a long time. The second method is to extend cloud with persistent local storage. Having Hadoop data on local storage has the additional advantage that it inter-operates better with Hadoop loadbalancing algorithms. When scheduling tasks, Hadoop load-balancing algorithms try to factor in data locality, i.e., information about the nodes on which data resides. These load balancing algorithms will not work well with network attached disks. Additionally, VM migration for balancing load, which is commonly used in cloud systems, is probably not useful for balancing the load since it does not take these factors including data locality into account. Therefore, migration of Hadoop VMs should be disabled. A final consideration arises from the objective of making sure that Hadoop s replication facility is not inadvertantly defeated by running Hadoop on a cloud infrastructure. In a cloud, it is possible that all 3 VMs containing the replicas of a file would be scheduled on the same physical machine. To prevent this, we use Hadoop s rack awareness property, All VMs running on the same physical machine are designated (to Hadoop) as being in the same (virtual) rack. Hadoop would then ensure that there are at least two different replicas across racks; i.e., that there are two replicas in different physical machines. Since rack awareness is a common feature of Big Data systems, this method can be used for other Big Data systems as well. 3.2 Details of Our Solution Figure 1 contains a high-level overview of our solution. We have a single controller node running core OpenStack services such as Keystone, Glance, Cinder and Quantum. Cinder is the volume management service and volumes created using Cinder reside on the controller and are attached to the virtual machines over iscsi. We have several compute nodes running nova-compute service that can spawn virtual machines. Each physical node has Intel Xeon E GHz, 8MB Cache with 16 GB RAM and 1 TB Hard disk. All the nodes are connected to two different networks - 1Gbps each. One network is used for OpenStack services to communicate with each other and the other is used to connected to a public network. A number of long-running Hadoop VMs are spawned on Open- Stack. These VMs are similar in behaviour to a Hadoop cluster, with each VM being similar to a Hadoop node. Spawning more VMs than needed is not a performance overhead since the VMs consume very little resources when they are not active. OpenStack instances can have three types of storage - a root disk, an ephemeral disk that is non-persistent and persistent storage attached over the network through Open- Stack's volume service. The root disk of a virtual machine resides on the host machine and it is not attached over the network. This implies that the root disk of the virtual machine does not depend on the network latency or bandwidth. Our solution to run Hadoop involves using the root disk for HDFS as shown in Figure 2. This avoids transfer of data over the network while running Hadoop jobs and is cost effective. For the Amazon comparison in Section 4.2,

3 following, we describe TestDFSIO, followed by a comparison of results of running this benchmark using our solution, standard Openstack, and Amazon. The comparison includes both performance and price comparisons of our solution with standard Openstack and Amazon. Figure 3: Definitions of Throughput and Average I/O Rate we do a similar setup by using the the instance storage of the instance. Instance storage on EC2 is non-persistent but we have found it to be faster and cheaper than standard EBS volumes. Since the root disk is not persistent, i.e. data stored on the root disk is lost after the VM is terminated, we need to periodically snapshot the data in the root. This can be performed as an asynchronous background task. The overhead in snapshots is generally lower than the overhead of accessing all I/O via the network. Most Big Data systems (e.g., Google search) write data once, but read it many times. Since only the writes have to be snapshotted, the overhead is lower. In practice, we also found that the storage does not disappear immediately if VM crashes, so that if the VM can be re-booted quickly, the storage will not be lost. 3.3 Disk Partitioning Solution Both the root disk and the ephemeral disk in Openstack are implemented as files on the local storage. An alternative would be to partition the local storage disks, and attach one or more of the partitions to the Hadoop VMs. This would give higher performance than our current solution, and is under implementation. The disadvantage of doing so is that a static partition of the disk would be dedicated to the Hadoop VM, whereas with the current implementation of root and ephemeral disks, it is possible for the amount of storage allocated to these disks to shrink and grow. Nevertheless, we intend to experiment with this alternative solution, as we believe that the gains in performance may outweigh the loss of flexibility for some applications. Implementation of the above solution is also not difficult in the current Openstack architecture. We assume that this storage is a new type of storage called local-persistent. It would be necessary to implement a new Openstack component that would keep track of the local-persistent partitions. Currently, Openstack contains a configuration flag libvirt_images_volume_group that specifies, on each compute node, the volume group that contains ephemeral disks. We plan to add a similar flag libvirt_localpersistent_volume_group that contains local persistent volumes. Access to these volumes would be via the usual Openstack access control mechanisms. Long-running Hadoop VMs could be started only on compute nodes that contain local-persistent storage using the Openstack filter scheduler. This scheduler allows the administrator to filter the list of nodes on which a new VM is launched. The VM initialization sequence has also got to be modified to avoid formatting the local-persistent disks attached to the instance. 4. Benchmarks and Measurements TestDFSIO is a standard benchmark used for testing the I/O performance of a Hadoop system [Mi2011]. In the 4.1 TestDFSIO The operation of TestDFSIO is a distributed I/O benchmark that works as follows. When TestDFSIO is invoked on a Hadoop cluster, it invokes the MapReduce infrastructure to create a number of parallel tasks on each node (shown diagrammatically in Figure 6). The benchmark, therefore, simulates the operation of a real Hadoop task. Each parallel task does I/O to a separate file at the maximum possible rate. The I/Os can be writes, reads or a mixture of reads and writes. It is conventional to run TestDFSIO and measure the write performance first, so as to create the files for subsequent measurements of read performance [Mi2011]. This test writes into or reads from a specified number of files. File size is specified as a parameter to the test. Each file is accessed in a separate map task [Te2013]. The reducer collects the following statistics: Number of tasks completed Number of bytes written/read Execution time I/O rate I/O rate squared The following statics are obtained after the job is completed: Read or write test Date and time the test finished Number of files Total number of bytes processed Throughput in MB/sec (total number of bytes / sum of processing times) Average I/O rate in MB/sec per file Standard deviation of I/O rate TestDFSIO generates two important metrics. The Throughput is the total I/O by the cluster per unit time per node. For a TestDFSIO job using N map tasks, and where the index 1 <= i <= N denotes the individual map tasks, the throughput is defined by the equation in Figure 2 [Mi2011]. The Average IO Rate measures the average I/O rate per node and is given by the equation in Figure 3. For N identical nodes, the two values should be almost identi- Figure 4: Standard Deployment of Hadoop on Openstack

4 Figure 5: Comparison of Write Performance cal. 4.2 Performance Comparison We first compare the performance of our solution against the standard method of deploying Hadoop on Openstack. TestDFSIO was run on a 5 node Hadoop cluster with a map capacity of 25. Each virtual machine we have used contains 4VCPUs, 8GB of RAM and 32 GB root disk and 20GB ephemeral storage. The 5 VMs are run on a 5 node OpeStack cluster i.e., each physical node hosts 1 VM. A total of 10 files, each of 1000MB were used to perform the benchmark. The configuration used in measuring our solution is shown in Figure 1. Figure 4 shows the standard method of deploying Hadoop on Openstack. Figure 5 compares the write performance of our proposed solution using locally attached disks against the standard method of using iscsi disks for storing persistent data. The Y-axis of the figure is in units of MB/s. It can be seen that there is a substantial difference in write performance. As expected, the Throughput and Average IO Rate figures are very close. For read performance, the Average IO Rate for the Proposed solution and the standard solution are 230 MB/s and 176 MB/s, respectively. 4.3 Cost Comparison The performance comparison in the previous section shows that the performance of our solution is superior to the performance achievable with the standard method of deploying Hadoop on Openstack. It is possible to replace the iscsi interconnect with a higher performance interconnect. In this section, we show that our solution is likely to be more cost effective than solutions that use such interconnects. The high performance cloud storage that we use for a cost comparison of our solution is Amazon Elastic Block Store. We compare the Amazon EBS solution with an implementation of our solution on Amazon using Amazon EC2 root disks for storage of local data. While the exact implementation of Amazon EBS is not known, it is believed to be a cluster disk implementation [Bl2010]. As of this writing, Standard EBS disks are capable of supporting a steady I/O rate of upto 100 IOPS (I/O operations per second), with bursts of upto twice or thrice that rate. Additionally, Amazon also provides Provisioned EBS disks, which can support burst rates of up to 30,000 IOPS. Detailed cost comparisons require taking many factors into account, for example the cost of hardware, and operational and maintenance costs. To provide an objective basis for such comparisons, we assume that the prices charged by Amazon for their services are indicative of the underlying costs of providing these services. For doing the cost comparison, we compare the costs of deploying two configurations on Amazon. The first configuration is similar to our solution, while the second solution leverages highperformance EBS disks for storing Hadoop data. The details of the configurations are as follows. The nodes in the Hadoop cluster were first generation large instance (m1.large) with a 100GB EBS Standard volume attached. In our experiments, we have set up a 5 node Hadoop cluster on Amazon EC2 consisting of 1 master node and 4 slave nodes. The root disk is also a standard volume without provisioned IOPS. m1.large machines are known to give moderate IO performance [Am2013a]. The instance comes with 850 GB of ephemeral storage, which is storage that is locally attached to our machine. The first configuration used the ephemeral disk for HDFS. This is similar to our solution, since the ephemeral disk is local storage. The second configuration used standard EBS volume for HDFS data. This corresponds to using highperformance networked cloud storage. Table 1: Services and their respective prices for AWS in North Virginia Region [Am2013] Instance Cost EC2 M1.Large $0.24 per hour EBS Standard $0.10 per 1 millions IOPs EBS-IOPs $0.10 per IOPS-Month Table 1 lists the costs of various Amazon services. While running TestDFSIO, on AWS we were able to find the exact number of I/O operations performed on the disk, using Amazon detailed monitoring. Our test run requires slightly over 1 million I/O operations if run continuously for an hour. If Hadoop was run on EBS volumes, we would be charged $0.10 in addition to the cost of running the instance for 1 hour i.e., $0.24. Since using our proposed solution eliminates the need for EBS volumes, we can avoid the cost of running the volume. Therefore, using our solution of running Hadoop on Ephemeral disks in Amazon EC2 proves to be 29% cheaper than running it on EBS Figure 6: TestDFSIO Operation

5 volumes. 5. Conclusions In this paper, we have shown that our solution for running Hadoop on a cloud infrastructure using local disks for persistent storage rather than networked storage has higher performance and is more cost-effective than the traditional alternatives. Based upon this, we argue that cloud infrastructures should support the use of persistent locally attached storage for efficient support of Big Data and other I/O intensive applications. The traditional argument for using networked storage for data availability does not apply to Big Data systems since they have replication and other availability methods already built in. Persistent local storage can co-exist with existing persistent network storage for other types of applications. In our future work, we plan to extend our work to other applications, such as databases. We had also proposed to allow the attachment of disk partitions directly to VMs as proposed in Section 3.3. This extension of our solution would further improve the performance and efficiency. Acknowledgments We would like to thank Abhishek B. S., Mahesh A., Rakesh Kumar, Sandeep Raju, Shruti Ranade, Vijesh M and Vivek P for their helpful discussions and comments. References 1. [Am2013] Amazon Elastic Cloud(EC2) Pricing 2. [Am2013a] Amazon EC2 Instance Types 3. [Ar2010] Armbrust, Michael, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee et al. "A view of cloud computing." Communications of the ACM 53, no. 4 (2010): [Bl2010] Bliekertz, Soren, On Amazon EC2's Underlying Architecture ecture.html 5. [Bo2008] Borthakur, Dhruba. "HDFS architecture guide." HADOOP APACHE PROJECT design. pdf (2008). 6. [Ci2012] Citrix XenServer Installation for Cloud- Stack, Cloudstack Installation Guide, Chapter 8.2, US/Apache_CloudStack/ incubating/html/installation_guide/citrix-xenserverinstallation.html 7. [Ci2013] Volumes. Openstack Compute Administration Manual, Chapter 11, 8. [Cr2009] Creeger, Mache. "Cloud computing: An overview." ACM Queue 7, no. 5 (2009): [EB2013] Amazon Elastic Block Store [Gh2003] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." In ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp ACM, [Lo2012] Local storage support for data volumes, Apache Cloudstack Project, [Mi2011] Michael G. Noll Benchmarking and Stress Testing an Hadoop Cluster with TeraSort, TestDFSIO, [Te2013] TestDFSIO.java elease /src/test/org/apache/hadoop/fs/TestDFSIO.java 14. [Sh2010] Shafer, Jeffrey, Scott Rixner, and Alan L. Cox. "The Hadoop distributed filesystem: Balancing portability and performance." In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pp IEEE, [Sh2010a] Shafer, Jeffrey. "I/O virtualization bottlenecks in cloud computing today." In Proceedings of the 2nd conference on I/O virtualization, pp USENIX Association, [Su2010] Sur, Sayantan, Hao Wang, Jian Huang, Xiangyong Ouyang, and Dhabaleswar K. Panda. "Can High-Performance Interconnects Benefit Hadoop Distributed File System." In Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds (MASVDC). Held in Conjunction with MICRO [Xi2010] Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp IEEE,

SLA-aware Resource Scheduling for Cloud Storage

SLA-aware Resource Scheduling for Cloud Storage SLA-aware Resource Scheduling for Cloud Storage Zhihao Yao Computer and Information Technology Purdue University West Lafayette, Indiana 47906 Email: yao86@purdue.edu Ioannis Papapanagiotou Computer and

More information

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation An Application of Hadoop and Horizontal Scaling to Conjunction Assessment Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation ABSTRACT This paper examines a horizontal scaling approach

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

APACHE HADOOP JERRIN JOSEPH CSU ID#2578741

APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Can High-Performance Interconnects Benefit Memcached and Hadoop? Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Infrastructure as a Service (IaaS)

Infrastructure as a Service (IaaS) Infrastructure as a Service (IaaS) (ENCS 691K Chapter 4) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ References 1. R. Moreno et al.,

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Cloud computing doesn t yet have a

Cloud computing doesn t yet have a The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Performance of the Cloud-Based Commodity Cluster. School of Computer Science and Engineering, International University, Hochiminh City 70000, Vietnam

Performance of the Cloud-Based Commodity Cluster. School of Computer Science and Engineering, International University, Hochiminh City 70000, Vietnam Computer Technology and Application 4 (2013) 532-537 D DAVID PUBLISHING Performance of the Cloud-Based Commodity Cluster Van-Hau Pham, Duc-Cuong Nguyen and Tien-Dung Nguyen School of Computer Science and

More information

Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances

Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances Ruchi Mittal 1, Ruhi Bagga 2 1, 2 Rayat Bahra Group of Institutes, Punjab Technical University, Patiala, Punjab, India Abstract:

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Cloud Computing through Virtualization and HPC technologies

Cloud Computing through Virtualization and HPC technologies Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC

More information

On the Varieties of Clouds for Data Intensive Computing

On the Varieties of Clouds for Data Intensive Computing On the Varieties of Clouds for Data Intensive Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group Yunhong Gu University of Illinois at Chicago Abstract By a cloud we mean

More information

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Cloud Computing and Amazon Web Services

Cloud Computing and Amazon Web Services Cloud Computing and Amazon Web Services Gary A. McGilvary edinburgh data.intensive research 1 OUTLINE 1. An Overview of Cloud Computing 2. Amazon Web Services 3. Amazon EC2 Tutorial 4. Conclusions 2 CLOUD

More information

Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741

Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741 Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741 ISBN 978-83-60810-22-4 DCFMS: A Chunk-Based Distributed File System for Supporting Multimedia Communication

More information

The Hidden Extras. The Pricing Scheme of Cloud Computing. Stephane Rufer

The Hidden Extras. The Pricing Scheme of Cloud Computing. Stephane Rufer The Hidden Extras The Pricing Scheme of Cloud Computing Stephane Rufer Cloud Computing Hype Cycle Definition Types Architecture Deployment Pricing/Charging in IT Economics of Cloud Computing Pricing Schemes

More information

Affinity Aware VM Colocation Mechanism for Cloud

Affinity Aware VM Colocation Mechanism for Cloud Affinity Aware VM Colocation Mechanism for Cloud Nilesh Pachorkar 1* and Rajesh Ingle 2 Received: 24-December-2014; Revised: 12-January-2015; Accepted: 12-January-2015 2014 ACCENTS Abstract The most of

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Amazon Cloud Storage Options

Amazon Cloud Storage Options Amazon Cloud Storage Options Table of Contents 1. Overview of AWS Storage Options 02 2. Why you should use the AWS Storage 02 3. How to get Data into the AWS.03 4. Types of AWS Storage Options.03 5. Object

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

PostgreSQL Performance Characteristics on Joyent and Amazon EC2 OVERVIEW In today's big data world, high performance databases are not only required but are a major part of any critical business function. With the advent of mobile devices, users are consuming data

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012) 1. Computation Amazon Web Services Amazon Elastic Compute Cloud (Amazon EC2) provides basic computation service in AWS. It presents a virtual computing environment and enables resizable compute capacity.

More information

Tech Report TR-WP3-6-2.9.2013 Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0

Tech Report TR-WP3-6-2.9.2013 Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0 Longitudinal Analytics of Web Archive data European Commission Seventh Framework Programme Call: FP7-ICT-2009-5, Activity: ICT-2009.1.6 Contract No: 258105 Tech Report TR-WP3-6-2.9.2013 Analyzing Virtualized

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform A B M Moniruzzaman 1, Kawser Wazed Nafi 2, Prof. Syed Akhter Hossain 1 and Prof. M. M. A. Hashem 1 Department

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model

More information

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar juliamyint@gmail.com 2 University of Computer

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Introduction

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com

More information

Performance Evaluation of the Illinois Cloud Computing Testbed

Performance Evaluation of the Illinois Cloud Computing Testbed Performance Evaluation of the Illinois Cloud Computing Testbed Ahmed Khurshid, Abdullah Al-Nayeem, and Indranil Gupta Department of Computer Science University of Illinois at Urbana-Champaign Abstract.

More information

Cloud Computing. Adam Barker

Cloud Computing. Adam Barker Cloud Computing Adam Barker 1 Overview Introduction to Cloud computing Enabling technologies Different types of cloud: IaaS, PaaS and SaaS Cloud terminology Interacting with a cloud: management consoles

More information

Beyond the Internet? THIN APPS STORE FOR SMART PHONES BASED ON PRIVATE CLOUD INFRASTRUCTURE. Innovations for future networks and services

Beyond the Internet? THIN APPS STORE FOR SMART PHONES BASED ON PRIVATE CLOUD INFRASTRUCTURE. Innovations for future networks and services Beyond the Internet? Innovations for future networks and services THIN APPS STORE FOR SMART PHONES BASED ON PRIVATE CLOUD INFRASTRUCTURE Authors Muzahid Hussain, Abhishek Tayal Ashish Tanwer, Parminder

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Savanna Hadoop on. OpenStack. Savanna Technical Lead Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013 Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Yahoo! Research Abstract Data analytics are key applications

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Efficient Cloud Management for Parallel Data Processing In Private Cloud

Efficient Cloud Management for Parallel Data Processing In Private Cloud 2012 International Conference on Information and Network Technology (ICINT 2012) IPCSIT vol. 37 (2012) (2012) IACSIT Press, Singapore Efficient Cloud Management for Parallel Data Processing In Private

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Understanding Data Locality in VMware Virtual SAN

Understanding Data Locality in VMware Virtual SAN Understanding Data Locality in VMware Virtual SAN July 2014 Edition T E C H N I C A L M A R K E T I N G D O C U M E N T A T I O N Table of Contents Introduction... 2 Virtual SAN Design Goals... 3 Data

More information

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director Building Storage as a Service with OpenStack Greg Elkinbard Senior Technical Director MIRANTIS 2012 PAGE 1 About the Presenter Greg Elkinbard Senior Technical Director at Mirantis Builds on demand IaaS

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

POSIX and Object Distributed Storage Systems

POSIX and Object Distributed Storage Systems 1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

The Google File System

The Google File System The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:

More information

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage Clodoaldo Barrera Chief Technical Strategist IBM System Storage Making a successful transition to Software Defined Storage Open Server Summit Santa Clara Nov 2014 Data at the core of everything Data is

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster

An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster Send Orders for Reprints to reprints@benthamscience.ae 792 The Open Cybernetics & Systemics Journal, 2015, 9, 792-798 Open Access An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster Wentao

More information

Berkeley Ninja Architecture

Berkeley Ninja Architecture Berkeley Ninja Architecture ACID vs BASE 1.Strong Consistency 2. Availability not considered 3. Conservative 1. Weak consistency 2. Availability is a primary design element 3. Aggressive --> Traditional

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

Efficient Metadata Management for Cloud Computing applications

Efficient Metadata Management for Cloud Computing applications Efficient Metadata Management for Cloud Computing applications Abhishek Verma Shivaram Venkataraman Matthew Caesar Roy Campbell {verma7, venkata4, caesar, rhc} @illinois.edu University of Illinois at Urbana-Champaign

More information

Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for Hortonworks Data Platform Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Amazon EC2 Product Details Page 1 of 5

Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of

More information

OpenStack Introduction. November 4, 2015

OpenStack Introduction. November 4, 2015 OpenStack Introduction November 4, 2015 Application Platforms Undergoing A Major Shift What is OpenStack Open Source Cloud Software Launched by NASA and Rackspace in 2010 Massively scalable Managed by

More information

A Very Brief Introduction To Cloud Computing. Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman

A Very Brief Introduction To Cloud Computing. Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman A Very Brief Introduction To Cloud Computing Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman What is The Cloud Cloud computing refers to logical computational resources accessible via a computer

More information

StorPool Distributed Storage Software Technical Overview

StorPool Distributed Storage Software Technical Overview StorPool Distributed Storage Software Technical Overview StorPool 2015 Page 1 of 8 StorPool Overview StorPool is distributed storage software. It pools the attached storage (hard disks or SSDs) of standard

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

STeP-IN SUMMIT 2013. June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

STeP-IN SUMMIT 2013. June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case) 10 th International Conference on Software Testing June 18 21, 2013 at Bangalore, INDIA by Sowmya Krishnan, Senior Software QA Engineer, Citrix Copyright: STeP-IN Forum and Quality Solutions for Information

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Hosting Transaction Based Applications on Cloud

Hosting Transaction Based Applications on Cloud Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information