Evaluating HDFS I/O Performance on Virtualized Systems
|
|
- Gavin Hicks
- 7 years ago
- Views:
Transcription
1 Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing attentions due to its flexibility and low cost to do big data analysis. This project studies the I/O performance of the Hadoop Distributed File System (HDFS) in both native and virtualized systems by implementing a new tool using Chen and Patterson s workload model to benchmark the throughput of HDFS. The benchmark program consists of two main modules: worker and manager that read in four parameters and handle process requests. The entire test is divided in to two experiments, with the second one as the optimized version of the first one. The results in general match the expectation for the virtualized and native systems and they have also invoked new possibility for future optimization. 1. Introduction Nowadays, in the time of booming technological innovations, people demand larger hard drives, larger memory, etc. Things are better if larger. As small data gets outcompeted by big data, Hadoop, featuring big data analysis at a low cost, is brought to spotlight. Currently, an increasing amount of companies have moved their IT infrastructures to the virtualized cloud to lower management cost. When such movement comes together with the demand for big data, a new market is born -- Hadoop as a Service (HaaS) [1]. There are already several companies that provide Hadoop service in the cloud in this market, such as Amazon EMR (Elastic Map Reduce). The HaaS market is rapidly growing even though it is relatively new. It is predicted that the market capital will grow from 131 million dollars in 2012 to 1.9 billion dollars in 2016 [1]. Such huge growing potential indicates the significance and necessity of research on the performance of Hadoop in the cloud. As a big data analysis tool, Hadoop s key component is file system, which deals with storage and transfer of large data files. The performance of file inputs and outputs of the file system will greatly affect the entire Hadoop performance. Therefore, a tool that evaluates IO performance of the Hadoop system would play an important role in improving Hadoop s performance in the cloud. The default and also the most popular benchmark tool in the HDFS is Test DFSIO, which provides basic functionality of benchmarking. However this tool is not comprehensive enough to fully support the optimization of Hadoop s performance in the cloud. To handle the optimization problem, I have implemented a benchmark based on Chen and Patterson s workload model for I/O system [2], which is more suitable in targeting Hadoop optimization in this project. I have tested my implementation in HDFS on both native and virtualized systems and obtained some preliminary results. The rest of this report is organized as follows. In section 2, I provide the related works of this project. Section 3 and 4 introduce my benchmark workload model and implementation. Section 5 and 6 discuss the experiments that I have run. I end this paper with a conclusion of my project and discussion of future work in Section Related Works 2.1 Hadoop and HDFS This section introduces Hadoop and HDFS, which are the objects of my I/O benchmark, to readers that are unfamiliar with these techniques. Hadoop: is an open source framework that implements the MapReduce programming model [3]. It supports data-intensive distributed applications on large clusters of commodity hardware. Since it offers a low cost solution to big data analysis, Hadoop is getting increasingly
2 popular. For example, prominent users of hadoop include Yahoo!, Facebook, and Amazon. Hadoop Distributed File System (HDFS): is a distributed, scalable, and portable file system for the Hadoop framework [4]. HDFS stores large files in multiple of 64 MB and access them sequentially. Since Hadoop is a well-known technique, I only briefly introduce the features that affect the design of my benchmark here. Readers interested in learning more about Hadoop and HDFS are encouraged to read the online documentation [4]. 2.2 TestDFSIO TestDFSIO is the default I/O test tool for HDFS. It allows users to select file size and buffer size and computes the throughput and IO rate of the current Hadoop File System setup. Users can use this as a basic tool to understand the performance of their hadoop clusters. However, this tool also has a few limitations. First, its benchmark model does not consider concurrency. This model assumes that there is only one job doing read or write in the Hadoop system while in real situation, multiple jobs can issue IO requests simultaneously. This model obviously fails to capture this characteristic. Also this test has to run with the MapReduce framework. It cannot be used to benchmark HDFS only. 3. Workload Model The core of any benchmark used in performance evaluation is the workload model. At the suggestion of the smart, handsome and excellent program chair of CADAVER 13, I use the canonical I/O system workload model proposed by Chen and Patterson to provide a good synthetic evaluation of HDFS [2]. This workload model characterizes five parameters that lead to the firstorder performance effects in I/O systems: 1) uniquebytes: the number of unique data byte read and written in a workload. 2) sizemean: the average size of an IO request. 3) readfrac: the fraction of reads. 4) processnum: the number of processes simultaneously issuing IO requests. 5) seqfrac: the fraction of requests that follow the prior request sequentially. This model is comprehensive and easy to implement. Unlike TestDFSIO, which only considers the file and buffer sizes, the parameters of this model also reflect the special locality, temporal locality and concurrency of an I/O system. Since this is a parameterized synthetic benchmark model, the implementation is easy. I can test the effects of a parameter by simply keeping the other ones unchanged. As a final note in this section, I would like to point out that HDFS is designed to process large amount of data sequentially. Hence, although I use this model as the core of my benchmark, the last parameter, seqfrac, is ignored in my implementation. 4. Implementation This section describes my implementation that tailor Chen and Patterson s workload model to run in the Hadoop framework. The benchmark program consists of two main modules: worker and manager. Manager: defines the configurations of the IO tests and hires workers to do File IO in HDFS. It sets the default values of all four parameters, selects the parameters and their ranges to test, and determines the number of runs for each test. During each run, it starts a number of workers in Hadoop. This number is equal to the processnum in the current run. The manager passes the values of uniquebytes, sizemean and readfrac to the workers and waits. After all the workers complete their tasks, it reads their processing time of File IO and computes the average as the File IO time of the current run. Worker: does file read and write in HDFS and records the total processing time of its IO requests. After getting the parameters from the manager, the worker clears the memory cache and starts to read data from or write data to two separated files in HDFS. It issues IO requests repeatedly until the total bytes being read and written is equal to or greater than uniquebytes. To simulate the IO requests in a real system, I use a pseudo random number generator to randomize the order of reading and writing data. This generator returns a number that is chosen pseudo-randomly with uniform distribution between 0 and 1. The worker
3 reads data if the returned number is smaller than or equal to readfrac and writes data otherwise. than in the native because the virtual system was running on top of another OS and it had less memory resource than the native system. 5.2 Results Figure 1. The interactions between the Manager and Workers. 5. First Experiment This section discusses the setup and results of my first experiment of running my benchmark tool in HDFS. 5.1 Setup Due to limited resources, I conducted this experiment on a single laptop with Intel Core 2 (2.4GHz) processor and 4GB 1067 MHz DDR3 memory. The native OS was Mac OS 10.6, the virtualization platform was VirtualBox and the virtualized OS was Ubuntu Since the computer had a limited disk size, I used relatively small values in this experiment. The default values of the parameters are 64 MB uniquebytes, 1 KB sizemean, 1 processnum and 0 readfrac. I tested how each parameter would affect the overall throughput by changing its value multiples times while keeping the other three with the default values. The testing range for each parameter is the following: unique- Bytes (64MB 2GB), sizemean (1KB 2MB), processnum (1 5), and readfrac (0 1). I expected that an increase in uniquebytes size or processnum would decrease the throughput while an increase in sizemean and readfrac would increase the throughput. Also I expected the throughput change caused by these parameter changes to be more significantly varied in the virtualized system Figure 2 5 shows the result of my test. The virtual environment throughput was smaller than that in the native environment in most of the tests. However, the shown test results had some deviations from what I had expected. The uniquebytes diagram in Figure 2 has a relatively flat (horizontal) graph, which differs from my expectation. This can be resulted from the fact that the size of the uniquebytes is not big enough to force the file system to continuously swapping data in and out. Figure 3, 4, and 5 generally match our expectations. The diagram in Figure 3 supports my predication that as the sizemean increases, the throughput decreases as well as the fact that the virtualized system throughput is more sensitive to size- Mean that the native system. Figure 4 matches with my expectation that as the processnum increases, the throughout decreases. The last graph tells us that as the readfrac increases, the throughput increases in both native and virtualized systems. Also there is a huge jump on the graph of the virtualized system. Since the virtualized system is running on top of the native system, the memory can still exist in the host cache even though the virtualized system clears its own memory. This explains the dramatic increase on graph. Overall, by observing the graphs, I reach the conclusion that this tool is a good benchmark that reflects the performance of HDFS, and the data in this test is not optimal enough to yield more significant results. In order to improve this, I can design a new test with a bigger uniquebytes to reach the decreasing threshold of the throughput as well as increase the default size of the sizemean. 1KB is so small that it could possibly become a factor that dominates the performance (E.g. A default size- Mean of 2KB can take up as long as two hours to run).
4 Figure 2. Results of testing uniquebytes from 64 MB to 2GB in the first experiment. Figure 3. Results of testing sizemean from 1KB to 1MB in the first experiment.
5 Figure 4. Results of testing processnum from 1 to 5 in the first experiment. Figure 5. Results of testing readfrac from 0 to 1 in the first experiment.
6 6. Second Experiment Based on the lessons learned from the first experiment, I conducted a second experiment for an optimal design. In this section, I will discuss the setup and result for the second experiment. 6.1 Setup This time I used two machines running the native and the virtualized systems respectively to speed up the testing process. The hardware configuration for both machines is Inter i5 (2.4 GHz) and the rest of the setting is the same as the first experiment. The design of this experiment is in general the same with the previous one except the default parameter values. The new default parameter values are: uniquebytes 4.5GB, sizemean 1M, processnum 2, readfrac 0.5. I selected this specific set of parameters in order to guarantee a reliable result by imitating a real-life setting. The new test range for the parameters are : uniquebytes (4GB 12GB), sizemean (1KB 2MB), processnum (1-6), and readfrac (0 1). 6.2 Result Surprisingly, in this experiment, the virtualized system outperformed the native system, which deviated from what I expected. This could be possibly attributed to a read fraction of 0.5, a value that is too large for an optimal result. Figure 6 shows another flat graph that is similar to Figure 2. It is possible that the data size is still not big enough to reach the bottleneck. The same reason can be used to explain the flat graph in Figure 7. Figure 8 matches the result from the first experiment. Figure 6. Results of testing uniquebytes from 4 GB to 12 GB in the second experiment.
7 Figure 7. Results of testing sizemean from 32KB to 2MB in the second experiment. Figure 8. Results of test processnum from 1 to 6 in the second experiment.
8 Figure 9. Results of testing readfrac from 0 to 1 in the second experiment. 7. Conclusions and Future Work In this project, I have implemented Chen and Patterson s workload model of I/O system as a better test tool than TestDFSIO to benchmark HDFS. The implementation was tested in both native and virtualized systems. Even though the result I obtained shows that this tool successfully captures the load and concurrency characteristics of HDFS, it also has potentials that are not revealed in this project and yet to be discovered by continuing to explore the parameter space in order to provide synthetic benchmarking. This project inspires me to learn from testing and possibly focus on selfscaling benchmark to find the parameters for optimal performance. Also the current tests were conducted on the pseudo distributive mode of Hadoop and it would be essential to see its performance in full-distributed mode. Moreover, Hadoop s performance in other VM platforms would also be a potential field to study for better performance. 8. References [1] TechNavio. (2013). Global Hadoop-as-a-service Market [Online]. Available: [2] P. M. Chen, D. A. Patterson, "A New Approach to I/O Performance Evaluation--Self-Scaling I/O Benchmarks, Predicted I/O Performance", ACM Transactions on Computer Systems, November [3] J. Dean and S. Ghemawat. "MapReduce: simplified data processing on large clusters." In OSDI '04: Sixth Symposium on Operating System Design and Implementation, pages , [4] The Apache Software Foundation. (2008). Welcome to Apache Hadoop!. [online]. Available:
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationEnergy-Saving Cloud Computing Platform Based On Micro-Embedded System
Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems
A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down
More informationMobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationA Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationMarvell DragonFly Virtual Storage Accelerator Performance Benchmarks
PERFORMANCE BENCHMARKS PAPER Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks Arvind Pruthi Senior Staff Manager Marvell April 2011 www.marvell.com Overview In today s virtualized data
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationDelivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationSurvey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationPerformance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationGeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
More informationPerformance Analysis of Mixed Distributed Filesystem Workloads
Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt Motivation Hadoop-tailored filesystems (e.g. CloudStore)
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationLinux Filesystem Performance Comparison for OLTP with Ext2, Ext3, Raw, and OCFS on Direct-Attached Disks using Oracle 9i Release 2
Linux Filesystem Performance Comparison for OLTP with Ext2, Ext3, Raw, and OCFS on Direct-Attached Disks using Oracle 9i Release 2 An Oracle White Paper January 2004 Linux Filesystem Performance Comparison
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationHP Z Turbo Drive PCIe SSD
Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e
More informationA Case for Flash Memory SSD in Hadoop Applications
A Case for Flash Memory SSD in Hadoop Applications Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee Dept of Computer Engineering, Sungkyunkwan University, Korea x860221@gmail.com, smwindy@naver.com,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationExperiences with Lustre* and Hadoop*
Experiences with Lustre* and Hadoop* Gabriele Paciucci (Intel) June, 2014 Intel * Some Con fidential name Do Not Forward and brands may be claimed as the property of others. Agenda Overview Intel Enterprise
More informationIMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
More informationBIG DATA USING HADOOP
+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationOverhead and Performance Impact when Using Full Drive Encryption with HP ProtectTools and SSD
After some weeks of waiting, I received my new HP EliteBook 8440p with the following parameters: Processor type Intel Core i7-620m Processor (2.66 GHz, 4 MB L3 cache) Operating system installed Windows
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationDIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More informationExploring RAID Configurations
Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID
More informationViolin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads
Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads Web server, SQL Server OLTP, Exchange Jetstress, and SharePoint Workloads Can Run Simultaneously on One Violin Memory
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationBlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationImproving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
More informationTalend Real-Time Big Data Sandbox. Big Data Insights Cookbook
Talend Real-Time Big Data Talend Real-Time Big Data Overview of Real-time Big Data Pre-requisites to run Setup & Talend License Talend Real-Time Big Data Big Data Setup & About this cookbook What is the
More informationLeveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationHP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing
More informationIndex Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
More informationComparison of Windows IaaS Environments
Comparison of Windows IaaS Environments Comparison of Amazon Web Services, Expedient, Microsoft, and Rackspace Public Clouds January 5, 215 TABLE OF CONTENTS Executive Summary 2 vcpu Performance Summary
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationPerformance Analysis: Benchmarking Public Clouds
Performance Analysis: Benchmarking Public Clouds Performance comparison of web server and database VMs on Internap AgileCLOUD and Amazon Web Services By Cloud Spectator March 215 PERFORMANCE REPORT WEB
More informationHDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationScalability Factors of JMeter In Performance Testing Projects
Scalability Factors of JMeter In Performance Testing Projects Title Scalability Factors for JMeter In Performance Testing Projects Conference STEP-IN Conference Performance Testing 2008, PUNE Author(s)
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationPepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion
More informationScalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
More informationHadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT
Hadoop on OpenStack Cloud Dmitry Mescheryakov Software Engineer, @MirantisIT Agenda OpenStack Sahara Demo Hadoop Performance on Cloud Conclusion OpenStack Open source cloud computing platform 17,209 commits
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationComparison of Web Server Architectures: a Measurement Study
Comparison of Web Server Architectures: a Measurement Study Enrico Gregori, IIT-CNR, enrico.gregori@iit.cnr.it Joint work with Marina Buzzi, Marco Conti and Davide Pagnin Workshop Qualità del Servizio
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationComputing in clouds: Where we come from, Where we are, What we can, Where we go
Computing in clouds: Where we come from, Where we are, What we can, Where we go Luc Bougé ENS Cachan/Rennes, IRISA, INRIA Biogenouest With help from many colleagues: Gabriel Antoniu, Guillaume Pierre,
More informationVirtual Machine Based Resource Allocation For Cloud Computing Environment
Virtual Machine Based Resource Allocation For Cloud Computing Environment D.Udaya Sree M.Tech (CSE) Department Of CSE SVCET,Chittoor. Andra Pradesh, India Dr.J.Janet Head of Department Department of CSE
More informationOpen Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
More informationWHITE PAPER Optimizing Virtual Platform Disk Performance
WHITE PAPER Optimizing Virtual Platform Disk Performance Think Faster. Visit us at Condusiv.com Optimizing Virtual Platform Disk Performance 1 The intensified demand for IT network efficiency and lower
More informationIntroducing EEMBC Cloud and Big Data Server Benchmarks
Introducing EEMBC Cloud and Big Data Server Benchmarks Quick Background: Industry-Standard Benchmarks for the Embedded Industry EEMBC formed in 1997 as non-profit consortium Defining and developing application-specific
More informationA METHODOLOGY FOR IDENTIFYING THE RELATIONSHIP BETWEEN PERFORMANCE FACTORS FOR CLOUD COMPUTING APPLICATIONS
A METHODOLOGY FOR IDENTIFYING THE RELATIONSHIP BETWEEN PERFORMANCE FACTORS FOR CLOUD COMPUTING APPLICATIONS Luis Bautista 1, 2, Alain April 2, Alain Abran 2 1 Department of Electronic Systems, Autonomous
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationPerformance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage
Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage Technical white paper Table of contents Executive summary... 2 Introduction... 2 Test methodology... 3
More informationGoogle File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationBig Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
More informationWhite Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
More informationPerformance And Scalability In Oracle9i And SQL Server 2000
Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability
More informationBig Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
More information