The Impact of Virtualization on High Performance Computing Clustering in the Cloud

Similar documents

Mobile Cloud Computing T Open Source IaaS

2) Xen Hypervisor 3) UEC

Data Centers and Cloud Computing

Virtualization and the U2 Databases

Virtualization. Jukka K. Nurminen

Virtualization. Types of Interfaces

Full and Para Virtualization

Intro to Virtualization

Large Construction of a Cloud IaaS with Dynamic Resource Allocation Method Using OpenStack

Virtualization for Cloud Computing

Options in Open Source Virtualization and Cloud Computing. Andrew Hadinyoto Republic Polytechnic

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies. Virtualization of Clusters and Data Centers

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Chapter 14 Virtual Machines

Cloud Computing and Big Data What Technical Writers Need to Know

CS 695 Topics in Virtualization and Cloud Computing and Storage Systems. Introduction

VMware Server 2.0 Essentials. Virtualization Deployment and Management

Virtualization. Pradipta De

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

How To Create A Cloud Based System For Aaas (Networking)

Virtualization. Michael Tsai 2015/06/08

Data Centers and Cloud Computing. Data Centers

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Hypervisors. Introduction. Introduction. Introduction. Introduction. Introduction. Credits:

Introduction to Cloud Computing

Cloud Computing CS

Introduction to OpenStack

Sistemi Operativi e Reti. Cloud Computing

COS 318: Operating Systems. Virtual Machine Monitors

9/26/2011. What is Virtualization? What are the different types of virtualization.

Server and Storage Virtualization. Virtualization. Overview. 5 Reasons to Virtualize

Introduction to Cloud Computing

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Emerging Technology for the Next Decade

Cloud Computing: Making the right choices

CHAPTER 2 THEORETICAL FOUNDATION

Cloud Computing. Chapter 1 Introducing Cloud Computing

DISTRIBUTED COMPUTER SYSTEMS CLOUD COMPUTING INTRODUCTION

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Data Centers and Cloud Computing. Data Centers. MGHPCC Data Center. Inside a Data Center

Cloud Computing. Adam Barker

A cure for Virtual Insanity: A vendor-neutral introduction to virtualization without the hype

Cloud computing - Architecting in the cloud

Cloud Computing an introduction

Virtualization. Dr. Yingwu Zhu

GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

Cloud Computing. Chapter 1 Introducing Cloud Computing

KVM, OpenStack, and the Open Cloud

Infrastructure as a Service (IaaS)

Basics in Energy Information (& Communication) Systems Virtualization / Virtual Machines

IOS110. Virtualization 5/27/2014 1

Cloud Courses Description

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Enabling Technologies for Distributed and Cloud Computing

OpenStack Introduction. November 4, 2015

Rackspace Cloud Databases and Container-based Virtualization

Cloud Computing #6 - Virtualization

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Cloud Courses Description

Date: December 2009 Version: 1.0. How Does Xen Work?

How To Understand Cloud Computing

The Art of Virtualization with Free Software

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Virtualization & Cloud Computing (2W-VnCC)

International Journal of Advancements in Research & Technology, Volume 1, Issue6, November ISSN

CS 695 Topics in Virtualization and Cloud Computing. Introduction

Virtualization. Jia Rao Assistant Professor in CS

Enabling Technologies for Distributed Computing

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

How To Understand Cloud Computing

VIRTUALIZATION 101. Brainstorm Conference 2013 PRESENTER INTRODUCTIONS

Open Source Cloud Computing: Characteristics and an Overview

White Paper on NETWORK VIRTUALIZATION

Cloud Models and Platforms

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Virtualization. Introduction to Virtualization Virtual Appliances Benefits to Virtualization Example Virtualization Products

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Cloud 101. Mike Gangl, Caltech/JPL, 2015 California Institute of Technology. Government sponsorship acknowledged

What is Cloud Computing? First, a little history. Demystifying Cloud Computing. Mainframe Era ( ) Workstation Era ( ) Xerox Star 1981!

Operating Systems Virtualization mechanisms

Server and Storage Virtualization

Anh Quach, Matthew Rajman, Bienvenido Rodriguez, Brian Rodriguez, Michael Roefs, Ahmed Shaikh

Basics of Virtualisation

BIG DATA TRENDS AND TECHNOLOGIES

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

An Introduction to Cloud Computing Concepts

Transcription:

The Impact of Virtualization on High Performance Computing Clustering in the Cloud Master Thesis Report Submitted in Fall 2013 In partial fulfillment of the requirements for the degree of Master of Science in Software Engineering at the School of Science and Engineering of Al Akhawayn University in Ifrane By Ouidad ACHAHBAR Supervised by Dr. Mohamed Riduan ABID Ifrane, Morocco January, 2014

Acknowledgment I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance and strength to complete this work, and for having the chance to study and accomplish my master degree with high support from my family, friends and professors. Thank you ALLAH. I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this research, providing me with valuable feedback and overseeing my progress in a weekly basis. Thank you Dr. Abid for your motivation and support. My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf. I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their continuous support, encouragement and love. There are no words to express my gratitude to all of you. Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for being always by my side; thank you for sharing enjoyable moments with me, and thank you for being my friends. Last but not least, special acknowledgements go to all my professors for their support, respect and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud. Ouidad Achahbar 2

Abstract The ongoing pervasiveness of Internet access is largely increasing big data production. This, in turn, increases demand on compute power to process the massive data, and thus rendering High Performance Computing (HPC) into a high solicited service. Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly infrastructures for processing these big data, e.g., High Performance Computing as a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter controls the creation of virtual machines instances that carry data processing jobs. In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS. We track HPC performance under different cloud virtualization platforms, namely KVM and VMware ESXi, and compare it to the performance in a physical computing cluster infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack. The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a granularity of 8 physical machines per cluster. We got several interesting results when we ran the selected benchmarks on virtualized and physical cluster. Each tested cluster provided different performance trends. Yet, the overall analysis of the research findings proved that the selection of virtualization technology can lead to significant improvements when running and handling HPCaaS. 3

ملخص يعتبر التفشي المستمر لظاهرة ولوج واستعمال اإلنترنت سببا رئيسيا في تزايد إنتاج العديد من البيانات الضخمة. هذا بدوره يؤدي إلى زيادة الطلب على قدرات حسابية عالية لمعالجة هذه البيانات. هذه المؤشرات جعلت من خدمة "حوسبة عالية األداء" كخدمة مثيرة لإلهتمام. استنادا إلى نموذج توفير الحوسبة كأداة مساعدة تقدم الحوسبة السحابية بنيات تحتية مرنة اإلستعمال لمعالجة البيانات الضخمة على سبيل المثال "الحوسبة العالية األداء كخدمة". مع ذلك يقترن أداء هذه األخيرة بشكل كبير بتقنية البيئة االفتراضية نظرا إلى تحكمها في إنشاء األالت االفتراضية )الحواسب االفتراضية( التي تقوم بوظائف معالجة البيانات. في هذه األطروحة قمنا بوصف و تقييم تأثير البيئة االفتراضية على "الحوسبة العالية األداء كخدمة". قمنا أيضا بتتبع أداء "الحوسبة العالية األداء" على برامج سحابية افتراضية مختلفة وعلى حوسبة مادية مكونة من ثمان أجهزة كمبيوتر. قمنا باستخدام "أوبن ستاك" لبناء "الحوسبة العالية األداء كخدمة" و "هادوب" لتشغيل خوارزميات "ماب رديوس" على كبيرة. بيانات من خالل نتائج هذا البحث الحظنا تغير مهم في أداء " الحوسبة العالية األداء" بتغير حجم البيانات نوعية الحوسبة )البنية التحتية: المادية واالفتراضية( وحجم الحوسبة. بالرغم من ذالك فاالستناج الذي وصلنا اليه يثبت ان تقنية البيئة االفتراضية لها دور مهم ومعتبر في تحسين أداء "الحوسبة العالية األداء". 4

Table of Content Acknowledgment 2 Abstract 3 ملخص 4 Table of Content 5 List of Figures 7 List of Tables 9 List of Appendices 10 List of Acronyms 11 PART I: THESIS OVERVIEW 12 Chapter 1: Introduction 13 1.1. Background 13 1.2. Motivation 14 1.3. Problem Statement 15 1.4. Research Question 15 1.5. Research Objective 15 1.6. Research Approach 15 1.7. Thesis Organization 16 PART II: THEORETICAL BASELINES 17 Chapter 2: Cloud Computing 18 3.1. Cloud Computing Definition 18 3.2. Cloud Computing Characteristics 19 3.3. Cloud Computing Service Models 20 3.4. Cloud Computing Deployment Models 21 3.5. Cloud Computing Benefits 22 3.6. Cloud Computing Providers 23 Chapter 3: Virtualization 24 4.1. Definition of Virtualization 24 4.2. History of Virtualization 25 4.3. Benefits of Virtualization 25 4.4. Virtualization Approaches 26 4.5. Virtual Machine Manager 28 Chapter 4: Big Data and High Performance Computing as a Service 32 5.1. Big Data 32 5.2. High Performance Computing as a Service (HPCaaS) 33 Chapter 5: Literature Review and Research Contribution 35 5.1. Related Work 35 5.2. Contribution 36 PART III: TECHNOLOGY ENABLERS 37 Chapter 6: Technology Enablers Selection 38 6.1. Cloud Platform Selection 38 6.2. Distributed and Parallel System Selection 40 5

Chapter 7: Openstack 42 7.1. OpenStack Overview 42 7.2. OpenStack History 42 7.3. OpenStack Components 43 7.4. OpenStack Supported Hypervisors 49 Chapter 8: Hadoop 50 8.1. Hadoop Overview 50 8.2. Hadoop History 50 8.3. Hadoop Architecture 51 8.4. Hadoop Implementation 52 8.5. Hadoop Cluster Connectivity 55 PART III: RESEARCH CONTRIBUTION 57 Chapter 9: Research Methodology 58 9.1. Research Approach 58 9.2. Research Steps 58 Chapter 10: Experimental Setup 59 10.1. Experimental Hardware 59 10.2. Experimental Software and Network 60 10.3. Clusters Architecture 60 10.4. Experimental Performance Benchmarks 64 10.5 Experimental Datasets Size 65 10.6 Experiment Execution 66 Chapter 11: Experimental Results 67 11.1. Hadoop Physical Cluster Results 67 11.2. Hadoop Virtualized Cluster- KVM Results 72 11.3. Hadoop Virtualized Cluster- VMware ESXi Results 77 11.4. Results Comparison 82 Chapter 12: Discussion 88 12.1. TeraSort 88 12.2. TestDFSIO 90 12.3. Conclusion 91 PART IV: CONCLUSION 92 Chapter 13 93 Conclusion and Future Work 93 Bibliography 94 Appendix A: OpenStack with KVM Configuration 100 Appendix B. OpenStack with VMware ESXi Configuration 127 Appendix C: Hadoop Configuration 131 Appendix D: TeraSort and TestDFSIO Execution 145 Appendix E: Data Gathering for TeraSort 147 Appendix F: Data Gathering for TestDFSIO 153 6

List of Figures Figure 1: Thesis organization... 16 Figure 2: NIST visual model of cloud computing definition... 19 Figure 3: services provided in cloud computing environment... 21 Figure 4: Full virtualization architecture... 26 Figure 5: Paravirtualization architecture... 27 Figure 6: Hardware assisted virtualization architecture... 28 Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor... 29 Figure 8: Xen hypervisor architecture... 30 Figure 9: KVM hypervisor architecture... 31 Figure 10: VMware ESXi architecture... 31 Figure 11: Data growth over 2008 and 2020... 32 Figure 12: Active cloud community population... 38 Figure 13: Active distributed systems population... 40 Figure 14: OpenStack conceptual architecture... 44 Figure 15: Nova subcomponents... 44 Figure 16: Glance subcomponents... 46 Figure 17: Keystone subcomponents... 46 Figure 18: Swift subcomponents... 47 Figure 19: Cinder subcomponents... 48 Figure 20: Quantum subcomponents... 48 Figure 21: Apache Hadoop subprojects... 51 Figure 22: Hadoop Architecture... 52 Figure 23: HDFS and MapReduce representation... 53 Figure 24: Word count MapReduce example... 55 Figure 25 : Research steps... 58 Figure 26 : Hadoop Physical Cluster... 61 Figure 27: Hadoop Physical Cluster architecture... 61 Figure 28: Hadoop virtualized cluster - KVM... 62 Figure 29: Hadoop virtualized cluster VMware ESXi (a)... 63 Figure 30 : Hadoop virtualized cluster VMware ESXi (b)... 64 Figure 31 : Experimental execution... 66 Figure 32: TeraSort performance on Hadoop Physical Cluster... 67 Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster... 68 Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster... 68 Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster... 68 Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster... 68 Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster... 69 Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster... 70 Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster... 70 Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster... 70 Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster... 70 Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster... 71 Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster... 71 Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster... 71 Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster... 72 Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster... 72 Figure 47: TeraSort performance on Hadoop KVM Cluster... 72 7

Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster... 73 Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster... 73 Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster... 73 Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster... 73 Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster... 74 Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster... 75 Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster... 75 Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster... 75 Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster... 75 Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster... 76 Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster... 76 Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster... 76 Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster... 77 Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster... 77 Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster... 77 Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster... 78 Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster... 78 Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster... 78 Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster... 78 Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster... 79 Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster... 80 Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster... 80 Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster... 80 Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster... 80 Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster... 81 Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster... 81 Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster... 81 Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster... 82 Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster... 82 Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and... 83 Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi... 83 Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and... 84 Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and... 84 Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi... 85 Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi... 85 Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM... 86 Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi... 86 Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi... 86 Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi... 87 Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87 Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs... 89 Figure 89 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs89 Figure 90: OpenStack warning statistics about system resources usage... 90 8

List of Tables Table 1 : A Comparison of cloud deployment models... 22 Table 2 : Cloud IaaS selection... 39 Table 3 : Parallel and distributed platform selection... 41 Table 4 : OpenStack releases... 43 Table 5 : OpenStack projects... 43 Table 6: Apache Hadoop subprojects... 51 Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster)... 59 Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster... 60 Table 9 : OpenStack virtual machines features... 60 Table 10 : Experimental performance metrics... 64 Table 11 : Datasets size used for Hadoop benchmarks... 65 Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop Physical Cluster... 67 Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop Physical Cluster... 69 Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and different number of nodes- Hadoop Physical Cluster... 71 Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop KVM Cluster... 72 Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop KVM Cluster... 74 Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop KVM Cluster... 76 Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster... 77 Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster... 79 Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster... 81 9

List of Appendices Appendix A : OpenStack with KVM Configuration....100 Appendix B : OpenStack with VMware ESXi Configuration.127 Appendix C: Hadoop Configuration... 131 Appendix D: TeraSort and TestDFSIO Execution..145 Appendix E: Data Gathering for TeraSort....147 Appendix F: Data Gathering for TestDFSIO 153 10

List of Acronyms HPC HPCaaS VM VMM EMC DCI GFS HDFS NDFS DOE NIST SaaS PaaS IaaS NoSQL SNIA ACID AWS HPhC HVC SSH JSON XML API Amazon EC2 Amazon S3 VLAN DHCP High Performance Computing High Performance Computing as a Service Virtual Machine Virtual Machine Manager American Multinational Corporation Digital Communications Inc. Google File System Hadoop Distributed File System Nutch Distributed File System Department of Energy National Laboratories National Institute of Standards and Technology Software as a Service Platform as a Service Infrastructure as a Service Not Only Structured Query Language Storage Networking Industry Association Atomicity, Consistency, Isolation and Durability Amazon Web Services Hadoop Physical Cluster Hadoop Virtualized Cluster Secure Shell JavaScript Object Notation Extensible Markup Language Application Programming Interface Amazon Elastic Compute Cloud Amazon Simple Storage Service Virtual Local Area Network Dynamic Host Configuration Protocol 11

Part I: Thesis Overview This part introduces the key points to understand the purpose of the present research. It provides an introduction of the research starting with its background, motivation, problem statement, research question, objective and research methodology. 12

Chapter 1: Introduction In this chapter, we first come to the background of the present research, and then describe the motivation and the problem behind conducting this study. After that, questions, objectives, and methodology of the research are stated. Finally, an outline of the thesis is given out at the end of this chapter. 1.1.Background During the last decades, the demand for computing power has steadily increased as data generated from social networks, web pages, sensors, online transactions, etc. is continuously growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000 exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data constitutes the Big Data phenomenon. As Big Data grows in terms of volume, velocity and value, the current technologies for storing, processing and analyzing data become inefficient and insufficient. Gartner survey stated that data growth is considered as the largest challenge for organizations [2]. Stating this issue, High Performance Computing (HPC) has started to be widely integrated in managing and handling Big Data. In this case, HPC is used to process and analyze Big Data related to different problems including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. However, HPC still lacks the toolsets that fit the current growth of data. In this case, new paradigms and storage tools were integrated with HPC to deal with the current challenges related to data management. Some of these technologies include, providing computing as a utility (cloud computing) and introducing new parallel and distributed paradigms. Cloud computing plays an important role as it provides organizations with the ability to analyze and store data economically and efficiently. Performing HPC in the cloud was introduced as data has started to be migrated and managed in the cloud. Digital Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as High Performance Computing as a Service (HPCaaS). In short, HPCaaS offers high- 13

performance, on-demand, and scalable HPC environment that can handle the complexity and challenges related to Big Data [5]. One of the most known and adopted parallel and distributed systems is MapReduce model that was developed by Google to meet the growing of their web search indexing process [6]. MapReduce computations are performed with the support of data storage system known as Google File System (GFS). The success of both Google File System and MapReduce inspired the development of Hadoop which is a distributed and parallel system that implements MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely adopted by big players in the market because of its scalability, reliability and low cost of implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an underlying technology that distributes the work across HPC cluster [8, 9]. 1.2.Motivation Many solutions have been proposed and developed to improve computation performance of Big Data. Some of them tend to improve algorithms efficiency, provide new distributed paradigms or develop powerful clustering environments. Though, few of those solutions have addressed a whole picture of integrating HPC with the current emerging technologies in terms of storage and processing. As stated before, some of the most popular technologies currently used in hosting and processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present, the use of HPC in the cloud computing is still limited. The first step towards this research was done by the Department of Energy National Laboratories (DOE), which started exploring the use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched partnership with major top universities in United States to conduct more research about cloud computing, distributed systems and high computing applications. HPCaaS still needs more investigation to decide on appropriate environments that can fit high computing requirements. One of the HPCaaS aspects that is not yet investigated is the impact of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this research consists in the need for evaluating HPCaaS performance using MapReduce and different virtualization techniques. This motivation is accompanied by a strong rational that is addressed by the free accessibility to MapReduce and cloud computing open sources. 14

1.3.Problem Statement Cloud computing is offering set of services for processing Big Data; one of these services is HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization techniques which are considered as the heart of cloud computing. Stating this, the problem addressed in this research is formulated as follow: HPCaaS is still facing poor performance and still doesn t fit Big Data requirements. 1.4.Research Question Addressing the problem statement, this thesis aims at bringing answers to the following research questions: 1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)? 2. Is it worth moving HPC to the cloud? 3. How virtualization techniques affect HPCaaS performance? 4. Is there an optimal virtualization technique that can ensure good performance? 1.5.Research Objective The purpose of the present research is to find solutions for the addressed issues and questions in the previous sections. Hence, this research introduces a new architecture that can handle HPC complexity and increase its performance. The proposed architecture consists of building a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal of this research is to investigate the added value of adopting virtualized cluster, and the second goal is to evaluate the impact of virtualization techniques on HPCaaS. 1.6.Research Approach To evaluate HPCaaS over different virtualization technologies, we followed both qualitative and quantitative research methodologies. The qualitative approach was adopted to select appropriate technology enablers that will be used in building an architecture that will solve the issues addressed in this study. On the other hand, quantitative approach was adopted to conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC), Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the performance of HPC. 15

1.7.Thesis Organization The rest of this thesis is structured as follow (Figure 1): Part I covers chapter 1 (current chapter) which introduces the present research. Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big Data and HPCaaS, and chapter 5 lists some related work and states clearly our contribution Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting the technology enablers of this research, and chapter 7 and 8 present in details OpenStack and Hadoop respectively. Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in conducting this research; chapter 10 demonstrates the environment preparation to run the needed experiments; chapter 11 introduces the results, and chapter 12 discusses the research findings. Part V covers chapter 13 which concludes the research findings and proposes some future work; further, this part includes bibliography and appendices of this study. Figure 1: Thesis organization 16

Part II: Theoretical Baselines The objective of this part is to elaborate and shed light on some scientific concepts, theories and topics that serve as a foundation to understand the whole picture of the present research. Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud computing; chapter 3 introduces cloud computing related technologies, namely virtualization; chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing previous research that were done in the domain of evaluating HPC. 17

Chapter 2: Cloud Computing Cloud computing becomes the current innovative and emerging trend in delivering IT services that attract both the interest of academic and industrial fields. Using advanced technologies, cloud computing provides end users with a variety of services, starting from the hardware level services to the application level. Cloud computing is understood as utility computing over the Internet. Meaning, computing services have moved from local data centers to hosted services which are offered over the Internet and paid based on pay-per-use model [14]. This chapter provides an overview of cloud computing concept. It provides a distinct definition of what cloud computing is; defines cloud computing characteristics, describes cloud service and deployment models, discusses some cloud computing benefits, and finally this chapter lists some cloud computing providers. 3.1.Cloud Computing Definition In the late 1960 s, John McCarthy brought a new concept into computer science field which predicts that technology will not be only provided as tangible products [14]. Meaning, computer resources will be provided as a service like water and electricity. The concept was known as utility computing, and nowadays it known as cloud computing. Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in 2009 as: Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. NIST definition of cloud sheds light on the effective use of cloud computing in terms of providing minimum management efforts of the shared resources. It sets five characteristics that define cloud computing: on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. Concerning the deployment models, NIST has classified them into: private, public, community and hybrid cloud. More details about cloud characteristics, delivery and deployment models are provided in the upcoming subsections. 18

The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing characteristics, service models, and deployment models. Figure 2: NIST visual model of cloud computing definition [14] 3.2.Cloud Computing Characteristics NIST has listed five main characteristics that describe precisely cloud computing, which are [15]: On-demand self-service: end users can use and change computing capabilities as desired without the need of human interaction with each service provider. Broad network access: resources are accessed over network using standards mechanism. Resource pooling: the provider s computing resources are pooled to serve multiple consumers; these resources are dynamically assigned and reassigned according to consumer demand. Examples of resources include storage, processing, memory, and network bandwidth. Rapid elasticity: cloud providers can elastically scale in and scale out resources depending on current end users demand. Therefore, resources can be available for provisioning in any quantity at any time. Measured service: resources usage can be monitored, controlled and measured; therefore, these features enable end users to pay using the pay as you go model. Other characteristics were investigated in [16], and which are listed as follow: 19

Reliability: this feature is ensured by implementing and providing multiple redundant sites. Having this feature, cloud computing is considered as an ideal solution for disaster recovery and business critical tasks. Customization: cloud computing allows customization of infrastructure and applications based on end user demand. Efficient resource utilization: this feature ensures delivering resources as long as they are needed. 3.3. Cloud Computing Service Models Based on NIST definition of cloud computing, cloud deployment models are classified as follow: Software as a Service (SaaS) Software as a Service (SaaS) represents application software, operating system and computing resources. End users can view the SaaS model as a web-based application interface where services and complete software applications are delivered over the Internet. Some examples of SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship Management, etc. Platform as a Service (PaaS) This service allows end users to create and deploy applications on provider s cloud infrastructure. In this case, end users do not manage or control the underlying cloud infrastructure like network, servers, operating systems, or storage. However, they do have control over the deployed applications by being allowed to design, model, develop and test them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc. Infrastructure as a Service (IaaS) This service consists of a set of virtualized computing resources such as network bandwidth, storage capacity, memory, and processing power. These resources can be used to deploy and run arbitrary software which can include operating systems and applications. Examples of IaaS providers are Drop Box, Amazon web service, etc. Cloud services are summarized in Figure 3. 20

Figure 3: services provided in cloud computing environment [16] 3.4.Cloud Computing Deployment Models Private Cloud Private cloud computing is provisioned for exclusive use by an organization. The cloud in this case is owned, managed and operated by the organization, a third party, or both of them. The advantage of private cloud consists in providing high security since the cloud is accessed by trusted entities within the organization [15]. Public Cloud The cloud infrastructure is provisioned for general public use. It may be owned, managed, and operated by cloud service provider who offers services based on pay-per-use model. In contrast to private cloud, public cloud is known as untrustworthy environment [15]. Community Cloud The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from different organizations that share some goals (e.g., mission, security requirements, policy, and compliance considerations). In this case, the cloud may be owned, managed, and operated by one or more organizations in the community, a third party, or combination of them [15]. Hybrid Cloud This cloud is a combination of both private and public cloud computing environments. Hybrid cloud provides high flexibility and choices for organization; for instance, critical core activities of an organization can be run under the control of the private part of the hybrid cloud while other tasks may be outsourced to the public part [17]. Table 1 summarizes cloud deployment models discussed above [17]. 21

Table 1 : A Comparison of cloud deployment models [17] 3.5.Cloud Computing Benefits Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the key benefits offered by the cloud include [17, 18]: Initial Cost Savings Organizations or individuals can save the big initial investment for launching new hardware, products and services; in this case, cloud computing platform offers the needed resources in terms of infrastructure, platform and applications. Scalability Cloud computing ensures high computing scalability by scaling up resources as needed. Therefore, when the usage increases, resources increase relatively to respond to end user demand. Availability Cloud providers have the infrastructure and bandwidth to accommodate business requirements for high speed access, storage and systems. Reliability Cloud computing implements redundant paths to support business continuity and disaster recovery. 22

Maintenance End users are not concerned with the resources maintenance since it is done by the cloud service provider. 3.6.Cloud Computing Providers There are many providers who offer cloud services with different features and pricing. Some of them are listed as follow [16, 19]: Amazon Web Services Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures advanced data privacy techniques to protect users data. For that reason, AWS got various security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II. Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB (relational data storage service that stores, processes and queries data sets in the cloud), etc. Google Google [21] offers high accessibility and usability in its cloud services. Some of Google services include: Google s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool used to exhibit product and uploading their images in the cloud), etc. Microsoft Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows applications. Some other services include: SQL Azure, Windows Azure Marketplace (an online market to buy and sell applications and data), etc. OpenStack OpenStack [23] is an open source platform for public and private cloud computing that aims at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA. Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc. [16]. 23

Chapter 3: Virtualization There are many different existing technologies and practices used by cloud providers; some of them are internet protocols for communication, virtual private cloud provisioning, load balancing and scalability, distributed processing, high performance computing technologies and virtualization [24]. This chapter emphasizes an understanding of virtualization technology as it is considered the core of cloud computing. It describes in details the history, benefits, types and the abstract layer of virtualization. 4.1.Definition of Virtualization Virtualization is a widely used term; it has been introduced for many years as a powerful technology in computer science. The definition of virtualization can change depending on which component of computer system is applied on. However, it is broadly defined as an abstract layer between physical resources and their logical representation [25]. NIST has defined virtualization as [26]: The simulation of the software and/or hardware upon which other software runs. This simulated environment is called a virtual machine (VM). There are many forms of virtualization, distinguished primarily by computing architecture layer. For example, application virtualization provides a virtual implementation of the application programming interface (API) that a running application expects to use, allowing applications developed for one platform to run on another without modifying the application itself. The Java Virtual Machine (JVM) is an example of application virtualization; it acts as an intermediary between the Java application code and the operating system (OS). Another form of virtualization, known as operating system virtualization, provides a virtual implementation of the OS interface that can be used to run applications written for the same OS as the host, with each application in a separate VM container. Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as follow [27]: The act of abstracting, hiding, or isolating the internal functions of a storage (sub) system or service from applications, host computers, or general network resources, for the purpose of enabling application and networkindependent management of storage or data. From both definitions, we can say that virtualization is a methodology of dividing a physical machine into multiple execution environments that allow multiple tasks to run simultaneously. This is done by providing a software abstract layer that is called Virtual 24

Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical resources from the operating system. In this case, VMM allows creating multiple guest Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM) [28]. 4.2.History of Virtualization The roots of virtualization go back to the first visualized IBM mainframes that were designed in the 1690s, and which allowed the company to run multiple applications and processes simultaneously. In fact, the main drivers behind introducing virtualization were the high cost of hardware and the need for running and isolating applications on the same hardware. During 1970s, the adoption of virtualization technology increased sharply in many organizations because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down as well as the emergence of multitasking operating systems. With these facts, there was no need to assure a high CPU utilization, and therefore, there was no need for virtualization technology. Yet, in the 1990s, virtualization technology brought again to the market after introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to reduce management costs by replacing a bunch of low-utilized servers by a single server [29]. 4.3.Benefits of Virtualization There a bunch of reasons that push many organizations to go for virtualization technology; some of them are listed in [24, 29, 30] as follow: Server Consolidation It condenses multiple servers into one physical server that would host many virtual machines. This feature allows the physical server to run at high rate of utilization, and it reduces at the same time the hardware maintenance, power and cooling requirements costs. Application Consolidation Legacy applications might require newer hardware and/or operating systems. In this case, virtualization can be used to virtualize the new requirements. Sandboxing Virtualization can provide secure and isolated environment by running virtual machines that can be used to run foreign or less-trusted applications. Multiple Simultaneous OS 25

It can provide the facility of having multiple simultaneous operating systems that can run different types of applications. Reducing Cost Virtualization reduces cost deployment and configuration by ensuring less hardware, less space and less staffing. Furthermore, virtualization reduces the cost of networking by requiring less wirings, switches and hubs. 4.4. Virtualization Approaches Virtualization can take different forms depending on which component of computer system is applied on [31]. In this section, we will shed light on three famous virtualization techniques: Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization. 4.4.1. Full Virtualization In full virtualization, guest OS is fully abstracted from the hardware level by adding virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being virtualized, and it requires no modifications. This approach provides each VM with all services of the physical system, including virtual BIOS, virtual devices and virtualized memory management. To manage the communication between different layers, full virtualization provides both binary translation and direct execution techniques (Figure 4). Binary translation is used to convert guest OS instructions into host instructions. On the other hand, application or user level instructions are directly executed on the processor to ensure high performance [32]. Microsoft Virtual Server is an example of full virtualization. Figure 4: Full virtualization architecture [32] 26

4.4.2. Paravirtualization The fundamental issue with full virtualization is the emulation of devices within the hypervisor. This issue was solved by developing paravirtualization technique which allows the guest OS to be aware that it's being virtualized and to have direct access to the underlying hardware. In paravirtualization, the actual guest code is modified to use a different interface that accesses the hardware directly or the virtual resources controlled by the hypervisor [32]. In more details, paravirtualization changes the OS kernel to replace non-virtualized instructions with hypercalls that communicate directly with the hypervisor. Thus, when a privileged command is to be executed on the guest OS, it is delivered to the hypervisor (instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt paravirtualization technology. Figure 5: Paravirtualization architecture [32] The downside of paravirtualization is that the guest must be modified to integrate hypervisor awareness. This is a limitation as some operating systems do not allow such modifications (e.g. Windows 2000/XP), and even the ones that can be modified may need additional resources for maintenance/troubleshooting [32]. 4.4.3. Hardware Assisted Virtualization Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case, VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6, privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs that use this approach [29]. 27

Figure 6: Hardware assisted virtualization architecture [32] 4.5.Virtual Machine Manager As defined before, hypervisor or VMM is the layer between the operating system and a guest operating system or the layer between the hardware and the guest operating systems. In [25], the author has set three main features that need to be maintained by VMM. First feature demonstrates that VMM has to provide an environment that is identical with the original machine that we want to virtualize. Second feature shows that programs running on VM or original machine should show the same performance, or, with some minor decrease. Finally, last feature states that VMM needs to control all system resources provided to VMs. 4.5.1. Hypervisor Types Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs directly on the system hardware, and therefore they monitor the operating system guests and they allocate all the needed resources including disk, memory, and CPU and I/O peripherals. Having no intermediary between Type 1 hypervisor and the physical layer has led to an efficient performance in terms of hardware access and security level (Figure 7-a). On the other hand, Type 2 hypervisor runs on host operating system that provides virtualization services such as I/O and memory management (Figure 4-b). Having an intermediary layer between the hypervisor and the hardware makes the installation process easier than Type 1 hypervisor since the operating system is in charge of hardware configuration such as networking and storage [33]. 28

Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33] The differences between Type 1 and Type 2 hypervisor can lead to different performance results. The layer between the hardware and the hypervisor in Type 2 makes the performance less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor needs first to pass the request to the operating system and then the hardware layer. Besides performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2 reliability. For instance, the failure in operating system can directly affect the hosted guests in Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the operating system availability. However, hypervisor type 2 has some advantages which consist in having fewer hardware/driver issues as the host operating system is responsible for interfacing with the hardware [34]. 4.5.2. Examples of Hypervisors a) Xen Hypervisor Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization [35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0). Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has direct access for the resources on the physical, which is not the case for DomU guests [36]. Overall architecture of Xen hypervisor is shown in Figure 8. 29

Figure 8: Xen hypervisor architecture Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris, FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are running in a virtualized environment, and they don t have direct access to the hardware resources. In this case, the guest operating system is modified to make special calls (hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a traditional unmodified operating system. On the other, in full virtualization, DomU are referred to as DomU HVM Guests and run standard any unchanged operating system [37]. DomU HVM is not aware that it is sharing processing time on the hardware, and it is not aware of the presence of other virtual machines. In this case, DomU HVM requires processors which specifically support hardware virtualization extensions (Intel VT or AMD-V). Virtualization extensions allow for many of the privileged kernel instructions (which in PV were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate technique. b) KVM Hypervisor KVM hypervisor provides a full virtualization solution based on Linux operating system. It works by reusing the hardware assisted virtualization extensions that were already developed. In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result, it takes; as mentioned above, a full advantage of many components which are already present within the kernel such as memory management and scheduling [38]. KVM is implemented using two main components; the first one is the KVM-loadable module that, when installed in the Linux kernel, provides management of the virtualization hardware (Figure 9). The second component provides PC platform emulation, which is offered by a modified version of 30

QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest operating system requests [39]. Figure 9: KVM hypervisor architecture c) VMware ESXi Hypervisor VMware was the first leader company that contributed to virtualization technology. One of its virtualization products is VMware ESXi which is installed directly on top of the physical machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of reliability and performance to companies of all sizes. The overall architecture of VMware ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the necessary processes to manage VMs. It provides certain functionality similar to that found in other operating systems, such as process creation and control, signals, file system, and process threads. Therefore, vmkernel supports running multiple virtual machines and provides some core functionalities like: Resource scheduling, I/O stacks and Device drivers [24]. Figure 10: VMware ESXi architecture [40] 31

Chapter 4: Big Data and High Performance Computing as a Service As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of users and data generated, the capacity and computing power of current data tools lead to inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The increase in data size to many terabytes and petabytes is known as Big Data. To handle the complexity of Big Data, HPC is adopted to provide high computation capabilities, high bandwidth, and low latency network. This chapter provides an overview of Big Data phenomena and HPaaS concept. Figure 11: Data growth over 2008 and 2020 [54] 5.1.Big Data 5.1.1. Big Data Definition Big Data is defined as large and complex datasets that are generated from different sources including social media, online transactions, sensors, smart meters and administrative services [43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of storing, analyzing and processing data. Literature reviews on Big Data divided the concept into four dimensions: Volume, Velocity, Variety and Value [43]. 32

Volume: the size of data generated is very large, and it goes from terabytes to petabytes. Velocity: data grows continuously at an exponential rate. Variety: data are generated in different forms: structured data, semi-structured and unstructured data. These forms require new techniques that can handle data heterogeneity. Value: the challenge in Big Data is to identify what is valuable as to be able to capture, transform and extract data for analysis. 5.1.2. Big Data Technologies With Big Data phenomenon, there is an increasing demand for new technologies that can support the volume, velocity, variety and value of data. Some of the new technologies are NoSQL, parallel and distributed paradigms and new cloud computing trends that can support the four dimensions of big data. NoSQL (Not Only Structured Query Language) is the transition from relational databases to non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability to replicate and to partition data over many servers, and the ability to provide high performance operations. However, moving from relational to NoSQL systems has eliminated some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability) [45]. In this context, NoSQL properties are defined by CAP theory [46] which states that developers must make trade-off decisions between consistency, availability and partitioning. Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and CouchDB [50]. Other supporting technologies for Big Data are parallel and distributed paradigms (e.g. Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in the upcoming chapters (Part III- Chapter 8, 9). 5.2. High Performance Computing as a Service (HPCaaS) 5.2.1. HPCaaS Overview High Performance Computing (HPC) is used to process and analyze large and complex problems, including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. HPC fits these requirements by implementing large physical clusters. However, traditional HPC faces a set 33

of challenges that consist in peak demand, high capital, and high expertise to acquire and operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of new technology trends including, cloud technologies, parallel processing paradigms and large storage infrastructures. Merging HPC with these new technologies has proposed new HPC model, called HPC as a service (HPCaaS). HPCaaS is an emerging computing model where end users have on-demand access to preexisting needed technologies that provide high performance and scalable HPC computing environment [52]. HPCaaS provides unlimited benefits because of the better quality of services provided by the cloud technologies, and the better parallel processing and storage provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some HPCaaS benefits are stated in [51] as follow: High Scalability: resources are scaling up as to ensure essential resources that fit users demand in terms of processing large and complex datasets. Low Cost: End-users can eliminate the initial capital outlay, time and complexity to procure HPC. Low Latency: by implementing the placement group concept that ensures the execution and processing of data in the same rack or on the same server. 5.2.2. HPCaaS Providers There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin Computing [53] which has been a leader in designing and implementing high performance environments for over a decade. Nowadays, it provides HPCaaS with different options: ondemand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services (AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is currently used for Computer Aided Engineering, molecular modeling, genome analysis, and numerical modeling across many industries including Oil and Gas, Financial Services and Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure HPC) [56] and Google (Google Compute Engine) [57]. 34

Chapter 5: Literature Review and Research Contribution In order to bridge the gap between the present research and previous studies, a review was conducted on the current state of HPC and virtualization. Therefore, this chapter situates the research in relation to previous research publications and states clearly the research contribution. 5.1. Related Work There have been several studies that evaluated the performance of high computing in the cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides, only few studies have evaluated the performance of high computing using the combination of both new emerging distributed paradigms and cloud environment [64]. In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines (VM), and they came up to the conclusion that the tested public clouds do not seem to be optimized for running HPC applications. This was explained by the fact that public cloud platforms have slow network connections between virtual machines. Furthermore, authors in [13] evaluated the performance of HPC applications in today's cloud environments (Amazon EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that running HPC on EC2 cloud platform limits performance and causes significant variability. Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of running HPC applications on three different platforms. First and second platform consist of two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more costeffective for low communication-intensive applications. In order to understand the performance implications on HPC using virtualized resources and distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16 nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and MapReduce [6]. The conclusion of this research suggested that most parallel applications can be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce, 35

and Dryad); however, scientific applications, which require complex communication patterns, still require more efficient runtime support. Evaluating HPC without relating it to new cloud technologies was also performed using different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that none of the techniques match the performance of the base system perfectly; yet, OpenVZ demonstrates high performance in both file system performance and industry-standard benchmarks. In [67], authors compared the performance of KVM and VMware. Overall findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave better results than VMWare. In [68], authors conducted quantitative analysis of two leading open source hypervisors, Xen and KVM. Their study evaluated the performance isolation, overall performance and scalability of virtual machines for each virtualization technology. In short, their findings showed that KVM has substantial problems with guests crashing (when increasing the number of guests); however, KVM still has better performance isolation than Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM, VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor. 5.2.Contribution So far, there are only few studies that compared different virtualization techniques and its impact on HPC in the cloud. The only study we found was done in [70], where authors compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization technology was compared with bare-metal using a set of high performance benchmarking tools. The results of this research demonstrated that KVM is the best choice for HPC in the cloud because of its rich features and near-native performance. The contribution of this present research will fill the literature gap by examining the impact of virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a distributed and parallel system. 36

Part III: Technology Enablers This part explains the use of OpenStack and Hadoop as underlying technologies for this research. Hence, this part starts first with providing a qualitative study for selecting an appropriate cloud platform and distributed system; second chapter of this part introduces in details OpenStack components, and third chapter presents Hadoop and its main aspects. 37

Chapter 6: Technology Enablers Selection The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built after conducting a qualitative study of available tools in the market. We targeted mainly open sources to select appropriate cloud computing platform and distributed system. Hence, this chapter presents the analysis we followed in selecting cloud platform and distributed system. 6.1.Cloud Platform Selection To compare available cloud open sources, we tried to choose the most popular platforms. The selection of competing platforms was based on a study that compares the popularity of OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12, the study showed that OpenStack has the largest total population index, followed by Eucalyptus, CloudStack, and Opennebula. Figure 12: Active cloud community population [71] Based on Figure 12, we selected to compare and study OpenStack, Opennebula and Eucalyptus. To adopt one of these cloud open sources, we used some other studies that compare their performance and quality [72-75]. In [72], authors compared some open and commercial cloud platforms. Concerning open platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they adopted a set of criteria, including storage, virtualization, network, management, security and vendor support. The results of the research showed that open-source and commercial solutions 38

can have comparable features, and that OpenNebula is the most feature-complete cloud platform when it is compared with Eucalyptus. [73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors compared the performance of both cloud platforms based on measuring the time when the cloud starts instantiating VMs and the time when they are ready to accept SSH connections. The findings of the research demonstrate that OpenStack is slightly better than OpenNebula due to smaller instantiation time. Moreover, the results showed that OpenStack is more suitable for high computing due to faster instantiation of large number of VMs. In [74], authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula. For the qualitative analysis, they adopted some of the following criteria: security, virtualization supported, access, image support, resource selection, storage support, highavailability support and API support. Based on the results of the qualitative study, authors concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would benefit in case of persistent storage support. For the quantitative analysis, authors measured the deployment, network overhead and the clean-up time of VMs. The results of quantitative analysis showed that each platform can be used depending on user requirements and specifications. In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack, OpenNebula and CloudStack. To perform the comparison, authors adopted the following criteria: storage, network, security, hypervisor, scalable and installation code openness. In short, the results of this study [75] showed that OpenStack is the preferred cloud open source. Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go for OpenStack as it is known for its flexibility and total openness. Table 2 : Cloud IaaS selection 39

6.2.Distributed and Parallel System Selection To compare available distributed and parallel systems in the market, we opted again for the popularity index of those systems. The selection of competing systems was based on a study done in [76]. The study is summarized in Figure 13 which compares the popularity index of Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The study was done in 2012, and it demonstrates the total downloads between January 2011 and March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed by MongoDB and Cassandra. Figure 13: Active distributed systems population [76] Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in order to end up with one selected system for the present research. MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data in tables with columns and rows. To provide high redundancy and make data highly available, MongoDB offers replication across multiple servers. While data is synchronized between servers using replication, MongoDB also facilitates the scale out option by supporting sharding which partitions a collection and stores the different portions on different machines. MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On the other hand, Hadoop is an open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [7]. More details about Hadoop are included in chapter 8. 40

A study done in [77] compares MongoDB and Hadoop systems. The study came up with three main conclusions; first, it is not appropriate to use MongoDB as an analytics platform; second, using Hadoop for MapReduce jobs is several times faster than using the built-in MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides, a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and MongoDB. In short, the study showed that MongoDB is roughly four times slower than Hadoop in fully-distributed mode. Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we decided to go for Hadoop as an analytical and storage tool for the present research. Table 3 : Parallel and distributed platform selection 41

Chapter 7: Openstack OpenStack is an open source platform for public and private cloud computing that aims at ensuring scalability and flexibility. It was developed by a wide range of developers and contributors using mainly Python (68%), XML (16%) and JavaScript (5%) [79]. This chapter provides detailed description of Openstack including, brief history; its components, the corresponding architecture, and finally some supported hypervisors. 7.1.OpenStack Overview The formal definition of OpenStack was stated in [80], which defines OpenStack as: a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface. From this definition, OpenStack is considered as an Infrastructure as a Service (IaaS). An important feature of OpenStack is that it provides a web interface called dashboard and APIs that make its services available via Amazon EC2 and S3 compatible APIs. This feature ensures that all existing tools that work with Amazon s cloud platform, can also work with OpenStack platform [81]. 7.2.OpenStack History OpenStack was a collaboration project between Rackspace Hosting and NASA. Both organizations planned to release internal cloud project object storage and compute. Rackspace contributed with their Cloud Files platform to support the storage part of OpenStack, while NASA contributed with their Nebula platform to support the compute part [82]. In July 2010, both organizations released the first version of OpenStack under Apache 2.0 License. In September 2012, OpenStack Foundation was established as an independent entity with the mission of protecting, empowering, and promoting OpenStack software. Now, OpenStack project is currently supported by more than 150 companies including AMD, Intel, Canonical, Red Hat, Cisco, Dell, HP, IBM and Yahoo! [83]. 7.1.OpenStack Releases OpenStack releases different versions with new improvement and contributions. All OpenStack releases since 2010 are listed in Table 4 [79]. 42

Table 4 : OpenStack releases [79] 7.3.OpenStack Components The core components of OpenStack software are: OpenStack Compute Infrastructure (Nova); OpenStack Object Storage Infrastructure (Swift) and OpenStack Image Service Infrastructure (Glance). Besides these components, OpenStack include Identity Service (Keystone), Network Service (Quantum), Dashboard Service (Horizon) and Block Storage (Cinder). Table 5 summarizes the main components of OpenStack and the corresponding code name. Table 5 : OpenStack projects Taking into consideration the previous mentioned OpenStack components, a conceptual architecture of OpenStack is provided in Figure 14 which shows how OpenStack components are interconnected [79]. 43

Figure 14: OpenStack conceptual architecture [79] 7.3.1. OpenStack Compute (Nova) Nova provides flexible management for virtual machines by allowing users to create, update, and terminate virtual machines. The overall architecture of Nova (Figure 15) is composed of the following sub-components: nova-api, nova-scheduler, nova-compute, nova-volume, queue and database [82]. Figure 15: Nova subcomponents 44

Nova-api is responsible of accepting and fulfilling the API requests. A request consists of actions that will be performed by nova subcomponents. In order to accept an API request, nova-api provides an endpoint for all API queries and enforcing some policies. If the request is about managing virtual machines, the nova-compute is involved to be in charge of creating or terminating a virtual machine instances. Normally, nova-compute receives requests from the queue sub-component. In order to manage virtual machine instances, nova-compute uses different ways and drivers such as libvirt software package, Xen API, vsphere API, etc. to support virtualization technologies. To specify where to send a request, nova-scheduler retrieves the request from the queue and determines which compute server host it should run on. In case there is a need for memory space, nova-volume does the creation, attachment, and detachment of persistent volumes to virtual machine instances [82]. Nova also provides network management by its subcomponent nova-network. The latter accepts networking tasks from the queue and then performs system commands to manipulate the network. Nova-network defines two types of IP addresses: Fixed IPs and Floating IPs. Fixed IP is considered as a private IP that is assigned to an instance during its life cycle. On the other hand, floating IP is considered as a public IP that will be used for external connectivity. The network itself that is defined in nova-compute can be classified into three categories: Flat, FlatDHCP and VLAN network [82]. Flat assigns a fixed IP address to an instance and attaches that IP on common bridge (created by the administrator). FlatDHCP builds upon the Flat manager by providing DHCP services to handle instance addressing and creation of bridges. VLAN provides a subnet, and a separate bridge for each project. The range of IPs of a given project is only accessible within the VLAN. The last subcomponents of nova are queue and database. Queue is responsible of passing messages between nova sub-components to facilitate the communication between them. It is implemented using RabbitMQ. Nova database stores most of the configuration and run-time state of the cloud infrastructure; it contains a set of tables such as: instance types, instances in use, networks available, fixed IPs, projects and virtual interfaces [82]. 7.3.2. OpenStack Object Storage (Glance) Glance manages virtual disk images. It consists of three main sub-components glance-api, glance-registry and glance database (Figure 16). Glance-api accepts incoming API requests 45

and then communicates them to other components (glance-registry and image store). All information about images is stored in glance-database. Last component which is glanceregistry is responsible of retrieving and storing metadata about images [82]. Figure 16: Glance subcomponents 7.3.3. OpenStack Identity Service (Keystone) Keystone authorizes users access to OpenStack components. It supports multiple forms of authentication including standard username and password credentials and token-based systems. Keystone architecture is represented by the following subcomponents (Figure 17): token backend, catalog backend, policy backend and identity backend [82]. Figure 17: Keystone subcomponents 7.3.4. OpenStack Object Store (Swift) Swift is the oldest project within OpenStack, and it is the underlying technology that powers Rackspace s Cloud Files service [82]. Swift provides a massively scalable and redundant object store by writing multiple copies of each object to multiple and separated storage 46

servers as to handle failures efficiently. Swift component consists of Proxy Server, Account Server, Container Server, and Object Server (Figure 18). Figure 18: Swift subcomponents Swift-proxy accepts incoming requests that consists of uploading files, making modifications to metadata and creating containers. Requests are served by account server, container server or object server. Object servers request about managing pre-existing objects or files in the storage; account server manages accounts defined with the object storage service, and container server manages the mapping of containers, folders, within the object store service [82]. 7.3.5. OpenStack Block Storage Service (Cinder) Cinder allows block devices to be connected to virtual machine instances for better performance. It consists of the following sub-components: cinder-api, cinder-volume, cinderdatabase and cinder-scheduler (Figure 19). Cinder-api accepts incoming requests and directs them to the cinder-volume which performs reading or writing to the cinder database to maintain states and interacts with other processes. Cinder-scheduler is responsible of selecting the optimal block storage node to create the volume on. In order to maintain communication between cinder components, message queue is used. 47

Figure 19: Cinder subcomponents 7.3.6. OpenStack Network Service (Quantum) Quantum allows users to create their own networks and then attach interfaces to them. It consists of quantum-server, quantum-account, quantum-plugin and quantum-database (Figure 20). Quantum-server accepts incoming API requests and then directs them to the correct quantum-plugin. Plugins and agents perform special actions such as plug/unplug ports, creating networks, subnets and IP addressing. Finally, quantum-database stores networking state for particular plugins. Figure 20: Quantum subcomponents 48

7.4.OpenStack Supported Hypervisors The abstraction feature provided by OpenStack Compute lead to support various existing hypervisors. Some of the supported hypervisors are listed as follow: KVM, LXC, QEMU, UML, VMWare ESX/ESXi, Xen, PowerVM, Hyper-V [79]. However, KVM is still the most widely used hypervisor in deploying OpenStack. Besides KVM, more existing deployments run Xen, LXC, VMWare and Hyper-V, but each of these hypervisors lack some features support or the documentation on how to use them with OpenStack is not well documented. 49

Chapter 8: Hadoop Hadoop has been adopted by big players in the market such as Google, Yahoo, LinkedIn, Facebook, New York Times, IBM, etc. [84]. This chapter provides a detailed overview of Hadoop, starting with a brief history of this open source, the corresponding architecture, implementation and some related features. 8.1.Hadoop Overview Hadoop is an Apache Java open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [85]. Hadoop has been designed to be reliable, fault tolerant and scalable project that can scale up from one single machine to thousands of machines. 8.2.Hadoop History In 2002, Hadoop was created by Doug Cutting as an open source for web crawling and indexing, and it was first named Nutch project. Nutch was developed to handle searching issues, but it faced the scalability problem as it wouldn t scale up to billions of web pages. To deal with this issue, Nutch team got inspired by Google s distributed filesystem (GFS). By adopting GFS architecture in 2004, Nutch team has delivered an open source called Nutch Distributed Filesystem (NDFS) [86]. When Google published its paper about MapReduce algorithm, Nutch team has tried to get advantage of that work by introducing MapReduce to its NDFS system. Implementing both NDFS and MapReduce made Nutch as a powerful system for web crawling and indexing. This success has pushed Nutch team to build an independent project in 2006 named Hadoop project. By this time, Doug Cutting joined Yahoo!, which provided enough resources to improve Hadoop performance. Even if Yahoo! has developed and contributed to 80% of Hadoop project, Hadoop was made its own top-level project at Apache in January 2008 [87]. Besides implementing MapReduce and HDFS algorithms, Hadoop project includes other subprojects that are listed in Table 6 [85]. 50

Table 6: Apache Hadoop subprojects Hadoop subprojects are grouped and named Hadoop Ecosystem. The overall picture of Hadoop Ecosystem is illustrated in Figure 21. ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeeper HBase MapReduce (Job Scheduling / Excution System) Avro HDFS (Hadoop Distributed File System) Figure 21: Apache Hadoop subprojects [85] 8.3.Hadoop Architecture Hadoop implements master/slave architecture, where master is named NameNode and slave is named DataNode. NameNode manages the file system namespace that consists of a hierarchy of files and directories used for data storage. When a file is created by client application, it is divided into blocks; each block is replicated and stored in DataNodes. In this case, information about the replicas numbers (number of block copies) and the mapping of replicas and blocks are stored in the NameNode. On the other hand, each DataNode is in charge of 51

managing storage attached to the node in which it is running on. Furthermore, each DataNode handles the read operation, write, block creation, deletion, and replication that come as instructions from the NameNode [86]. Besides NameNode and DataNodes, Hadoop cluster consists of Secondary NameNode (backup node for NameNode), JobTracker and TaskTracker. JobTracker is located in the master node, and it is responsible of distributing MapReduce tasks to other nodes in the cluster. On the other hand, TaskTracker runs locally tasks distributed by the JobTracker; each slave in the cluster contains one TaskTracker that can also run on master node [86]. The overall architecture of Hadoop is illustrated in Figure 22. Figure 22: Hadoop Architecture 8.4.Hadoop Implementation Hadoop is mainly implemented using HDFS and MapReduce paradigm. HDFS is used to store large data sets while MapReduce is used to analyze and process data across Hadoop cluster. Taking into consideration the architecture provided in Figure 22, HDFS concept is represented by the NameNode, Secondary NameNode and DataNodes, while MapReduce is represented by the JobTracker and TaskTracker (Figure 23). 52

Figure 23: HDFS and MapReduce representation 8.4.1. HDFS Overview HDFS is designed as a hierarchy of files and directories. Each file is divided into blocks that are stored in different DataNodes. NameNode stores only the metadata that includes information about blocks locations and the number of copies of each block. Furthermore, HDFS allows NameNode to perform the namespace operations such as opening, closing and renaming files and directories. As stated before, HDFS performs data replication to ensure fault-tolerance. The replication factor is set when a file is created, and it can be modified later [85]. An example that illustrates the HDFS process is the read, write and creation operations. During the read operation, the HDFS request from the NameNode the list of DataNodes that host replicas of the blocks of a given file. The list is sorted by the network topology distance from the client. After deciding on the DataNode from where to fetch data, The HDFS client contacts directly the DataNode and requests the desired block. On the other hand, during the write operation, the HDFS asks the NameNode to choose DataNodes that will store replicas of the first block of the file, second block and so on as so far. For each block, the client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. Concerning the creation operation, when there is a request to create a file, the HDFS caches first the file into a temporary local file. When the latter accumulates data up to the HDFS block size, the HDFS 53

contacts the NameNode to insert the file name into the file system namespace and allocate a data block for it. After that, the NameNode selects the DataNodes that will host the data blocks. At this stage, the client moves the block of data from the local temporary file to the specified DataNode [85]. 8.4.2. MapReduce Overview Hadoop MapReduce is a programming paradigm that processes very large data sets in parallel manner on large clusters. It was first introduced by Google in 2004 [6]. The core idea of MapReduce is splitting the input data set into chunks that will be processed by map tasks in a parallel manner. The output of each map task is sorted to be then directed as an input to the reduce task. Taking into consideration the previous definition, MapReduce can be classified into two steps: map step and reduce step [88]. Map task process is divided by itself into five phases: read, map, collect, spill and merge. The read phase consists of reading the data chunk from the HDFS, and then creating the input key-value. Map phase is about executing the user-defined map function to generate the mapoutput data. Collect phase performs the collection of intermediate (map-output) data into a buffer before spilling. Spilling process sorts, performs compression, if specified, and writes to local disk to create file spills. The last step in the map task is the merge phase which merges all file spills into one single map output file [88]. Reduce task is also divided into four phases: shuffle, merge, reduce and reduce phase. Shuffle phase transfers the intermediate data (map output) from the mapper slaves to a reducer's node and decompressing if needed. Merge phase performs the merging of the sorted outputs that come from different mappers to be directed as the input to the reduce phase. Reduce phase executes the user-defined reduce function to produce the final output data. Finally, write phase compresses, if needed, and writes the final output to HDFS [88]. A popular example that illustrates the MapReduce execution is the Words Count example which counts the number of occurrence of each individual word in a given file (Figure 24) [89]. 54

Figure 24: Word count MapReduce example [89] 8.5.Hadoop Cluster Connectivity When Hadoop starts connecting, each DataNode performs a handshake with the NameNode. The purpose of this operation is to verify the namespace ID and the software version of the DataNode. The namespace ID is assigned to the filesystem instance when it is formatted, and it is stored in all nodes of the cluster. Nodes with a different namespace ID will not be able to be part of the cluster. However, if the namespace ID is the same, the handshake will be performed successfully between the DataNodes and the NameNode. At this point, each DataNode stores its unique storage ID, which is an internal identifier of the DataNode. The main purpose of this ID is to make the DataNode recognizable even if it is restarted with a different IP address or port [87]. During normal operation, DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. In case the NameNode does not receive a heartbeat from a DataNode in ten minutes, the NameNode considers the DataNode as a dead node. In this case, NameNode creates new replicas of those blocks on dead DataNodes. In fact, heartbeats are not only used for ensuring NameNode-DataNodes connectivity, but it is also used to send statistical information such as total storage capacity, and fraction of storage in use. Another benefit of heartbeats is to send instructions from the NameNode to DataNodes. Those instructions include commands to replicate blocks to other DataNodes, remove local block 55

replicas, reregister and send an immediate block report, and shut down the node. These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations [87]. 56

Part III: Research Contribution To clarify the steps we followed in this study, we divided this part into four chapters 9, 10, 11 and 12. Chapter 9 defines the research methodology; chapter 10 describes the experimental setup that we used to get the performance of HPCaaS; chapter 11 presents the results we got from each experiment, and finally, chapter 12 discusses and analyzes the research findings. 57

Chapter 9: Research Methodology The choice of research methodology depends mainly on the nature of the research question. This chapter discusses the methodology that was followed in conducting the present study. It explains first the choice of the selected methodology, and then it demonstrates an overall picture of the research steps. 9.1.Research Approach The present research was based on a combination of qualitative and quantitative approach. Qualitative approach was followed to compare and select appropriate technology enablers for this research (Part III, Chapter 7), whereas quantitative approach was adopted to provide numeric measurements of HPC on physical cluster and virtualized clusters (Part IV, Chapter 10, 11 and 12), 9.2.Research Steps Figure 25 summarizes the steps followed in conducting the present research. Figure 25 : Research steps 58

Chapter 10: Experimental Setup In order to investigate the research question, we have conducted three main experiments. The first experiment evaluates the performance of HPC on Hadoop Physical Cluster (HPhC); the second experiment evaluates the performance of HPC using Hadoop Virtualized Cluster (HVC) with KVM, and the last experiment evaluates HPC using Hadoop virtualized cluster with VMware ESXi virtualization technology. This chapter describes the experiment setup used in this research; it provides an overall picture of the three adopted clusters; it specifies the hardware, software and network specifications; it introduces the benchmarks used to evaluate the performance of HPC on each cluster; it lists the datasets sizes used in each experiment, and finally, this chapter explains the experimental execution of the present research. 10.1.Experimental Hardware In our performance study, we have built 3 different clusters: Hadoop Physical Cluster, Hadoop Virtualized Cluster using KVM and Hadoop Virtualized Cluster using VMware ESXi. Each cluster is composed of eight machines. For the physical cluster, we used 8 Dell OptiPlex 755 Desktop computers with specifications listed in Table 7. For both Hadoop virtualized clusters (KVM and VMware ESXi), we used a Dell PowerEdge server with features listed in Table 8. On top of the server, we installed OpenStack to create eight virtual machines using KVM hypervisor and then VMware ESXi hypervisor. Because of some limited flexibility of OpenStack, we cloud create VMs with features described in Table 9. Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) 59

Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster Table 9 : OpenStack virtual machines features 10.2.Experimental Software and Network As stated in chapter 6, we opted for Hadoop to process and store small and large datasets; we chose to install Hadoop version 1.2.1. Concerning OpenStack, the version that was adopted is Folsom Release which supports KVM, Xen, VMWare and other hypervisors. Networking configuration was characterized by a bandwidth of 100Mbps per port. 10.3.Clusters Architecture In this section, we will conceptualize each individual cluster in terms of its layers and components. 10.3.1. Hadoop Physical Cluster Figure 26 and 27 show an overall picture of Hadoop Physical Cluster. The configuration was done in Linux Lab at AUI. The lab is connected to 1 Gbps switch (provides 100 Mbps per port) that is also connected to other offices in the building (where the lab is allocated). As 60

both figures depict, the cluster contains eight machines where one machine was selected to be the master and slave node at the same time. The reason behind choosing the master node to serve as both master and slave node is to increase the cluster performance when processing and storing datasets. Figure 26 : Hadoop Physical Cluster Figure 27: Hadoop Physical Cluster architecture 10.3.2. Hadoop Virtualized Cluster KVM The second cluster we built in this research is Hadoop Virtualized Cluster with KVM technology. As Figure 28 shows, the first step in configuring the cluster is to install an operating system on Dell PowerEdge server; the OS that was selected is Ubuntu Precise 12.04 61

LTS- 64 bits. The next step is to install and configuring KVM packages which are loaded in Linux OS as KVM driver. After preparing the system with OS and KVM hypervisor, next step is to install OpenStack on top of the OS (OpenStack with KVM documentation is provided in Appendix A). Finally, last step is to configure Hadoop on top of each OpenStack VM instance (Hadoop documentation is provided in Appendix C). Figure 28: Hadoop virtualized cluster - KVM The first OpenStack component that needs to be installed is the keystone which manages the authentication to OpenStack resources. After downloading and installing the keystone package, the next step is to create tenants (OpenStack projects) and OpenStack users that are associated to one or more tenants. Each user can be a member or an admin in a given project; in this case, roles need to be created in order to set rights and privileges to each user. After creating users, tenants and roles, next step is to create OpenStack services (nova, keystone, and glance service) that provide one or more endpoints (URLs) through which users can access OpenStack resources. The second component to install is OpenStack glance which allows creating and managing different formats of images (Ubuntu, Fedora, Windows, etc.) Glance packages include glance-api that accepts incoming API requests; glance-database that stores all information about images, and finally glance-registry that is responsible of retrieving and storing metadata about images. Third component to deploy in OpenStack is the Nova package which includes nova-compute, nova-scheduler, nova-network, novaobjectstore, nova-api, rabbitmq-server, novnc and nova-consoleauth. All these components collaborate and communicate with each other to create and manage instances, networks and, if needed, volumes. Finally, to have access to instances, a user friendly insterface can be 62

installed through configuring OpenStack dashboard or Horizon. After login to OpenStack Dashboard, the user can launch instances with the possibility of specifying the number of CPUs, disk space, total RAM memory per VM, etc. After creating VM instances (with requirements listed in Table 9), we installed Hadoop 1.2.1 on each VM. Hadoop configuration starts with identifying the master node and slave nodes. For master node, there are six files that need to be configured: core-site, hadoop-env, hdfs, mapred-site, master and slaves files. Concerning slave nodes, the only files that need to be configured are hadoop-env, core-site, hdfs and mapred-site files. When connecting nodes, the cluster needs to be formatted as to clean the file namespace. After formatting Hadoop, the cluster can be started to run jobs. 10.3.3. Hadoop Virtualized Cluster VMware ESXi The third cluster that was built in this research is Hadoop Virtualized Cluster using VMware ESXi technology (Figure 29). The first step in configuring this cluster is to install VMware ESXi on top of Dell PowerEdge server. Then, OpenStack is configured on top of the hypervisor (OpenStack with VMware ESXi documentation is provided in Appendix B). After configuring OpenStack, instances can be then created to build Hadoop cluster. Figure 29: Hadoop virtualized cluster VMware ESXi (a) In fact, when installing OpenStack with VMware ESXi, Openstack is installed as a VM on top of VMware ESXi hypervisor. Then, through OpenStack dashboard, instances can be created as VMs on top of VMware ESXi hypervisor (Figure 30). 63

Figure 30 : Hadoop virtualized cluster VMware ESXi (b) 10.4.Experimental Performance Benchmarks To evaluate the impact of machine virtualization on HPCaaS, we adopted two main known benchmarks: Terasort and TestDFSIO benchmarks [90]. TeraSort performance metrics consist of measuring the average time to sort certain datasets, while TestDFSIO performance metrics consist of measuring the execution time to write and read datasets. Table 10 summaries the performance metrics used in evaluating HPCaaS. Table 10 : Experimental performance metrics 10.4.1. TeraSort Description TeraSort was developed by Owen O Malley and Arun Murthy at Yahoo Inc [90]. It won the annual general purpose terabyte sort benchmark in 2008 and 2009. It does considerable computation, networking, and storage I/O, and is often considered to be representative of real Hadoop workloads [90]. Terasort is divided into three main steps: Teragen, Terasort and Teravalidate. 64

Teragen generates random data that will be sorted by Terasort. It writes the generated data as a file of n rows, where each row is 100 bytes. Each row is formatted as follow: 10 bytes key, 10 bytes rowid and 78 bytes filler, where keys are random characters from the set.. ~, rowid is an integer that specifies the row id, and filler consists of 7 runs of 10 characters from A to Z. When data is generated, TeraSort sorts this data using quicksort algorithm. The latter is integrated with map/reduces tasks to use a sorted list of n-1 sampled keys that define the key range for each reduce [9]. Finally, Teravalidate ensures that the output data of TeraSort is sorted. It creates one map task per file in TeraSort s output directory; in this case, each map task ensures that each key is less than or equal to the previous one. Furthermore, map task generates records with the first and last keys of the file; then the reduce tasks ensures that the first key of file i is greater than the last key of file i 1. If there is any unordered keys, Teravalidate reports this as an output of the reduce task [90]. (TeraSort benchmark is documented in Appendix D) 10.4.2. TestDFSIO Description TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read operations. Such benchmark can be helpful for testing HDFS by checking network performance, and testing hardware, OS and Hadoop setup [90]. TestDFSIO is written in Java, and its source code can be found in [91]. TestDFSIO is composed of TestDFSIO-Write and TestDFSIO-Read. Both operations are performed by specifying the number of files and the size of each file in megabyte [90]. (TestDFSIO benchmark is documented in Appendix D) 10.5 Experimental Datasets Size In each experiment, we measured the performance of Hadoop cluster using different dataset sizes. For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets, and for TestDFSIO, we used 100 MB, 1 GB, 10 GB and 100 GB datasets. Table 11 summarizes the dataset sizes used in this research. Table 11 : Datasets size used for Hadoop benchmarks 65

10.6 Experiment Execution We started conducting each experiment by scaling the cluster from three machines up to eight machines. In other words, we test each benchmark on three machines, four machines until we reached eight machines. Furthermore, for each individual benchmark, we performed three tests on 100MB, 1GB, 10 GB and 30 GB (TeraSort) and 100MB, 1GB, 10 GB and 100 GB (TestDFSIO), then we calculated the mean to avoid any outliers and to provide more accurate results. Figure 31 simplifies the steps of running experiment 1 on HPhC using Terasort benchmark. Figure 31 : Experimental execution 66

Chapter 11: Experimental Results This chapter presents the findings we got from running each experiment. It presents the results of running HPC on HPhC; on HVC with KVM, and then the results of running HPC on HVC using VMware ESXi. Last section, compares the results we got from running each experiment. (The results we got from running experiments are listed in Appendix E and F) 11.1.Hadoop Physical Cluster Results 11.1.1. TeraSort Performance on HPhC Running TeraSort benchmark showed that it needs much time to sort large datasets like 10 GB and 30 GB. Yet, scaling the cluster to more nodes led to significant time reduction in sorting datasets. The results we got from running this benchmark on Hadoop Physical Cluster are listed in Table 12 and conceptualized in Figure 32. Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 32: TeraSort performance on Hadoop Physical Cluster 67

Figure 33 and 34 illustrate clearly the benefit of scaling the cluster. For instance, running 100MB with 3 nodes needs around 21.33 seconds, while with 8 nodes, it needs 19.97 seconds (reduced by 6%). In the case of 1GB, the average time was reduced by 4% when scaling from 3 to 8 nodes. Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster Concerning 10GB, the results were somehow different (Figure35). Sorting 10 GB was reduced by 18.55% when scaling from 3 to 6 machines. Yet, increasing the number of machines to 8 nodes led to significant reduction in sorting performance. This can be explained by the impact of network bottleneck, especially that Hadoop is highly influenced by this issue. Furthermore, the impact of 8 nodes was important when running large datasets like 30 GB (Figure 36). For this case, the average time to sort the dataset was reduced by 24.77% (difference of 42 minutes) when increasing the number of nodes from 3 to 8. Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster 68

11.1.2. TestDFSIO- Write Performance on HPhC Running TestDFSIO-Write on Hadoop physical cluster follows in general one pattern. Meaning, as the number of VMs increases, the average time decreases when writing different dataset sizes. Table 13 and Figure 37 list and illustrate the results we got from running TestDFSIO-Write on HPhC. Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster Zooming on TestDFSIO-Write for 100MB dataset (Figure 38), the average time for running TestDFSIO-Write decreased as the number the of slaves increases. In this case, scaling the cluster from 3 machines (including the master) to 8 machines led to a reduction of 11.25% in overall writing average time. The same observation is applied when running TestDFSIO- Write for 1GB dataset (Figure 39) where the average time was reduced by 46.5 % when scaling from 3 to 8 slaves. 69

Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster When running 100 GB (Figure 41), we observe a sharp time reduction in running the TestDFSIO-Write when scaling from 3 to 8 slaves; this reduction was quantified by 12.53%. However, an expected average time was increased when scaling from 4 to 5 machines. Again, this unexpected result can be explained by the overall network performance. 11.1.3. TestDFSIO- Read Performance on HPhC Running TestDFSIO-Read led also to significant performance improvement when the physical cluster was scaled up to 8 machines (Table 14 and Figure 42). In general, this observation is applied for all dataset sizes. 70

Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster When the cluster was scaled from 3 to 7 nodes, the average time for reading 100MB (Figure 43) was reduced by 4.36% and 2.46% when reading 1GB (Figure 44). However, when scaling the cluster from 7 to 8 machines, the average time increased suddenly when reading both 100MB and 1GB. The same observation was made when reading 10GB and 100GB (Figure 45 and 46). Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster 71

Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster 11.2.Hadoop Virtualized Cluster- KVM Results 11.2.1. TeraSort Performance on HVC-KVM Running TeraSort on Hadoop KVM Cluster showed an important improvement in sorting various dataset sizes. Yet, this observation is applied when scaling the KVM cluster from 3 to 5 VMs. The results we got from running this benchmark on Hadoop KVM Cluster are listed in Table 15 and conceptualized in Figure 47. Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 47: TeraSort performance on Hadoop KVM Cluster 72

From Figure 48, sorting 100MB on 3 VMs takes around 15 seconds, and it decreases by 2.2% and 5.5% when sorting the dataset on 4 and 5 VMs respectively. Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster When sorting 1GB, 10 GB and 30 GB (Figure 49, 50 and 51), the performance was slightly improved as the number of VMs increases. For example, sorting time of 10GB was decreased by 0.3%, and sorting time of 30 GB was decreased by 5% when scaling from 3 to 4 nodes. However, when the cluster was scaled to 5, 6, 7 and 8 nodes, the overall performance of sorting 1GB, 10 GB and 30 GB was sharply decreased. Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster 73

11.2.2. TestDFSIO-Write Performance on HVC-KVM Running TestDFSIO-Write on Hadoop KVM was slightly improved as the number of VMs increases. The results of running TestDFSIO-Write are listed in Table 16 and illustrated in Figure 52. Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster For all dataset sizes (Figure 53, 54, 55 and 56), as stated before, the overall performance was slightly improved as the number of VMs increased from 3, 4 and 5. For instance, writing 10GB was improved by 1.6% when scaling from 3 to 5 VM. Furthermore, when trying to write 100GB, the system was crashed because of the overall system overhead (Figure 56). 74

Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster 11.2.3. TestDFSIO- Read Performance on HVC-KVM TestDFSIO- Read has the same behavior as TestDFSIO-Write. Meaning, the performance of reading different dataset sizes increases as the number of VMs increases from 3 to 5. The results we got from running TestDFSIO- Read are illustrated in Table 17 and Figure 57. 75

Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster As Figure 58, 59, 60 and 61 depict, the overall performance of reading different dataset sizes increases as the number of VMs increases from 3 to 5. For example, the average time for reading 100GB was slightly decreased by 3% when scaling from 3 to 5 VMs. Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster 76

Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster 11.3.Hadoop Virtualized Cluster- VMware ESXi Results 11.3.1. TeraSort Performance on HVC-VMware ESXi Table 18 and Figure 62 present the performance of running TeraSort on Hadoop VMware ESXi Cluster; the overall observation shows significant improvement in sorting various dataset sizes. In contrast to KVM cluster, VMware ESXi keeps decreasing the average time of storing as the number of VMs increases from 3 to 6 (for large datasets). Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster 77

As Figure 63 depicts, the performance of sorting 1 GB was decreased by 23% when scaling the cluster from 3 to 6 VMs. Yet, the performance starts degrading as the number of VMs increases from 6 to 7 and 8. Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster A significant high performance was observed when sorting 30GB (Figure 66). The performance was increased by 34% from 3 to 6 VMs, 25% from 3 to 7 VMs and 3% from 3 to 8 VMs. Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster 78

11.3.2. TestDFSIO-Write Performance on HVC-VMware ESXi Running TestDFSIO-Write on Hadoop VMware ESXi was improved as the number of VMs increases to 7. The results of running TestDFSIO-Write are listed in Table 19 and illustrated in Figure 67. Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster For all dataset sizes (Figure 68, 69, 70 and 71), the overall performance was improved as the number of VMs increases from 3 to 7. For instance, writing 100 MB was improved by 37% when scaling from 3 to 7 VMs. Furthermore, when writing large dataset like 10GB, the overall performance increased by 12% when scaling from 3 to 7 VMs. However, for the case of 100GB, the performance started degrading when scaling from 6 to 7 and 8 VMs. 79

Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster 11.3.3. TestDFSIO- Read Performance on HVC-VMware ESXi TestDFSIO- Read behaves as TestDFSIO- Write when the performance of reading different dataset sizes increases as the number of VMs increases from 3 to 7. However, the average time for reading different datasets was less than writing operation (by more than half). The results we got from running TestDFSIO- Read on VMware ESXi are listed in Table 20 and conceptualized in Figure 72. 80

Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster Figures 73, 74, 75 and 76 show the performance of running TestDFSIO-Read on each individual dataset. For most dataset sizes, the performance was improved as the number of VMs inreased up to 7. For instance, the performance of reading 100GB was improved by 36% when scaling from 3 to 7 VMs. However, reading 1GB behavied differently as the correspondding performance started to decline at VM 6. Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster 81

Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster 11.4. Results Comparison 11.4.1. TeraSort Performance The overall performance of the 3 clusters varies depending on the datasets size and the number of nodes involved in each cluster. Yet, Hadoop VMware ESXi cluster was performing much better than other clusters when running TeraSort benchmark on large datasets. Starting with 100MB (Figure 77), TeraSort showed high performance when being virtualized with VMware ESXi and KVM. Both clusters were 25% (VMware ESXi) and 30% (KVM) faster than the physical cluster (in case of 3 nodes). Further, a significant performance was achieved when scaling the cluster to 4, 5 and 6 nodes; in this case, both KVM and VMware ESXi were faster than the physical cluster. After increasing the number of nodes to 7 and 8, VMware ESXi performance decreases by 33% and becomes slower than the physical cluster by 18% (when scaling from 3 to 8 nodes). On the other hand, the average time of sorting 100MB dataset on KVM cluster declined as the number of nodes increases to 7 and 8, and therefore, the sorting performance was improved from 15 to 14 seconds. Further, virtualized cluster (KVM) was performing better than the physical cluster by 29.5% and 27% for 7 and 8 nodes respectively. 82

Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and VMware ESXi When increasing the dataset size, the performance changes in each scenario (dataset size and number of nodes). In the case of 1GB (Figure 78), virtualized cluster was keeping the best performance when compared with the physical cluster. When the cluster was composed of 3-5 nodes, virtualized clusters sort the 1GB dataset with a range of 87-90 seconds, while the physical cluster sorts the same dataset with a range of 182-187 seconds. When increasing the number of nodes from 5 to 8, VMware ESXi was faster than other clusters; however, KVM knew a decline in its performance when being compared with KVM cluster of 3-4 nodes and when being compared with the physical cluster. For instance, in the case of 8 machines, physical cluster was faster than KVM cluster by 89%. Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi 83

The same observation on 1GB can be applied when sorting 10GB dataset (Figure 79). Yet, in this case, the performance of virtualized clusters was very high than the physical cluster. For instance, in the case of 5 VMs, VMware ESXi cluster was faster than physical cluster by 60%, and KVM was faster than physical cluster by 51%. Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and VMware ESXi When moving to larger datasets, VMware ESXi cluster proved its significant performance in sorting the 30 GB dataset (Figure 80). For instance, in the case of 4 nodes, VMware ESXi was faster than KVM cluster by 28% and faster than physical cluster by 61%. Moreover, KVM was performing better than physical cluster when the cluster was composed of 3, 4, 5 and 6 nodes. Afterward, when increasing the cluster size to 7 and 8 nodes, the KVM cluster decreased in its performance and became slower than the physical cluster. Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and 84

VMware ESXi The last observation consists in VMware ESXi performance on 8 nodes cluster. For all different datasets, we observed that VMware performance degraded; for example, for 10 GB, the performance decreased by 51%. Even though, VMware ESXi kept performing better than other clusters. 11.4.2. TestDFSIO- Write Performance The results we got from TestDFSIO were different than the ones in TeraSort benchmark. The overall observation of Figure 81 and 82 shows that virtualization is still performing better than the physical cluster. In the case of 3-5 nodes cluster, we can observe that KVM cluster performance is much better than VMware ESXi and physical cluster. For instance, writing 100 MB using 5 nodes, KVM cluster was 11% faster than physical cluster and 24% faster than VMware ESXi cluster (Figure 81). However, we observed that physical cluster was performing better than VMware ESXi, and the difference was quantified by 48% seconds (100 MB using 5 nodes). When scaling the cluster from 5 to 8 nodes, the KVM cluster knew sharp performance degradation. Again, this is due to system overhead. In this case, the physical cluster showed better results than virtualized clusters. Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi The same observation is applied when sorting 100 GB (Figure 83). The only difference is that KVM cluster with 8 nodes was unable to write the 100 GB. 85

Figure 83: Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi Figure 84 : Average time for writing 100 GB on HPhC, HVC with KVM and VMware ESXi 11.4.3. TestDFSIO- Read Performance As illustrated in Figure 84 and 85, reading small datasets (100MB and 1GB) showed that virtualized cluster is faster than physical cluster. Yet, this applied for KVM cluster when it is composed of 3-5 nodes. Afterwards, when KVM clusters scaled to 6, 7 and 8 nodes, the performance of reading all datasets degraded. On the other hand, physical cluster performed better than VMware ESXi in all case (100MB and 1G on different number of nodes). Figure 85: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi Figure 86 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi When increasing the dataset size to 10 GB and 100GB (Figure 86 and 87), we can see different performance trends. When the cluster is composed of 3-5 nodes, KVM cluster kept better performance than other clusters. For instance, for 100 GB and 3 nodes, KVM cluster 86

was faster than VMware ESXi by 12% and faster than physical cluster by 44%. However, as other benchmarks (TeraSort and TestDFSIO-Write), KVM cluster showed a sharp degradation in reading 100GB when the cluster scaled to 6, 7 and 8 nodes. When reading 10GB and 100 GB, in contrast to TestDFSIO-Write results, VMware ESXi cluster was faster than physical cluster in all scenarios (number of nodes). For instance, using a cluster of 3 nodes; VMware ESXi was faster than the physical cluster by 36% and 55.5% in the case of 7 and 8 nodes respectively. An important observation is that KVM cluster with 8 VMs was unable neither to write nor to read 100GB dataset (Figure 87). Figure 87: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi Figure 88 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87

Chapter 12: Discussion The results we got in this research proved significant improvements when virtualizing HPC, especially when the latter was tested with TeraSort benchmark; in this case, we found that both virtualized clusters (KVM and VMware ESXi) have better performance than physical cluster. 12.1.TeraSort Performance When running TeraSort benchmark, VMware ESXi cluster proved to have fast sorting of large datasets starting from 1GB, 10 GB and 30 GB. For instance, sorting 30GB using a cluster of 4 nodes showed that VMware ESXi is faster than KVM by 64% and faster than physical cluster by 84% (Figure 80). Concerning the KVM cluster, it was also proved to be faster than the physical cluster. However, when the number of nodes increases in virtualized clusters, the performance of TeraSort degraded significantly. In the case of KVM cluster, when the number of nodes increases to 6, 7 and 8, the overall performance of running TeraSort became slower. In fact, the reason behind facing this degradation is explained by the system overhead, especially disk overhead. A study was done in [92] performed an analysis of KVM scalability in OpenStack platform, and it state that KVM is not recommended to be used when many virtual hard disks will be accessed at the same time. Therefore, since TeraSort has both computational and I/O jobs, KVM VMs affected the overall performance when they were scaled to 6, 7 and 8. Moreover, another study was done in [93] states that KVM has substantial problems with guests crashing when it reaches a certain number of VMs (4 for this study [93]); hence, scalability is considered an issue for system overhead when using KVM virtualization. In the case of VMware ESXi cluster, the performance of running TeraSort declines when the cluster was scaled to 8 nodes. The same as KVM, the reason is due to system overhead. However, the system overhead is not related to scalability issue because VMware ESXi is known to be scalable [94]. To make sure from the cause that led to system overhead, we tracked the performance of sorting 30GB dataset on 8 VMware ESXi VMs (using VMware vsphere Client), and we found that, at some point, the memory required to sort the dataset exceeds the available memory offered by the cluster. This can be observed in Figure 88 which illustrates that active memory (in red, memory currently consumed by VMs) is higher than the granted memory (in grey, memory provided by the hosting hardware) between 5:05 and 5:10 88

PM range. Another proof that confirms the system overhead is the latency rate; in this case, we tracked the latency of running 30 GB on 8 VMs, and we observed that system latency reaches its peak (Figure 89) when sorting this dataset. Thus, latency impacts the overall performance when the number of VMs increases to 8. The last proof was reported by OpenStack Dashboard (Figure 90) which showed warning state of resources usage after creating 8 VMware ESXi instances. In short, VMware ESXi cluster performance declines at 8 VMs because of resources shortage. Figure 89: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs Figure 90 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs 89

Figure 91: OpenStack warning statistics about system resources usage In short, Even if TeraSort performance decreases when the number of VMware ESXi VMs increases to 8, the results we got still confirm that Hadoop VMware ESXi cluster is better than Hadoop KVM Cluster and Hadoop Physical Cluster. 12.2.TestDFSIO Performance The performance behavior of each cluster changed when running TestDFSIO benchmark. For all dataset sizes, KVM cluster proved to have high performance than other clusters when performing both TestDFSIO-Write and TestDFSIO-Read (Figures 81-87). On the other hand, VMware ESXi showed the lowest performance when compared to KVM and physical cluster. In fact, the reason that explains the good results we got from running TestDFSIO on KVM is related to virtio API. The latter is integrated in KVM hypervisor to provide an efficient abstraction for I/O operations [95]. Virtio was studied in [96] and proved that it enhances KVM performance at I/O operations; the authors of this study [96] tested the performance of KVM (with virtio API) at I/O operations and compared it with VMware vsphere 5.1 performance. They concluded that KVM with virtio API achieves I/O rates that are 49% higher than VMware vsphere 5.1. When running TestDFSIO, we observed again that the performance of both virtualized clusters decreases as the number of VMs goes beyond 6 (KVM) and 7 (VMware ESXi). 90

12.3.Conclusion Brief, the overall performance of TeraSort and TestDFSIO proved that, first, virtualization has better performance than physical cluster, and, second, the selection of underlying virtualization technology can lead to significant improvements when performing HPCaaS. Therefore, in this research, VMware ESXi proved to have the best performance especially when running computational jobs (TeraSort). To deal with the issue of system overhead in virtualized clusters, HPCaaS needs to be run in a cloud environment that has balanced number of VMs. For this research, the reasonable number that provides high performance was 7 VMs for VMware ESXi and 5 VMs for KVM cluster. 91

Part IV: Conclusion This part summarizes the research objectives and findings and suggests some related future work. Bibliography of this report is listed after the conclusion, and finally, a set of appendices (OpenStack Documentation, Hadoop Documentation, Benchmarks Execution and Data Gathering) are provided at the end of this report. 92

Chapter 13 Conclusion and Future Work This project aimed at demonstrating the impact of running HPCaaS on different virtualization technologies, namely, KVM and VMware ESXi cluster. For that, we have built three main Hadoop clusters: Hadoop Physical Cluster, Hadoop Virtualized Cluster with KVM and Hadoop Virtualized Cluster with VMware ESXi. For virtualized clusters, we proposed to build Hadoop cluster on top of OpenStack platform. On each cluster, we run two known benchmarks: TeraSort and TestDFSIO. Each benchmark was tested on different dataset sizes and on different number of machines (from 3 to 8 machines). To ensure the credibility and reliability of the research, we performed three tests on each scenario; for instance, we tested TeraSort for 30GB on each cluster three times, and then we took the mean to avoid any outliers. The findings of this research clearly demonstrate that vitalized clusters can perform much better than physical cluster when processing and handling HPC, especially when there is less overhead on the virtualized cluster. We found that Hadoop VMware ESXi cluster performs better at sorting big datasets (more computations), and Hadoop KVM cluster performs better at I/O operations. Finally, this report includes detailed installation guides of OpenStack and Hadoop that will save time and facilitate the work for future students who want to work on related research. As future work, the possibilities for extending this research can go in different directions. The first proposed work is to conduct the research experiments using real HPC applications that can show precisely the impact of virtualization on HPCaaS. The second proposed future work is to conduct this research using other emerging virtualization technologies such as XEN, and Hyper-V. Third proposed future work is to see the impact of cloud platforms on improving the HPCaaS; meaning, another research can be conducted to see for example, if replacing OpenStack with another cloud infrastructure can lead to better results. Finally, since we got positive results about the impact of visualization on HPCaaS, this research can be investigated more by integrating its findings in other domains such as Smart Grid. 93

Bibliography [1] J. Gantz and D. Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, IDC IVIEW, pp. 1-16, 2012 [2] Gartner, Inc., Hunting and Harvesting in a Digital World, in Gartner CIO Agenda Report, pp. 1-8, 2013 [3] Amazon Web Services, High Performance Computing (HPC) on AWS, http://aws.amazon.com/hpc-applications/ [4] J. Gantz and D. Reinsel, The Digital Universe Decade Are You Ready?, IDC IVIEW, pp. 1-15, 2010 [5] Ch.Vecchiola1, S. Pandey, R.Buyya, High-Performance Cloud Computing: A View of Scientific Applications, in the 10th International Symposium on Pervasive Systems, Algorithms and Networks I-SPAN, IEEE Computer Society, pp. 4-16, 2009 [6] J. Dean and S. Ghemawat, MapReduce: Simple Data Processing on Large Clusters, in OSDI, pp. 1-12. 2004 [7] Hadoop: http://hadoop.apache.org/ [8] S. Krishman, M. Tatineni, and C. Baru, MyHaddop Hadoop-on-Demand on Traditional HPC Resources, in the National Science Foundation s Cluster Exploratory Program, pp. 1-7, 2011 [9] E. Molina-Estolano, M. Gokhale, C. Maltzahn1, J. May, J. Bent, S. Brandt, Mixing Hadoop and HPC Workloads on Parallel Filesystems, in the 4th Annual Workshop on Petascale Data Storage, pp. 1-5, 2009 [10] C. Cranor, M. Polte, and G. Gibson, HPC Computation on Hadoop Storage with PLFS, Parallel Data Laboratory at Carnegie Mellon University, pp. 1-9, 2012 [11] Y. Xiaotao, L. Aili, and Z. Lin, Research of High Performance Computing with Clouds, in the Third International Symposium on Computer Science and Computational Technology (ISCSCT), pp. 289-293, 2010 [12] KVM:http://www.linux-kvm.org/page/Main_Page [13] VMware ESXi: http://www.vmware.com/ [14] D. Boulter, Simplify Your Journey to the Cloud, Capgemini and SOGETI, pp. 1-8, 2010. [15] P. Mell and T. Grance, The NIST Definition of Cloud Computing, National Institute of Standards and Technology, pp. 1-3, 2011 [16] A. E. Youssef, Exploring Cloud Computing Services and Applications, Journal of Emerging Trends in Computing and Information Sciences, vol. 3, no. 6, pp. 838-847, 2012 [17] T. Korri, Cloud Computing: Utility Computing over the Internet, Seminar on 94

Internetworking, pp. 1-5, 2009 [18] ISACA, Cloud Computing: Business Benefits with Security, Governance and Assurance Perspectives, pp. 1-10, 2009 [19] A. T. Velte, T. J. Velte, R. C. Elsenpeter, Cloud Computing, A practical approach,1st ed., USA : McGraw-Hills, 2009 [20] Amazon Web Services: http://aws.amazon.com/ [21] Google Cloud Platform: https://cloud.google.com/ [22] Microsoft Cloud Services: http://www.microsoft.com/enterprise/it- trends/cloudcomputing/default.aspx?search=true#fbid=33s2kmnt99z [23] Open Source Software for Building Private and Public Clouds: http://www.openstack.org [24] I. Menken, and G. Blokdijk, Cloud Computing Virtualization Specialist Complete Certification Kit - Study Guide Book and Online Course, Emereo Pty Ltd, 2009 [25] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012 [26] K. Scarfone, M. Souppaya, and P. Hoffman, Guide to Security for Full Virtualization Technologies, National Institute of Standards and Technology, 2011 [27] D. Dale, Server and Storage Virtualization with IP Storage, Storage Networking Industry Association (SNIA), 2008 [28] D. Marinescu and R. Kroger; State of the Art in Autonomic Computing and Virtualization, Wiesbaden University of Applied Sciences, pp. 1-21,2007 [29] K. Koganti, E. Patnala2, S. Narasingu, J. Chaitanya,Virtualization Technology in Cloud Computing Environment, in International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 3, 2013 [30] N. Susanta and T. Chiueh, A Survey on Virtualization Technologies, Department of Computer Science at Stony Brook, 2006 [31] Virtualization: A Key to Virtualization World: http://isa.unomaha.edu/wpcontent/uploads/2012/08/virtualization.pdf [32] Virtualization Overview, white paper, VMware, 2006 [33] N. Alam, Survey on Hypervisors, School of Informatics and Computing at Indiana University, 2011 [34] C. D. Graziano, A Performance Analysis of Xen and KVM Hypervisors for Hosting the Xen Worlds Project, Digital Repository at Iowa State University, pp. 12-39, 2011 [35] N. Yaqub, Comparison of Virtualization Performance: VMWare and KVM, Master Thesis, pp. 30-44, 2012 [36] How Does Xen Work?, white paper, Xen, 2009 [37] O. Kulkarmi, N. Xinli, and P. K. Swamy, Cutting-Edge Perspective of Security Analysis for Xen Virtual Machines, International Journal of Engineering Research and 95

Development, vo. 2, no. 3, pp. 40-45, 2012 [38] T. Hirt, KVM The Kernel-based Virtual Machine, Red Hat Inc., 2010 [39] M. T. Jones, Anatomy of a Linux Hypervisor, IBM Corporation, 2009 [40] VMware ESXi 5.0 Operations Guide, white paper, VMware, 2011 [41] M. K. Kakhani, S. Kakhani, and S. R. Biradar, Research Issues in Big Data Analytics, Vol. 2, No. 8, pp. 228 232, 2013 [42] C. Hagen, Big Data and the Creative Destruction of Today s, ATKearney, 2012 [43] Oracle : Big Data for the Enterprise, white paper, Oracle Corp., 2013 [44] Oracle NoSQL Database, white paper, Oracle Corp., 2011 [45] S. Yu, ACID Properties in Distributed Databases, Advanced ebusiness Transactions for B2B-Collaborations, 2009 [46] S. Gilbert and N. Lynch, Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, vol. 33, no. 2, p. 51, 2002 [47] A. Lakshman, P. Malik, Cassandra - A Decentralized Structured Storage System, ACM SIGOPS Operating Systems Review, vol. 44, no.2, pp. 35-40, 2010 [48] G. Lars., Introduction, in HBase: The Definitive Guide, USA: O'Reilly Media, 2011 [49] MongoDB: http://www.mongodb.org/ [50] Apache CouchDB : http://couchdb.apache.org/ [51] J.Bernstein, K. McMahon, Computing on Demand HPC as a Service: High Performance Computing for High Performance Business, white paper, Penguin Computing & McMahon Consulting. [52] Y. Xiaotao, L. Aili, Z. Lin, Research of High Performance Computing With Clouds, International Symposium Computer Science and Computational Technology, pp. 289 293, 2010 [53] Self-service POD Portal: http://www.penguincomputing.com/services/hpccloud/pod [54] Amazon Cloud Storage: http://aws.amazon.com/ec2/reserved-instances/ [55] Amazon Cloud Drive: http://aws.amazon.com/ec2/spot-instances/ [56] Microsoft High Performance Computing for Developers: http://msdn.microsoft.com/en-us/library/ff976568.aspx [57] Google Cloud Storage: https://cloud.google.com/products/compute-engine [58] S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, Case Study for Running HPC Applications in Public Clouds, in Science Cloud '10, 2012 [59] K. R. Jackson, Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud, in Cloud Computing Technology and Science 96

(CloudCom), 2010 IEEE Second International Conference on, pp. 159-168, 2010 [60] E. Walker, Benchmarking Amazon EC2 for High-Performance Scientific Computing, Texas Advanced Computing Center at the University of Texas, pp. 18-23, 2008 [61] J. Ekanayake and G. Fox, High Performance Parallel Computing with Clouds and Cloud Technologies, School of Informatics and Computing at Indiana University, pp. 1-20, 2009. [62] Y. Gu and R. L. Grossman, Sector and Sphere: The Design and Implementation of a High Performance Data Cloud, National Science Foundation, pp. 1-11, 2008 [63] A. Gupta and D. Milojicic, Evaluation of HPC Applications on Cloud, Helwett- Packard Development Company, pp. 1-6, 2011 [64] C. Evangelinos and C. N. Hill. Cloud Computing for parallel Scientific HPC Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on Amazon s EC2, Department of Earth, Atmospheric and Planetary Sciences at Massachusetts Institute of Technology, pp. 1-6, 2009 [65] Dryad and DryadLINQ for Data Intensive Research : http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx [66] C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, Evaluating Xen, VMware, and OpenVZ Virtualization Platforms for Network Virtualization, Federal University of Rio de Janeiro, pp. 1-1, 2010 [67] N. Yaqub, Comparison of Virtualization Performance: VMWare and KVM, Master Thesis, pp. 30-44, 2012 [68] T. Deshane, M. Ben-Yehuda, A. Shah, and B. Rao, Quantitative Comparison of Xen and KVM, in Xen Summit, pp. 1-3, 2008 [69] J. Hwang, S. Wu, and T. Wood, A Component-Based Performance Comparison of Four Hypervisors, George Washington University and IBM T.J. Watson Research Center, pp. 1-8, 2012 [70] A. J. Younge, R. Henschel, J. T. Brown, G. Laszewski, J. Qiu, and G. C. Fox, Analysis of Virtualization Technologies for High Performance Computing Environments, Pervasive Technology Institute, pp. 1-8, 2012 [71] Q. Jiang. Open Source Iaas Community Analysis, Eucalyptus Systems Inc., 2012 [72] I. Voras, M. Orlic, and B. Mihaljevié, An Early Comparison of Commercial and Open- Spurce Cloud P latforms for Scientific Environments, University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb, Croatia, 2012 [73] E. Caron, L. Toch, and J. Rouzaud-Cornabas, Performance Comparison between OpenStack and OpenNebula and the multi-cloud Architecture: Application to Cosmology, Research Report N 8421, 2013 [74] K. Kostantos, A. Kapsalis, D. Kyriazis, M. Themistocleous, and P. Cunha, Open-Source IAAS Fit for Purpose: A Comparison between Openbula and OpenStack, International Journal of Electronic Business Management, Vol. 11, No. 3, 2013 97

[75] O. Sefraoui, M. Aissaoui, and M. Eleuldj, Comparison of Multiple IaaS Cloud Platform Solutions, Mohamed I University, 2012 [76] Donnie Berkholz s Story of Data3: http://redmonk.com/dberkholz/2012/03/26/nosql-database-popularity-according-tojaspersoft/ [77] E. Dede, M. Govindaraju, D. Gunter, R. Canon, and L. Ramakrishnan, Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis, SUNY Binghamton and Lawrence Berekely National Lab, 2012 [78] J. H. Lee, Log Analysis System Using Hadoop and MongoDB, CUBRID, 2012. [79] OpenStack: http://www.openstack.org/ [80] OpenStack Training Guides, white paper, OpenStack Foundation, 2013 [81] A. Sehgal, Introduction to OpenStack: Running a Cloud Computing Infrastructure with Openstack, in the 6th International Conference on Autonomous Infrastructure, Management and Security, University of Luxembourg, 2012 [82] K. Pepple, Deploying OpenStack, O'Reilly Median, 2011 [83] OpenStack, Companies Supporting the OpenStack Foundation, http://www.openstack.org/foundation/companies/ [84] G. Sasiniveda and N. Revathi, Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop", International Journal of Computer Trends and Technology, vol. no. 3, 2013 [85] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation, 2007 [86] T. White, Hadoop: The Definitive Guide, O'Reilly Media, 2010 [87] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, Sunnyvale, 2010 [88] H. Herodotu, Hadoop Performance Models, Computer Science Department at Duke University, 2011 [89] Blogclub Tworkshops, Hadoop and MapReduce, http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/ [90] M. G. Noll, Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test DFSIO & Co., 2011 [91] Apache Hadoop, TestDFSIO Apache Hadoop Code Source, http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoopmapreduce- client-jobclient/0.23.9/org/apache/hadoop/fs/testdfsio.java [92] F. Rahma*, T. Adji, Widyawan, Scalability Analysis of KVM-Based Private Cloud For IaaS, in International Journal of Cloud Computing and Services Science, Vol.2, No.4, ppt. 288-295, 2013 98

[93] T.Deshane, M. Yehuda, A. Shah, B. Rao, Quantitative Comparison of Xen and KVM, in Journal of Physics: Conference, 2010 [94] Virtualizing Resource intensive Applications, white paper, VMware, 2009 [95] Scale-up Virtualization with Red Hat Enterprise Linux 5.4 on an HP ProLiant DL785 G6, white paper, Redhat, 2009 [96] KVM Virtualized I/O Performance, white paper, IBM & Redhat, 2013. 99

Appendix A: OpenStack with KVM Configuration Pre-configuration 1. Update your machine sudo apt-get update sudo apt-get upgrade 2. Install bridge-utils sudo apt-get install bridge-utils 3. NTP Server 3.1. Install the NTP Server sudo apt-get install ntp 3.2. Open the file /etc/ntp.conf Add the following lines to make sure that the time on the server stays in sync with an external server. server ntp.ubuntu.com server 127.127.1.0 fudge 127.127.1.0 stratum 10 3.3.Restart NTP Service sudo service ntp restart 4. Network Configuration As public IP address changes periodically, you need to set a static IP address that will be used in OpenStack configuration. In this case, we have two network interfaces eth0 and eth1. Eth0 was chosen as the network management; as a result, this interface was set to static IP address (in this guide, we used 10.60.62.12 as an IP management). 100

Hypervisor Configuration 1. KVM Configuration If you want to install OpenStack with KVM hypervisor, then you need to follow the following steps: 1.1.Check if your machine supports virtualization ouidad@ouidad:~$ egrep -c '(vmx svm)' /proc/cpuinfo 8 ouidad@ouidad:~$ If the output is 0, then your machine does not support virtualization; otherwise, if the output is greater than 0, the machine support virtualization technology. 1.2. Check if KVM can be supported ouidad@ouidad:~$ kvm-ok INFO: /dev/kvm exists KVM acceleration can be used ouidad@ouidad:~$ If the output is as shown above, then your machine supports KVM virtualization. 1.3.Install KVM and libvirt sudo apt-get install kvm libvirt-bin 1.4.KVM configuration You can check the following website to configure the necessary files for KVM support: https://help.ubuntu.com/community/kvm/installation 1.5 Reboot your machine 101

OpenStack Databases Configuration 1. MySQL 1.1.Install Mysql server and related packages sudo apt-get install mysql-server python-mysqldb 1.2.Create the root password for MySQL The password used in this guide is "secret" 1.3.Open /etc/mysql/my.cnf Change the bind address from bind-address=127.0.0.1 to bind-address = 0.0.0.0 1.4. Restart MySQL server sudo restart mysql 2. Nova Database 2.1. Create Nova database nova sudo mysql -uroot -psecret -e 'CREATE DATABASE nova;' 2.2.Create nova user named novadbadmin sudo mysql -uroot -psecret -e 'CREATE USER novadbadmin;' 2.3.Grant all privileges for novadbadmin on the database "nova" sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON nova.* TO 'novadbadmin'@'%';" 2.4. Create a password for the user "novadbadmin"; the password in this case is secret sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'novadbadmin'@'%' = PASSWORD ('novasecret');" 3. Glance Database 3.1.Create glance database named glance sudo mysql -uroot -psecret -e 'CREATE DATABASE glance;' 102

3.2.Create a user named glancedbadmin sudo mysql -uroot -psecret -e 'CREATE USER glancedbadmin; ' 3.3. Grant all privileges for glancedbadmin on the database "glance" sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON glance.* TO 'glancedbadmin'@'%';" 3.4. Create a password for the user "glancedbadmin" sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'glancedbadmin'@'%' = PASSWORD('glancesecret');" 4. Keystone Database 4.1.Create a database named keystone sudo mysql -uroot -psecret -e 'CREATE DATABASE keystone;' 4.2.Create a user named keystonedbadmin. sudo mysql -uroot -psecret -e 'CREATE USER keystonedbadmin;' 4.3. Grant all privileges for keystonedbadmin on the database "keystone". sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON keystone.* TO 'keystonedbadmin'@'%';" 4.4.Create a password for the user "keystonedbadmin" sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'keystonedbadmin'@'%' = PASSWORD('keystonesecret');" 103

Keystone Configuration 1. Install Keystone sudo apt-get install keystone python-keystone python-keystoneclient 2. Open /etc/keystone/keystone.conf Make the following changes: Change admin_token = ADMIN to admin_token = admin Change connection = sqlite:////var/lib/keystone/keystone.db to connection = mysql://keystonedbadmin:keystonesecret@10.60.62.12/keystone 3. Restart keystone sudo service keystone restart 4. Create glance schema in MySQL databas sudo keystone-manage db_sync 5. Export environment variables export SERVICE_ENDPOINT="http://localhost:35357/v2.0" export SERVICE_TOKEN=admin Note: you can also add these variables to ~/.bashrc as to avoid exporting them each time. 6. Create tenants Create admin and service tenants keystone tenant-create --name admin keystone tenant-create --name service 7. Create users Create OpenStack users by executing the following commands. In this case, we are creating four users - admin, nova, glance and swift keystone user-create --name admin --pass admin --email admin@foobar.com keystone user-create --name nova --pass nova --email nova@foobar.com keystone user-create --name glance --pass glance --email glance@foobar.com keystone user-create --name swift --pass swift --email swift@foobar.com 104

8. Create roles Create the roles by executing the following commands. In this case, we are creating two roles - admin and Member. keystone role-create --name admin keystone role-create --name Member Sample output: 9. List tenants, users and roles keystone tenant-list keystone user-list keystone role-list Sample output: 105

10. Adding roles to users in tenants 10.1. Add the role of admin to the user admin of the tenant admin keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role 8af19783ac784e0397e0346c7f1ec --tenant_id ee14adbd1ac84445921 819cf7a5b7f5f 10.2. Add the role of admin to the user nova of the tenant service. keystone user-role-add --user 5ce6dd40bf2249e5ab35a95da63d7930 --role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41dae1fa726c6 10.3. Add the role of admin to the user glance of the tenant service. keystone user-role-add --user 9967843ee4aa421189f3382849700cad --role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d ae1fa726c6 10.4. Add the role of admin to the user swift of the tenant service. keystone user-role-add --user 24979d9ac31e4b83a58a89c1ad842ffa --role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d ae1fa726c6 10.5. The Member role is used by Horizon and Swift. So add the Member role accordingly. (user: admin, role: Member, tenant: admin) keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role c2860fd6f3fd4538a07161bdb2691f60 --tenant_id ee14adbd1ac84445921 819cf7a5b7f5f 11. Create services Create the required services which the users can authenticate with: nova-compute, novavolume, glance, swift, keystone and ec2 are some of the services that we create. 11.1.Nova Compute Service keystone service-create --name nova --type compute --description 'Opensatck Compute Service' 106

11.2.Volume Service keystone service-create --name volume --type volume --description 'OpenStack Volume Service' 11.3.Image Service keystone service-create --name glance --type image --description 'Openstack Image Service' 11.4. Object Store Service keystone service-create --name swift --type object_store --description 'Openstack Storage Service' 11.5.Identity Service keystone service-create --name keystone --type identity --description 'Openstack Identity Service' 11.6.EC2 Service keystone service-create --name ec2 --type ec2 --description 'EC2 Service' 12. List keystone service list keystone service-list Sample output: 107

13. Create endpoints Create endpoints for each of the services that have been created above (service id is displayed using keystone service-list command). 13.1. Endpoint for identity service keystone endpoint-create --region RegionOne --service_id 207bf81ddfe1481aa242148f246d091f --publicurl http://localhost:5000/v2.0 --internalurl http://localhost:5000/v2.0 --adminurl http://localhost:35357/v2.0 13.2.Endpoint for nova service keystone endpoint-create --region RegionOne --service_id 72b9d125eaa84aaf9c8ce734027eea21 --publicurl 'http://localhost:8774/v2/%(tenant_id)s' -- internalurl 'http://localhost:8774/v2/%(tenant_id)s' --adminurl 'http://localhost:8774/v2/%(tenant_id)s' 13.3.Endpoint for the image service keystone endpoint-create --region RegionOne --service_id 581f6a8e337642a0a39090ffe6947e2d --publicurl 'http://localhost:9292/v1' --internalurl 'http://localhost:9292/v1' --adminurl 'http://localhost:9292/v1' 13.4.Define the EC2 compatibility service: keystone endpoint-create --region RegionOne --service_id 4b1619d4f9f34cc9aaf473282c2340f0 --publicurl http://localhost:8773/services/cloud -- internalurl http://localhost:8773/services/cloud --adminurl http://localhost:8773/services/admin 13.5.Endpoint for the Volume service keystone endpoint-create --region RegionOne --service_id 6afe27a1768b403b9521418a87646ec4 --publicurl 'http://localhost:8776/v1/%(tenant_id)s' -- internalurl 'http://localhost:8776/v1/%(tenant_id)s' --adminurl 'http://localhost:8776/v1/%(tenant_id)s' 13.6.Endpoint for object storage service keystone endpoint-create --region RegionOne --service_id 2ec242420a114671a4fe15e745b45d3f --publicurl 'http://localhost:8888/v1/auth_%(tenant_id)s' --adminurl 'http://localhost:8888/v1' -- internalurl 'http://localhost:8888/v1/auth_%(tenant_id)s' 108

Glance Configuration 1. Install Glance packages sudo apt-get install glance glance-api glance-client glance-common glance-registry python-glance 2. Open /etc/glance/glance-api-paste.ini Change the following lines: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% By: admin_tenant_name = service admin_user = glance admin_password = glance 3. Now open /etc/glance/glance-registry-paste.ini Change the following lines: By: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% admin_tenant_name = service admin_user = glance admin_password = glance 4. Open the file /etc/glance/glance-registry.conf Change the line which contains the option "sql_connection =" to this: sql_connection = mysql://glancedbadmin:glancesecret@10.60.62.12/glance Add the following lines at the end of the file as to allow glance to use keystone for authentication. [paste_deploy] flavor = keystone 109

5. Open /etc/glance/glance-api.conf Add the following lines at the end of the file. [paste_deploy] flavor = keystone 6. Create glance schema in MySQL database sudo glance-manage version_control 0 sudo glance-manage db_sync 7. Restart glance-api and glance-registry sudo restart glance-api sudo restart glance-registry 8. Export the following environment variables. export SERVICE_TOKEN=admin export OS_TENANT_NAME=admin export OS_USERNAME=admin export OS_PASSWORD=admin export OS_AUTH_URL="http://localhost:5000/v2.0/" export SERVICE_ENDPOINT=http://localhost:35357/v2.0 Note: you can add these variables to ~/.bashrc. 9. Check if glance was successfully configured glance index The above command displays nothing; if you get an output, check the troubleshooting section. 110

Nova Configuration 1. Install Nova packages sudo apt-get install nova-api nova-cert nova-compute nova-compute-kvm nova-doc novanetwork nova-objectstore nova-scheduler nova-volume rabbitmq-server novnc novaconsoleauth 2. Edit the /etc/nova/nova.conf file --dhcpbridge_flagfile=/etc/nova/nova.conf --dhcpbridge=/usr/bin/nova-dhcpbridge --logdir=/var/log/nova --state_path=/var/lib/nova --lock_path =/run/lock/nova --allow_admin_api=true --use_deprecated_auth=false --auth_strategy=keystone --scheduler_driver=nova.scheduler.simple.simplescheduler --s3_host =10.60.62.12 --ec2_host=10.60.62.12 --rabbit_host=10.60.62.12 --cc_host =10.60.62.12 --nova_url=http://10.60.62.12:8774/v1.1/ --routing_source_ip=10.60.62.12 --glance_api_servers=10.60.62.12:9292 --image_service=nova.image.glance.glanceimageservice --iscsi_ip_prefix=192.168.4 --sql_connection=mysql://novadbadmin:novasecret@10.60.62.12/nova --ec2_url=http://10.60.62.12:8773/services/cloud --keystone_ec2_url=http://10.60.62.12:5000/v2.0/ec2tokens --api_paste_config=/etc/nova/api-paste.ini --libvirt_type=kvm --libvirt_use_virtio_for_bridges=true --start_guests_on_host_boot=true --resume_guests_state_on_host_boot=true --novnc_enabled=true --novncproxy_base_url=http://10.60.62.12:6080/vnc_auto.html --vncserver_proxyclient_address=10.60.62.12 --vncserver_listen=10.60.62.12 --network_manager=nova.network.manager.flatdhcpmanager --public_interface=eth0 --flat_interface=eth0 --flat_network_bridge=br100 --network_size=32 --flat_injected=false --force_dhcp_release --iscsi_helper=tgtadm --connection_type=libvirt --root_help Important Note: 10.60.62.12 has to be replaced by your local machine public IP address. Moreover, you need to change libvirt_type variable by the current hypervisor you are using. 111

3. Change the ownership of the /etc/nova folder and permissions for /etc/nova/nova.conf sudo chown -R nova:nova /etc/nova sudo chmod 644 /etc/nova/nova.conf 4. Open /etc/nova/api-paste.ini Change the following configuration By: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% admin_tenant_name = service admin_user = nova admin_password = nova 5. Create nova schema in the MySQL database. sudo nova-manage db sync 6. Provide a range of IPs to be associated to the instances. sudo nova-manage network create private --fixed_range_v4=10.60.62.0/27 -- bridge=br100 --bridge_interface=eth0 --network_size=32 7. Export the following environment variables. export OS_TENANT_NAME=admin export OS_USERNAME=admin export OS_PASSWORD=admin export OS_AUTH_URL="http://localhost:5000/v2.0/" Note: you can add the environment variables at the end of ~/.bashrc file. 8. Manage nova volumes Create a Physical Volume: sudo pvcreate /dev/sda3 Create a Volume Group named nova-volumes: sudo vgcreate nova-volumes /dev/sda3 112

Note: to create a physical volume, you need first to create a primary partition (in this guide, the partition name is /dev/sda3). In this case you can follow these steps: 9. Restart nova services sudo service libvirt-bin restart sudo service nova-network restart sudo service nova-compute sudo service nova-api restart sudo service nova-objectstore restart sudo service nova-scheduler restart sudo service nova-volume restart sudo service nova-consoleauth service 10. Check if nova services are running sudo nova-manage service list Sample output: Note: if you the state of a given service is not :-), then try to run the following commands in separate terminals: sudo /usr/bin/nova-compute sudo /usr/bin/nova-network 113

OpenStack Dashboard 1. Install OpenStack Dashboard sudo apt-get install openstack-dashboard 2. Restart apache service sudo service apache2 restart 3. Open a browser and enter IP address of your machine If you followed this tutorial, then the possible logins are: Username: admin Password: admin Username: nova Password: nova Username: glance Password: glance Username: swift Password swift Figure 1: Dashboard authentication page 114

Image Configuration In order to create an image, you can to access the following links to download the needed images: http://smoser.brickies.net/ubuntu/ttylinux-uec/old/ http://uec-images.ubuntu.com/ Example: Ubuntu Precise i386 Image 1. Download Ubuntu Precise Version (12.04 LTS) Download Ubuntu precise version (precise-server-cloudimg-i386-root.tar.gz) from http://uecimages.ubuntu.com/precise/current/, using the following command: wget http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386.tar.gz 2. Extract the downloaded package sudo tar fxvz precise-server-cloudimg-i386.tar.gz The extracted files are: precise- server-cloudimg-i386-vmlinuz-virtual precise-server-cloudimg-i386-loader precise-server-cloudimg-i386.img 3. Add the Ubuntu image into glance database 3.1. Add the kernel file glance add name="precise32-kernel" disk_format=aki container_format=aki < preciseserver-cloudimg-i386-vmlinuz-virtual 3.2. Add the loader file glance add name="precise32-ramdisk" disk_format=ari container_format=ari < preciseserver-cloudimg-i386-loader 3.3.Add the image file Get the id of both the kernel and loader using: glance index glance index Sample output: 115

In this case, the id of Ubuntu kernel is 8386c173-cd90-4c7d-8540-da484abd0c1a and the id of Ubuntu loader is 5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d. Now, add the image file using the kernel and loader id: glance add name="precise32_image" disk_format=ami container_format=ami kernel_id=8386c173-cd90-4c7d-8540-da484abd0c1a ramdisk_id=5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d < precise-server-cloudimg-i386.img 4. Using the Horizon, you can find the uploaded image (Precise32_image) Figure 2: List of OpenStack images 116

Keypair Configuration 1. Generate for you local machine If you didnt generate akey for you local machine, then run the following command : ssh-keygen -t rsa -P "" 2. Create keypair The following command can be used to either generate a new keypair, or to upload an existing public key. cd.ssh nova keypair-add --pub_key id_rsa.pub mykey nova keypair-list 3. List keypairs nova keypair-list Sample output: 4. Check the created keypair Confirm that the uploaded keypair matches the local key by checking your key's fingerprint with the ssh-keygen command: ssh-keygen l f ~/.ssh/id_rsa.pub Sample output: Note: You can use OpenStack Dashboard to perform all operations related to keypair generation. 117

Security Groups Configuration 1. List default security groups nova secgroup-list Sample output: 2. Enable access to TCP port 22 Allow access to port 22 from all IP addresses (specified in CIDR notation as 0.0.0.0/0) with the following command: nova secgroup-add-rule default tcp 22 22 0.0.0.0/0 Sample output: 3. Enable pinging to virtual machine instance by allowing ICMP traffic nova secgroup-add-rule default icmp -1-1 0.0.0.0/0 Sample output: 118

Flavors Configuration 1. Flavor overview Flavors are used to specify the properties of an instance. The following table illustrates the needed arguments to define a flavor. Column ID Name Memory_MB Disk Ephemeral Swap VCPUs TX_Factor Is_Public extra_specs Description A unique numeric id. A descriptive name. xx.size_name is conventional not required, though some third party tools may rely on it. Memory_MB: virtual machine memory in megabytes. Virtual root disk size in gigabytes. This is an ephemeral disk the base image is copied into. When booting from a persistent volume it is not used. The "0" size is a special case which uses the native base image size as the size of the ephemeral root volume. Specifies the size of a secondary ephemeral data disk. This is an empty, unformatted disk and exists only for the life of the instance. Optional swap space allocation for the instance. Number of virtual CPUs presented to the instance. Optional property allows created servers to have a different bandwidth cap than that defined in the network they are attached to. This factor is multiplied by the rxtx_base property of the network. Default value is 1.0 (that is, the same as attached network). Boolean value, whether flavor is available to all users or private to the tenant it was created in. Defaults to True. Additional optional restrictions on which compute nodes the flavor can run on. This is implemented as key/value pairs that must match against the corresponding key/value pairs on compute nodes. Can be used to implement things like special resources (such as flavors that can only run on compute nodes with GPU hardware). Table 1: Flavor arguments 2. List available flavors Use nova flavor-list command to view the list of available flavors: nova flavor-list 3. Create a flavor Create a flavor with the following suggested specifications: sudo nova-manage instance_type create --name=m1.cluster --memory=975 --cpu=2 -- root_gb=100 --ephemeral_gb=10 --flavor=8 119

Instances Management Instances can be created either by using the dashboard interface or using command line. 1. Create instances with no specifications nova boot --flavor ID --image Image-ID MyInstanceName 2. Create an instance with an associated keypair To associate a key with an instance on boot add --key_name Mykey to your command line: nova boot --image Image-ID --flavor ID --key_name Mykey MyInstanceName 3. Create an instance with a security group It is also possible to add and remove security groups when an instance is running. nova add-secgroup MyInstanceName MysecurityGroup nova remove-secgroup MyInstanceName MysecurityGroup 4. Create an instance with a given keypair and security group nova boot --flavor ID --image Image-ID --key_name Mykey MyInstanceName 5. Display instance details nova show MyInstanceName 6. Access an instance You can connect to an instance console via VNC. The latter can be accessed either by the Horizon interface, command line or other tools such as virt-manager. Using command line nova get-vnc-console host_name novnc Sample output: The link displayed above can be used to access the instance console. 120

Using virt-manager If you cannot connect to VNC console, you can use virt-manager; in this case, use the following command to download the virt-manager package: sudo apt-get install virt-manager To have access to virt-manager inetrface, run the following command, sudo virt-manager Using local machine terminal If the instance you created asked you for login name and password, you can in this case, access the instance through your local machine. In this case you need to follow these steps: ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@instance_ip_address For Ubuntu the user name is root or ubuntu. Example: if you want to access an Ubuntu instance with IP address 10.60.62.8, you can then run the commands in the following commands: ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@10.60.62.8 ssh ubuntu@10.60.62.8 Sample output: 121

7. Connecting Instances The following steps can be followed to connect OpenStack Instances (Assumption: we need to connect instance with hostname host1 to another instance with hostname host2): Generate the keypair on host1 & host2 to run ssh (ssh-keygen -t rsa) On host2 o Check the sshd_config on that instance (It s located in /etc/ssh/sshd_config) o Uncomment the following two lines in sshd_config RSAAuthentication yes PubkeyAuthentication yes o Append the contents of id_rsa.pub file of host 1 to authorized_keys file of host 2 8. Delete an instance nova delete MyInstanceName 122

1. Exception 1: glance index error Solution OpenStack Troubleshooting ouidad@ouidad:~$ glance index Failed to show index. Got error: There was an error connecting to a server Details: [Errno 111] Connection refused Glance Exceptions In most cases, the above exception is due to glance-api service which may not be running. Therefore, you need to run the following command to check why the glance-api is not running. For the above output, we have an error in the glance-api-paste.ini, so you need to open that file to fix the error. ouidad@ouidad:~$ sudo gedit /etc/glance/glance-api-paste.ini After fixing the error, you need to restart the glance-api service ouidad@ouidad:~$ sud/usr/bin/glance-apini 123

Nova Exceptions 1. Exception 1: nova services not running sudo nova-manage service list When running sudo nova-manage service list, if you a service has xxx state, then you need the service in a separate terminal. Solution For example, if nova-compute has xxx state, you need to run the following command: sudo /usr/bin/nova-compute The same solution can be applied for other services: sudo /usr/bin/nova-network sudo /usr/bin/nova-scheduler sudo /usr/bin/nova-consoleauth sudo /usr/bin/nova-cert sudo /usr/bin/nova-volume 2. Exception 2: sudo nova-manage service list doesn t display the expected output ouidad@ouidad:~$ sudo nova-manage service list Command failed, please check log for more info 2013-09-02 19:46:28.050 15999 CRITICAL nova [-] No module named quantumclient.common Solution ouidad@ouidad:~$ sudo apt-get install python-quantumclient 3. Exception 3: Unable to start nova compute libvirterror: operation failed: domain 'instance-.. already exists with uuid Sample output: Solution You need to login to nova database and delete the instance id from instances table. Moreover, you need to delete the instance id from related tables such as security_group_instance_association and instance_info_caches. 124

Example: we want to delete an instance with id=3 From the tables displayed above, delete the instance id = 3 from security_group_instance_association and instance_info_caches as well as from virtual_interfaces table. 125

Dashboard Exceptions 1. Exception 1: Unable to retrieve images/instances Sample output Solution If you get one of the following exceptions, the only way I solved the problem is to drop the endpoint and re-create them again. Then, you need to reboot your local machine. References for Appendix A http://docs.openstack.org/folsom/openstack-ops/content/flavors.html http://www.hastexo.com/resources/docs/installing-openstack-essex-20121-ubuntu-1204- precise-pangolin http://docs.openstack.org/essex/openstackcompute/starter/content/introduction_to_openstack_and_its_components-d1e59.html 126

Appendix B. OpenStack with VMware ESXi Configuration 1. Downloading VMware ESXi Download VMware ESXi (vsphere 5.5) from: https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55 2. Installing VMware ESXi After burning VMware ESXi software into a CD, install it on top of your hardware. 3. Download vsphere Client To manage your VMware ESXi host: Install vsphere Client in another machine with Windows OS. After opening the software, login to VMware ESXi machine with your username and password. After login, you will get access to VMware ESXi machine resources. In our case, VMware ESXi machine has an IP address of 10.50.1.166 (Figure 1) Figure 1: vsphere Client interface: access to VMware ESXi 10.50.1.166 127

4. Create Openstack VM Create a virtual machine on top of VMware ESXi using vsphere Client. The VM will be used to host OpenStack. Create the VM with Ubuntu Precise LTS 12.04 64bits Guest. 5. Download VMware vsphere Web Services SDK Download appropriate SDK from: http://www.vmware.com/support/developer/vcsdk/ Copy the SDK to /openstack/vmware file. Make sure that the WSDL is available by checking if this path is existing /openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl /openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl: this path will be specified in nova.conf. 6. Configure OpenStack on VMware ESXi You need to follow the same steps provided in OpenStack KVM documentation. The main difference here is the nova.conf configuration. 7. Nova.conf Configuration In this case, you need to specify the compute_driver, host_ip (VMware ESXi machine), host_username, host_password and sdl_location (for SDK) as follow [vmware] host_password = 12357890 host_username = root host_ip = 10.50.1.166 compute_driver = vmwareapi.vmwareesxdriver sdl_location=file:///openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl 8. Dashboard access Access OpenStack resources from the Horizon using the IP address of Openstack VM. 9. Make sure that you OpenStack is installed wth VMware ESXi This is done from Horizon interface Example: 128

Figure 2: OpenStack with VMware ESXi hypervisor 10. Manage OpenStack with VMware ESXi After configuring OpenStack, you can now download images and create instances. Each time you create an instance, it will be displayed in vsphere Client interface as depicted in Figure 1. Concerning images, you need to add images with vmdk extension. You can find them in the following website (you can download them from the free images section): http://stacklet.com 129

Figure 3: access to VMs (OpenStack instances) through vsphere Client interface References http://docs.openstack.org/trunk/config-reference/content/vmware.html https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55 https://www.vmware.com/support/developer/vc-sdk/ 130

Appendix C: Hadoop Configuration Prerequisites for Installing Hadoop 1. Adding a dedicated Hadoop system user (all machines) Create a Hadoop user account (hduser) for running Hadoop using the following commands: ouidad@host1:~$ sudo addgroup hadoop ouidad@host1:~$ sudo adduser --ingroup hadoop hduser 2. Configuring SSH 2.1. To manage cluster nodes, Hadoop requires SSH access. In this case, you need to generate an SSH key for the hduser user. ouidad@host1:~$ su hduser Password: hduser@host1:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 44:f5:7b:85:32:f7:69:c7:d7:fc:75:38:63:32:be:d7 hduser@host1 The key's randomart image is: +--[ RSA 2048]----+....... + o.. = *o S + *ox. =.o*..... E.. +-----------------+ 131

2.2. In order to allow Hadoop interacts directly with its nodes, you need to create an RSA key pair with an empty password. This is done by enable SSH access to your local machine with this newly created key. hduser@host1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 3. Install JAVA 3.1.Download jdk-6u45-linux-i586.bin (for 32 bits architecture) from: http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html 3.2.JDK Installation chmod +x jdk-6u45-linux-i586.bin sudo./jdk-6u45-linux-i586.bin 3.3.Make sure that JDK is installed ouidad@host1:~$ java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing) 3.4. Move JDK folder from its current location to /home/hduser path ouidad@host1:~$ sudo cp /Downloads/jdk1.6.0_45 /home/hduser -r 3.5. Change the JDK ownership ouidad@host1:~$ sudo chown -R hduser:hadoop /home/hduser/jdk1.6.0_45/ 132

Installing Hadoop 1. Download Hadoop version 1.2.1 (hadoop-1.2.1.tar.gz) from http://www.apache.org/dyn/closer.cgi/hadoop/core 2. Extract the downloaded version ouidad@host1:~/downloads$ tar -zxvf hadoop-1.2.1.tar.gz 3. Move the extracted folder (hadoop-1.2.1) from Downloads folder to /home/hduser ouidad@host1:~/downloads$ sudo cp hadoop-1.2.1 /home/hduser/ -r 4. Change the ownership ouidad@host1:~/downloads$ sudo chown -R hduser:hadoop /home/hduser/hadoop-1.2.1 5. Bashrc file configuration (All machines) You need first to login to the hduser account, then you need to run the following command: hduser@host1:~$ sudo gedit ~/.bashrc at the end of the file, add the following line: export JAVA_HOME=~/jdk1.6.0_45 export PATH =$JAVA_HOME/bin:$PATH 6. Hdfs folder creation (All machines) You need first to login to the hduser account, then create the following folder: hduser@host1:~$ sudo mkdir -p /home/hduser/hdfs/temp hduser@host1:~$sudo chown hduser:hadoop /home/hduser/hdfs/temp hduser@host1:~$sudo chmod 777 /home/hduser/hdfs/temp/ hduser@host1:~$sudo chmod 775 /home/hduser/hdfs/temp/ 133

7. Hadoop Files Configuration (Slave machines) Move to /hadoop-1.2.1/conf folder to change the following files 7.1. hadoop-env.sh File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hadoop-env.sh Replace the following two lines: # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun by (uncomment the second line): # The java implementation to use. Required. export JAVA_HOME=~/jdk1.6.0_45 Then, add at the end of the file: export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true 7.2. core-site.xml File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml Add the following lines between the <configuration> tags: <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfs/temp</value> <description>a base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property> 134

7.3.mapred-site.xml File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml Add the following lines between the <configuration> tags: <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 7.4. hdfs-site.xml File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml <property> <name>dfs.replication</name> <value>3 </value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> Note: Number 3 illustrates the total number of block replication. If you have a cluster of 3-10 nodes, set the replication factor to 3 8. Hadoop Files Configuration (Master) 8.1. core-site.xml File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml Add the following lines between the <configuration> tags: <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfs/temp</value> <description>a base for other temporary directories.</description> </property> 135

<property> <name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property> 8.2.mapred-site.xml File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml Add the following lines between the <configuration> tags: <property> <name>mapred.job.tracker</name> <value>master: 54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 8.3. hdfs-site.xml File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml Add the following lines between the <configuration> tags: <property> <name>dfs.replication</name> <value>3 </value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> 8.4. slaves File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit slaves Comment the localhost, and add the name of your slaves (you can set your master node as master and slave at the same by adding the master hostname to slaves file. master host1 host2. 8.4. masters File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit masters Comment the localhost, and add the name of your master node. 136

master Connecting Nodes 1. IP address configuration (All machines) 1.1. Find out the IP address of each machine hduser@host1:~$ ifconfig eth0 Link encap:ethernet HWaddr 00:23:ae:b0:89:ae inet addr:10.50.0.170 Bcast:10.50.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:198693 errors:0 dropped:0 overruns:0 frame:0 TX packets:9134 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:30871002 (30.8 MB) TX bytes:1334436 (1.3 MB) Interrupt:21 Memory:fe6e0000-fe700000 lo Link encap:local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:58 errors:0 dropped:0 overruns:0 frame:0 TX packets:58 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9306 (9.3 KB) TX bytes:9306 (9.3 KB) 1.2. Find out the host name of each machine hduser@host1:~$ sudo gedit /etc/hostname 1.1. Open hosts file (for each machine) hduser@host1:~$ sudo gedit /etc/hosts Replace the content of the file by the IP Addresses of all machines, including in the cluster. 10.50.0.197 master 10.50.0.94 slave. 2. Connect the master hduser with the hduser on slaves Example: For machine with hostname host1 137

hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host1 Example: For machine with hostname host2 hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host2 3. Test the connection between each slave and master machine hduser@master:~$ ssh host1 Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic i686) * Documentation: https://help.ubuntu.com/ System information as of Sun Jun 30 19:44:28 WEST 2013 System load: 0.08 Processes: 159 Usage of /: 77.7% of 228.23GB Users logged in: 2 Memory usage: 35% IP address for eth0: 10.50.0.170 Swap usage: 0% => There is 1 zombie process. Graph this data and manage this system at https://landscape.canonical.com/ 97 packages can be updated. 66 updates are security updates. Last login: Sun Jun 30 18:39:15 2013 from ip6-localhost If the connection is set up, you need then to cancel it to continue your installation hduser@host5:~$ exit logout Connection to host5 closed. 138

Formatting the HDFS & Starting Multi-node Cluster 1. Format the HDFS filesystem via the NameNode hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format Here is the output: 13/06/30 20:00:42 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = master/10.50.0.197 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/ 13/06/30 20:00:42 INFO util.gset: VM type = 32-bit 13/06/30 20:00:42 INFO util.gset: 2% max memory = 19.33375 MB 13/06/30 20:00:42 INFO util.gset: capacity = 2^22 = 4194304 entries 13/06/30 20:00:42 INFO util.gset: recommended=4194304, actual=4194304 13/06/30 20:00:42 INFO namenode.fsnamesystem: fsowner=hduser 13/06/30 20:00:42 INFO namenode.fsnamesystem: supergroup=supergroup 13/06/30 20:00:42 INFO namenode.fsnamesystem: ispermissionenabled=true 13/06/30 20:00:42 INFO namenode.fsnamesystem: dfs.block.invalidate.limit=100 13/06/30 20:00:42 INFO namenode.fsnamesystem: isaccesstokenenabled=false accesskeyupdateinterval=0 min(s), accesstokenlifetime=0 min(s) 13/06/30 20:00:42 INFO namenode.namenode: Caching file names occuring more than 10 times 13/06/30 20:00:42 INFO common.storage: Image file of size 112 saved in 0 seconds. 13/06/30 20:00:42 INFO namenode.fseditlog: closing edit log: position=4, editlog=/home/hduser/hdfs/temp/dfs/name/current/edits 13/06/30 20:00:42 INFO namenode.fseditlog: close success: truncate to 4, editlog=/home/hduser/hdfs/temp/dfs/name/current/edits 13/06/30 20:00:43 INFO common.storage: Storage directory /home/hduser/hdfs/temp/dfs/name has been successfully formatted. 13/06/30 20:00:43 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197 ************************************************************/ 2. Start the multi-node cluster hduser@master:~/hadoop-1.2.1$ bin/start-all.sh Start both DFS and Hadoop Map/Reduce daemons: hduser@master:~/hadoop-1.2.1$ bin/start-dfs.sh hduser@master:~/hadoop-1.2.1$ bin/start-mapred.sh 139

4. On master machine, check if the following java processes are running : hduser@master:~$ jps 5721 SecondaryNameNode 6738 DataNode 5243 NameNode 6047 TaskTracker 8423 Jps 5805 JobTracker 4. On slave machines, check if the following java processes are running: hduser@master:~$ jps 1902 DataNode 4002 Jps 2108 TaskTracker If you get the following oputput: hduser@host1:~/hadoop-1.2.1/conf$ jps The program 'jps' can be found in the following packages: * openjdk-6-jdk * openjdk-7-jdk Ask your administrator to install one of them Then install one of the suggested packages: hduser@host1:~/hadoop-1.2.1/conf$ sudo apt-get install openjdk-7-jdk Note: if you didn t get the same services, follow the suggestion provided for exception 2. 140

Hadoop Troubleshooting 1. Formatting the Namenode Exception: Cannot lock storage hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format 13/06/30 19:57:35 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = master/10.50.0.197 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/. 13/06/30 19:57:38 ERROR namenode.namenode: java.io.ioexception: Cannot lock storage /home/hduser/hdfs/temp/dfs/name. The directory is already locked. at org.apache.hadoop.hdfs.server.common.storage$storagedirectory.lock(storage.java:599) at org.apache.hadoop.hdfs.server.namenode.fsimage.format(fsimage.java:1327) at org.apache.hadoop.hdfs.server.namenode.fsimage.format(fsimage.java:1345) at org.apache.hadoop.hdfs.server.namenode.namenode.format(namenode.java:1207) at org.apache.hadoop.hdfs.server.namenode.namenode.createnamenode(namenode.java:1398) at org.apache.hadoop.hdfs.server.namenode.namenode.main(namenode.java:1419) 13/06/30 19:57:38 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197 ************************************************************/ Solution Step 1: Stop all processes hduser@master:~/hadoop-1.2.1$ bin/stop-all.sh Step 2 : move to /hdfs/temp folder and run the following command hduser@master:~/hdfs/temp$ sudo rm -rf * Step 3 : Restart your work by formatting the namenode hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format 141

2. Formatting the Namenode Exception: Cannot create directory /home/hduser/hdfs Solution In this case, make sure that you have set the following permission when creating the /hdfs/temp folder hduser@host1:~$sudo chmod 750 /home/hduser/hdfs/temp/ 3. Exception in log file: hadoop-hduser-datanode-host1.log or when Hadoop DataNode doesn t show up in slave nodes hduser@host1:~/hadoop-1.2.1/logs$ sudo gedit hadoop-hduser-datanode-host1.log 2013-06-30 19:01:09,078 ERROR org.apache.hadoop.hdfs.server.datanode.datanode: java.io.ioexception: Incompatible namespaceids in /home/hduser/hdfs/temp/dfs/data: namenode namespaceid = 1345454277; datanode namespaceid = 1875045188 at org.apache.hadoop.hdfs.server.datanode.datastorage.dotransition(datastorage.java:232) at org.apache.hadoop.hdfs.server.datanode.datastorage.recovertransitionread(datastorage.java:147) at org.apache.hadoop.hdfs.server.datanode.datanode.startdatanode(datanode.java:399) at org.apache.hadoop.hdfs.server.datanode.datanode.<init>(datanode.java:309) at org.apache.hadoop. hdfs.server.datanode.datanode.makeinstance(datanode.java:1651) at org.apache.hadoop.hdfs.server.datanode.datanode.instantiatedatanode(datanode.java:1590) at org.apache.hadoop.hdfs.server.datanode.datanode.createdatanode(datanode.java:1608) at org.apache.hadoop.hdfs.server.datanode.datanode.securemain(datanode.java:1734) at org.apache.hadoop.hdfs.server.datanode.datanode.main(datanode.java:1751) Solution 1 1. From master machine, open VERSION file under /hdfs/temp/dfs/name/current folder: hduser@master:~/hdfs/temp/dfs/name/current$ sudo gedit VERSION Here is the content of VERSION file: #Sun Jun 30 20:00:43 WEST 2013 namespaceid=1289101159 ctime=0 storagetype=name_node layoutversion=-32 Check the id of the namespace variable ( in this case it is 1289101159); remember the id as you will need it in the next step 2. From all slaves machines where you found the above exception, open the VERSION file under /hdfs/temp/dfs/data/current folder: hduser@host1:~/hdfs/tmp/dfs/data/current$ sudo gedit VERSION 142

Here is the content of VERSION file: #Fri Jun 14 09:22:08 WET 2013 namespaceid=176572587 storageid=ds-1900366223-127.0.1.1-50010-1371201728420 ctime=0 storagetype=data_node layoutversion=-32 Replace the namespaceid variable with the value you found in the VERSION file of the master. The content of file VERSION under /hdfs/temp/dfs/data/current folder is: #Fri Jun 14 09:22:08 WET 2013 namespaceid=1289101159 storageid=ds-1900366223-127.0.1.1-50010-1371201728420 ctime=0 storagetype=data_node layoutversion=-32 Solution 2 1. Stop the whole cluster 2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /hdfs/temp /dfs/data. 3. Reformat the NameNode. 4. Restart the cluster. 4. Safe mode exception when running MapReduce examples org.apache.hadoop.ipc.remoteexception: org.apache.hadoop.hdfs.server.namenode.safemodeexception: Cannot delete /benchmarks/testdfsio. Name node is in safe mode. The reported blocks is only 3601 but the threshold is 0.9990 and the total blocks 3748. Safe mode will be turned off automatically. at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.deleteinternal(fsnamesystem.java:2111) at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.delete(fsnamesystem.java:2088) at org.apache.hadoop.hdfs.server.namenode.namenode.delete(namenode.java:832) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) Solution hduser@master:~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave Safe mode is OFF hduser@master:~/hadoop-1.2.1$ bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean TestDFSIO.0.0.4 143

References for Appendix C [1]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-nodecluster/ [2]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-nodecluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes 144

Appendix D: TeraSort and TestDFSIO Execution 1. TeraSort 1.1.Generate the TeraSort input data using TeraGen TeraGen generates random data that can be conveniently used as input data for a subsequent TeraSort run. The command to run TeraGen in order to generate 100 MB of input data is: bin/hadoop jar hadoop-*examples*.jar teragen 1000000 /home/hduser/terasort-input 1000000 specifies the number of rows of input data to generate, each of which having a size of 100 bytes. 1.2.Run the actual TeraSort benchmark using TeraSort The syntax to run the TeraSort benchmark is as follows: bin/hadoop jar hadoop-*examples*.jar terasort /home/hduser/terasort-input /home/hduser/terasort-output 1.3.Validate the sorted output data of TeraSort using TeraValidate The syntax to run the TeraValidate is as follow: bin/hadoop jar hadoop-*examples*.jar teravalidate /home/hduser/terasort-input /home/hduser/terasort-output 1. Check TeraSort Analysis To check the average time to generate 100 MB, you need to run the following command: bin/hadoop job -history /home/hduser/terasort-input To check the average time to sort 100 MB, you need to run the following command: bin/hadoop job -history /home/hduser/terasort-output 2. Clean up your temporary files When re-running TeraSort Benchmark, you need to clean up all generated files in the first TeraSort test. bin/hadoop dfs -rmr /home/hduser/terasort-input bin/hadoop dfs -rmr /home/hduser/terasort-output 145

2. TestDFSIO 2.1. Write data using TestDFSIO-Write To generate 1000MB dataset, you need to specify an input with 10 files, and each file with 10MB. To allow this operation, the following command needs to be executed: hadoop jar hadoop-*test*.jar TestDFSIO -write -nrfiles 10 -filesize 10 A sample output of TestDFSIO-write operation provides information about the throughput, average I/O rate, I/O rate standard deviation and test execution time. 13/11/07 15:37:27 INFO fs.testdfsio: ----- TestDFSIO ----- : write 13/11/07 15:37:27 INFO fs.testdfsio:date & time: Thu Nov 07 15:37:27 UTC 2013 13/11/07 15:37:27 INFO fs.testdfsio: Number of files: 10 13/11/07 15:37:27 INFO fs.testdfsio: Total MBytes processed: 100 13/11/07 15:37:27 INFO fs.testdfsio: Throughput mb/sec: 5.680527152919791 13/11/07 15:37:27 INFO fs.testdfsio: Average IO rate mb/sec: 9.899490356445312 13/11/07 15:37:27 INFO fs.testdfsio: IO rate std deviation: 7.567628183406918 13/11/07 15:37:27 INFO fs.testdfsio: Test exec time sec: 17.568 13/11/07 15:37:27 INFO fs.testdfsio: 2.2.Read data using TestDFSIO-Read After getting the results of TestDFSIO-write command, the next step is to run TestDFSIOread operation. In this case, to read the previous generated data, the following command needs to be executed. hadoop jar hadoop-*test*.jar TestDFSIO -read -nrfiles 10 -filesize 10 A sample output of write operation provides information about the throughput, average I/O rate, I/O rate standard deviation and test execution time. 13/11/07 15:38:11 INFO fs.testdfsio: ----- TestDFSIO ----- : read 13/11/07 15:38:11 INFO fs.testdfsio: Date & time: Thu Nov 07 15:38:11 UTC 2013 13/11/07 15:38:11 INFO fs.testdfsio: Number of files: 10 13/11/07 15:38:11 INFO fs.testdfsio: Total MBytes processed: 100 13/11/07 15:38:11 INFO fs.testdfsio: Throughput mb/sec: 70.57163020465772 13/11/07 15:38:11 INFO fs.testdfsio: Average IO rate mb/sec: 73.69004821777344 13/11/07 15:38:11 INFO fs.testdfsio: IO rate std deviation: 16.249892929638822 13/11/07 15:38:11 INFO fs.testdfsio: Test exec time sec: 15.51 13/11/07 15:38:11 INFO fs.testdfsio: 2.3.Clean your cluster The last step is to clean up the generated data using the following command: bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean 146

Appendix E: Data Gathering for TeraSort 1. Hadoop Physical Cluster Number of Machines 3 4 5 Dataset Size Map Test 1 Test 2 Test 3 Mean 100 MB Map 6 7 6 6.33 100 MB Shuffling 10 10 11 10.33 100 MB Reduce 5 5 4 4.67 1 GB Map 13 14 16 14.33 1 GB Shuffling 83 81 85 83.00 1 GB Reduce 99 77 93 89.67 10 GB Map 31 22 19 24.00 10 GB Shuffling 1065 921 930 972.00 10 GB Reduce 1511 1841 1679 1677.00 30 GB Map 26 25 28 26.33 30 GB Shuffling 2971 2312 3081 2788.00 30 GB Reduce 9522 7544 8434 8500.00 100 MB Map 5 7 7 6.33 100 MB Shuffling 10 10 10 10.00 100 MB Reduce 4 6 4 4.67 1 GB Map 14 16 15 15.00 1 GB Shuffling 81 79 80 80.00 1 GB Reduce 99 87 82 89.33 10 GB Map 19 21 19 19.67 10 GB Shuffling 951 921 881 917.67 10 GB Reduce 1680 1714 1421 1605.00 30 GB Map 21 22 20 21.00 30 GB Shuffling 2860 2912 3120 2964.00 30 GB Reduce 5908 6412 6109 6143.00 100 MB Map 5 6 6 5.67 100 MB Shuffling 10 10 10 10.00 100 MB Reduce 5 5 5 5.00 1 GB Map 14 14 13 13.67 1 GB Shuffling 79 87 80 82.00 1 GB Reduce 84 93 83 86.67 10 GB Map 21 19 20 20.00 10 GB Shuffling 937 900 857 898.00 10 GB Reduce 1729 1611 1360 1566.67 30 GB Map 19 23 22 21.33 30 GB Shuffling 2446 2710 2650 2602.00 30 GB Reduce 5437 7118 6821 6458.67 100 MB Map 6 6 5 5.67 100 MB Shuffling 10 10 11 10.33 100 MB Reduce 4 4 5 4.33 147

7 8 1 GB Map 16 18 14 16.00 1 GB Shuffling 86 89 83 86.00 1 GB Reduce 91 75 73 79.67 10 GB Map 18 18 16 17.33 10 GB Shuffling 885 929 906 906.67 10 GB Reduce 1147 1515 1097 1253.00 30 GB Map 20 19 20 19.67 30 GB Shuffling 2731 2694 2725 2716.67 30 GB Reduce 6419 6210 5877 6168.67 100 MB Map 6 5 6 5.67 100 MB Shuffling 10 10 10 10.00 100 MB Reduce 5 5 4 4.67 1 GB Map 16 18 12 15.33 1 GB Shuffling 83 81 87 83.67 1 GB Reduce 85 83 80 82.67 10 GB Map 23 27 25 25.00 10 GB Shuffling 985 910 979 958.00 10 GB Reduce 1681 1591 1514 1595.33 30 GB Map 37 23 40 33.33 30 GB Shuffling 2983 2796 2882 2887.00 30 GB Reduce 6514 5891 5338 5914.33 100 MB Map 5 5 5 5.00 100 MB Shuffling 10 10 10 10.00 100 MB Reduce 5 4 5 4.67 1 GB Map 15 11 10 12.00 1 GB Shuffling 92 91 88 90.33 1 GB Reduce 80 76 75 77.00 10 GB Map 20 25 29 24.67 10 GB Shuffling 925 1020 893 946.00 10 GB Reduce 1043 1679 2092 1604.67 30 GB Map 27 24 30 27.00 30 GB Shuffling 2812 2777 2834 2807.67 30 GB Reduce 5319 6317 5395 5677.00 148

2. Hadoop Virtualized Cluster- KVM Number of Dataset KVM VMs Size Map Test 1 Test 2 Test 3 Mean 100 MB Map 4 6 5 5 100 MB Shuffling 7 7 7 7 100 MB Reduce 3 3 3 3 1 GB Map 12 14 12 12.67 1 GB Shuffling 37 37 38 37.33 3 1 GB Reduce 41 41 40 40.67 10 GB Map 24 20 23 22.33 10 GB Shuffling 781 737 718 745.33 10 GB Reduce 336 345 392 357.67 30 GB Map 24 24 23 23.67 30 GB Shuffling 2150 2220 2172 2180.67 30 GB Reduce 1559 1542 1539 1546.67 100 MB Map 5 5 5 5.00 100 MB Shuffling 6 7 7 6.67 100 MB Reduce 3 3 3 3.00 1 GB Map 12 15 16 14.33 1 GB Shuffling 28 34 38 33.33 4 1 GB Reduce 38 40 40 39.33 10 GB Map 28 29 23 26.67 10 GB Shuffling 657 672 657 662.00 10 GB Reduce 438 442 419 433.00 100 GB Map 25 28 25 26.00 100 GB Shuffling 1952 2046 1887 1961.67 100 GB Reduce 1616 1517 1605 1579.33 100 MB Map 5 5 5 5.00 100 MB Shuffling 6 7 6 6.33 100 MB Reduce 3 3 3 3.00 1 GB Map 61 64 85 70.00 1 GB Shuffling 113 109 139 120.33 5 1 GB Reduce 51 41 42 44.67 10 GB Map 33 29 32 31.33 10 GB Shuffling 746 632 877 751.67 10 GB Reduce 445 477 358 426.67 100 GB Map 37 66 51 51.33 100 GB Shuffling 3446 3332 2816 3198.00 100 GB Reduce 1413 1597 1788 1599.33 100 MB Map 5 5 4 4.67 100 MB Shuffling 6 6 6 6.00 100 MB Reduce 3 4 4 3.67 1 GB Map 224 343 266 277.67 1 GB Shuffling 511 464 492 489.00 6 1 GB Reduce 56 48 63 55.67 10 GB Map 45 37 42 41.33 149

10 GB Shuffling 1652 1387 1745 1594.67 10 GB Reduce 404 412 532 449.33 100 GB Map 140 180 50 123.33 100 GB Shuffling 7402 10197 5710 7769.67 100 GB Reduce 1717 1565 1206 1496.00 100 MB Map 5 5 5 5.00 100 MB Shuffling 6 6 6 6.00 100 MB Reduce 4 3 3 3.33 1 GB Map 124 245 365 244.67 1 GB Shuffling 1083 958 1344 1128.33 7 1 GB Reduce 102 121 81 101.33 10 GB Map 61 63 58 60.67 10 GB Shuffling 1024 1984 2062 1690.00 10 GB Reduce 985 1101 1024 1036.67 100 GB Map 185 163 154 167.33 100 GB Shuffling 12112 10197 12024 11444.33 100 GB Reduce 1987 1851 2106 1981.33 100 MB Map 5 5 5 5.00 100 MB Shuffling 6 6 6 6.00 100 MB Reduce 4 3 3 3.33 1 GB Map 162 193 167 174.00 1 GB Shuffling 1201 1320 1259 1260.00 8 1 GB Reduce 545.4 244.42 163.62 317.81 10 GB Map 104 121 97 107.33 10 GB Shuffling 2489.52 2440.32 2536.26 2488.70 10 GB Reduce 1211.55 1354.23 2283.52 1616.43 100 GB Map 201 195 168 188 100 GB Shuffling 11087 14587 13214 12962.667 100 GB Reduce 3088 3145 2906 3046.3333 150

3. Hadoop Virtualized Cluster- VMware ESXi Number of VMware VMs Dataset Size Map Test 1 Test 2 Test 3 Mean 100 MB Map 5 5 5 5 100 MB Shuffling 8 7 7 7 100 MB Reduce 4 4 4 4 1 GB Map 18 16 16 17 1 GB Shuffling 42 49 41 44 3 1 GB Reduce 40 38 39 39 10 GB Map 24 22 23 23 10 GB Shuffling 660 636 645 647 10 GB Reduce 492 483 493 489 30 GB Map 44 44 43 44 30 GB Shuffling 4108 3952 3891 3984 30 GB Reduce 2278 2315 2101 2231 100 MB Map 5 5 5 5 100 MB Shuffling 7 7 8 7 100 MB Reduce 4 4 4 4 1 GB Map 19 15 15 16 1 GB Shuffling 38 39 42 40 4 1 GB Reduce 42 41 40 41 10 GB Map 25 24 25 24.66667 10 GB Shuffling 672 691 682 682 10 GB Reduce 486 425 411 440.6667 30 GB Map 35 51 43 43 30 GB Shuffling 2657 3257 3214 3042.667 30 GB Reduce 1985 1852 1865 1900.667 100 MB Map 7 5 5 6 100 MB Shuffling 8 7 7 7 100 MB Reduce 4 3 3 3 1 GB Map 19 21 18 19 1 GB Shuffling 35 30 32 32 5 1 GB Reduce 39 35 37 37 10 GB Map 31 26 28 28 10 GB Shuffling 553 514 503 523 10 GB Reduce 418 432 421 424 30 GB Map 39 36 45 40 30 GB Shuffling 2540 2412 2286 2413 30 GB Reduce 2310 2245 2101 2219 100 MB Map 5 6 5 5 100 MB Shuffling 7 7 6 7 100 MB Reduce 5 4 4 4 1 GB Map 18 18 17 18 1 GB Shuffling 28 29 27 28 6 1 GB Reduce 32 29 34 32 10 GB Map 59 42 41 47 151

10 GB Shuffling 536 552 529 539 10 GB Reduce 369 385 336 363 30 GB Map 30 32 28 30 30 GB Shuffling 2412 2254 2114 2260 30 GB Reduce 2098 1671 1658 1809 100 MB Map 10 10 8 9 100 MB Shuffling 12 11 8 10 100 MB Reduce 4 4 4 4 1 GB Map 24 29 26 26 1 GB Shuffling 35 32 39 35 7 1 GB Reduce 26 34 25 28 10 GB Map 52 56 52 53 10 GB Shuffling 536 520 511 522 10 GB Reduce 298 290 302 297 30 GB Map 84 76 87 82 30 GB Shuffling 3210 2687 2968 2955 30 GB Reduce 1743 1523 1621 1629 100 MB Map 17 16 11 15 100 MB Shuffling 15 16 14 15 100 MB Reduce 4 4 4 4 1 GB Map 81 79 81 80 1 GB Shuffling 92 93 82 89 8 1 GB Reduce 36 36 37 36 10 GB Map 128 102 127 119 10 GB Shuffling 1340 1102 1021 1154 10 GB Reduce 509 562 554 542 30 GB Map 144 137 142 141 30 GB Shuffling 4481 4251 4012 4248 30 GB Reduce 1753 1578 1697 1676 152

Appendix F: Data Gathering for TestDFSIO 1. Hadoop Physical Cluster Dataset Size 100 MB 1 GB 10 GB 100 GB Operatio n Write Read Write Read Write Read Write Read Number of Nodes = 3 Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 2.867 2.861 2.421 2.72 Average IO rate (mb/sec) 2.903 2.957 2.517 2.79 IO rate standard deviation 0.363 0.505 0.486 0.45 Execution time (sec) 17.786 16.717 18.8 17.77 Throughput (mb/sec) 7.645 6.309 6.558 6.84 Average IO rate (mb/sec) 19.509 11.442 31.255 20.74 IO rate standard deviation 26.167 14.595 40.655 27.14 Execution time (sec) 14.72 16.721 14.705 15.38 Throughput (mb/sec) 2.507 2.713 2.204 2.47 Average IO rate (mb/sec) 2.889 2.866 2.498 2.75 IO rate standard deviation 1.2632 0.765 0.929 0.99 Execution time (sec) 77.129 74.47 83.658 78.42 Throughput (mb/sec) 6.037 7.297 5.068 6.13 Average IO rate (mb/sec) 10.231 31.235 8.779 16.75 IO rate standard deviation 10.784 39.149 9.712 19.88 Execution time (sec) 43.468 35.947 42.779 40.73 Throughput (mb/sec) 2.503 2.589 3.288 2.79 Average IO rate (mb/sec) 2.671 2.761 3.318 2.92 IO rate standard deviation 0.796 0.817 0.317 0.64 Execution time (sec) 674.535 641.232 363.144 559.64 Throughput (mb/sec) 7.956 7.799 5.458 7.07 Average IO rate (mb/sec) 11.289 12.452 5.786 9.84 IO rate standard deviation 6.421 12.916 1.485 6.94 Execution time (sec) 241.896 296.722 257.708 265.44 Throughput (mb/sec) 3.544 3.275 3.275 3.36 Average IO rate (mb/sec) 3.546 3.284 3.282 3.37 IO rate standard deviation 0.089 0.165 0.148 0.13 Execution time (sec) 3315.61 3343.122 3338.37 3332.37 Throughput (mb/sec) 4.746 5.109 5.603 5.15 Average IO rate (mb/sec) 4.791 5.238 12.875 7.63 IO rate standard deviation 0.478 0.852 18.333 6.55 Execution time (sec) 2387.659 2467.634 2734.46 2529.92 153

Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 4 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Throughput (mb/sec) 2.65 3.303 3.639 3.20 Average IO rate (mb/sec) 2.661 3.543 3.932 3.38 IO rate standard deviation 0.173 0.796 1.212 0.73 Execution time (sec) 17.665 17.039 15.674 16.79 Throughput (mb/sec) 6.405 9.038 5.827 7.09 Average IO rate (mb/sec) 19.631 31.433 19.547 23.54 IO rate standard deviation 28.056 37.351 29.466 31.62 Execution time (sec) 15.35 13.684 14.676 14.57 Throughput (mb/sec) 2.556 2.79 2.786 2.71 Average IO rate (mb/sec) 2.669 2.954 2.885 2.84 IO rate standard deviation 0.582 0.747 0.536 0.62 Execution time (sec) 59.677 58.02 61.536 59.74 Throughput (mb/sec) 12.133 6.031 8.264 8.81 Average IO rate (mb/sec) 27.419 7.751 25.182 20.12 IO rate standard deviation 34.001 4.998 35.168 24.72 Execution time (sec) 33.087 40.004 32.861 35.32 Throughput (mb/sec) 3.713 3.325 3.201 3.41 Average IO rate (mb/sec) 3.735 3.341 3.22 3.43 IO rate standard deviation 0.283 0.236 0.248 0.26 Execution time (sec) 315.636 347.593 367.294 343.51 Throughput (mb/sec) 5.045 5.738 5.006 5.26 Average IO rate (mb/sec) 5.205 11.437 5.138 7.26 IO rate standard deviation 1.779 15.779 0.884 6.15 Execution time (sec) 258.009 261.24 276.283 265.18 Throughput (mb/sec) 3.533 3.354 3.366 3.42 Average IO rate (mb/sec) 3.538 3.356 3.37 3.42 IO rate standard deviation 0.136 0.085 0.111 0.11 Execution time (sec) 3557.813 3507.078 3184.76 3416.5 Throughput (mb/sec) 7.009 6.716 4.349 6.02 Average IO rate (mb/sec) 7.6966 12.229 10.179 10.03 IO rate standard deviation 5.129 9.796 12.546 9.16 Execution time (sec) 2098.422 2700.035 3046.01 2614.8 154

Data Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 5 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Throughput (mb/sec) 2.597 2.791 3.804 3.06 Average IO rate (mb/sec) 2.623 2.841 3.941 3.14 IO rate standard deviation 0.28 0.406 0.772 0.49 Execution time (sec) 16.672 15.708 16.679 16.35 Throughput (mb/sec) 8.019 7.213 10.68 8.64 Average IO rate (mb/sec) 32.097 35.821 46.053 37.99 IO rate standard deviation 40.452 47.846 40.287 42.86 Execution time (sec) 14.584 14.501 14.579 14.55 Throughput (mb/sec) 2.477 2.896 2.676 2.68 Average IO rate (mb/sec) 2.572 3.032 2.757 2.79 IO rate standard deviation 0.533 0.64 0.533 0.57 Execution time (sec) 59.372 56.319 54.271 56.65 Throughput (mb/sec) 7.659 5.617 8.738 7.34 Average IO rate (mb/sec) 11.029 8.651 25.868 15.18 IO rate standard deviation 7.049 8.984 41.954 19.33 Execution time (sec) 36.214 35.04 30.18 33.81 Throughput (mb/sec) 3.309 3.337 3.382 3.34 Average IO rate (mb/sec) 3.329 3.367 3.415 3.37 IO rate standard deviation 0.264 0.335 0.331 0.31 Execution time (sec) 346.239 340.622 361.257 349.37 Throughput (mb/sec) 6.309 5.741 4.771 5.61 Average IO rate (mb/sec) 9.178 13.109 4.839 9.04 IO rate standard deviation 9.178 23.064 0.609 10.95 Execution time (sec) 263.224 256.89 254.85 258.32 Throughput (mb/sec) 3.103 3.081 3.343 3.18 Average IO rate (mb/sec) 3.115 3.092 3.349 3.19 IO rate standard deviation 0.191 0.183 0.1386 0.17 3177.00 3402.7 Execution time (sec) 3552.118 3478.991 1 0 Throughput (mb/sec) 4.737 5.198 4.478 4.80 Average IO rate (mb/sec) 6.078 5.292 4.512 5.29 2558.05 2421.2 IO rate standard deviation 2462.739 2243.086 1 9 Execution time (sec) 2.597 2.791 3.804 3.06 155

Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 6 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read 300 GB Write Read Throughput (mb/sec) 3.603 4.19 3.536 3.78 Average IO rate (mb/sec) 3.726 4.329 3.949 4.00 IO rate standard deviation 0.714 0.865 1.337 0.97 Execution time (sec) 16.679 15.703 15.792 16.06 Throughput (mb/sec) 7.017 6.681 8.877 7.53 Average IO rate (mb/sec) 37.162 24.101 33.456 31.57 IO rate standard deviation 49.103 36.636 41.63 42.46 Execution time (sec) 14.165 13.942 14.833 14.31 Throughput (mb/sec) 3.089 3.155 3.0178 3.09 Average IO rate (mb/sec) 3.369 3.239 3.088 3.23 IO rate standard deviation 1.13 0.595 0.522 0.75 Execution time (sec) 55.472 51.491 51.749 52.90 Throughput (mb/sec) 7.809 7.593 5.651 7.02 Average IO rate (mb/sec) 8.239 20.751 6.391 11.79 IO rate standard deviation 1.988 34.499 2.169 12.89 Execution time (sec) 33.23 34.396 32.392 33.34 Throughput (mb/sec) 3.366 3.133 3.782 3.43 Average IO rate (mb/sec) 3.386 3.139 3.796 3.44 IO rate standard deviation 0.267 0.14 0.229 0.21 Execution time (sec) 347.497 353.804 297.105 332.80 Throughput (mb/sec) 5.681 6.327 14.756 8.92 Average IO rate (mb/sec) 10.302 14.173 27.573 17.35 IO rate standard deviation 13.222 18.079 22.233 17.84 Execution time (sec) 269.214 270.797 176.225 238.75 Throughput (mb/sec) 3.343 3.252 3.268 3.29 Average IO rate (mb/sec) 3.352 3.26 3.275 3.30 IO rate standard deviation 0.178 0.173 6.127 2.16 3313.77 3299.2 Execution time (sec) 3254.674 3329.312 3 5 Throughput (mb/sec) 5.435 5.169 6.126 5.58 Average IO rate (mb/sec) 7.827 5.465 11.738 8.34 IO rate standard deviation 8.045 3.505 13.987 8.51 2168.30 2339.6 Execution time (sec) 2369.118 2481.531 4 5 Throughput (mb/sec) 3.603 4.19 3.536 3.78 Average IO rate (mb/sec) 3.726 4.329 3.949 4.00 IO rate standard deviation 0.714 0.865 1.337 0.97 Execution time (sec) 16.679 15.703 15.792 16.06 Throughput (mb/sec) 7.017 6.681 8.877 7.53 Average IO rate (mb/sec) 37.162 24.101 33.456 31.57 IO rate standard deviation 49.103 36.636 41.63 42.46 Execution time (sec) 14.165 13.942 14.833 14.31 156

Data Size 100 MB 1 GB 10 GB 100 GB 300 GB Number of Nodes = 7 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Write Read Throughput (mb/sec) 3.475 3.028 3.475 3.33 Average IO rate (mb/sec) 3.928 3.263 3.928 3.71 IO rate standard deviation 1.605 0.905 1.605 1.37 Execution time (sec) 15.793 15.679 15.793 15.76 Throughput (mb/sec) 9.034 6.669 9.034 8.25 Average IO rate (mb/sec) 29.731 14.642 29.731 24.70 IO rate standard deviation 35.058 25.436 35.058 31.85 Execution time (sec) 14.071 14.688 14.071 14.28 Throughput (mb/sec) 3.771 3.837 3.203 3.60 Average IO rate (mb/sec) 3.814 3.887 3.509 3.74 IO rate standard deviation 0.402 0.441 1.118 0.65 Execution time (sec) 44.285 41.408 52.404 46.03 Throughput (mb/sec) 6.069 6.664 6.644 6.46 Average IO rate (mb/sec) 13.227 19.04 7.797 13.35 IO rate standard deviation 19.929 38.007 3.689 20.54 Execution time (sec) 42.883 37.004 32.181 37.36 Throughput (mb/sec) 3.377 3.548 3.636 3.52 Average IO rate (mb/sec) 3.395 3.568 3.646 3.54 IO rate standard deviation 0.248 0.28 0.194 0.24 Execution time (sec) 342.034 313.38 311.647 322.35 Throughput (mb/sec) 5.909 7.832 7.661 7.13 Average IO rate (mb/sec) 8.364 18.227 14.755 13.78 IO rate standard deviation 6.168 22.808 17.955 15.64 Execution time (sec) 273.925 238.699 242.805 251.81 Throughput (mb/sec) 2.698 3.49 3.609 3.27 Average IO rate (mb/sec) 2.77 3.493 3.611 3.29 IO rate standard deviation 0.508 0.083 0.075 0.22 Execution time (sec) 2987.432 2972.533 2849.33 2936.4 3 Throughput (mb/sec) 3.9676 4.499 4.804 4.42 Average IO rate (mb/sec) 6.569 9.992 6.072 7.54 IO rate standard deviation 8.425 14.613 3.837 8.96 2653.41 2150.8 Execution time (sec) 1846.735 1952.279 4 1 Throughput (mb/sec) 3.475 3.028 3.475 3.33 Average IO rate (mb/sec) 3.928 3.263 3.928 3.71 IO rate standard deviation 1.605 0.905 1.605 1.37 Execution time (sec) 15.793 15.679 15.793 15.76 Throughput (mb/sec) 9.034 6.669 9.034 8.25 Average IO rate (mb/sec) 29.731 14.642 29.731 24.70 IO rate standard deviation 35.058 25.436 35.058 31.85 Execution time (sec) 14.071 14.688 14.071 14.28 157

Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 8 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read 300 GB Write Read Throughput (mb/sec) 3.229 2.197 1.828 3.51 Average IO rate (mb/sec) 3.485 2.235 2.198 3.78 IO rate standard deviation 1.161 0.324 1.507 1.2115 Execution time (sec) 15.854 16.137 16.931 15.764 5 Throughput (mb/sec) 5.521 5.721 5.361 7.767 Average IO rate (mb/sec) 18.754 18.857 16.446 30.20 IO rate standard deviation 40.071 40.038 32.486 41.78 Execution time (sec) 14.75 15.846 15.278 14.72 Throughput (mb/sec) 3.977 3.701 4.010 3.82 Average IO rate (mb/sec) 4.067 3.826 4.061 3.91 IO rate standard deviation 0.614 0.758 0.473 0.63 Execution time (sec) 38.578 46.594 43.285 43.11 Throughput (mb/sec) 6.079 6.579 5.672 6.49 Average IO rate (mb/sec) 14.054 36.377 14.545 20.03 IO rate standard deviation 24.771 67.854 26.687 35.06 Execution time (sec) 40.959 40.594 42.45 39.94 Throughput (mb/sec) 3.718 3.441 3.432 3.52 Average IO rate (mb/sec) 3.733 3.466 3.460 3.54 IO rate standard deviation 0.244 0.306 0.320 0.23 Execution time (sec) 305.018 332.807 337.531 323.16 Throughput (mb/sec) 7.692 8.208 6.057 6.43 Average IO rate (mb/sec) 14.645 16.074 13.983 11.94 IO rate standard deviation 12.945 16.939 19.665 12.82 Execution time (sec) 292.329 230.154 289.889 286.00 Throughput (mb/sec) 3.208 3.568 3.592 3.54 Average IO rate (mb/sec) 3.216 3.571 3.595 3.55 IO rate standard deviation 0.154 0.086 0.095 0.13 Execution time (sec) 3386.666 2908.484 2918.7 2988.4 Throughput (mb/sec) 4.959 4.648 5.303 5.23 Average IO rate (mb/sec) 4.941 4.694 6.544 7.66 IO rate standard deviation 1.292 0.495 4.823 6.73 Execution time (sec) 2508.348 2343.769 2358.37 2471.1 Throughput (mb/sec) 2.641 2.629 2.638 2.63 Average IO rate (mb/sec) 2.657 2.644 2.654 2.65 IO rate standard deviation 0.190 0.178 0.196 0.19 Execution time (sec) 7882.599 7862.87 8000.19 7917.8 Throughput (mb/sec) 4.202 4.111 4.307 4.28 Average IO rate (mb/sec) 5.188 5.044 8.211 6.04 IO rate standard deviation 4.432 3.660 12.171 6.15 Execution time (sec) 5386.197 5796.385 5551.46 5546.3 158

2. Hadoop Virtualized Cluster- KVM Data Size Number of KVM VMs = 3 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 6.804 9.417 6.152 7.39 Average IO rate (mb/sec) 11.989 20.359 15.808 7.71 100 MB IO rate standard deviation 9.399 15.769 16.049 1.63 Write Execution time (sec) 15.439 15.405 13.405 14.75 Throughput (mb/sec) 101.833 96.618 104.275 16.12 Average IO rate (mb/sec) 102.428 102.200 105.770 16.88 Read IO rate standard deviation 7.777 22.154 12.986 4.15 Execution time (sec) 13.479 14.881 13.399 13.92 Throughput (mb/sec) 7.764 7.231 8.201 7.73 Average IO rate (mb/sec) 9.173 7.397 11.249 9.27 1 GB IO rate standard deviation 3.808 1.115 6.539 3.82 Write Execution time (sec) 40.681 40.126 38.515 39.77 Throughput (mb/sec) 22.912 19.046 25.187 22.38 Average IO rate (mb/sec) 30.131 19.969 40.464 30.19 Read IO rate standard deviation 16.609 4.422 34.112 18.38 Execution time (sec) 20.441 20.518 19.44 20.13 Throughput (mb/sec) 7.409 7.429 7.323 7.39 Average IO rate (mb/sec) 7.837 7.68 7.616 7.71 10 GB IO rate standard deviation 1.917 1.43 1.55 1.63 Write Execution time (sec) 283.681 283.554 288.894 285.38 Throughput (mb/sec) 15.179 16.526 16.663 16.12 Average IO rate (mb/sec) 15.23 17.456 17.96 16.88 Read IO rate standard deviation 0.899 5.7934 5.753 4.15 Execution time (sec) 148.455 133.574 128.987 137.01 Throughput (mb/sec) 6.704 7.621 7.621 7.32 Average IO rate (mb/sec) 6.883 7.557 7.247 7.23 IO rate standard deviation 1.147 1.554 1.512 1.40 100 GB Write Execution time (sec) 2929.379 2666.541 2812.221 2802.71 Throughput (mb/sec) 15.959 15.79 15.79 15.85 Average IO rate (mb/sec) 16.413 15.831 15.831 16.03 Read IO rate standard deviation 0.717 0.818 0.724 0.75 Execution time (sec) 1316.845 1486.787 1438.554 1414.06 159

Data Size Number of KVM VMs = 4 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 4.842 5.13 4.721 5.25 Average IO rate (mb/sec) 8.131 12.451 12.213 6.71 100 MB IO rate standard deviation 5.865 13.164 16.688 4.41 Write Execution time (sec) 14.37 15.692 15.473 15.18 Throughput (mb/sec) 95.419 77.042 96.061 11.37 Average IO rate (mb/sec) 100.145 84.183 97.716 11.97 Read IO rate standard deviation 21.111 23.277 12.857 2.92 Execution time (sec) 14.387 14.364 15.534 14.76 Throughput (mb/sec) 5.825 5.557 5.677 5.69 Average IO rate (mb/sec) 7.556 7.236 7.601 7.46 1 GB IO rate standard deviation 5.199 4.868 5.323 5.13 Write Execution time (sec) 40.198 38.079 40.489 39.59 Throughput (mb/sec) 26.314 33.061 23.697 27.69 Average IO rate (mb/sec) 45.562 52.421 31.314 43.10 Read IO rate standard deviation 42.962 41.111 18.684 34.25 Execution time (sec) 19.474 15.461 19.475 18.14 Throughput (mb/sec) 5.817 5.188 5.182 5.40 Average IO rate (mb/sec) 7.263 6.567 6.457 6.76 10 GB IO rate standard deviation 4.535 4.212 3.955 4.23 Write Execution time (sec) 270.133 296.114 301.458 289.24 Throughput (mb/sec) 14.008 11.722 11.517 12.42 Average IO rate (mb/sec) 15.293 12.759 18.052 15.37 Read IO rate standard deviation 4.447 3.54 16.12 8.04 Execution time (sec) 118.331 144.184 130.603 131.04 Throughput (mb/sec) 5.149 5.339 5.259 5.25 Average IO rate (mb/sec) 6.361 6.886 6.886 6.71 100 GB IO rate standard deviation 3.833 4.625 4.78 4.41 Write Execution time (sec) 2778.663 2780.868 2824.785 2794.77 Throughput (mb/sec) 11.655 11.181 11.269 11.37 Average IO rate (mb/sec) 12.193 12.002 11.724 11.97 Read IO rate standard deviation 2.891 3.319 2.549 2.92 Execution time (sec) 1369.266 1318.89 1520.755 1402.97 160

Data Size Number of KVM VMs = 5 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 5.796 4.807 2.949 4.52 Write Average IO rate (mb/sec) 6.55 5.447 3.682 5.23 100 MB IO rate standard deviation 2.342 2.171 2.446 2.32 Execution time (sec) 14.444 14.696 14.398 14.51 Throughput (mb/sec) 42.481 54.171 54.083 50.25 Average IO rate (mb/sec) 52.311 65.455 63.057 60.27 Read IO rate standard deviation 21.799 23.039 20.053 21.63 Execution time (sec) 14.39 14.466 14.534 14.46 Throughput (mb/sec) 3.962 2.168 2.552 2.89 Average IO rate (mb/sec) 4.422 2.215 2.626 3.09 1 GB Write IO rate standard deviation 1.699 0.375 0.527 0.87 Execution time (sec) 42.716 37.287 37.65 39.22 Throughput (mb/sec) 4.883 7.708 5.135 5.91 Average IO rate (mb/sec) 6.698 9.251 5.884 7.28 Read IO rate standard deviation 4.412 4.452 2.42 3.76 Execution time (sec) 18.364 17.669 18.061 18.03 Throughput (mb/sec) 3.369 3.495 3.421 3.43 Average IO rate (mb/sec) 3.374 3.497 3.294 3.39 10 GB Write IO rate standard deviation 0.123 0.081 0.057 0.09 Execution time (sec) 262.581 287.531 291.531 280.55 Throughput (mb/sec) 8.792 7.17 8.27 8.08 Average IO rate (mb/sec) 8.558 7.3 7.211 7.69 Read IO rate standard deviation 1.058 0.906 0.906 0.96 Execution time (sec) 128.409 125.356 133.347 129.04 Throughput (mb/sec) 5.149 6.847 5.2 5.73 Average IO rate (mb/sec) 6.361 6.121 5.677 6.05 100 GB Write IO rate standard deviation 4.811 5.255 5.75 5.27 Execution time (sec) 2679.211 2850.74 4 2824.512 2784.82 Throughput (mb/sec) 11.655 11.181 11.269 Average IO rate (mb/sec) 12.193 12.002 11.724 11.97 Read IO rate standard deviation 2.891 3.319 2.549 2.92 Execution time (sec) 1475.121 1214.15 1420.575 1369.95 161

Data Size Number of KVM VMs = 6 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 5.054 4.318 3.31 4.23 Write Average IO rate (mb/sec) 6.463 4.769 3.984 5.07 100 MB IO rate standard deviation 2.986 2.163 2.446 2.53 Execution time (sec) 14.474 14.44 14.42 14.44 Throughput (mb/sec) 62.035 56.085 24.337 47.49 Average IO rate (mb/sec) 69.145 65.138 62.606 65.63 Read IO rate standard deviation 22.529 23.831 38.026 28.13 Execution time (sec) 14.468 14.441 15.151 14.69 Throughput (mb/sec) 3.089 3.155 2.982 3.08 Write Average IO rate (mb/sec) 3.369 3.239 3.262 3.29 1 GB IO rate standard deviation 1.13 0.595 1.115 0.95 Execution time (sec) 55.472 51.491 57.861 54.94 Throughput (mb/sec) 7.809 7.593 9.488 8.30 Average IO rate (mb/sec) 8.239 20.751 29.577 19.52 Read IO rate standard deviation 1.988 34.499 36.679 24.39 Execution time (sec) 34.23 34.396 31.08 33.24 Throughput (mb/sec) 1.138 0.393 0.862 0.80 Write Average IO rate (mb/sec) 1.326 0.393 0.875 0.86 10 GB IO rate standard deviation 0.105 0.015 0.112 0.08 Execution time (sec) 310.523 372.186 359.615 347.44 Throughput (mb/sec) 0.881 1.437 1.568 1.30 Average IO rate (mb/sec) 3.091 1.639 1.721 2.15 IO rate standard deviation 5.442 0.666 0.645 2.25 Execution time (sec) 144.278 115.98 155.58 138.61 Throughput (mb/sec) 2.597 2.898 2.581 2.69 Write Average IO rate (mb/sec) 2.516 2.606 2.625 2.58 100 GB IO rate standard deviation 0.155 0.157 0.21 0.17 Read Execution time (sec) 4130.984 4322.184 4179.124 4210.76 Throughput (mb/sec) 4.365 4.125 4.335 4.28 Average IO rate (mb/sec) 4.744 4.994 3.951 4.56 IO rate standard deviation 1.235 1.352 1.228 1.27 Execution time (sec) 3115.787 3411.599 3954.8 3494.06 162

Number of KVM VMs = 7 Data Size Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 2.81 2.419 2.604 2.61 Average IO rate (mb/sec) 3.285 2.562 2.68 2.84 100 MB IO rate standard deviation 1.731 0.719 0.477 0.98 Write Execution time (sec) 16.788 17.535 19.524 17.95 Throughput (mb/sec) 36.311 39.541 40.404 38.75 Average IO rate (mb/sec) 42.668 52.131 52.211 49.00 Read IO rate standard deviation 16.107 22.127 24.376 20.87 Execution time (sec) 15.563 15.498 15.573 15.54 Throughput (mb/sec) 2.027 2.263 2.234 2.17 Average IO rate (mb/sec) 2.072 2.342 2.273 2.23 1 GB IO rate standard deviation 0.357 0.484 2.273 1.04 Write Execution time (sec) 69.088 66.069 70.888 68.68 Throughput (mb/sec) 9.77 24.537 11.26 15.19 Average IO rate (mb/sec) 20.969 28.656 25.211 24.95 Read IO rate standard deviation 26.933 13.725 23.079 21.25 Execution time (sec) 32.326 38.813 35.21 35.45 Throughput (mb/sec) 3.052 3.505 3.727 3.43 Average IO rate (mb/sec) 3.073 3.519 3.239 3.28 10 GB IO rate standard deviation 0.267 0.226 0.216 0.24 Write Execution time (sec) 390.563 318.818 301.551 336.98 Throughput (mb/sec) 7.934 8.232 10.567 8.91 Average IO rate (mb/sec) 18.501 8.33 11.837 12.89 Read IO rate standard deviation 6.414 0.9 4.159 3.82 Execution time (sec) 167.311 153.892 163.909 161.70 Throughput (mb/sec) 2.214 2.632 2.325 2.39 Average IO rate (mb/sec) 2.421 2.412 2.514 2.45 100 GB IO rate standard deviation 0.195 0.157 0.21 0.19 Write Execution time (sec) 8303.277 8990.14 8776.16 8689.86 Throughput (mb/sec) 3.218 3.625 5.024 3.96 Average IO rate (mb/sec) 4.421 5.114 2.125 3.89 Read IO rate standard deviation 1.521 1.235 1.095 1.28 Execution time (sec) 7820.6253 7573.74 9175.136 8189.84 163

Number of KVM VMs = 8 Data Size Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) 7.45965 2.58245 3.80997 4.62 Write Average IO rate (mb/sec) 7.00618 3.8086 4.14562 4.99 IO rate standard deviation 2.206 2.365 2.323 2.30 100 MB Execution time (sec) 26.14782 43.34817 29.21525 32.90 Throughput (mb/sec) 49.132 46.789 52.311 49.41 Average IO rate (mb/sec) 34.014 43.115 66.94 48.02 Read IO rate standard deviation 13.245 13.154 14.774 13.72 Execution time (sec) 24.96277 31.83195 31.4689 29.42 Throughput (mb/sec) 2.002 2.365 2.004 2.12 Write Average IO rate (mb/sec) 2.105 2.211 2.106 2.14 IO rate standard deviation 0.311 0.12 2.185 0.87 1 GB Execution time (sec) 93.2688 83.90763 150.2826 109.15 Throughput (mb/sec) 9.02 11.417 11.352 10.60 Average IO rate (mb/sec) 20.123 19.296 20.011 19.81 Read IO rate standard deviation 20.923 18.665 23.001 20.86 Execution time (sec) 38.7912 22.5756 77.462 46.28 Throughput (mb/sec) 3.052 3.505 3.727 3.43 Average IO rate (mb/sec) 3.009 3.157 3.562 3.24 Write IO rate standard deviation 0.213 0.215 0.2 0.21 10 GB Execution time (sec) 515.5432 420.8398 729.7534 555.38 Throughput (mb/sec) 7.934 8.232 10.567 8.91 Average IO rate (mb/sec) 18.501 8.33 11.837 12.89 Read IO rate standard deviation 5.621 6.211 1.529 4.45 Execution time (sec) 830.37 249.305 368.7953 482.82 Throughput (mb/sec) Write Average IO rate (mb/sec) IO rate standard deviation 100 GB Execution time (sec) Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) 164

4. Hadoop Virtualized Cluster- VMware ESXi Dataset Size 100 MB 1 GB 10 GB Number of VMware ESXi VMs = 3 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read 100 GB Write Read Throughput (mb/sec) 1.534 4.382 5.396 3.77 Average IO rate (mb/sec) 5.854 6.476 8.613 6.98 IO rate standard deviation 5.995 4.094 4.688 4.93 Execution time (sec) 32.586 39.459 33.961 35.34 Throughput (mb/sec) 15.813 15.489 11.664 14.32 Average IO rate (mb/sec) 36.691 40.070 35.419 37.39 IO rate standard deviation 17.121 18.054 17.575 17.58 Execution time (sec) 27.836 29.189 31.256 29.43 Throughput (mb/sec) 2.796 2.843 2.267 2.64 Average IO rate (mb/sec) 3.25 3.036 2.63 2.97 IO rate standard deviation 1.284 0.661 0.925 0.96 Execution time (sec) 98.707 99.748 105.382 101.28 Throughput (mb/sec) 14.873 16.918 15.707 15.83 Average IO rate (mb/sec) 17.528 18.735 17.787 18.02 IO rate standard deviation 5.826 5.519 5.377 5.57 Execution time (sec) 45.231 45.825 44.245 45.10 Throughput (mb/sec) 16.154 17.254 16.259 16.56 Average IO rate (mb/sec) 29.400 29.484 28.354 29.08 IO rate standard deviation 0.002 0.003 0.03 0.01 Execution time (sec) 477.380 467.431 467.431 470.75 Throughput (mb/sec) 17.214 16.213 16.254 16.56 Average IO rate (mb/sec) 90.557 87.254 90.264 89.36 IO rate standard deviation 0.0219 0.0211 0.003 0.02 Execution time (sec) 138.808 153.864 162.121 151.60 Throughput (mb/sec) 8.215 8.255 6.923 7.80 Average IO rate (mb/sec) 6.874 6.254 7.257 6.80 IO rate standard deviation 0.952 0.961 1.021 0.98 Execution time (sec) 4630.131 4766.423 4621.21 4672.59 Throughput (mb/sec) 12.214 12.214 12.214 12.21 Average IO rate (mb/sec) 15.24 15.24 15.24 15.24 IO rate standard deviation 2.721 2.745 2.847 2.77 Execution time (sec) 1621.001 1569.541 1642.21 1610.92 165

Number of VMware ESXi VMs = 4 Dataset Size 100 MB Operation Criteria Test1 Test2 Test3 Mean Write Read Throughput (mb/sec) 3.621 3.256 4 3.63 Average IO rate (mb/sec) 6.094 6.509 5.994 6.20 IO rate standard deviation 3.863 4.442 4.69 4.33 Execution time (sec) 32.953 38.188 33.315 34.82 Throughput (mb/sec) 22.207 13.139 11.465 15.60 Average IO rate (mb/sec) 35.199 33.706 30.311 33.07 IO rate standard deviation 11.124 17.761 22.259 17.05 Execution time (sec) 27.896 29.868 29.815 29.19 1 GB 10 GB 100 GB Write Read Write Read Write Read Throughput (mb/sec) 2.652 3.716 4.593 3.65 Average IO rate (mb/sec) 2.808 4.021 5.233 4.02 IO rate standard deviation 0.718 1.157 2.367 1.41 Execution time (sec) 91.877 90.642 82.769 88.43 Throughput (mb/sec) 18.332 19.887 12.121 16.78 Average IO rate (mb/sec) 24.537 37.693 21.528 27.92 IO rate standard deviation 17.041 40.033 20.724 25.93 Execution time (sec) 43.546 43.877 41.49 42.97 Throughput (mb/sec) 16.211 16.001 15.251 15.82 Average IO rate (mb/sec) 24.756 29.481 25.328 26.52 IO rate standard deviation 0.004 0.002 0.002 0.00 Execution time (sec) 474.891 457.717 415.126 449.24 Throughput (mb/sec) 13.254 12.354 16.321 13.98 Average IO rate (mb/sec) 22.644 21.14 23.214 22.33 IO rate standard deviation 0.001 0.014 0.003 0.01 Execution time (sec) 151.35 120.893 139.212 137.15 Throughput (mb/sec) 4.215 4.101 4.259 4.19 Average IO rate (mb/sec) 6.214 5.214 6.254 5.89 IO rate standard deviation 0.617 1.002 0.658 0.76 Execution time (sec) 4384.964 4514.001 3913.98 4270.98 Throughput (mb/sec) 12.214 13.12 12.542 12.63 Average IO rate (mb/sec) 16.241 15.214 15.24 15.57 IO rate standard deviation 2.125 2.155 2.314 2.20 Execution time (sec) 1573.197 1203.144 1503.98 1426.78 166

Number of VMware ESXi VMs = 5 Dataset Size Operation Cretiria Test1 Test2 Test3 Mean Throughput (mb/sec) 6.787 5.783 5.797 6.12 100 MB 1 GB 10 GB 100 GB Write Read Write Read Write Read Write Read Average IO rate (mb/sec) 7.487 6.311 6.085 6.63 IO rate standard deviation 2.357 1.683 1.291 1.78 Execution time (sec) 21.832 19.816 22.687 21.45 Throughput (mb/sec) 32.553 30.599 31.699 31.62 Average IO rate (mb/sec) 33.458 33.203 19.672 28.78 IO rate standard deviation 9.214 8.962 9.374 9.18 Execution time (sec) 20.689 17.884 18.865 19.15 Throughput (mb/sec) 1.926 2.032 2.497 2.15 Average IO rate (mb/sec) 2.031 2.133 2.648 2.27 IO rate standard deviation 0.526 0.569 0.793 0.63 Execution time (sec) 76.754 79.927 68.664 75.12 Throughput (mb/sec) 18.093 9.005 14.98 14.03 Average IO rate (mb/sec) 23.024 9.845 17.178 16.68 IO rate standard deviation 11.61 3.008 7.886 7.50 Execution time (sec) 26.606 34.991 32.247 31.28 Throughput (mb/sec) 3.065 3.131 3.03 3.08 Average IO rate (mb/sec) 3.079 3.143 3.048 3.09 IO rate standard deviation 0.217 0.203 0.238 0.22 Execution time (sec) 421.213 417.535 427.938 422.23 Throughput (mb/sec) 10.624 10.144 10.566 149.57 Average IO rate (mb/sec) 10.701 10.24 10.709 10.50 IO rate standard deviation 0.912 1.023 1.236 4.21 Execution time (sec) 124.088 137.512 131.46 131.02 Throughput (mb/sec) 3.202 3.147 3.144 3.16 Average IO rate (mb/sec) 3.298 3.182 3.249 3.24 IO rate standard deviation 0.617 0.335 0.778 0.58 Execution time (sec) 3584.964 3607.653 3595.375 3596.00 Throughput (mb/sec) 11.951 11.709 12.024 11.89 Average IO rate (mb/sec) 12.24 11.877 2.255 8.79 IO rate standard deviation 2.372 2.201 1.667 2.08 Execution time (sec) 1163.197 1085.013 1054.679 1100.96 167

Dataset Size 100 MB Number of VMware ESXi VMs = 6 Operation Criteria Test1 Test2 Test3 Mean Write Read Throughput (mb/sec) 2.528 4.331 1.161 2.67 Average IO rate (mb/sec) 2.767 4.794 1.924 3.16 IO rate standard deviation 0.803 1.448 1.256 1.17 Execution time (sec) 30.602 23.713 30.648 28.32 Throughput (mb/sec) 18.91 24.085 20.934 21.31 Average IO rate (mb/sec) 23.593 29.105 30.76 27.82 IO rate standard deviation 10.874 12.232 13.055 12.05 Execution time (sec) 20.554 18.235 18.076 18.96 1 GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) 3.035 1.663 1.578 2.09 Average IO rate (mb/sec) 4.024 1.683 1.655 2.45 IO rate standard deviation 2.173 0.185 0.389 0.92 Execution time (sec) 58.777 86.469 75.434 73.56 Throughput (mb/sec) 3.201 7.975 8.673 6.62 Average IO rate (mb/sec) 5.086 7.729 9.918 7.58 IO rate standard deviation 4.142 1.629 3.371 3.05 Execution time (sec) 32.292 44.313 44.413 40.34 Throughput (mb/sec) 3.132 3.012 2.693 2.95 Average IO rate (mb/sec) 3.163 3.053 2.718 2.98 IO rate standard deviation 0.329 0.366 0.261 0.32 Execution time (sec) 375.74 408.223 462.919 415.63 Throughput (mb/sec) 8.422 9.66 9.21 9.10 Average IO rate (mb/sec) 8.489 9.26 9.32 9.02 IO rate standard deviation 0.774 2.301 1.254 1.44 Execution time (sec) 122.793 125.499 133.985 127.43 Throughput (mb/sec) 27.459 26.086 26.888 26.81 Average IO rate (mb/sec) 27.459 26.086 26.888 26.81 IO rate standard deviation 7.555 0.005 0.002 2.52 Execution time (sec) 3669.984 3881.374 3752.234 3767.86 Throughput (mb/sec) 92.256 98.351 96.926 95.84 Average IO rate (mb/sec) 92.256 98.351 96.926 95.84 IO rate standard deviation 0.009 0.0156 0.017 0.01 Execution time (sec) 1106.825 1042.597 1049.995 1066.47 168

Dataset Size 100 MB Write Number of VMware ESXi VMs = 7 Operation Criteria Test1 Test2 Test3 Mean Read 1 GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) 16.815 16.87 22.75 18.81 Average IO rate (mb/sec) 16.835 15.22 22.18 18.08 IO rate standard deviation 0.021 0.003 0.005 0.01 Execution time (sec) 23.757 21.82 20.727 22.10 Throughput (mb/sec) 112.524 137.741 124.069 124.78 Average IO rate (mb/sec) 103.235 135.542 131.096 88.89 IO rate standard deviation 0.019 0.027 0.015 6.01 Execution time (sec) 18.002 16.993 16.189 18.81 Throughput (mb/sec) 21.989 22.135 18.241 20.79 Average IO rate (mb/sec) 21.989 22.187 18.215 20.80 IO rate standard deviation 0.003 0.004 1.526 0.51 Execution time (sec) 66.387 69.104 78.357 71.28 Throughput (mb/sec) 66.366 72.849 76.694 71.97 Average IO rate (mb/sec) 66.366 62.325 79.241 69.31 IO rate standard deviation 0.011 0.007 0.012 0.01 Execution time (sec) 44.061 31.754 42.002 39.27 Throughput (mb/sec) 25.951 21.215 22.587 23.25 Average IO rate (mb/sec) 23.465 26.124 25.638 25.08 IO rate standard deviation 0.005 0.003 0.005 0.00 Execution time (sec) 412.5 400.975 417.404 410.29 Throughput (mb/sec) 92.125 86.671 80.214 86.34 Average IO rate (mb/sec) 98.851 97.256 92.541 96.22 IO rate standard deviation 0.006 0.004 0.004 0.005 Execution time (sec) 121.16 132.544 132.544 128.75 Throughput (mb/sec) 19.274 25.261 23.574 22.70 Average IO rate (mb/sec) 26.332 28.315 27.036 27.23 IO rate standard deviation 0.002 0.005 0.004 0.004 Execution time (sec) 3826.909 3645.909 3727.394 3733.40 Throughput (mb/sec) 45.215 55.547 65.963 55.58 Average IO rate (mb/sec) 95.214 94.686 84.254 91.38 IO rate standard deviation 0.019 0.023 0.014 0.02 Execution time (sec) 1074.606 994.225 980.919 1016.58 169

Dataset Size 100 MB Write Number of VMware ESXi VMs = 8 Operation Cretiria Test1 Test2 Test3 Mean Read 1 GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) 6.352 6.214 5.214 5.93 Average IO rate (mb/sec) 8.359 16.072 9.325 11.25 IO rate standard deviation 0.001 0.035 0.001 0.01 Execution time (sec) 42.097 22.322 34.282 32.90 Throughput (mb/sec) 69.215 82.254 68.325 73.26 Average IO rate (mb/sec) 93.721 146.511 95.328 111.85 IO rate standard deviation 0.018 0.02 0.014 0.02 Execution time (sec) 27.748 27.962 26.957 27.56 Throughput (mb/sec) 7.652 6.521 6.241 6.80 Average IO rate (mb/sec) 13.067 16.873 17.89 15.94 IO rate standard deviation 0.003 4.231 0.002 1.41 Execution time (sec) 94.415 94.711 79.465 89.53 Throughput (mb/sec) 36.678 60.214 62.124 53.01 Average IO rate (mb/sec) 28.352 99.265 78.019 68.55 IO rate standard deviation 0.003 0.021 0.006 0.01 Execution time (sec) 55.273 30.137 62.601 49.34 Throughput (mb/sec) 18.124 19.254 18.625 18.67 Average IO rate (mb/sec) 26.557 24.477 24.955 25.33 IO rate standard deviation 0.004 0.004 0.004 0.00 Execution time (sec) 400.564 438.626 432.917 424.04 Throughput (mb/sec) 89.268 119.348 78.019 95.55 Average IO rate (mb/sec) 69.361 80.541 59.013 69.64 IO rate standard deviation 0.008 0.007 0.007 0.0073 Execution time (sec) 130.975 101.048 171.601 134.54 Throughput (mb/sec) 18.214 19.421 19.566 19.07 Average IO rate (mb/sec) 27.006 24.451 25.324 25.59 IO rate standard deviation 0.001 0.002 0.002 0.00 Execution time (sec) 3737.64 4138.379 3981.254 3952.43 Throughput (mb/sec) 93.514 90.291 91.157 91.65 Average IO rate (mb/sec) 68.325 78.245 65.247 70.61 IO rate standard deviation 0.143 0.012 0.102 0.09 Execution time (sec) 1090.37 1130.456 1105.645 1108.82 170