The Impact of Virtualization on High Performance Computing Clustering in the Cloud

Transcription

1 The Impact of Virtualization on High Performance Computing Clustering in the Cloud Master Thesis Report Submitted in Fall 2013 In partial fulfillment of the requirements for the degree of Master of Science in Software Engineering at the School of Science and Engineering of Al Akhawayn University in Ifrane By Ouidad ACHAHBAR Supervised by Dr. Mohamed Riduan ABID Ifrane, Morocco January, 2014

2 Acknowledgment I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance and strength to complete this work, and for having the chance to study and accomplish my master degree with high support from my family, friends and professors. Thank you ALLAH. I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this research, providing me with valuable feedback and overseeing my progress in a weekly basis. Thank you Dr. Abid for your motivation and support. My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf. I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their continuous support, encouragement and love. There are no words to express my gratitude to all of you. Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for being always by my side; thank you for sharing enjoyable moments with me, and thank you for being my friends. Last but not least, special acknowledgements go to all my professors for their support, respect and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud. Ouidad Achahbar 2

3 Abstract The ongoing pervasiveness of Internet access is largely increasing big data production. This, in turn, increases demand on compute power to process the massive data, and thus rendering High Performance Computing (HPC) into a high solicited service. Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly infrastructures for processing these big data, e.g., High Performance Computing as a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter controls the creation of virtual machines instances that carry data processing jobs. In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS. We track HPC performance under different cloud virtualization platforms, namely KVM and VMware ESXi, and compare it to the performance in a physical computing cluster infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack. The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a granularity of 8 physical machines per cluster. We got several interesting results when we ran the selected benchmarks on virtualized and physical cluster. Each tested cluster provided different performance trends. Yet, the overall analysis of the research findings proved that the selection of virtualization technology can lead to significant improvements when running and handling HPCaaS. 3

4 ملخص يعتبر التفشي المستمر لظاهرة ولوج واستعمال اإلنترنت سببا رئيسيا في تزايد إنتاج العديد من البيانات الضخمة. هذا بدوره يؤدي إلى زيادة الطلب على قدرات حسابية عالية لمعالجة هذه البيانات. هذه المؤشرات جعلت من خدمة "حوسبة عالية األداء" كخدمة مثيرة لإلهتمام. استنادا إلى نموذج توفير الحوسبة كأداة مساعدة تقدم الحوسبة السحابية بنيات تحتية مرنة اإلستعمال لمعالجة البيانات الضخمة على سبيل المثال "الحوسبة العالية األداء كخدمة". مع ذلك يقترن أداء هذه األخيرة بشكل كبير بتقنية البيئة االفتراضية نظرا إلى تحكمها في إنشاء األالت االفتراضية )الحواسب االفتراضية( التي تقوم بوظائف معالجة البيانات. في هذه األطروحة قمنا بوصف و تقييم تأثير البيئة االفتراضية على "الحوسبة العالية األداء كخدمة". قمنا أيضا بتتبع أداء "الحوسبة العالية األداء" على برامج سحابية افتراضية مختلفة وعلى حوسبة مادية مكونة من ثمان أجهزة كمبيوتر. قمنا باستخدام "أوبن ستاك" لبناء "الحوسبة العالية األداء كخدمة" و "هادوب" لتشغيل خوارزميات "ماب رديوس" على كبيرة. بيانات من خالل نتائج هذا البحث الحظنا تغير مهم في أداء " الحوسبة العالية األداء" بتغير حجم البيانات نوعية الحوسبة )البنية التحتية: المادية واالفتراضية( وحجم الحوسبة. بالرغم من ذالك فاالستناج الذي وصلنا اليه يثبت ان تقنية البيئة االفتراضية لها دور مهم ومعتبر في تحسين أداء "الحوسبة العالية األداء". 4

5 Table of Content Acknowledgment 2 Abstract 3 ملخص 4 Table of Content 5 List of Figures 7 List of Tables 9 List of Appendices 10 List of Acronyms 11 PART I: THESIS OVERVIEW 12 Chapter 1: Introduction Background Motivation Problem Statement Research Question Research Objective Research Approach Thesis Organization 16 PART II: THEORETICAL BASELINES 17 Chapter 2: Cloud Computing Cloud Computing Definition Cloud Computing Characteristics Cloud Computing Service Models Cloud Computing Deployment Models Cloud Computing Benefits Cloud Computing Providers 23 Chapter 3: Virtualization Definition of Virtualization History of Virtualization Benefits of Virtualization Virtualization Approaches Virtual Machine Manager 28 Chapter 4: Big Data and High Performance Computing as a Service Big Data High Performance Computing as a Service (HPCaaS) 33 Chapter 5: Literature Review and Research Contribution Related Work Contribution 36 PART III: TECHNOLOGY ENABLERS 37 Chapter 6: Technology Enablers Selection Cloud Platform Selection Distributed and Parallel System Selection 40 5

6 Chapter 7: Openstack OpenStack Overview OpenStack History OpenStack Components OpenStack Supported Hypervisors 49 Chapter 8: Hadoop Hadoop Overview Hadoop History Hadoop Architecture Hadoop Implementation Hadoop Cluster Connectivity 55 PART III: RESEARCH CONTRIBUTION 57 Chapter 9: Research Methodology Research Approach Research Steps 58 Chapter 10: Experimental Setup Experimental Hardware Experimental Software and Network Clusters Architecture Experimental Performance Benchmarks Experimental Datasets Size Experiment Execution 66 Chapter 11: Experimental Results Hadoop Physical Cluster Results Hadoop Virtualized Cluster- KVM Results Hadoop Virtualized Cluster- VMware ESXi Results Results Comparison 82 Chapter 12: Discussion TeraSort TestDFSIO Conclusion 91 PART IV: CONCLUSION 92 Chapter Conclusion and Future Work 93 Bibliography 94 Appendix A: OpenStack with KVM Configuration 100 Appendix B. OpenStack with VMware ESXi Configuration 127 Appendix C: Hadoop Configuration 131 Appendix D: TeraSort and TestDFSIO Execution 145 Appendix E: Data Gathering for TeraSort 147 Appendix F: Data Gathering for TestDFSIO 153 6

7 List of Figures Figure 1: Thesis organization Figure 2: NIST visual model of cloud computing definition Figure 3: services provided in cloud computing environment Figure 4: Full virtualization architecture Figure 5: Paravirtualization architecture Figure 6: Hardware assisted virtualization architecture Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor Figure 8: Xen hypervisor architecture Figure 9: KVM hypervisor architecture Figure 10: VMware ESXi architecture Figure 11: Data growth over 2008 and Figure 12: Active cloud community population Figure 13: Active distributed systems population Figure 14: OpenStack conceptual architecture Figure 15: Nova subcomponents Figure 16: Glance subcomponents Figure 17: Keystone subcomponents Figure 18: Swift subcomponents Figure 19: Cinder subcomponents Figure 20: Quantum subcomponents Figure 21: Apache Hadoop subprojects Figure 22: Hadoop Architecture Figure 23: HDFS and MapReduce representation Figure 24: Word count MapReduce example Figure 25 : Research steps Figure 26 : Hadoop Physical Cluster Figure 27: Hadoop Physical Cluster architecture Figure 28: Hadoop virtualized cluster - KVM Figure 29: Hadoop virtualized cluster VMware ESXi (a) Figure 30 : Hadoop virtualized cluster VMware ESXi (b) Figure 31 : Experimental execution Figure 32: TeraSort performance on Hadoop Physical Cluster Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster Figure 47: TeraSort performance on Hadoop KVM Cluster

8 Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87 Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs Figure 89 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs89 Figure 90: OpenStack warning statistics about system resources usage

9 List of Tables Table 1 : A Comparison of cloud deployment models Table 2 : Cloud IaaS selection Table 3 : Parallel and distributed platform selection Table 4 : OpenStack releases Table 5 : OpenStack projects Table 6: Apache Hadoop subprojects Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster Table 9 : OpenStack virtual machines features Table 10 : Experimental performance metrics Table 11 : Datasets size used for Hadoop benchmarks Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop Physical Cluster Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop Physical Cluster Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and different number of nodes- Hadoop Physical Cluster Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop KVM Cluster Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop KVM Cluster Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop KVM Cluster Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster

10 List of Appendices Appendix A : OpenStack with KVM Configuration Appendix B : OpenStack with VMware ESXi Configuration.127 Appendix C: Hadoop Configuration Appendix D: TeraSort and TestDFSIO Execution..145 Appendix E: Data Gathering for TeraSort Appendix F: Data Gathering for TestDFSIO

11 List of Acronyms HPC HPCaaS VM VMM EMC DCI GFS HDFS NDFS DOE NIST SaaS PaaS IaaS NoSQL SNIA ACID AWS HPhC HVC SSH JSON XML API Amazon EC2 Amazon S3 VLAN DHCP High Performance Computing High Performance Computing as a Service Virtual Machine Virtual Machine Manager American Multinational Corporation Digital Communications Inc. Google File System Hadoop Distributed File System Nutch Distributed File System Department of Energy National Laboratories National Institute of Standards and Technology Software as a Service Platform as a Service Infrastructure as a Service Not Only Structured Query Language Storage Networking Industry Association Atomicity, Consistency, Isolation and Durability Amazon Web Services Hadoop Physical Cluster Hadoop Virtualized Cluster Secure Shell JavaScript Object Notation Extensible Markup Language Application Programming Interface Amazon Elastic Compute Cloud Amazon Simple Storage Service Virtual Local Area Network Dynamic Host Configuration Protocol 11

12 Part I: Thesis Overview This part introduces the key points to understand the purpose of the present research. It provides an introduction of the research starting with its background, motivation, problem statement, research question, objective and research methodology. 12

13 Chapter 1: Introduction In this chapter, we first come to the background of the present research, and then describe the motivation and the problem behind conducting this study. After that, questions, objectives, and methodology of the research are stated. Finally, an outline of the thesis is given out at the end of this chapter. 1.1.Background During the last decades, the demand for computing power has steadily increased as data generated from social networks, web pages, sensors, online transactions, etc. is continuously growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000 exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data constitutes the Big Data phenomenon. As Big Data grows in terms of volume, velocity and value, the current technologies for storing, processing and analyzing data become inefficient and insufficient. Gartner survey stated that data growth is considered as the largest challenge for organizations [2]. Stating this issue, High Performance Computing (HPC) has started to be widely integrated in managing and handling Big Data. In this case, HPC is used to process and analyze Big Data related to different problems including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. However, HPC still lacks the toolsets that fit the current growth of data. In this case, new paradigms and storage tools were integrated with HPC to deal with the current challenges related to data management. Some of these technologies include, providing computing as a utility (cloud computing) and introducing new parallel and distributed paradigms. Cloud computing plays an important role as it provides organizations with the ability to analyze and store data economically and efficiently. Performing HPC in the cloud was introduced as data has started to be migrated and managed in the cloud. Digital Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as High Performance Computing as a Service (HPCaaS). In short, HPCaaS offers high- 13

14 performance, on-demand, and scalable HPC environment that can handle the complexity and challenges related to Big Data [5]. One of the most known and adopted parallel and distributed systems is MapReduce model that was developed by Google to meet the growing of their web search indexing process [6]. MapReduce computations are performed with the support of data storage system known as Google File System (GFS). The success of both Google File System and MapReduce inspired the development of Hadoop which is a distributed and parallel system that implements MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely adopted by big players in the market because of its scalability, reliability and low cost of implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an underlying technology that distributes the work across HPC cluster [8, 9]. 1.2.Motivation Many solutions have been proposed and developed to improve computation performance of Big Data. Some of them tend to improve algorithms efficiency, provide new distributed paradigms or develop powerful clustering environments. Though, few of those solutions have addressed a whole picture of integrating HPC with the current emerging technologies in terms of storage and processing. As stated before, some of the most popular technologies currently used in hosting and processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present, the use of HPC in the cloud computing is still limited. The first step towards this research was done by the Department of Energy National Laboratories (DOE), which started exploring the use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched partnership with major top universities in United States to conduct more research about cloud computing, distributed systems and high computing applications. HPCaaS still needs more investigation to decide on appropriate environments that can fit high computing requirements. One of the HPCaaS aspects that is not yet investigated is the impact of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this research consists in the need for evaluating HPCaaS performance using MapReduce and different virtualization techniques. This motivation is accompanied by a strong rational that is addressed by the free accessibility to MapReduce and cloud computing open sources. 14

15 1.3.Problem Statement Cloud computing is offering set of services for processing Big Data; one of these services is HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization techniques which are considered as the heart of cloud computing. Stating this, the problem addressed in this research is formulated as follow: HPCaaS is still facing poor performance and still doesn t fit Big Data requirements. 1.4.Research Question Addressing the problem statement, this thesis aims at bringing answers to the following research questions: 1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)? 2. Is it worth moving HPC to the cloud? 3. How virtualization techniques affect HPCaaS performance? 4. Is there an optimal virtualization technique that can ensure good performance? 1.5.Research Objective The purpose of the present research is to find solutions for the addressed issues and questions in the previous sections. Hence, this research introduces a new architecture that can handle HPC complexity and increase its performance. The proposed architecture consists of building a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal of this research is to investigate the added value of adopting virtualized cluster, and the second goal is to evaluate the impact of virtualization techniques on HPCaaS. 1.6.Research Approach To evaluate HPCaaS over different virtualization technologies, we followed both qualitative and quantitative research methodologies. The qualitative approach was adopted to select appropriate technology enablers that will be used in building an architecture that will solve the issues addressed in this study. On the other hand, quantitative approach was adopted to conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC), Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the performance of HPC. 15

16 1.7.Thesis Organization The rest of this thesis is structured as follow (Figure 1): Part I covers chapter 1 (current chapter) which introduces the present research. Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big Data and HPCaaS, and chapter 5 lists some related work and states clearly our contribution Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting the technology enablers of this research, and chapter 7 and 8 present in details OpenStack and Hadoop respectively. Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in conducting this research; chapter 10 demonstrates the environment preparation to run the needed experiments; chapter 11 introduces the results, and chapter 12 discusses the research findings. Part V covers chapter 13 which concludes the research findings and proposes some future work; further, this part includes bibliography and appendices of this study. Figure 1: Thesis organization 16

17 Part II: Theoretical Baselines The objective of this part is to elaborate and shed light on some scientific concepts, theories and topics that serve as a foundation to understand the whole picture of the present research. Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud computing; chapter 3 introduces cloud computing related technologies, namely virtualization; chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing previous research that were done in the domain of evaluating HPC. 17

18 Chapter 2: Cloud Computing Cloud computing becomes the current innovative and emerging trend in delivering IT services that attract both the interest of academic and industrial fields. Using advanced technologies, cloud computing provides end users with a variety of services, starting from the hardware level services to the application level. Cloud computing is understood as utility computing over the Internet. Meaning, computing services have moved from local data centers to hosted services which are offered over the Internet and paid based on pay-per-use model [14]. This chapter provides an overview of cloud computing concept. It provides a distinct definition of what cloud computing is; defines cloud computing characteristics, describes cloud service and deployment models, discusses some cloud computing benefits, and finally this chapter lists some cloud computing providers. 3.1.Cloud Computing Definition In the late 1960 s, John McCarthy brought a new concept into computer science field which predicts that technology will not be only provided as tangible products [14]. Meaning, computer resources will be provided as a service like water and electricity. The concept was known as utility computing, and nowadays it known as cloud computing. Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in 2009 as: Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. NIST definition of cloud sheds light on the effective use of cloud computing in terms of providing minimum management efforts of the shared resources. It sets five characteristics that define cloud computing: on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. Concerning the deployment models, NIST has classified them into: private, public, community and hybrid cloud. More details about cloud characteristics, delivery and deployment models are provided in the upcoming subsections. 18

19 The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing characteristics, service models, and deployment models. Figure 2: NIST visual model of cloud computing definition [14] 3.2.Cloud Computing Characteristics NIST has listed five main characteristics that describe precisely cloud computing, which are [15]: On-demand self-service: end users can use and change computing capabilities as desired without the need of human interaction with each service provider. Broad network access: resources are accessed over network using standards mechanism. Resource pooling: the provider s computing resources are pooled to serve multiple consumers; these resources are dynamically assigned and reassigned according to consumer demand. Examples of resources include storage, processing, memory, and network bandwidth. Rapid elasticity: cloud providers can elastically scale in and scale out resources depending on current end users demand. Therefore, resources can be available for provisioning in any quantity at any time. Measured service: resources usage can be monitored, controlled and measured; therefore, these features enable end users to pay using the pay as you go model. Other characteristics were investigated in [16], and which are listed as follow: 19

20 Reliability: this feature is ensured by implementing and providing multiple redundant sites. Having this feature, cloud computing is considered as an ideal solution for disaster recovery and business critical tasks. Customization: cloud computing allows customization of infrastructure and applications based on end user demand. Efficient resource utilization: this feature ensures delivering resources as long as they are needed Cloud Computing Service Models Based on NIST definition of cloud computing, cloud deployment models are classified as follow: Software as a Service (SaaS) Software as a Service (SaaS) represents application software, operating system and computing resources. End users can view the SaaS model as a web-based application interface where services and complete software applications are delivered over the Internet. Some examples of SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship Management, etc. Platform as a Service (PaaS) This service allows end users to create and deploy applications on provider s cloud infrastructure. In this case, end users do not manage or control the underlying cloud infrastructure like network, servers, operating systems, or storage. However, they do have control over the deployed applications by being allowed to design, model, develop and test them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc. Infrastructure as a Service (IaaS) This service consists of a set of virtualized computing resources such as network bandwidth, storage capacity, memory, and processing power. These resources can be used to deploy and run arbitrary software which can include operating systems and applications. Examples of IaaS providers are Drop Box, Amazon web service, etc. Cloud services are summarized in Figure 3. 20

21 Figure 3: services provided in cloud computing environment [16] 3.4.Cloud Computing Deployment Models Private Cloud Private cloud computing is provisioned for exclusive use by an organization. The cloud in this case is owned, managed and operated by the organization, a third party, or both of them. The advantage of private cloud consists in providing high security since the cloud is accessed by trusted entities within the organization [15]. Public Cloud The cloud infrastructure is provisioned for general public use. It may be owned, managed, and operated by cloud service provider who offers services based on pay-per-use model. In contrast to private cloud, public cloud is known as untrustworthy environment [15]. Community Cloud The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from different organizations that share some goals (e.g., mission, security requirements, policy, and compliance considerations). In this case, the cloud may be owned, managed, and operated by one or more organizations in the community, a third party, or combination of them [15]. Hybrid Cloud This cloud is a combination of both private and public cloud computing environments. Hybrid cloud provides high flexibility and choices for organization; for instance, critical core activities of an organization can be run under the control of the private part of the hybrid cloud while other tasks may be outsourced to the public part [17]. Table 1 summarizes cloud deployment models discussed above [17]. 21

22 Table 1 : A Comparison of cloud deployment models [17] 3.5.Cloud Computing Benefits Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the key benefits offered by the cloud include [17, 18]: Initial Cost Savings Organizations or individuals can save the big initial investment for launching new hardware, products and services; in this case, cloud computing platform offers the needed resources in terms of infrastructure, platform and applications. Scalability Cloud computing ensures high computing scalability by scaling up resources as needed. Therefore, when the usage increases, resources increase relatively to respond to end user demand. Availability Cloud providers have the infrastructure and bandwidth to accommodate business requirements for high speed access, storage and systems. Reliability Cloud computing implements redundant paths to support business continuity and disaster recovery. 22

23 Maintenance End users are not concerned with the resources maintenance since it is done by the cloud service provider. 3.6.Cloud Computing Providers There are many providers who offer cloud services with different features and pricing. Some of them are listed as follow [16, 19]: Amazon Web Services Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures advanced data privacy techniques to protect users data. For that reason, AWS got various security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II. Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB (relational data storage service that stores, processes and queries data sets in the cloud), etc. Google Google [21] offers high accessibility and usability in its cloud services. Some of Google services include: Google s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool used to exhibit product and uploading their images in the cloud), etc. Microsoft Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows applications. Some other services include: SQL Azure, Windows Azure Marketplace (an online market to buy and sell applications and data), etc. OpenStack OpenStack [23] is an open source platform for public and private cloud computing that aims at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA. Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc. [16]. 23

24 Chapter 3: Virtualization There are many different existing technologies and practices used by cloud providers; some of them are internet protocols for communication, virtual private cloud provisioning, load balancing and scalability, distributed processing, high performance computing technologies and virtualization [24]. This chapter emphasizes an understanding of virtualization technology as it is considered the core of cloud computing. It describes in details the history, benefits, types and the abstract layer of virtualization. 4.1.Definition of Virtualization Virtualization is a widely used term; it has been introduced for many years as a powerful technology in computer science. The definition of virtualization can change depending on which component of computer system is applied on. However, it is broadly defined as an abstract layer between physical resources and their logical representation [25]. NIST has defined virtualization as [26]: The simulation of the software and/or hardware upon which other software runs. This simulated environment is called a virtual machine (VM). There are many forms of virtualization, distinguished primarily by computing architecture layer. For example, application virtualization provides a virtual implementation of the application programming interface (API) that a running application expects to use, allowing applications developed for one platform to run on another without modifying the application itself. The Java Virtual Machine (JVM) is an example of application virtualization; it acts as an intermediary between the Java application code and the operating system (OS). Another form of virtualization, known as operating system virtualization, provides a virtual implementation of the OS interface that can be used to run applications written for the same OS as the host, with each application in a separate VM container. Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as follow [27]: The act of abstracting, hiding, or isolating the internal functions of a storage (sub) system or service from applications, host computers, or general network resources, for the purpose of enabling application and networkindependent management of storage or data. From both definitions, we can say that virtualization is a methodology of dividing a physical machine into multiple execution environments that allow multiple tasks to run simultaneously. This is done by providing a software abstract layer that is called Virtual 24

25 Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical resources from the operating system. In this case, VMM allows creating multiple guest Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM) [28]. 4.2.History of Virtualization The roots of virtualization go back to the first visualized IBM mainframes that were designed in the 1690s, and which allowed the company to run multiple applications and processes simultaneously. In fact, the main drivers behind introducing virtualization were the high cost of hardware and the need for running and isolating applications on the same hardware. During 1970s, the adoption of virtualization technology increased sharply in many organizations because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down as well as the emergence of multitasking operating systems. With these facts, there was no need to assure a high CPU utilization, and therefore, there was no need for virtualization technology. Yet, in the 1990s, virtualization technology brought again to the market after introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to reduce management costs by replacing a bunch of low-utilized servers by a single server [29]. 4.3.Benefits of Virtualization There a bunch of reasons that push many organizations to go for virtualization technology; some of them are listed in [24, 29, 30] as follow: Server Consolidation It condenses multiple servers into one physical server that would host many virtual machines. This feature allows the physical server to run at high rate of utilization, and it reduces at the same time the hardware maintenance, power and cooling requirements costs. Application Consolidation Legacy applications might require newer hardware and/or operating systems. In this case, virtualization can be used to virtualize the new requirements. Sandboxing Virtualization can provide secure and isolated environment by running virtual machines that can be used to run foreign or less-trusted applications. Multiple Simultaneous OS 25

26 It can provide the facility of having multiple simultaneous operating systems that can run different types of applications. Reducing Cost Virtualization reduces cost deployment and configuration by ensuring less hardware, less space and less staffing. Furthermore, virtualization reduces the cost of networking by requiring less wirings, switches and hubs Virtualization Approaches Virtualization can take different forms depending on which component of computer system is applied on [31]. In this section, we will shed light on three famous virtualization techniques: Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization Full Virtualization In full virtualization, guest OS is fully abstracted from the hardware level by adding virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being virtualized, and it requires no modifications. This approach provides each VM with all services of the physical system, including virtual BIOS, virtual devices and virtualized memory management. To manage the communication between different layers, full virtualization provides both binary translation and direct execution techniques (Figure 4). Binary translation is used to convert guest OS instructions into host instructions. On the other hand, application or user level instructions are directly executed on the processor to ensure high performance [32]. Microsoft Virtual Server is an example of full virtualization. Figure 4: Full virtualization architecture [32] 26

27 Paravirtualization The fundamental issue with full virtualization is the emulation of devices within the hypervisor. This issue was solved by developing paravirtualization technique which allows the guest OS to be aware that it's being virtualized and to have direct access to the underlying hardware. In paravirtualization, the actual guest code is modified to use a different interface that accesses the hardware directly or the virtual resources controlled by the hypervisor [32]. In more details, paravirtualization changes the OS kernel to replace non-virtualized instructions with hypercalls that communicate directly with the hypervisor. Thus, when a privileged command is to be executed on the guest OS, it is delivered to the hypervisor (instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt paravirtualization technology. Figure 5: Paravirtualization architecture [32] The downside of paravirtualization is that the guest must be modified to integrate hypervisor awareness. This is a limitation as some operating systems do not allow such modifications (e.g. Windows 2000/XP), and even the ones that can be modified may need additional resources for maintenance/troubleshooting [32] Hardware Assisted Virtualization Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case, VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6, privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs that use this approach [29]. 27

28 Figure 6: Hardware assisted virtualization architecture [32] 4.5.Virtual Machine Manager As defined before, hypervisor or VMM is the layer between the operating system and a guest operating system or the layer between the hardware and the guest operating systems. In [25], the author has set three main features that need to be maintained by VMM. First feature demonstrates that VMM has to provide an environment that is identical with the original machine that we want to virtualize. Second feature shows that programs running on VM or original machine should show the same performance, or, with some minor decrease. Finally, last feature states that VMM needs to control all system resources provided to VMs Hypervisor Types Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs directly on the system hardware, and therefore they monitor the operating system guests and they allocate all the needed resources including disk, memory, and CPU and I/O peripherals. Having no intermediary between Type 1 hypervisor and the physical layer has led to an efficient performance in terms of hardware access and security level (Figure 7-a). On the other hand, Type 2 hypervisor runs on host operating system that provides virtualization services such as I/O and memory management (Figure 4-b). Having an intermediary layer between the hypervisor and the hardware makes the installation process easier than Type 1 hypervisor since the operating system is in charge of hardware configuration such as networking and storage [33]. 28

29 Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33] The differences between Type 1 and Type 2 hypervisor can lead to different performance results. The layer between the hardware and the hypervisor in Type 2 makes the performance less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor needs first to pass the request to the operating system and then the hardware layer. Besides performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2 reliability. For instance, the failure in operating system can directly affect the hosted guests in Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the operating system availability. However, hypervisor type 2 has some advantages which consist in having fewer hardware/driver issues as the host operating system is responsible for interfacing with the hardware [34] Examples of Hypervisors a) Xen Hypervisor Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization [35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0). Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has direct access for the resources on the physical, which is not the case for DomU guests [36]. Overall architecture of Xen hypervisor is shown in Figure 8. 29

30 Figure 8: Xen hypervisor architecture Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris, FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are running in a virtualized environment, and they don t have direct access to the hardware resources. In this case, the guest operating system is modified to make special calls (hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a traditional unmodified operating system. On the other, in full virtualization, DomU are referred to as DomU HVM Guests and run standard any unchanged operating system [37]. DomU HVM is not aware that it is sharing processing time on the hardware, and it is not aware of the presence of other virtual machines. In this case, DomU HVM requires processors which specifically support hardware virtualization extensions (Intel VT or AMD-V). Virtualization extensions allow for many of the privileged kernel instructions (which in PV were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate technique. b) KVM Hypervisor KVM hypervisor provides a full virtualization solution based on Linux operating system. It works by reusing the hardware assisted virtualization extensions that were already developed. In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result, it takes; as mentioned above, a full advantage of many components which are already present within the kernel such as memory management and scheduling [38]. KVM is implemented using two main components; the first one is the KVM-loadable module that, when installed in the Linux kernel, provides management of the virtualization hardware (Figure 9). The second component provides PC platform emulation, which is offered by a modified version of 30

31 QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest operating system requests [39]. Figure 9: KVM hypervisor architecture c) VMware ESXi Hypervisor VMware was the first leader company that contributed to virtualization technology. One of its virtualization products is VMware ESXi which is installed directly on top of the physical machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of reliability and performance to companies of all sizes. The overall architecture of VMware ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the necessary processes to manage VMs. It provides certain functionality similar to that found in other operating systems, such as process creation and control, signals, file system, and process threads. Therefore, vmkernel supports running multiple virtual machines and provides some core functionalities like: Resource scheduling, I/O stacks and Device drivers [24]. Figure 10: VMware ESXi architecture [40] 31

32 Chapter 4: Big Data and High Performance Computing as a Service As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of users and data generated, the capacity and computing power of current data tools lead to inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The increase in data size to many terabytes and petabytes is known as Big Data. To handle the complexity of Big Data, HPC is adopted to provide high computation capabilities, high bandwidth, and low latency network. This chapter provides an overview of Big Data phenomena and HPaaS concept. Figure 11: Data growth over 2008 and 2020 [54] 5.1.Big Data Big Data Definition Big Data is defined as large and complex datasets that are generated from different sources including social media, online transactions, sensors, smart meters and administrative services [43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of storing, analyzing and processing data. Literature reviews on Big Data divided the concept into four dimensions: Volume, Velocity, Variety and Value [43]. 32

33 Volume: the size of data generated is very large, and it goes from terabytes to petabytes. Velocity: data grows continuously at an exponential rate. Variety: data are generated in different forms: structured data, semi-structured and unstructured data. These forms require new techniques that can handle data heterogeneity. Value: the challenge in Big Data is to identify what is valuable as to be able to capture, transform and extract data for analysis Big Data Technologies With Big Data phenomenon, there is an increasing demand for new technologies that can support the volume, velocity, variety and value of data. Some of the new technologies are NoSQL, parallel and distributed paradigms and new cloud computing trends that can support the four dimensions of big data. NoSQL (Not Only Structured Query Language) is the transition from relational databases to non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability to replicate and to partition data over many servers, and the ability to provide high performance operations. However, moving from relational to NoSQL systems has eliminated some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability) [45]. In this context, NoSQL properties are defined by CAP theory [46] which states that developers must make trade-off decisions between consistency, availability and partitioning. Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and CouchDB [50]. Other supporting technologies for Big Data are parallel and distributed paradigms (e.g. Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in the upcoming chapters (Part III- Chapter 8, 9) High Performance Computing as a Service (HPCaaS) HPCaaS Overview High Performance Computing (HPC) is used to process and analyze large and complex problems, including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. HPC fits these requirements by implementing large physical clusters. However, traditional HPC faces a set 33

34 of challenges that consist in peak demand, high capital, and high expertise to acquire and operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of new technology trends including, cloud technologies, parallel processing paradigms and large storage infrastructures. Merging HPC with these new technologies has proposed new HPC model, called HPC as a service (HPCaaS). HPCaaS is an emerging computing model where end users have on-demand access to preexisting needed technologies that provide high performance and scalable HPC computing environment [52]. HPCaaS provides unlimited benefits because of the better quality of services provided by the cloud technologies, and the better parallel processing and storage provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some HPCaaS benefits are stated in [51] as follow: High Scalability: resources are scaling up as to ensure essential resources that fit users demand in terms of processing large and complex datasets. Low Cost: End-users can eliminate the initial capital outlay, time and complexity to procure HPC. Low Latency: by implementing the placement group concept that ensures the execution and processing of data in the same rack or on the same server HPCaaS Providers There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin Computing [53] which has been a leader in designing and implementing high performance environments for over a decade. Nowadays, it provides HPCaaS with different options: ondemand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services (AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is currently used for Computer Aided Engineering, molecular modeling, genome analysis, and numerical modeling across many industries including Oil and Gas, Financial Services and Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure HPC) [56] and Google (Google Compute Engine) [57]. 34

35 Chapter 5: Literature Review and Research Contribution In order to bridge the gap between the present research and previous studies, a review was conducted on the current state of HPC and virtualization. Therefore, this chapter situates the research in relation to previous research publications and states clearly the research contribution Related Work There have been several studies that evaluated the performance of high computing in the cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides, only few studies have evaluated the performance of high computing using the combination of both new emerging distributed paradigms and cloud environment [64]. In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines (VM), and they came up to the conclusion that the tested public clouds do not seem to be optimized for running HPC applications. This was explained by the fact that public cloud platforms have slow network connections between virtual machines. Furthermore, authors in [13] evaluated the performance of HPC applications in today's cloud environments (Amazon EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that running HPC on EC2 cloud platform limits performance and causes significant variability. Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of running HPC applications on three different platforms. First and second platform consist of two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more costeffective for low communication-intensive applications. In order to understand the performance implications on HPC using virtualized resources and distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16 nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and MapReduce [6]. The conclusion of this research suggested that most parallel applications can be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce, 35

36 and Dryad); however, scientific applications, which require complex communication patterns, still require more efficient runtime support. Evaluating HPC without relating it to new cloud technologies was also performed using different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that none of the techniques match the performance of the base system perfectly; yet, OpenVZ demonstrates high performance in both file system performance and industry-standard benchmarks. In [67], authors compared the performance of KVM and VMware. Overall findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave better results than VMWare. In [68], authors conducted quantitative analysis of two leading open source hypervisors, Xen and KVM. Their study evaluated the performance isolation, overall performance and scalability of virtual machines for each virtualization technology. In short, their findings showed that KVM has substantial problems with guests crashing (when increasing the number of guests); however, KVM still has better performance isolation than Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM, VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor. 5.2.Contribution So far, there are only few studies that compared different virtualization techniques and its impact on HPC in the cloud. The only study we found was done in [70], where authors compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization technology was compared with bare-metal using a set of high performance benchmarking tools. The results of this research demonstrated that KVM is the best choice for HPC in the cloud because of its rich features and near-native performance. The contribution of this present research will fill the literature gap by examining the impact of virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a distributed and parallel system. 36

37 Part III: Technology Enablers This part explains the use of OpenStack and Hadoop as underlying technologies for this research. Hence, this part starts first with providing a qualitative study for selecting an appropriate cloud platform and distributed system; second chapter of this part introduces in details OpenStack components, and third chapter presents Hadoop and its main aspects. 37

38 Chapter 6: Technology Enablers Selection The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built after conducting a qualitative study of available tools in the market. We targeted mainly open sources to select appropriate cloud computing platform and distributed system. Hence, this chapter presents the analysis we followed in selecting cloud platform and distributed system. 6.1.Cloud Platform Selection To compare available cloud open sources, we tried to choose the most popular platforms. The selection of competing platforms was based on a study that compares the popularity of OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12, the study showed that OpenStack has the largest total population index, followed by Eucalyptus, CloudStack, and Opennebula. Figure 12: Active cloud community population [71] Based on Figure 12, we selected to compare and study OpenStack, Opennebula and Eucalyptus. To adopt one of these cloud open sources, we used some other studies that compare their performance and quality [72-75]. In [72], authors compared some open and commercial cloud platforms. Concerning open platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they adopted a set of criteria, including storage, virtualization, network, management, security and vendor support. The results of the research showed that open-source and commercial solutions 38

39 can have comparable features, and that OpenNebula is the most feature-complete cloud platform when it is compared with Eucalyptus. [73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors compared the performance of both cloud platforms based on measuring the time when the cloud starts instantiating VMs and the time when they are ready to accept SSH connections. The findings of the research demonstrate that OpenStack is slightly better than OpenNebula due to smaller instantiation time. Moreover, the results showed that OpenStack is more suitable for high computing due to faster instantiation of large number of VMs. In [74], authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula. For the qualitative analysis, they adopted some of the following criteria: security, virtualization supported, access, image support, resource selection, storage support, highavailability support and API support. Based on the results of the qualitative study, authors concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would benefit in case of persistent storage support. For the quantitative analysis, authors measured the deployment, network overhead and the clean-up time of VMs. The results of quantitative analysis showed that each platform can be used depending on user requirements and specifications. In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack, OpenNebula and CloudStack. To perform the comparison, authors adopted the following criteria: storage, network, security, hypervisor, scalable and installation code openness. In short, the results of this study [75] showed that OpenStack is the preferred cloud open source. Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go for OpenStack as it is known for its flexibility and total openness. Table 2 : Cloud IaaS selection 39

40 6.2.Distributed and Parallel System Selection To compare available distributed and parallel systems in the market, we opted again for the popularity index of those systems. The selection of competing systems was based on a study done in [76]. The study is summarized in Figure 13 which compares the popularity index of Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The study was done in 2012, and it demonstrates the total downloads between January 2011 and March Figure 13 depicts that Hadoop is the most popular distributed system, followed by MongoDB and Cassandra. Figure 13: Active distributed systems population [76] Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in order to end up with one selected system for the present research. MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data in tables with columns and rows. To provide high redundancy and make data highly available, MongoDB offers replication across multiple servers. While data is synchronized between servers using replication, MongoDB also facilitates the scale out option by supporting sharding which partitions a collection and stores the different portions on different machines. MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On the other hand, Hadoop is an open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [7]. More details about Hadoop are included in chapter 8. 40

41 A study done in [77] compares MongoDB and Hadoop systems. The study came up with three main conclusions; first, it is not appropriate to use MongoDB as an analytics platform; second, using Hadoop for MapReduce jobs is several times faster than using the built-in MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides, a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and MongoDB. In short, the study showed that MongoDB is roughly four times slower than Hadoop in fully-distributed mode. Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we decided to go for Hadoop as an analytical and storage tool for the present research. Table 3 : Parallel and distributed platform selection 41

42 Chapter 7: Openstack OpenStack is an open source platform for public and private cloud computing that aims at ensuring scalability and flexibility. It was developed by a wide range of developers and contributors using mainly Python (68%), XML (16%) and JavaScript (5%) [79]. This chapter provides detailed description of Openstack including, brief history; its components, the corresponding architecture, and finally some supported hypervisors. 7.1.OpenStack Overview The formal definition of OpenStack was stated in [80], which defines OpenStack as: a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface. From this definition, OpenStack is considered as an Infrastructure as a Service (IaaS). An important feature of OpenStack is that it provides a web interface called dashboard and APIs that make its services available via Amazon EC2 and S3 compatible APIs. This feature ensures that all existing tools that work with Amazon s cloud platform, can also work with OpenStack platform [81]. 7.2.OpenStack History OpenStack was a collaboration project between Rackspace Hosting and NASA. Both organizations planned to release internal cloud project object storage and compute. Rackspace contributed with their Cloud Files platform to support the storage part of OpenStack, while NASA contributed with their Nebula platform to support the compute part [82]. In July 2010, both organizations released the first version of OpenStack under Apache 2.0 License. In September 2012, OpenStack Foundation was established as an independent entity with the mission of protecting, empowering, and promoting OpenStack software. Now, OpenStack project is currently supported by more than 150 companies including AMD, Intel, Canonical, Red Hat, Cisco, Dell, HP, IBM and Yahoo! [83]. 7.1.OpenStack Releases OpenStack releases different versions with new improvement and contributions. All OpenStack releases since 2010 are listed in Table 4 [79]. 42

43 Table 4 : OpenStack releases [79] 7.3.OpenStack Components The core components of OpenStack software are: OpenStack Compute Infrastructure (Nova); OpenStack Object Storage Infrastructure (Swift) and OpenStack Image Service Infrastructure (Glance). Besides these components, OpenStack include Identity Service (Keystone), Network Service (Quantum), Dashboard Service (Horizon) and Block Storage (Cinder). Table 5 summarizes the main components of OpenStack and the corresponding code name. Table 5 : OpenStack projects Taking into consideration the previous mentioned OpenStack components, a conceptual architecture of OpenStack is provided in Figure 14 which shows how OpenStack components are interconnected [79]. 43

44 Figure 14: OpenStack conceptual architecture [79] OpenStack Compute (Nova) Nova provides flexible management for virtual machines by allowing users to create, update, and terminate virtual machines. The overall architecture of Nova (Figure 15) is composed of the following sub-components: nova-api, nova-scheduler, nova-compute, nova-volume, queue and database [82]. Figure 15: Nova subcomponents 44

45 Nova-api is responsible of accepting and fulfilling the API requests. A request consists of actions that will be performed by nova subcomponents. In order to accept an API request, nova-api provides an endpoint for all API queries and enforcing some policies. If the request is about managing virtual machines, the nova-compute is involved to be in charge of creating or terminating a virtual machine instances. Normally, nova-compute receives requests from the queue sub-component. In order to manage virtual machine instances, nova-compute uses different ways and drivers such as libvirt software package, Xen API, vsphere API, etc. to support virtualization technologies. To specify where to send a request, nova-scheduler retrieves the request from the queue and determines which compute server host it should run on. In case there is a need for memory space, nova-volume does the creation, attachment, and detachment of persistent volumes to virtual machine instances [82]. Nova also provides network management by its subcomponent nova-network. The latter accepts networking tasks from the queue and then performs system commands to manipulate the network. Nova-network defines two types of IP addresses: Fixed IPs and Floating IPs. Fixed IP is considered as a private IP that is assigned to an instance during its life cycle. On the other hand, floating IP is considered as a public IP that will be used for external connectivity. The network itself that is defined in nova-compute can be classified into three categories: Flat, FlatDHCP and VLAN network [82]. Flat assigns a fixed IP address to an instance and attaches that IP on common bridge (created by the administrator). FlatDHCP builds upon the Flat manager by providing DHCP services to handle instance addressing and creation of bridges. VLAN provides a subnet, and a separate bridge for each project. The range of IPs of a given project is only accessible within the VLAN. The last subcomponents of nova are queue and database. Queue is responsible of passing messages between nova sub-components to facilitate the communication between them. It is implemented using RabbitMQ. Nova database stores most of the configuration and run-time state of the cloud infrastructure; it contains a set of tables such as: instance types, instances in use, networks available, fixed IPs, projects and virtual interfaces [82] OpenStack Object Storage (Glance) Glance manages virtual disk images. It consists of three main sub-components glance-api, glance-registry and glance database (Figure 16). Glance-api accepts incoming API requests 45

46 and then communicates them to other components (glance-registry and image store). All information about images is stored in glance-database. Last component which is glanceregistry is responsible of retrieving and storing metadata about images [82]. Figure 16: Glance subcomponents OpenStack Identity Service (Keystone) Keystone authorizes users access to OpenStack components. It supports multiple forms of authentication including standard username and password credentials and token-based systems. Keystone architecture is represented by the following subcomponents (Figure 17): token backend, catalog backend, policy backend and identity backend [82]. Figure 17: Keystone subcomponents OpenStack Object Store (Swift) Swift is the oldest project within OpenStack, and it is the underlying technology that powers Rackspace s Cloud Files service [82]. Swift provides a massively scalable and redundant object store by writing multiple copies of each object to multiple and separated storage 46

47 servers as to handle failures efficiently. Swift component consists of Proxy Server, Account Server, Container Server, and Object Server (Figure 18). Figure 18: Swift subcomponents Swift-proxy accepts incoming requests that consists of uploading files, making modifications to metadata and creating containers. Requests are served by account server, container server or object server. Object servers request about managing pre-existing objects or files in the storage; account server manages accounts defined with the object storage service, and container server manages the mapping of containers, folders, within the object store service [82] OpenStack Block Storage Service (Cinder) Cinder allows block devices to be connected to virtual machine instances for better performance. It consists of the following sub-components: cinder-api, cinder-volume, cinderdatabase and cinder-scheduler (Figure 19). Cinder-api accepts incoming requests and directs them to the cinder-volume which performs reading or writing to the cinder database to maintain states and interacts with other processes. Cinder-scheduler is responsible of selecting the optimal block storage node to create the volume on. In order to maintain communication between cinder components, message queue is used. 47

48 Figure 19: Cinder subcomponents OpenStack Network Service (Quantum) Quantum allows users to create their own networks and then attach interfaces to them. It consists of quantum-server, quantum-account, quantum-plugin and quantum-database (Figure 20). Quantum-server accepts incoming API requests and then directs them to the correct quantum-plugin. Plugins and agents perform special actions such as plug/unplug ports, creating networks, subnets and IP addressing. Finally, quantum-database stores networking state for particular plugins. Figure 20: Quantum subcomponents 48

49 7.4.OpenStack Supported Hypervisors The abstraction feature provided by OpenStack Compute lead to support various existing hypervisors. Some of the supported hypervisors are listed as follow: KVM, LXC, QEMU, UML, VMWare ESX/ESXi, Xen, PowerVM, Hyper-V [79]. However, KVM is still the most widely used hypervisor in deploying OpenStack. Besides KVM, more existing deployments run Xen, LXC, VMWare and Hyper-V, but each of these hypervisors lack some features support or the documentation on how to use them with OpenStack is not well documented. 49

50 Chapter 8: Hadoop Hadoop has been adopted by big players in the market such as Google, Yahoo, LinkedIn, Facebook, New York Times, IBM, etc. [84]. This chapter provides a detailed overview of Hadoop, starting with a brief history of this open source, the corresponding architecture, implementation and some related features. 8.1.Hadoop Overview Hadoop is an Apache Java open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [85]. Hadoop has been designed to be reliable, fault tolerant and scalable project that can scale up from one single machine to thousands of machines. 8.2.Hadoop History In 2002, Hadoop was created by Doug Cutting as an open source for web crawling and indexing, and it was first named Nutch project. Nutch was developed to handle searching issues, but it faced the scalability problem as it wouldn t scale up to billions of web pages. To deal with this issue, Nutch team got inspired by Google s distributed filesystem (GFS). By adopting GFS architecture in 2004, Nutch team has delivered an open source called Nutch Distributed Filesystem (NDFS) [86]. When Google published its paper about MapReduce algorithm, Nutch team has tried to get advantage of that work by introducing MapReduce to its NDFS system. Implementing both NDFS and MapReduce made Nutch as a powerful system for web crawling and indexing. This success has pushed Nutch team to build an independent project in 2006 named Hadoop project. By this time, Doug Cutting joined Yahoo!, which provided enough resources to improve Hadoop performance. Even if Yahoo! has developed and contributed to 80% of Hadoop project, Hadoop was made its own top-level project at Apache in January 2008 [87]. Besides implementing MapReduce and HDFS algorithms, Hadoop project includes other subprojects that are listed in Table 6 [85]. 50

51 Table 6: Apache Hadoop subprojects Hadoop subprojects are grouped and named Hadoop Ecosystem. The overall picture of Hadoop Ecosystem is illustrated in Figure 21. ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeeper HBase MapReduce (Job Scheduling / Excution System) Avro HDFS (Hadoop Distributed File System) Figure 21: Apache Hadoop subprojects [85] 8.3.Hadoop Architecture Hadoop implements master/slave architecture, where master is named NameNode and slave is named DataNode. NameNode manages the file system namespace that consists of a hierarchy of files and directories used for data storage. When a file is created by client application, it is divided into blocks; each block is replicated and stored in DataNodes. In this case, information about the replicas numbers (number of block copies) and the mapping of replicas and blocks are stored in the NameNode. On the other hand, each DataNode is in charge of 51

52 managing storage attached to the node in which it is running on. Furthermore, each DataNode handles the read operation, write, block creation, deletion, and replication that come as instructions from the NameNode [86]. Besides NameNode and DataNodes, Hadoop cluster consists of Secondary NameNode (backup node for NameNode), JobTracker and TaskTracker. JobTracker is located in the master node, and it is responsible of distributing MapReduce tasks to other nodes in the cluster. On the other hand, TaskTracker runs locally tasks distributed by the JobTracker; each slave in the cluster contains one TaskTracker that can also run on master node [86]. The overall architecture of Hadoop is illustrated in Figure 22. Figure 22: Hadoop Architecture 8.4.Hadoop Implementation Hadoop is mainly implemented using HDFS and MapReduce paradigm. HDFS is used to store large data sets while MapReduce is used to analyze and process data across Hadoop cluster. Taking into consideration the architecture provided in Figure 22, HDFS concept is represented by the NameNode, Secondary NameNode and DataNodes, while MapReduce is represented by the JobTracker and TaskTracker (Figure 23). 52

53 Figure 23: HDFS and MapReduce representation HDFS Overview HDFS is designed as a hierarchy of files and directories. Each file is divided into blocks that are stored in different DataNodes. NameNode stores only the metadata that includes information about blocks locations and the number of copies of each block. Furthermore, HDFS allows NameNode to perform the namespace operations such as opening, closing and renaming files and directories. As stated before, HDFS performs data replication to ensure fault-tolerance. The replication factor is set when a file is created, and it can be modified later [85]. An example that illustrates the HDFS process is the read, write and creation operations. During the read operation, the HDFS request from the NameNode the list of DataNodes that host replicas of the blocks of a given file. The list is sorted by the network topology distance from the client. After deciding on the DataNode from where to fetch data, The HDFS client contacts directly the DataNode and requests the desired block. On the other hand, during the write operation, the HDFS asks the NameNode to choose DataNodes that will store replicas of the first block of the file, second block and so on as so far. For each block, the client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. Concerning the creation operation, when there is a request to create a file, the HDFS caches first the file into a temporary local file. When the latter accumulates data up to the HDFS block size, the HDFS 53

54 contacts the NameNode to insert the file name into the file system namespace and allocate a data block for it. After that, the NameNode selects the DataNodes that will host the data blocks. At this stage, the client moves the block of data from the local temporary file to the specified DataNode [85] MapReduce Overview Hadoop MapReduce is a programming paradigm that processes very large data sets in parallel manner on large clusters. It was first introduced by Google in 2004 [6]. The core idea of MapReduce is splitting the input data set into chunks that will be processed by map tasks in a parallel manner. The output of each map task is sorted to be then directed as an input to the reduce task. Taking into consideration the previous definition, MapReduce can be classified into two steps: map step and reduce step [88]. Map task process is divided by itself into five phases: read, map, collect, spill and merge. The read phase consists of reading the data chunk from the HDFS, and then creating the input key-value. Map phase is about executing the user-defined map function to generate the mapoutput data. Collect phase performs the collection of intermediate (map-output) data into a buffer before spilling. Spilling process sorts, performs compression, if specified, and writes to local disk to create file spills. The last step in the map task is the merge phase which merges all file spills into one single map output file [88]. Reduce task is also divided into four phases: shuffle, merge, reduce and reduce phase. Shuffle phase transfers the intermediate data (map output) from the mapper slaves to a reducer's node and decompressing if needed. Merge phase performs the merging of the sorted outputs that come from different mappers to be directed as the input to the reduce phase. Reduce phase executes the user-defined reduce function to produce the final output data. Finally, write phase compresses, if needed, and writes the final output to HDFS [88]. A popular example that illustrates the MapReduce execution is the Words Count example which counts the number of occurrence of each individual word in a given file (Figure 24) [89]. 54

55 Figure 24: Word count MapReduce example [89] 8.5.Hadoop Cluster Connectivity When Hadoop starts connecting, each DataNode performs a handshake with the NameNode. The purpose of this operation is to verify the namespace ID and the software version of the DataNode. The namespace ID is assigned to the filesystem instance when it is formatted, and it is stored in all nodes of the cluster. Nodes with a different namespace ID will not be able to be part of the cluster. However, if the namespace ID is the same, the handshake will be performed successfully between the DataNodes and the NameNode. At this point, each DataNode stores its unique storage ID, which is an internal identifier of the DataNode. The main purpose of this ID is to make the DataNode recognizable even if it is restarted with a different IP address or port [87]. During normal operation, DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. In case the NameNode does not receive a heartbeat from a DataNode in ten minutes, the NameNode considers the DataNode as a dead node. In this case, NameNode creates new replicas of those blocks on dead DataNodes. In fact, heartbeats are not only used for ensuring NameNode-DataNodes connectivity, but it is also used to send statistical information such as total storage capacity, and fraction of storage in use. Another benefit of heartbeats is to send instructions from the NameNode to DataNodes. Those instructions include commands to replicate blocks to other DataNodes, remove local block 55

56 replicas, reregister and send an immediate block report, and shut down the node. These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations [87]. 56

57 Part III: Research Contribution To clarify the steps we followed in this study, we divided this part into four chapters 9, 10, 11 and 12. Chapter 9 defines the research methodology; chapter 10 describes the experimental setup that we used to get the performance of HPCaaS; chapter 11 presents the results we got from each experiment, and finally, chapter 12 discusses and analyzes the research findings. 57

58 Chapter 9: Research Methodology The choice of research methodology depends mainly on the nature of the research question. This chapter discusses the methodology that was followed in conducting the present study. It explains first the choice of the selected methodology, and then it demonstrates an overall picture of the research steps. 9.1.Research Approach The present research was based on a combination of qualitative and quantitative approach. Qualitative approach was followed to compare and select appropriate technology enablers for this research (Part III, Chapter 7), whereas quantitative approach was adopted to provide numeric measurements of HPC on physical cluster and virtualized clusters (Part IV, Chapter 10, 11 and 12), 9.2.Research Steps Figure 25 summarizes the steps followed in conducting the present research. Figure 25 : Research steps 58

59 Chapter 10: Experimental Setup In order to investigate the research question, we have conducted three main experiments. The first experiment evaluates the performance of HPC on Hadoop Physical Cluster (HPhC); the second experiment evaluates the performance of HPC using Hadoop Virtualized Cluster (HVC) with KVM, and the last experiment evaluates HPC using Hadoop virtualized cluster with VMware ESXi virtualization technology. This chapter describes the experiment setup used in this research; it provides an overall picture of the three adopted clusters; it specifies the hardware, software and network specifications; it introduces the benchmarks used to evaluate the performance of HPC on each cluster; it lists the datasets sizes used in each experiment, and finally, this chapter explains the experimental execution of the present research Experimental Hardware In our performance study, we have built 3 different clusters: Hadoop Physical Cluster, Hadoop Virtualized Cluster using KVM and Hadoop Virtualized Cluster using VMware ESXi. Each cluster is composed of eight machines. For the physical cluster, we used 8 Dell OptiPlex 755 Desktop computers with specifications listed in Table 7. For both Hadoop virtualized clusters (KVM and VMware ESXi), we used a Dell PowerEdge server with features listed in Table 8. On top of the server, we installed OpenStack to create eight virtual machines using KVM hypervisor and then VMware ESXi hypervisor. Because of some limited flexibility of OpenStack, we cloud create VMs with features described in Table 9. Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) 59

60 Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster Table 9 : OpenStack virtual machines features 10.2.Experimental Software and Network As stated in chapter 6, we opted for Hadoop to process and store small and large datasets; we chose to install Hadoop version Concerning OpenStack, the version that was adopted is Folsom Release which supports KVM, Xen, VMWare and other hypervisors. Networking configuration was characterized by a bandwidth of 100Mbps per port Clusters Architecture In this section, we will conceptualize each individual cluster in terms of its layers and components Hadoop Physical Cluster Figure 26 and 27 show an overall picture of Hadoop Physical Cluster. The configuration was done in Linux Lab at AUI. The lab is connected to 1 Gbps switch (provides 100 Mbps per port) that is also connected to other offices in the building (where the lab is allocated). As 60

61 both figures depict, the cluster contains eight machines where one machine was selected to be the master and slave node at the same time. The reason behind choosing the master node to serve as both master and slave node is to increase the cluster performance when processing and storing datasets. Figure 26 : Hadoop Physical Cluster Figure 27: Hadoop Physical Cluster architecture Hadoop Virtualized Cluster KVM The second cluster we built in this research is Hadoop Virtualized Cluster with KVM technology. As Figure 28 shows, the first step in configuring the cluster is to install an operating system on Dell PowerEdge server; the OS that was selected is Ubuntu Precise

62 LTS- 64 bits. The next step is to install and configuring KVM packages which are loaded in Linux OS as KVM driver. After preparing the system with OS and KVM hypervisor, next step is to install OpenStack on top of the OS (OpenStack with KVM documentation is provided in Appendix A). Finally, last step is to configure Hadoop on top of each OpenStack VM instance (Hadoop documentation is provided in Appendix C). Figure 28: Hadoop virtualized cluster - KVM The first OpenStack component that needs to be installed is the keystone which manages the authentication to OpenStack resources. After downloading and installing the keystone package, the next step is to create tenants (OpenStack projects) and OpenStack users that are associated to one or more tenants. Each user can be a member or an admin in a given project; in this case, roles need to be created in order to set rights and privileges to each user. After creating users, tenants and roles, next step is to create OpenStack services (nova, keystone, and glance service) that provide one or more endpoints (URLs) through which users can access OpenStack resources. The second component to install is OpenStack glance which allows creating and managing different formats of images (Ubuntu, Fedora, Windows, etc.) Glance packages include glance-api that accepts incoming API requests; glance-database that stores all information about images, and finally glance-registry that is responsible of retrieving and storing metadata about images. Third component to deploy in OpenStack is the Nova package which includes nova-compute, nova-scheduler, nova-network, novaobjectstore, nova-api, rabbitmq-server, novnc and nova-consoleauth. All these components collaborate and communicate with each other to create and manage instances, networks and, if needed, volumes. Finally, to have access to instances, a user friendly insterface can be 62

63 installed through configuring OpenStack dashboard or Horizon. After login to OpenStack Dashboard, the user can launch instances with the possibility of specifying the number of CPUs, disk space, total RAM memory per VM, etc. After creating VM instances (with requirements listed in Table 9), we installed Hadoop on each VM. Hadoop configuration starts with identifying the master node and slave nodes. For master node, there are six files that need to be configured: core-site, hadoop-env, hdfs, mapred-site, master and slaves files. Concerning slave nodes, the only files that need to be configured are hadoop-env, core-site, hdfs and mapred-site files. When connecting nodes, the cluster needs to be formatted as to clean the file namespace. After formatting Hadoop, the cluster can be started to run jobs Hadoop Virtualized Cluster VMware ESXi The third cluster that was built in this research is Hadoop Virtualized Cluster using VMware ESXi technology (Figure 29). The first step in configuring this cluster is to install VMware ESXi on top of Dell PowerEdge server. Then, OpenStack is configured on top of the hypervisor (OpenStack with VMware ESXi documentation is provided in Appendix B). After configuring OpenStack, instances can be then created to build Hadoop cluster. Figure 29: Hadoop virtualized cluster VMware ESXi (a) In fact, when installing OpenStack with VMware ESXi, Openstack is installed as a VM on top of VMware ESXi hypervisor. Then, through OpenStack dashboard, instances can be created as VMs on top of VMware ESXi hypervisor (Figure 30). 63

64 Figure 30 : Hadoop virtualized cluster VMware ESXi (b) 10.4.Experimental Performance Benchmarks To evaluate the impact of machine virtualization on HPCaaS, we adopted two main known benchmarks: Terasort and TestDFSIO benchmarks [90]. TeraSort performance metrics consist of measuring the average time to sort certain datasets, while TestDFSIO performance metrics consist of measuring the execution time to write and read datasets. Table 10 summaries the performance metrics used in evaluating HPCaaS. Table 10 : Experimental performance metrics TeraSort Description TeraSort was developed by Owen O Malley and Arun Murthy at Yahoo Inc [90]. It won the annual general purpose terabyte sort benchmark in 2008 and It does considerable computation, networking, and storage I/O, and is often considered to be representative of real Hadoop workloads [90]. Terasort is divided into three main steps: Teragen, Terasort and Teravalidate. 64

65 Teragen generates random data that will be sorted by Terasort. It writes the generated data as a file of n rows, where each row is 100 bytes. Each row is formatted as follow: 10 bytes key, 10 bytes rowid and 78 bytes filler, where keys are random characters from the set.. ~, rowid is an integer that specifies the row id, and filler consists of 7 runs of 10 characters from A to Z. When data is generated, TeraSort sorts this data using quicksort algorithm. The latter is integrated with map/reduces tasks to use a sorted list of n-1 sampled keys that define the key range for each reduce [9]. Finally, Teravalidate ensures that the output data of TeraSort is sorted. It creates one map task per file in TeraSort s output directory; in this case, each map task ensures that each key is less than or equal to the previous one. Furthermore, map task generates records with the first and last keys of the file; then the reduce tasks ensures that the first key of file i is greater than the last key of file i 1. If there is any unordered keys, Teravalidate reports this as an output of the reduce task [90]. (TeraSort benchmark is documented in Appendix D) TestDFSIO Description TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read operations. Such benchmark can be helpful for testing HDFS by checking network performance, and testing hardware, OS and Hadoop setup [90]. TestDFSIO is written in Java, and its source code can be found in [91]. TestDFSIO is composed of TestDFSIO-Write and TestDFSIO-Read. Both operations are performed by specifying the number of files and the size of each file in megabyte [90]. (TestDFSIO benchmark is documented in Appendix D) 10.5 Experimental Datasets Size In each experiment, we measured the performance of Hadoop cluster using different dataset sizes. For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets, and for TestDFSIO, we used 100 MB, 1 GB, 10 GB and 100 GB datasets. Table 11 summarizes the dataset sizes used in this research. Table 11 : Datasets size used for Hadoop benchmarks 65

66 10.6 Experiment Execution We started conducting each experiment by scaling the cluster from three machines up to eight machines. In other words, we test each benchmark on three machines, four machines until we reached eight machines. Furthermore, for each individual benchmark, we performed three tests on 100MB, 1GB, 10 GB and 30 GB (TeraSort) and 100MB, 1GB, 10 GB and 100 GB (TestDFSIO), then we calculated the mean to avoid any outliers and to provide more accurate results. Figure 31 simplifies the steps of running experiment 1 on HPhC using Terasort benchmark. Figure 31 : Experimental execution 66

67 Chapter 11: Experimental Results This chapter presents the findings we got from running each experiment. It presents the results of running HPC on HPhC; on HVC with KVM, and then the results of running HPC on HVC using VMware ESXi. Last section, compares the results we got from running each experiment. (The results we got from running experiments are listed in Appendix E and F) 11.1.Hadoop Physical Cluster Results TeraSort Performance on HPhC Running TeraSort benchmark showed that it needs much time to sort large datasets like 10 GB and 30 GB. Yet, scaling the cluster to more nodes led to significant time reduction in sorting datasets. The results we got from running this benchmark on Hadoop Physical Cluster are listed in Table 12 and conceptualized in Figure 32. Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 32: TeraSort performance on Hadoop Physical Cluster 67

68 Figure 33 and 34 illustrate clearly the benefit of scaling the cluster. For instance, running 100MB with 3 nodes needs around seconds, while with 8 nodes, it needs seconds (reduced by 6%). In the case of 1GB, the average time was reduced by 4% when scaling from 3 to 8 nodes. Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster Concerning 10GB, the results were somehow different (Figure35). Sorting 10 GB was reduced by 18.55% when scaling from 3 to 6 machines. Yet, increasing the number of machines to 8 nodes led to significant reduction in sorting performance. This can be explained by the impact of network bottleneck, especially that Hadoop is highly influenced by this issue. Furthermore, the impact of 8 nodes was important when running large datasets like 30 GB (Figure 36). For this case, the average time to sort the dataset was reduced by 24.77% (difference of 42 minutes) when increasing the number of nodes from 3 to 8. Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster 68

69 TestDFSIO- Write Performance on HPhC Running TestDFSIO-Write on Hadoop physical cluster follows in general one pattern. Meaning, as the number of VMs increases, the average time decreases when writing different dataset sizes. Table 13 and Figure 37 list and illustrate the results we got from running TestDFSIO-Write on HPhC. Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster Zooming on TestDFSIO-Write for 100MB dataset (Figure 38), the average time for running TestDFSIO-Write decreased as the number the of slaves increases. In this case, scaling the cluster from 3 machines (including the master) to 8 machines led to a reduction of 11.25% in overall writing average time. The same observation is applied when running TestDFSIO- Write for 1GB dataset (Figure 39) where the average time was reduced by 46.5 % when scaling from 3 to 8 slaves. 69

70 Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster When running 100 GB (Figure 41), we observe a sharp time reduction in running the TestDFSIO-Write when scaling from 3 to 8 slaves; this reduction was quantified by 12.53%. However, an expected average time was increased when scaling from 4 to 5 machines. Again, this unexpected result can be explained by the overall network performance TestDFSIO- Read Performance on HPhC Running TestDFSIO-Read led also to significant performance improvement when the physical cluster was scaled up to 8 machines (Table 14 and Figure 42). In general, this observation is applied for all dataset sizes. 70

71 Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and different number of nodes- Hadoop Physical Cluster Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster When the cluster was scaled from 3 to 7 nodes, the average time for reading 100MB (Figure 43) was reduced by 4.36% and 2.46% when reading 1GB (Figure 44). However, when scaling the cluster from 7 to 8 machines, the average time increased suddenly when reading both 100MB and 1GB. The same observation was made when reading 10GB and 100GB (Figure 45 and 46). Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster 71

72 Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster 11.2.Hadoop Virtualized Cluster- KVM Results TeraSort Performance on HVC-KVM Running TeraSort on Hadoop KVM Cluster showed an important improvement in sorting various dataset sizes. Yet, this observation is applied when scaling the KVM cluster from 3 to 5 VMs. The results we got from running this benchmark on Hadoop KVM Cluster are listed in Table 15 and conceptualized in Figure 47. Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 47: TeraSort performance on Hadoop KVM Cluster 72

73 From Figure 48, sorting 100MB on 3 VMs takes around 15 seconds, and it decreases by 2.2% and 5.5% when sorting the dataset on 4 and 5 VMs respectively. Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster When sorting 1GB, 10 GB and 30 GB (Figure 49, 50 and 51), the performance was slightly improved as the number of VMs increases. For example, sorting time of 10GB was decreased by 0.3%, and sorting time of 30 GB was decreased by 5% when scaling from 3 to 4 nodes. However, when the cluster was scaled to 5, 6, 7 and 8 nodes, the overall performance of sorting 1GB, 10 GB and 30 GB was sharply decreased. Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster 73

74 TestDFSIO-Write Performance on HVC-KVM Running TestDFSIO-Write on Hadoop KVM was slightly improved as the number of VMs increases. The results of running TestDFSIO-Write are listed in Table 16 and illustrated in Figure 52. Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster For all dataset sizes (Figure 53, 54, 55 and 56), as stated before, the overall performance was slightly improved as the number of VMs increased from 3, 4 and 5. For instance, writing 10GB was improved by 1.6% when scaling from 3 to 5 VM. Furthermore, when trying to write 100GB, the system was crashed because of the overall system overhead (Figure 56). 74

75 Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster TestDFSIO- Read Performance on HVC-KVM TestDFSIO- Read has the same behavior as TestDFSIO-Write. Meaning, the performance of reading different dataset sizes increases as the number of VMs increases from 3 to 5. The results we got from running TestDFSIO- Read are illustrated in Table 17 and Figure

76 Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop KVM Cluster Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster As Figure 58, 59, 60 and 61 depict, the overall performance of reading different dataset sizes increases as the number of VMs increases from 3 to 5. For example, the average time for reading 100GB was slightly decreased by 3% when scaling from 3 to 5 VMs. Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster 76

77 Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster 11.3.Hadoop Virtualized Cluster- VMware ESXi Results TeraSort Performance on HVC-VMware ESXi Table 18 and Figure 62 present the performance of running TeraSort on Hadoop VMware ESXi Cluster; the overall observation shows significant improvement in sorting various dataset sizes. In contrast to KVM cluster, VMware ESXi keeps decreasing the average time of storing as the number of VMs increases from 3 to 6 (for large datasets). Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster 77

78 As Figure 63 depicts, the performance of sorting 1 GB was decreased by 23% when scaling the cluster from 3 to 6 VMs. Yet, the performance starts degrading as the number of VMs increases from 6 to 7 and 8. Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster A significant high performance was observed when sorting 30GB (Figure 66). The performance was increased by 34% from 3 to 6 VMs, 25% from 3 to 7 VMs and 3% from 3 to 8 VMs. Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster 78

79 TestDFSIO-Write Performance on HVC-VMware ESXi Running TestDFSIO-Write on Hadoop VMware ESXi was improved as the number of VMs increases to 7. The results of running TestDFSIO-Write are listed in Table 19 and illustrated in Figure 67. Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster For all dataset sizes (Figure 68, 69, 70 and 71), the overall performance was improved as the number of VMs increases from 3 to 7. For instance, writing 100 MB was improved by 37% when scaling from 3 to 7 VMs. Furthermore, when writing large dataset like 10GB, the overall performance increased by 12% when scaling from 3 to 7 VMs. However, for the case of 100GB, the performance started degrading when scaling from 6 to 7 and 8 VMs. 79

80 Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster TestDFSIO- Read Performance on HVC-VMware ESXi TestDFSIO- Read behaves as TestDFSIO- Write when the performance of reading different dataset sizes increases as the number of VMs increases from 3 to 7. However, the average time for reading different datasets was less than writing operation (by more than half). The results we got from running TestDFSIO- Read on VMware ESXi are listed in Table 20 and conceptualized in Figure

81 Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster Figures 73, 74, 75 and 76 show the performance of running TestDFSIO-Read on each individual dataset. For most dataset sizes, the performance was improved as the number of VMs inreased up to 7. For instance, the performance of reading 100GB was improved by 36% when scaling from 3 to 7 VMs. However, reading 1GB behavied differently as the correspondding performance started to decline at VM 6. Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster 81

82 Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster Results Comparison TeraSort Performance The overall performance of the 3 clusters varies depending on the datasets size and the number of nodes involved in each cluster. Yet, Hadoop VMware ESXi cluster was performing much better than other clusters when running TeraSort benchmark on large datasets. Starting with 100MB (Figure 77), TeraSort showed high performance when being virtualized with VMware ESXi and KVM. Both clusters were 25% (VMware ESXi) and 30% (KVM) faster than the physical cluster (in case of 3 nodes). Further, a significant performance was achieved when scaling the cluster to 4, 5 and 6 nodes; in this case, both KVM and VMware ESXi were faster than the physical cluster. After increasing the number of nodes to 7 and 8, VMware ESXi performance decreases by 33% and becomes slower than the physical cluster by 18% (when scaling from 3 to 8 nodes). On the other hand, the average time of sorting 100MB dataset on KVM cluster declined as the number of nodes increases to 7 and 8, and therefore, the sorting performance was improved from 15 to 14 seconds. Further, virtualized cluster (KVM) was performing better than the physical cluster by 29.5% and 27% for 7 and 8 nodes respectively. 82

83 Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and VMware ESXi When increasing the dataset size, the performance changes in each scenario (dataset size and number of nodes). In the case of 1GB (Figure 78), virtualized cluster was keeping the best performance when compared with the physical cluster. When the cluster was composed of 3-5 nodes, virtualized clusters sort the 1GB dataset with a range of seconds, while the physical cluster sorts the same dataset with a range of seconds. When increasing the number of nodes from 5 to 8, VMware ESXi was faster than other clusters; however, KVM knew a decline in its performance when being compared with KVM cluster of 3-4 nodes and when being compared with the physical cluster. For instance, in the case of 8 machines, physical cluster was faster than KVM cluster by 89%. Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi 83

84 The same observation on 1GB can be applied when sorting 10GB dataset (Figure 79). Yet, in this case, the performance of virtualized clusters was very high than the physical cluster. For instance, in the case of 5 VMs, VMware ESXi cluster was faster than physical cluster by 60%, and KVM was faster than physical cluster by 51%. Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and VMware ESXi When moving to larger datasets, VMware ESXi cluster proved its significant performance in sorting the 30 GB dataset (Figure 80). For instance, in the case of 4 nodes, VMware ESXi was faster than KVM cluster by 28% and faster than physical cluster by 61%. Moreover, KVM was performing better than physical cluster when the cluster was composed of 3, 4, 5 and 6 nodes. Afterward, when increasing the cluster size to 7 and 8 nodes, the KVM cluster decreased in its performance and became slower than the physical cluster. Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and 84

85 VMware ESXi The last observation consists in VMware ESXi performance on 8 nodes cluster. For all different datasets, we observed that VMware performance degraded; for example, for 10 GB, the performance decreased by 51%. Even though, VMware ESXi kept performing better than other clusters TestDFSIO- Write Performance The results we got from TestDFSIO were different than the ones in TeraSort benchmark. The overall observation of Figure 81 and 82 shows that virtualization is still performing better than the physical cluster. In the case of 3-5 nodes cluster, we can observe that KVM cluster performance is much better than VMware ESXi and physical cluster. For instance, writing 100 MB using 5 nodes, KVM cluster was 11% faster than physical cluster and 24% faster than VMware ESXi cluster (Figure 81). However, we observed that physical cluster was performing better than VMware ESXi, and the difference was quantified by 48% seconds (100 MB using 5 nodes). When scaling the cluster from 5 to 8 nodes, the KVM cluster knew sharp performance degradation. Again, this is due to system overhead. In this case, the physical cluster showed better results than virtualized clusters. Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi The same observation is applied when sorting 100 GB (Figure 83). The only difference is that KVM cluster with 8 nodes was unable to write the 100 GB. 85

86 Figure 83: Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi Figure 84 : Average time for writing 100 GB on HPhC, HVC with KVM and VMware ESXi TestDFSIO- Read Performance As illustrated in Figure 84 and 85, reading small datasets (100MB and 1GB) showed that virtualized cluster is faster than physical cluster. Yet, this applied for KVM cluster when it is composed of 3-5 nodes. Afterwards, when KVM clusters scaled to 6, 7 and 8 nodes, the performance of reading all datasets degraded. On the other hand, physical cluster performed better than VMware ESXi in all case (100MB and 1G on different number of nodes). Figure 85: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi Figure 86 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi When increasing the dataset size to 10 GB and 100GB (Figure 86 and 87), we can see different performance trends. When the cluster is composed of 3-5 nodes, KVM cluster kept better performance than other clusters. For instance, for 100 GB and 3 nodes, KVM cluster 86

87 was faster than VMware ESXi by 12% and faster than physical cluster by 44%. However, as other benchmarks (TeraSort and TestDFSIO-Write), KVM cluster showed a sharp degradation in reading 100GB when the cluster scaled to 6, 7 and 8 nodes. When reading 10GB and 100 GB, in contrast to TestDFSIO-Write results, VMware ESXi cluster was faster than physical cluster in all scenarios (number of nodes). For instance, using a cluster of 3 nodes; VMware ESXi was faster than the physical cluster by 36% and 55.5% in the case of 7 and 8 nodes respectively. An important observation is that KVM cluster with 8 VMs was unable neither to write nor to read 100GB dataset (Figure 87). Figure 87: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi Figure 88 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87

88 Chapter 12: Discussion The results we got in this research proved significant improvements when virtualizing HPC, especially when the latter was tested with TeraSort benchmark; in this case, we found that both virtualized clusters (KVM and VMware ESXi) have better performance than physical cluster TeraSort Performance When running TeraSort benchmark, VMware ESXi cluster proved to have fast sorting of large datasets starting from 1GB, 10 GB and 30 GB. For instance, sorting 30GB using a cluster of 4 nodes showed that VMware ESXi is faster than KVM by 64% and faster than physical cluster by 84% (Figure 80). Concerning the KVM cluster, it was also proved to be faster than the physical cluster. However, when the number of nodes increases in virtualized clusters, the performance of TeraSort degraded significantly. In the case of KVM cluster, when the number of nodes increases to 6, 7 and 8, the overall performance of running TeraSort became slower. In fact, the reason behind facing this degradation is explained by the system overhead, especially disk overhead. A study was done in [92] performed an analysis of KVM scalability in OpenStack platform, and it state that KVM is not recommended to be used when many virtual hard disks will be accessed at the same time. Therefore, since TeraSort has both computational and I/O jobs, KVM VMs affected the overall performance when they were scaled to 6, 7 and 8. Moreover, another study was done in [93] states that KVM has substantial problems with guests crashing when it reaches a certain number of VMs (4 for this study [93]); hence, scalability is considered an issue for system overhead when using KVM virtualization. In the case of VMware ESXi cluster, the performance of running TeraSort declines when the cluster was scaled to 8 nodes. The same as KVM, the reason is due to system overhead. However, the system overhead is not related to scalability issue because VMware ESXi is known to be scalable [94]. To make sure from the cause that led to system overhead, we tracked the performance of sorting 30GB dataset on 8 VMware ESXi VMs (using VMware vsphere Client), and we found that, at some point, the memory required to sort the dataset exceeds the available memory offered by the cluster. This can be observed in Figure 88 which illustrates that active memory (in red, memory currently consumed by VMs) is higher than the granted memory (in grey, memory provided by the hosting hardware) between 5:05 and 5:10 88

89 PM range. Another proof that confirms the system overhead is the latency rate; in this case, we tracked the latency of running 30 GB on 8 VMs, and we observed that system latency reaches its peak (Figure 89) when sorting this dataset. Thus, latency impacts the overall performance when the number of VMs increases to 8. The last proof was reported by OpenStack Dashboard (Figure 90) which showed warning state of resources usage after creating 8 VMware ESXi instances. In short, VMware ESXi cluster performance declines at 8 VMs because of resources shortage. Figure 89: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs Figure 90 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs 89

90 Figure 91: OpenStack warning statistics about system resources usage In short, Even if TeraSort performance decreases when the number of VMware ESXi VMs increases to 8, the results we got still confirm that Hadoop VMware ESXi cluster is better than Hadoop KVM Cluster and Hadoop Physical Cluster TestDFSIO Performance The performance behavior of each cluster changed when running TestDFSIO benchmark. For all dataset sizes, KVM cluster proved to have high performance than other clusters when performing both TestDFSIO-Write and TestDFSIO-Read (Figures 81-87). On the other hand, VMware ESXi showed the lowest performance when compared to KVM and physical cluster. In fact, the reason that explains the good results we got from running TestDFSIO on KVM is related to virtio API. The latter is integrated in KVM hypervisor to provide an efficient abstraction for I/O operations [95]. Virtio was studied in [96] and proved that it enhances KVM performance at I/O operations; the authors of this study [96] tested the performance of KVM (with virtio API) at I/O operations and compared it with VMware vsphere 5.1 performance. They concluded that KVM with virtio API achieves I/O rates that are 49% higher than VMware vsphere 5.1. When running TestDFSIO, we observed again that the performance of both virtualized clusters decreases as the number of VMs goes beyond 6 (KVM) and 7 (VMware ESXi). 90

91 12.3.Conclusion Brief, the overall performance of TeraSort and TestDFSIO proved that, first, virtualization has better performance than physical cluster, and, second, the selection of underlying virtualization technology can lead to significant improvements when performing HPCaaS. Therefore, in this research, VMware ESXi proved to have the best performance especially when running computational jobs (TeraSort). To deal with the issue of system overhead in virtualized clusters, HPCaaS needs to be run in a cloud environment that has balanced number of VMs. For this research, the reasonable number that provides high performance was 7 VMs for VMware ESXi and 5 VMs for KVM cluster. 91

92 Part IV: Conclusion This part summarizes the research objectives and findings and suggests some related future work. Bibliography of this report is listed after the conclusion, and finally, a set of appendices (OpenStack Documentation, Hadoop Documentation, Benchmarks Execution and Data Gathering) are provided at the end of this report. 92

93 Chapter 13 Conclusion and Future Work This project aimed at demonstrating the impact of running HPCaaS on different virtualization technologies, namely, KVM and VMware ESXi cluster. For that, we have built three main Hadoop clusters: Hadoop Physical Cluster, Hadoop Virtualized Cluster with KVM and Hadoop Virtualized Cluster with VMware ESXi. For virtualized clusters, we proposed to build Hadoop cluster on top of OpenStack platform. On each cluster, we run two known benchmarks: TeraSort and TestDFSIO. Each benchmark was tested on different dataset sizes and on different number of machines (from 3 to 8 machines). To ensure the credibility and reliability of the research, we performed three tests on each scenario; for instance, we tested TeraSort for 30GB on each cluster three times, and then we took the mean to avoid any outliers. The findings of this research clearly demonstrate that vitalized clusters can perform much better than physical cluster when processing and handling HPC, especially when there is less overhead on the virtualized cluster. We found that Hadoop VMware ESXi cluster performs better at sorting big datasets (more computations), and Hadoop KVM cluster performs better at I/O operations. Finally, this report includes detailed installation guides of OpenStack and Hadoop that will save time and facilitate the work for future students who want to work on related research. As future work, the possibilities for extending this research can go in different directions. The first proposed work is to conduct the research experiments using real HPC applications that can show precisely the impact of virtualization on HPCaaS. The second proposed future work is to conduct this research using other emerging virtualization technologies such as XEN, and Hyper-V. Third proposed future work is to see the impact of cloud platforms on improving the HPCaaS; meaning, another research can be conducted to see for example, if replacing OpenStack with another cloud infrastructure can lead to better results. Finally, since we got positive results about the impact of visualization on HPCaaS, this research can be investigated more by integrating its findings in other domains such as Smart Grid. 93

94 Bibliography [1] J. Gantz and D. Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, IDC IVIEW, pp. 1-16, 2012 [2] Gartner, Inc., Hunting and Harvesting in a Digital World, in Gartner CIO Agenda Report, pp. 1-8, 2013 [3] Amazon Web Services, High Performance Computing (HPC) on AWS, [4] J. Gantz and D. Reinsel, The Digital Universe Decade Are You Ready?, IDC IVIEW, pp. 1-15, 2010 [5] Ch.Vecchiola1, S. Pandey, R.Buyya, High-Performance Cloud Computing: A View of Scientific Applications, in the 10th International Symposium on Pervasive Systems, Algorithms and Networks I-SPAN, IEEE Computer Society, pp. 4-16, 2009 [6] J. Dean and S. Ghemawat, MapReduce: Simple Data Processing on Large Clusters, in OSDI, pp [7] Hadoop: [8] S. Krishman, M. Tatineni, and C. Baru, MyHaddop Hadoop-on-Demand on Traditional HPC Resources, in the National Science Foundation s Cluster Exploratory Program, pp. 1-7, 2011 [9] E. Molina-Estolano, M. Gokhale, C. Maltzahn1, J. May, J. Bent, S. Brandt, Mixing Hadoop and HPC Workloads on Parallel Filesystems, in the 4th Annual Workshop on Petascale Data Storage, pp. 1-5, 2009 [10] C. Cranor, M. Polte, and G. Gibson, HPC Computation on Hadoop Storage with PLFS, Parallel Data Laboratory at Carnegie Mellon University, pp. 1-9, 2012 [11] Y. Xiaotao, L. Aili, and Z. Lin, Research of High Performance Computing with Clouds, in the Third International Symposium on Computer Science and Computational Technology (ISCSCT), pp , 2010 [12] KVM: [13] VMware ESXi: [14] D. Boulter, Simplify Your Journey to the Cloud, Capgemini and SOGETI, pp. 1-8, [15] P. Mell and T. Grance, The NIST Definition of Cloud Computing, National Institute of Standards and Technology, pp. 1-3, 2011 [16] A. E. Youssef, Exploring Cloud Computing Services and Applications, Journal of Emerging Trends in Computing and Information Sciences, vol. 3, no. 6, pp , 2012 [17] T. Korri, Cloud Computing: Utility Computing over the Internet, Seminar on 94

95 Internetworking, pp. 1-5, 2009 [18] ISACA, Cloud Computing: Business Benefits with Security, Governance and Assurance Perspectives, pp. 1-10, 2009 [19] A. T. Velte, T. J. Velte, R. C. Elsenpeter, Cloud Computing, A practical approach,1st ed., USA : McGraw-Hills, 2009 [20] Amazon Web Services: [21] Google Cloud Platform: [22] Microsoft Cloud Services: trends/cloudcomputing/default.aspx?search=true#fbid=33s2kmnt99z [23] Open Source Software for Building Private and Public Clouds: [24] I. Menken, and G. Blokdijk, Cloud Computing Virtualization Specialist Complete Certification Kit - Study Guide Book and Online Course, Emereo Pty Ltd, 2009 [25] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012 [26] K. Scarfone, M. Souppaya, and P. Hoffman, Guide to Security for Full Virtualization Technologies, National Institute of Standards and Technology, 2011 [27] D. Dale, Server and Storage Virtualization with IP Storage, Storage Networking Industry Association (SNIA), 2008 [28] D. Marinescu and R. Kroger; State of the Art in Autonomic Computing and Virtualization, Wiesbaden University of Applied Sciences, pp. 1-21,2007 [29] K. Koganti, E. Patnala2, S. Narasingu, J. Chaitanya,Virtualization Technology in Cloud Computing Environment, in International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 3, 2013 [30] N. Susanta and T. Chiueh, A Survey on Virtualization Technologies, Department of Computer Science at Stony Brook, 2006 [31] Virtualization: A Key to Virtualization World: [32] Virtualization Overview, white paper, VMware, 2006 [33] N. Alam, Survey on Hypervisors, School of Informatics and Computing at Indiana University, 2011 [34] C. D. Graziano, A Performance Analysis of Xen and KVM Hypervisors for Hosting the Xen Worlds Project, Digital Repository at Iowa State University, pp , 2011 [35] N. Yaqub, Comparison of Virtualization Performance: VMWare and KVM, Master Thesis, pp , 2012 [36] How Does Xen Work?, white paper, Xen, 2009 [37] O. Kulkarmi, N. Xinli, and P. K. Swamy, Cutting-Edge Perspective of Security Analysis for Xen Virtual Machines, International Journal of Engineering Research and 95

96 Development, vo. 2, no. 3, pp , 2012 [38] T. Hirt, KVM The Kernel-based Virtual Machine, Red Hat Inc., 2010 [39] M. T. Jones, Anatomy of a Linux Hypervisor, IBM Corporation, 2009 [40] VMware ESXi 5.0 Operations Guide, white paper, VMware, 2011 [41] M. K. Kakhani, S. Kakhani, and S. R. Biradar, Research Issues in Big Data Analytics, Vol. 2, No. 8, pp , 2013 [42] C. Hagen, Big Data and the Creative Destruction of Today s, ATKearney, 2012 [43] Oracle : Big Data for the Enterprise, white paper, Oracle Corp., 2013 [44] Oracle NoSQL Database, white paper, Oracle Corp., 2011 [45] S. Yu, ACID Properties in Distributed Databases, Advanced ebusiness Transactions for B2B-Collaborations, 2009 [46] S. Gilbert and N. Lynch, Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, vol. 33, no. 2, p. 51, 2002 [47] A. Lakshman, P. Malik, Cassandra - A Decentralized Structured Storage System, ACM SIGOPS Operating Systems Review, vol. 44, no.2, pp , 2010 [48] G. Lars., Introduction, in HBase: The Definitive Guide, USA: O'Reilly Media, 2011 [49] MongoDB: [50] Apache CouchDB : [51] J.Bernstein, K. McMahon, Computing on Demand HPC as a Service: High Performance Computing for High Performance Business, white paper, Penguin Computing & McMahon Consulting. [52] Y. Xiaotao, L. Aili, Z. Lin, Research of High Performance Computing With Clouds, International Symposium Computer Science and Computational Technology, pp , 2010 [53] Self-service POD Portal: [54] Amazon Cloud Storage: [55] Amazon Cloud Drive: [56] Microsoft High Performance Computing for Developers: [57] Google Cloud Storage: [58] S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, Case Study for Running HPC Applications in Public Clouds, in Science Cloud '10, 2012 [59] K. R. Jackson, Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud, in Cloud Computing Technology and Science 96

97 (CloudCom), 2010 IEEE Second International Conference on, pp , 2010 [60] E. Walker, Benchmarking Amazon EC2 for High-Performance Scientific Computing, Texas Advanced Computing Center at the University of Texas, pp , 2008 [61] J. Ekanayake and G. Fox, High Performance Parallel Computing with Clouds and Cloud Technologies, School of Informatics and Computing at Indiana University, pp. 1-20, [62] Y. Gu and R. L. Grossman, Sector and Sphere: The Design and Implementation of a High Performance Data Cloud, National Science Foundation, pp. 1-11, 2008 [63] A. Gupta and D. Milojicic, Evaluation of HPC Applications on Cloud, Helwett- Packard Development Company, pp. 1-6, 2011 [64] C. Evangelinos and C. N. Hill. Cloud Computing for parallel Scientific HPC Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on Amazon s EC2, Department of Earth, Atmospheric and Planetary Sciences at Massachusetts Institute of Technology, pp. 1-6, 2009 [65] Dryad and DryadLINQ for Data Intensive Research : [66] C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, Evaluating Xen, VMware, and OpenVZ Virtualization Platforms for Network Virtualization, Federal University of Rio de Janeiro, pp. 1-1, 2010 [67] N. Yaqub, Comparison of Virtualization Performance: VMWare and KVM, Master Thesis, pp , 2012 [68] T. Deshane, M. Ben-Yehuda, A. Shah, and B. Rao, Quantitative Comparison of Xen and KVM, in Xen Summit, pp. 1-3, 2008 [69] J. Hwang, S. Wu, and T. Wood, A Component-Based Performance Comparison of Four Hypervisors, George Washington University and IBM T.J. Watson Research Center, pp. 1-8, 2012 [70] A. J. Younge, R. Henschel, J. T. Brown, G. Laszewski, J. Qiu, and G. C. Fox, Analysis of Virtualization Technologies for High Performance Computing Environments, Pervasive Technology Institute, pp. 1-8, 2012 [71] Q. Jiang. Open Source Iaas Community Analysis, Eucalyptus Systems Inc., 2012 [72] I. Voras, M. Orlic, and B. Mihaljevié, An Early Comparison of Commercial and Open- Spurce Cloud P latforms for Scientific Environments, University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb, Croatia, 2012 [73] E. Caron, L. Toch, and J. Rouzaud-Cornabas, Performance Comparison between OpenStack and OpenNebula and the multi-cloud Architecture: Application to Cosmology, Research Report N 8421, 2013 [74] K. Kostantos, A. Kapsalis, D. Kyriazis, M. Themistocleous, and P. Cunha, Open-Source IAAS Fit for Purpose: A Comparison between Openbula and OpenStack, International Journal of Electronic Business Management, Vol. 11, No. 3,

98 [75] O. Sefraoui, M. Aissaoui, and M. Eleuldj, Comparison of Multiple IaaS Cloud Platform Solutions, Mohamed I University, 2012 [76] Donnie Berkholz s Story of Data3: [77] E. Dede, M. Govindaraju, D. Gunter, R. Canon, and L. Ramakrishnan, Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis, SUNY Binghamton and Lawrence Berekely National Lab, 2012 [78] J. H. Lee, Log Analysis System Using Hadoop and MongoDB, CUBRID, [79] OpenStack: [80] OpenStack Training Guides, white paper, OpenStack Foundation, 2013 [81] A. Sehgal, Introduction to OpenStack: Running a Cloud Computing Infrastructure with Openstack, in the 6th International Conference on Autonomous Infrastructure, Management and Security, University of Luxembourg, 2012 [82] K. Pepple, Deploying OpenStack, O'Reilly Median, 2011 [83] OpenStack, Companies Supporting the OpenStack Foundation, [84] G. Sasiniveda and N. Revathi, Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop", International Journal of Computer Trends and Technology, vol. no. 3, 2013 [85] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation, 2007 [86] T. White, Hadoop: The Definitive Guide, O'Reilly Media, 2010 [87] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, Sunnyvale, 2010 [88] H. Herodotu, Hadoop Performance Models, Computer Science Department at Duke University, 2011 [89] Blogclub Tworkshops, Hadoop and MapReduce, [90] M. G. Noll, Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test DFSIO & Co., 2011 [91] Apache Hadoop, TestDFSIO Apache Hadoop Code Source, client-jobclient/0.23.9/org/apache/hadoop/fs/testdfsio.java [92] F. Rahma*, T. Adji, Widyawan, Scalability Analysis of KVM-Based Private Cloud For IaaS, in International Journal of Cloud Computing and Services Science, Vol.2, No.4, ppt ,

99 [93] T.Deshane, M. Yehuda, A. Shah, B. Rao, Quantitative Comparison of Xen and KVM, in Journal of Physics: Conference, 2010 [94] Virtualizing Resource intensive Applications, white paper, VMware, 2009 [95] Scale-up Virtualization with Red Hat Enterprise Linux 5.4 on an HP ProLiant DL785 G6, white paper, Redhat, 2009 [96] KVM Virtualized I/O Performance, white paper, IBM & Redhat,

100 Appendix A: OpenStack with KVM Configuration Pre-configuration 1. Update your machine sudo apt-get update sudo apt-get upgrade 2. Install bridge-utils sudo apt-get install bridge-utils 3. NTP Server 3.1. Install the NTP Server sudo apt-get install ntp 3.2. Open the file /etc/ntp.conf Add the following lines to make sure that the time on the server stays in sync with an external server. server ntp.ubuntu.com server fudge stratum Restart NTP Service sudo service ntp restart 4. Network Configuration As public IP address changes periodically, you need to set a static IP address that will be used in OpenStack configuration. In this case, we have two network interfaces eth0 and eth1. Eth0 was chosen as the network management; as a result, this interface was set to static IP address (in this guide, we used as an IP management). 100

101 Hypervisor Configuration 1. KVM Configuration If you want to install OpenStack with KVM hypervisor, then you need to follow the following steps: 1.1.Check if your machine supports virtualization egrep -c '(vmx svm)' /proc/cpuinfo 8 ouidad@ouidad:~$ If the output is 0, then your machine does not support virtualization; otherwise, if the output is greater than 0, the machine support virtualization technology Check if KVM can be supported ouidad@ouidad:~$ kvm-ok INFO: /dev/kvm exists KVM acceleration can be used ouidad@ouidad:~$ If the output is as shown above, then your machine supports KVM virtualization. 1.3.Install KVM and libvirt sudo apt-get install kvm libvirt-bin 1.4.KVM configuration You can check the following website to configure the necessary files for KVM support: Reboot your machine 101

102 OpenStack Databases Configuration 1. MySQL 1.1.Install Mysql server and related packages sudo apt-get install mysql-server python-mysqldb 1.2.Create the root password for MySQL The password used in this guide is "secret" 1.3.Open /etc/mysql/my.cnf Change the bind address from bind-address= to bind-address = Restart MySQL server sudo restart mysql 2. Nova Database 2.1. Create Nova database nova sudo mysql -uroot -psecret -e 'CREATE DATABASE nova;' 2.2.Create nova user named novadbadmin sudo mysql -uroot -psecret -e 'CREATE USER novadbadmin;' 2.3.Grant all privileges for novadbadmin on the database "nova" sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON nova.* TO 'novadbadmin'@'%';" 2.4. Create a password for the user "novadbadmin"; the password in this case is secret sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'novadbadmin'@'%' = PASSWORD ('novasecret');" 3. Glance Database 3.1.Create glance database named glance sudo mysql -uroot -psecret -e 'CREATE DATABASE glance;' 102

103 3.2.Create a user named glancedbadmin sudo mysql -uroot -psecret -e 'CREATE USER glancedbadmin; ' 3.3. Grant all privileges for glancedbadmin on the database "glance" sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON glance.* TO 'glancedbadmin'@'%';" 3.4. Create a password for the user "glancedbadmin" sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'glancedbadmin'@'%' = PASSWORD('glancesecret');" 4. Keystone Database 4.1.Create a database named keystone sudo mysql -uroot -psecret -e 'CREATE DATABASE keystone;' 4.2.Create a user named keystonedbadmin. sudo mysql -uroot -psecret -e 'CREATE USER keystonedbadmin;' 4.3. Grant all privileges for keystonedbadmin on the database "keystone". sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON keystone.* TO 'keystonedbadmin'@'%';" 4.4.Create a password for the user "keystonedbadmin" sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'keystonedbadmin'@'%' = PASSWORD('keystonesecret');" 103

104 Keystone Configuration 1. Install Keystone sudo apt-get install keystone python-keystone python-keystoneclient 2. Open /etc/keystone/keystone.conf Make the following changes: Change admin_token = ADMIN to admin_token = admin Change connection = sqlite:////var/lib/keystone/keystone.db to connection = mysql://keystonedbadmin:keystonesecret@ /keystone 3. Restart keystone sudo service keystone restart 4. Create glance schema in MySQL databas sudo keystone-manage db_sync 5. Export environment variables export SERVICE_ENDPOINT=" export SERVICE_TOKEN=admin Note: you can also add these variables to ~/.bashrc as to avoid exporting them each time. 6. Create tenants Create admin and service tenants keystone tenant-create --name admin keystone tenant-create --name service 7. Create users Create OpenStack users by executing the following commands. In this case, we are creating four users - admin, nova, glance and swift keystone user-create --name admin --pass admin -- [email protected] keystone user-create --name nova --pass nova -- [email protected] keystone user-create --name glance --pass glance -- [email protected] keystone user-create --name swift --pass swift -- [email protected] 104

105 8. Create roles Create the roles by executing the following commands. In this case, we are creating two roles - admin and Member. keystone role-create --name admin keystone role-create --name Member Sample output: 9. List tenants, users and roles keystone tenant-list keystone user-list keystone role-list Sample output: 105

106 10. Adding roles to users in tenants Add the role of admin to the user admin of the tenant admin keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role 8af19783ac784e0397e0346c7f1ec --tenant_id ee14adbd1ac cf7a5b7f5f Add the role of admin to the user nova of the tenant service. keystone user-role-add --user 5ce6dd40bf2249e5ab35a95da63d role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c b098f41dae1fa726c Add the role of admin to the user glance of the tenant service. keystone user-role-add --user ee4aa421189f cad --role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c b098f41d ae1fa726c Add the role of admin to the user swift of the tenant service. keystone user-role-add --user 24979d9ac31e4b83a58a89c1ad842ffa --role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c b098f41d ae1fa726c The Member role is used by Horizon and Swift. So add the Member role accordingly. (user: admin, role: Member, tenant: admin) keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role c2860fd6f3fd4538a07161bdb2691f60 --tenant_id ee14adbd1ac cf7a5b7f5f 11. Create services Create the required services which the users can authenticate with: nova-compute, novavolume, glance, swift, keystone and ec2 are some of the services that we create Nova Compute Service keystone service-create --name nova --type compute --description 'Opensatck Compute Service' 106

107 11.2.Volume Service keystone service-create --name volume --type volume --description 'OpenStack Volume Service' 11.3.Image Service keystone service-create --name glance --type image --description 'Openstack Image Service' Object Store Service keystone service-create --name swift --type object_store --description 'Openstack Storage Service' 11.5.Identity Service keystone service-create --name keystone --type identity --description 'Openstack Identity Service' 11.6.EC2 Service keystone service-create --name ec2 --type ec2 --description 'EC2 Service' 12. List keystone service list keystone service-list Sample output: 107

108 13. Create endpoints Create endpoints for each of the services that have been created above (service id is displayed using keystone service-list command) Endpoint for identity service keystone endpoint-create --region RegionOne --service_id 207bf81ddfe1481aa242148f246d091f --publicurl --internalurl --adminurl Endpoint for nova service keystone endpoint-create --region RegionOne --service_id 72b9d125eaa84aaf9c8ce734027eea21 --publicurl ' -- internalurl ' --adminurl ' Endpoint for the image service keystone endpoint-create --region RegionOne --service_id 581f6a8e337642a0a39090ffe6947e2d --publicurl ' --internalurl ' --adminurl ' Define the EC2 compatibility service: keystone endpoint-create --region RegionOne --service_id 4b1619d4f9f34cc9aaf473282c2340f0 --publicurl -- internalurl --adminurl Endpoint for the Volume service keystone endpoint-create --region RegionOne --service_id 6afe27a1768b403b a87646ec4 --publicurl ' -- internalurl ' --adminurl ' Endpoint for object storage service keystone endpoint-create --region RegionOne --service_id 2ec242420a114671a4fe15e745b45d3f --publicurl ' --adminurl ' -- internalurl ' 108

109 Glance Configuration 1. Install Glance packages sudo apt-get install glance glance-api glance-client glance-common glance-registry python-glance 2. Open /etc/glance/glance-api-paste.ini Change the following lines: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% By: admin_tenant_name = service admin_user = glance admin_password = glance 3. Now open /etc/glance/glance-registry-paste.ini Change the following lines: By: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% admin_tenant_name = service admin_user = glance admin_password = glance 4. Open the file /etc/glance/glance-registry.conf Change the line which contains the option "sql_connection =" to this: sql_connection = mysql://glancedbadmin:glancesecret@ /glance Add the following lines at the end of the file as to allow glance to use keystone for authentication. [paste_deploy] flavor = keystone 109

110 5. Open /etc/glance/glance-api.conf Add the following lines at the end of the file. [paste_deploy] flavor = keystone 6. Create glance schema in MySQL database sudo glance-manage version_control 0 sudo glance-manage db_sync 7. Restart glance-api and glance-registry sudo restart glance-api sudo restart glance-registry 8. Export the following environment variables. export SERVICE_TOKEN=admin export OS_TENANT_NAME=admin export OS_USERNAME=admin export OS_PASSWORD=admin export OS_AUTH_URL=" export SERVICE_ENDPOINT= Note: you can add these variables to ~/.bashrc. 9. Check if glance was successfully configured glance index The above command displays nothing; if you get an output, check the troubleshooting section. 110

111 Nova Configuration 1. Install Nova packages sudo apt-get install nova-api nova-cert nova-compute nova-compute-kvm nova-doc novanetwork nova-objectstore nova-scheduler nova-volume rabbitmq-server novnc novaconsoleauth 2. Edit the /etc/nova/nova.conf file --dhcpbridge_flagfile=/etc/nova/nova.conf --dhcpbridge=/usr/bin/nova-dhcpbridge --logdir=/var/log/nova --state_path=/var/lib/nova --lock_path =/run/lock/nova --allow_admin_api=true --use_deprecated_auth=false --auth_strategy=keystone --scheduler_driver=nova.scheduler.simple.simplescheduler --s3_host = ec2_host= rabbit_host= cc_host = nova_url= --routing_source_ip= glance_api_servers= : image_service=nova.image.glance.glanceimageservice --iscsi_ip_prefix= ec2_url= --keystone_ec2_url= --api_paste_config=/etc/nova/api-paste.ini --libvirt_type=kvm --libvirt_use_virtio_for_bridges=true --start_guests_on_host_boot=true --resume_guests_state_on_host_boot=true --novnc_enabled=true --novncproxy_base_url= --vncserver_proxyclient_address= vncserver_listen= network_manager=nova.network.manager.flatdhcpmanager --public_interface=eth0 --flat_interface=eth0 --flat_network_bridge=br100 --network_size=32 --flat_injected=false --force_dhcp_release --iscsi_helper=tgtadm --connection_type=libvirt --root_help Important Note: has to be replaced by your local machine public IP address. Moreover, you need to change libvirt_type variable by the current hypervisor you are using. 111

112 3. Change the ownership of the /etc/nova folder and permissions for /etc/nova/nova.conf sudo chown -R nova:nova /etc/nova sudo chmod 644 /etc/nova/nova.conf 4. Open /etc/nova/api-paste.ini Change the following configuration By: admin_tenant_name = %SERVICE_TENANT_NAME% admin_user = %SERVICE_USER% admin_password = %SERVICE_PASSWORD% admin_tenant_name = service admin_user = nova admin_password = nova 5. Create nova schema in the MySQL database. sudo nova-manage db sync 6. Provide a range of IPs to be associated to the instances. sudo nova-manage network create private --fixed_range_v4= /27 -- bridge=br100 --bridge_interface=eth0 --network_size=32 7. Export the following environment variables. export OS_TENANT_NAME=admin export OS_USERNAME=admin export OS_PASSWORD=admin export OS_AUTH_URL=" Note: you can add the environment variables at the end of ~/.bashrc file. 8. Manage nova volumes Create a Physical Volume: sudo pvcreate /dev/sda3 Create a Volume Group named nova-volumes: sudo vgcreate nova-volumes /dev/sda3 112

113 Note: to create a physical volume, you need first to create a primary partition (in this guide, the partition name is /dev/sda3). In this case you can follow these steps: 9. Restart nova services sudo service libvirt-bin restart sudo service nova-network restart sudo service nova-compute sudo service nova-api restart sudo service nova-objectstore restart sudo service nova-scheduler restart sudo service nova-volume restart sudo service nova-consoleauth service 10. Check if nova services are running sudo nova-manage service list Sample output: Note: if you the state of a given service is not :-), then try to run the following commands in separate terminals: sudo /usr/bin/nova-compute sudo /usr/bin/nova-network 113

114 OpenStack Dashboard 1. Install OpenStack Dashboard sudo apt-get install openstack-dashboard 2. Restart apache service sudo service apache2 restart 3. Open a browser and enter IP address of your machine If you followed this tutorial, then the possible logins are: Username: admin Password: admin Username: nova Password: nova Username: glance Password: glance Username: swift Password swift Figure 1: Dashboard authentication page 114

115 Image Configuration In order to create an image, you can to access the following links to download the needed images: Example: Ubuntu Precise i386 Image 1. Download Ubuntu Precise Version (12.04 LTS) Download Ubuntu precise version (precise-server-cloudimg-i386-root.tar.gz) from using the following command: wget 2. Extract the downloaded package sudo tar fxvz precise-server-cloudimg-i386.tar.gz The extracted files are: precise- server-cloudimg-i386-vmlinuz-virtual precise-server-cloudimg-i386-loader precise-server-cloudimg-i386.img 3. Add the Ubuntu image into glance database 3.1. Add the kernel file glance add name="precise32-kernel" disk_format=aki container_format=aki < preciseserver-cloudimg-i386-vmlinuz-virtual 3.2. Add the loader file glance add name="precise32-ramdisk" disk_format=ari container_format=ari < preciseserver-cloudimg-i386-loader 3.3.Add the image file Get the id of both the kernel and loader using: glance index glance index Sample output: 115

116 In this case, the id of Ubuntu kernel is 8386c173-cd90-4c7d-8540-da484abd0c1a and the id of Ubuntu loader is 5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d. Now, add the image file using the kernel and loader id: glance add name="precise32_image" disk_format=ami container_format=ami kernel_id=8386c173-cd90-4c7d-8540-da484abd0c1a ramdisk_id=5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d < precise-server-cloudimg-i386.img 4. Using the Horizon, you can find the uploaded image (Precise32_image) Figure 2: List of OpenStack images 116

117 Keypair Configuration 1. Generate for you local machine If you didnt generate akey for you local machine, then run the following command : ssh-keygen -t rsa -P "" 2. Create keypair The following command can be used to either generate a new keypair, or to upload an existing public key. cd.ssh nova keypair-add --pub_key id_rsa.pub mykey nova keypair-list 3. List keypairs nova keypair-list Sample output: 4. Check the created keypair Confirm that the uploaded keypair matches the local key by checking your key's fingerprint with the ssh-keygen command: ssh-keygen l f ~/.ssh/id_rsa.pub Sample output: Note: You can use OpenStack Dashboard to perform all operations related to keypair generation. 117

118 Security Groups Configuration 1. List default security groups nova secgroup-list Sample output: 2. Enable access to TCP port 22 Allow access to port 22 from all IP addresses (specified in CIDR notation as /0) with the following command: nova secgroup-add-rule default tcp /0 Sample output: 3. Enable pinging to virtual machine instance by allowing ICMP traffic nova secgroup-add-rule default icmp /0 Sample output: 118

119 Flavors Configuration 1. Flavor overview Flavors are used to specify the properties of an instance. The following table illustrates the needed arguments to define a flavor. Column ID Name Memory_MB Disk Ephemeral Swap VCPUs TX_Factor Is_Public extra_specs Description A unique numeric id. A descriptive name. xx.size_name is conventional not required, though some third party tools may rely on it. Memory_MB: virtual machine memory in megabytes. Virtual root disk size in gigabytes. This is an ephemeral disk the base image is copied into. When booting from a persistent volume it is not used. The "0" size is a special case which uses the native base image size as the size of the ephemeral root volume. Specifies the size of a secondary ephemeral data disk. This is an empty, unformatted disk and exists only for the life of the instance. Optional swap space allocation for the instance. Number of virtual CPUs presented to the instance. Optional property allows created servers to have a different bandwidth cap than that defined in the network they are attached to. This factor is multiplied by the rxtx_base property of the network. Default value is 1.0 (that is, the same as attached network). Boolean value, whether flavor is available to all users or private to the tenant it was created in. Defaults to True. Additional optional restrictions on which compute nodes the flavor can run on. This is implemented as key/value pairs that must match against the corresponding key/value pairs on compute nodes. Can be used to implement things like special resources (such as flavors that can only run on compute nodes with GPU hardware). Table 1: Flavor arguments 2. List available flavors Use nova flavor-list command to view the list of available flavors: nova flavor-list 3. Create a flavor Create a flavor with the following suggested specifications: sudo nova-manage instance_type create --name=m1.cluster --memory=975 --cpu=2 -- root_gb=100 --ephemeral_gb=10 --flavor=8 119

120 Instances Management Instances can be created either by using the dashboard interface or using command line. 1. Create instances with no specifications nova boot --flavor ID --image Image-ID MyInstanceName 2. Create an instance with an associated keypair To associate a key with an instance on boot add --key_name Mykey to your command line: nova boot --image Image-ID --flavor ID --key_name Mykey MyInstanceName 3. Create an instance with a security group It is also possible to add and remove security groups when an instance is running. nova add-secgroup MyInstanceName MysecurityGroup nova remove-secgroup MyInstanceName MysecurityGroup 4. Create an instance with a given keypair and security group nova boot --flavor ID --image Image-ID --key_name Mykey MyInstanceName 5. Display instance details nova show MyInstanceName 6. Access an instance You can connect to an instance console via VNC. The latter can be accessed either by the Horizon interface, command line or other tools such as virt-manager. Using command line nova get-vnc-console host_name novnc Sample output: The link displayed above can be used to access the instance console. 120

121 Using virt-manager If you cannot connect to VNC console, you can use virt-manager; in this case, use the following command to download the virt-manager package: sudo apt-get install virt-manager To have access to virt-manager inetrface, run the following command, sudo virt-manager Using local machine terminal If the instance you created asked you for login name and password, you can in this case, access the instance through your local machine. In this case you need to follow these steps: ssh-copy-id -i $HOME/.ssh/id_rsa.pub For Ubuntu the user name is root or ubuntu. Example: if you want to access an Ubuntu instance with IP address , you can then run the commands in the following commands: ssh-copy-id -i $HOME/.ssh/id_rsa.pub ssh Sample output: 121

122 7. Connecting Instances The following steps can be followed to connect OpenStack Instances (Assumption: we need to connect instance with hostname host1 to another instance with hostname host2): Generate the keypair on host1 & host2 to run ssh (ssh-keygen -t rsa) On host2 o Check the sshd_config on that instance (It s located in /etc/ssh/sshd_config) o Uncomment the following two lines in sshd_config RSAAuthentication yes PubkeyAuthentication yes o Append the contents of id_rsa.pub file of host 1 to authorized_keys file of host 2 8. Delete an instance nova delete MyInstanceName 122

123 1. Exception 1: glance index error Solution OpenStack Troubleshooting glance index Failed to show index. Got error: There was an error connecting to a server Details: [Errno 111] Connection refused Glance Exceptions In most cases, the above exception is due to glance-api service which may not be running. Therefore, you need to run the following command to check why the glance-api is not running. For the above output, we have an error in the glance-api-paste.ini, so you need to open that file to fix the error. ouidad@ouidad:~$ sudo gedit /etc/glance/glance-api-paste.ini After fixing the error, you need to restart the glance-api service ouidad@ouidad:~$ sud/usr/bin/glance-apini 123

124 Nova Exceptions 1. Exception 1: nova services not running sudo nova-manage service list When running sudo nova-manage service list, if you a service has xxx state, then you need the service in a separate terminal. Solution For example, if nova-compute has xxx state, you need to run the following command: sudo /usr/bin/nova-compute The same solution can be applied for other services: sudo /usr/bin/nova-network sudo /usr/bin/nova-scheduler sudo /usr/bin/nova-consoleauth sudo /usr/bin/nova-cert sudo /usr/bin/nova-volume 2. Exception 2: sudo nova-manage service list doesn t display the expected output ouidad@ouidad:~$ sudo nova-manage service list Command failed, please check log for more info :46: CRITICAL nova [-] No module named quantumclient.common Solution ouidad@ouidad:~$ sudo apt-get install python-quantumclient 3. Exception 3: Unable to start nova compute libvirterror: operation failed: domain 'instance-.. already exists with uuid Sample output: Solution You need to login to nova database and delete the instance id from instances table. Moreover, you need to delete the instance id from related tables such as security_group_instance_association and instance_info_caches. 124

125 Example: we want to delete an instance with id=3 From the tables displayed above, delete the instance id = 3 from security_group_instance_association and instance_info_caches as well as from virtual_interfaces table. 125

126 Dashboard Exceptions 1. Exception 1: Unable to retrieve images/instances Sample output Solution If you get one of the following exceptions, the only way I solved the problem is to drop the endpoint and re-create them again. Then, you need to reboot your local machine. References for Appendix A precise-pangolin 126

127 Appendix B. OpenStack with VMware ESXi Configuration 1. Downloading VMware ESXi Download VMware ESXi (vsphere 5.5) from: 2. Installing VMware ESXi After burning VMware ESXi software into a CD, install it on top of your hardware. 3. Download vsphere Client To manage your VMware ESXi host: Install vsphere Client in another machine with Windows OS. After opening the software, login to VMware ESXi machine with your username and password. After login, you will get access to VMware ESXi machine resources. In our case, VMware ESXi machine has an IP address of (Figure 1) Figure 1: vsphere Client interface: access to VMware ESXi

128 4. Create Openstack VM Create a virtual machine on top of VMware ESXi using vsphere Client. The VM will be used to host OpenStack. Create the VM with Ubuntu Precise LTS bits Guest. 5. Download VMware vsphere Web Services SDK Download appropriate SDK from: Copy the SDK to /openstack/vmware file. Make sure that the WSDL is available by checking if this path is existing /openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl /openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl: this path will be specified in nova.conf. 6. Configure OpenStack on VMware ESXi You need to follow the same steps provided in OpenStack KVM documentation. The main difference here is the nova.conf configuration. 7. Nova.conf Configuration In this case, you need to specify the compute_driver, host_ip (VMware ESXi machine), host_username, host_password and sdl_location (for SDK) as follow [vmware] host_password = host_username = root host_ip = compute_driver = vmwareapi.vmwareesxdriver sdl_location=file:///openstack /vmware/sdk/wsdl/vim25/vimservice.wsdl 8. Dashboard access Access OpenStack resources from the Horizon using the IP address of Openstack VM. 9. Make sure that you OpenStack is installed wth VMware ESXi This is done from Horizon interface Example: 128

129 Figure 2: OpenStack with VMware ESXi hypervisor 10. Manage OpenStack with VMware ESXi After configuring OpenStack, you can now download images and create instances. Each time you create an instance, it will be displayed in vsphere Client interface as depicted in Figure 1. Concerning images, you need to add images with vmdk extension. You can find them in the following website (you can download them from the free images section): 129

130 Figure 3: access to VMs (OpenStack instances) through vsphere Client interface References

131 Appendix C: Hadoop Configuration Prerequisites for Installing Hadoop 1. Adding a dedicated Hadoop system user (all machines) Create a Hadoop user account (hduser) for running Hadoop using the following commands: ouidad@host1:~$ sudo addgroup hadoop ouidad@host1:~$ sudo adduser --ingroup hadoop hduser 2. Configuring SSH 2.1. To manage cluster nodes, Hadoop requires SSH access. In this case, you need to generate an SSH key for the hduser user. ouidad@host1:~$ su hduser Password: hduser@host1:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 44:f5:7b:85:32:f7:69:c7:d7:fc:75:38:63:32:be:d7 hduser@host1 The key's randomart image is: +--[ RSA 2048] o.. = *o S + *ox. =.o*..... E

132 2.2. In order to allow Hadoop interacts directly with its nodes, you need to create an RSA key pair with an empty password. This is done by enable SSH access to your local machine with this newly created key. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 3. Install JAVA 3.1.Download jdk-6u45-linux-i586.bin (for 32 bits architecture) from: JDK Installation chmod +x jdk-6u45-linux-i586.bin sudo./jdk-6u45-linux-i586.bin 3.3.Make sure that JDK is installed java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing) 3.4. Move JDK folder from its current location to /home/hduser path sudo cp /Downloads/jdk1.6.0_45 /home/hduser -r 3.5. Change the JDK ownership sudo chown -R hduser:hadoop /home/hduser/jdk1.6.0_45/ 132

133 Installing Hadoop 1. Download Hadoop version (hadoop tar.gz) from 2. Extract the downloaded version tar -zxvf hadoop tar.gz 3. Move the extracted folder (hadoop-1.2.1) from Downloads folder to /home/hduser sudo cp hadoop /home/hduser/ -r 4. Change the ownership sudo chown -R hduser:hadoop /home/hduser/hadoop Bashrc file configuration (All machines) You need first to login to the hduser account, then you need to run the following command: sudo gedit ~/.bashrc at the end of the file, add the following line: export JAVA_HOME=~/jdk1.6.0_45 export PATH =$JAVA_HOME/bin:$PATH 6. Hdfs folder creation (All machines) You need first to login to the hduser account, then create the following folder: sudo mkdir -p /home/hduser/hdfs/temp chown hduser:hadoop /home/hduser/hdfs/temp chmod 777 /home/hduser/hdfs/temp/ chmod 775 /home/hduser/hdfs/temp/ 133

134 7. Hadoop Files Configuration (Slave machines) Move to /hadoop-1.2.1/conf folder to change the following files 7.1. hadoop-env.sh File sudo gedit hadoop-env.sh Replace the following two lines: # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun by (uncomment the second line): # The java implementation to use. Required. export JAVA_HOME=~/jdk1.6.0_45 Then, add at the end of the file: export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true 7.2. core-site.xml File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml Add the following lines between the <configuration> tags: <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfs/temp</value> <description>a base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property> 134

135 7.3.mapred-site.xml File sudo gedit mapred-site.xml Add the following lines between the <configuration> tags: <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 7.4. hdfs-site.xml File hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml <property> <name>dfs.replication</name> <value>3 </value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> Note: Number 3 illustrates the total number of block replication. If you have a cluster of 3-10 nodes, set the replication factor to 3 8. Hadoop Files Configuration (Master) 8.1. core-site.xml File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml Add the following lines between the <configuration> tags: <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfs/temp</value> <description>a base for other temporary directories.</description> </property> 135

136 <property> <name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property> 8.2.mapred-site.xml File sudo gedit mapred-site.xml Add the following lines between the <configuration> tags: <property> <name>mapred.job.tracker</name> <value>master: 54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 8.3. hdfs-site.xml File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml Add the following lines between the <configuration> tags: <property> <name>dfs.replication</name> <value>3 </value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> 8.4. slaves File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit slaves Comment the localhost, and add the name of your slaves (you can set your master node as master and slave at the same by adding the master hostname to slaves file. master host1 host masters File hduser@master:~/hadoop-1.2.1/conf$ sudo gedit masters Comment the localhost, and add the name of your master node. 136

137 master Connecting Nodes 1. IP address configuration (All machines) 1.1. Find out the IP address of each machine ifconfig eth0 Link encap:ethernet HWaddr 00:23:ae:b0:89:ae inet addr: Bcast: Mask: UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets: errors:0 dropped:0 overruns:0 frame:0 TX packets:9134 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes: (30.8 MB) TX bytes: (1.3 MB) Interrupt:21 Memory:fe6e0000-fe lo Link encap:local Loopback inet addr: Mask: UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:58 errors:0 dropped:0 overruns:0 frame:0 TX packets:58 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9306 (9.3 KB) TX bytes:9306 (9.3 KB) 1.2. Find out the host name of each machine sudo gedit /etc/hostname 1.1. Open hosts file (for each machine) sudo gedit /etc/hosts Replace the content of the file by the IP Addresses of all machines, including in the cluster master slave. 2. Connect the master hduser with the hduser on slaves Example: For machine with hostname host1 137

138 ssh-copy-id -i $HOME/.ssh/id_rsa.pub Example: For machine with hostname host2 ssh-copy-id -i $HOME/.ssh/id_rsa.pub 3. Test the connection between each slave and master machine ssh host1 Welcome to Ubuntu LTS (GNU/Linux generic i686) * Documentation: System information as of Sun Jun 30 19:44:28 WEST 2013 System load: 0.08 Processes: 159 Usage of /: 77.7% of GB Users logged in: 2 Memory usage: 35% IP address for eth0: Swap usage: 0% => There is 1 zombie process. Graph this data and manage this system at 97 packages can be updated. 66 updates are security updates. Last login: Sun Jun 30 18:39: from ip6-localhost If the connection is set up, you need then to cancel it to continue your installation hduser@host5:~$ exit logout Connection to host5 closed. 138

139 Formatting the HDFS & Starting Multi-node Cluster 1. Format the HDFS filesystem via the NameNode bin/hadoop namenode -format Here is the output: 13/06/30 20:00:42 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = master/ STARTUP_MSG: args = [-format] STARTUP_MSG: version = STARTUP_MSG: build = -r ; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/ 13/06/30 20:00:42 INFO util.gset: VM type = 32-bit 13/06/30 20:00:42 INFO util.gset: 2% max memory = MB 13/06/30 20:00:42 INFO util.gset: capacity = 2^22 = entries 13/06/30 20:00:42 INFO util.gset: recommended= , actual= /06/30 20:00:42 INFO namenode.fsnamesystem: fsowner=hduser 13/06/30 20:00:42 INFO namenode.fsnamesystem: supergroup=supergroup 13/06/30 20:00:42 INFO namenode.fsnamesystem: ispermissionenabled=true 13/06/30 20:00:42 INFO namenode.fsnamesystem: dfs.block.invalidate.limit=100 13/06/30 20:00:42 INFO namenode.fsnamesystem: isaccesstokenenabled=false accesskeyupdateinterval=0 min(s), accesstokenlifetime=0 min(s) 13/06/30 20:00:42 INFO namenode.namenode: Caching file names occuring more than 10 times 13/06/30 20:00:42 INFO common.storage: Image file of size 112 saved in 0 seconds. 13/06/30 20:00:42 INFO namenode.fseditlog: closing edit log: position=4, editlog=/home/hduser/hdfs/temp/dfs/name/current/edits 13/06/30 20:00:42 INFO namenode.fseditlog: close success: truncate to 4, editlog=/home/hduser/hdfs/temp/dfs/name/current/edits 13/06/30 20:00:43 INFO common.storage: Storage directory /home/hduser/hdfs/temp/dfs/name has been successfully formatted. 13/06/30 20:00:43 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at master/ ************************************************************/ 2. Start the multi-node cluster hduser@master:~/hadoop-1.2.1$ bin/start-all.sh Start both DFS and Hadoop Map/Reduce daemons: hduser@master:~/hadoop-1.2.1$ bin/start-dfs.sh hduser@master:~/hadoop-1.2.1$ bin/start-mapred.sh 139

140 4. On master machine, check if the following java processes are running : hduser@master:~$ jps 5721 SecondaryNameNode 6738 DataNode 5243 NameNode 6047 TaskTracker 8423 Jps 5805 JobTracker 4. On slave machines, check if the following java processes are running: hduser@master:~$ jps 1902 DataNode 4002 Jps 2108 TaskTracker If you get the following oputput: hduser@host1:~/hadoop-1.2.1/conf$ jps The program 'jps' can be found in the following packages: * openjdk-6-jdk * openjdk-7-jdk Ask your administrator to install one of them Then install one of the suggested packages: hduser@host1:~/hadoop-1.2.1/conf$ sudo apt-get install openjdk-7-jdk Note: if you didn t get the same services, follow the suggestion provided for exception

141 Hadoop Troubleshooting 1. Formatting the Namenode Exception: Cannot lock storage bin/hadoop namenode -format 13/06/30 19:57:35 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = master/ STARTUP_MSG: args = [-format] STARTUP_MSG: version = STARTUP_MSG: build = -r ; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/. 13/06/30 19:57:38 ERROR namenode.namenode: java.io.ioexception: Cannot lock storage /home/hduser/hdfs/temp/dfs/name. The directory is already locked. at org.apache.hadoop.hdfs.server.common.storage$storagedirectory.lock(storage.java:599) at org.apache.hadoop.hdfs.server.namenode.fsimage.format(fsimage.java:1327) at org.apache.hadoop.hdfs.server.namenode.fsimage.format(fsimage.java:1345) at org.apache.hadoop.hdfs.server.namenode.namenode.format(namenode.java:1207) at org.apache.hadoop.hdfs.server.namenode.namenode.createnamenode(namenode.java:1398) at org.apache.hadoop.hdfs.server.namenode.namenode.main(namenode.java:1419) 13/06/30 19:57:38 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at master/ ************************************************************/ Solution Step 1: Stop all processes hduser@master:~/hadoop-1.2.1$ bin/stop-all.sh Step 2 : move to /hdfs/temp folder and run the following command hduser@master:~/hdfs/temp$ sudo rm -rf * Step 3 : Restart your work by formatting the namenode hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format 141

142 2. Formatting the Namenode Exception: Cannot create directory /home/hduser/hdfs Solution In this case, make sure that you have set the following permission when creating the /hdfs/temp folder chmod 750 /home/hduser/hdfs/temp/ 3. Exception in log file: hadoop-hduser-datanode-host1.log or when Hadoop DataNode doesn t show up in slave nodes hduser@host1:~/hadoop-1.2.1/logs$ sudo gedit hadoop-hduser-datanode-host1.log :01:09,078 ERROR org.apache.hadoop.hdfs.server.datanode.datanode: java.io.ioexception: Incompatible namespaceids in /home/hduser/hdfs/temp/dfs/data: namenode namespaceid = ; datanode namespaceid = at org.apache.hadoop.hdfs.server.datanode.datastorage.dotransition(datastorage.java:232) at org.apache.hadoop.hdfs.server.datanode.datastorage.recovertransitionread(datastorage.java:147) at org.apache.hadoop.hdfs.server.datanode.datanode.startdatanode(datanode.java:399) at org.apache.hadoop.hdfs.server.datanode.datanode.<init>(datanode.java:309) at org.apache.hadoop. hdfs.server.datanode.datanode.makeinstance(datanode.java:1651) at org.apache.hadoop.hdfs.server.datanode.datanode.instantiatedatanode(datanode.java:1590) at org.apache.hadoop.hdfs.server.datanode.datanode.createdatanode(datanode.java:1608) at org.apache.hadoop.hdfs.server.datanode.datanode.securemain(datanode.java:1734) at org.apache.hadoop.hdfs.server.datanode.datanode.main(datanode.java:1751) Solution 1 1. From master machine, open VERSION file under /hdfs/temp/dfs/name/current folder: hduser@master:~/hdfs/temp/dfs/name/current$ sudo gedit VERSION Here is the content of VERSION file: #Sun Jun 30 20:00:43 WEST 2013 namespaceid= ctime=0 storagetype=name_node layoutversion=-32 Check the id of the namespace variable ( in this case it is ); remember the id as you will need it in the next step 2. From all slaves machines where you found the above exception, open the VERSION file under /hdfs/temp/dfs/data/current folder: hduser@host1:~/hdfs/tmp/dfs/data/current$ sudo gedit VERSION 142

143 Here is the content of VERSION file: #Fri Jun 14 09:22:08 WET 2013 namespaceid= storageid=ds ctime=0 storagetype=data_node layoutversion=-32 Replace the namespaceid variable with the value you found in the VERSION file of the master. The content of file VERSION under /hdfs/temp/dfs/data/current folder is: #Fri Jun 14 09:22:08 WET 2013 namespaceid= storageid=ds ctime=0 storagetype=data_node layoutversion=-32 Solution 2 1. Stop the whole cluster 2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /hdfs/temp /dfs/data. 3. Reformat the NameNode. 4. Restart the cluster. 4. Safe mode exception when running MapReduce examples org.apache.hadoop.ipc.remoteexception: org.apache.hadoop.hdfs.server.namenode.safemodeexception: Cannot delete /benchmarks/testdfsio. Name node is in safe mode. The reported blocks is only 3601 but the threshold is and the total blocks Safe mode will be turned off automatically. at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.deleteinternal(fsnamesystem.java:2111) at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.delete(fsnamesystem.java:2088) at org.apache.hadoop.hdfs.server.namenode.namenode.delete(namenode.java:832) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) Solution hduser@master:~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave Safe mode is OFF hduser@master:~/hadoop-1.2.1$ bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean TestDFSIO

144 References for Appendix C [1] [2] 144

145 Appendix D: TeraSort and TestDFSIO Execution 1. TeraSort 1.1.Generate the TeraSort input data using TeraGen TeraGen generates random data that can be conveniently used as input data for a subsequent TeraSort run. The command to run TeraGen in order to generate 100 MB of input data is: bin/hadoop jar hadoop-*examples*.jar teragen /home/hduser/terasort-input specifies the number of rows of input data to generate, each of which having a size of 100 bytes. 1.2.Run the actual TeraSort benchmark using TeraSort The syntax to run the TeraSort benchmark is as follows: bin/hadoop jar hadoop-*examples*.jar terasort /home/hduser/terasort-input /home/hduser/terasort-output 1.3.Validate the sorted output data of TeraSort using TeraValidate The syntax to run the TeraValidate is as follow: bin/hadoop jar hadoop-*examples*.jar teravalidate /home/hduser/terasort-input /home/hduser/terasort-output 1. Check TeraSort Analysis To check the average time to generate 100 MB, you need to run the following command: bin/hadoop job -history /home/hduser/terasort-input To check the average time to sort 100 MB, you need to run the following command: bin/hadoop job -history /home/hduser/terasort-output 2. Clean up your temporary files When re-running TeraSort Benchmark, you need to clean up all generated files in the first TeraSort test. bin/hadoop dfs -rmr /home/hduser/terasort-input bin/hadoop dfs -rmr /home/hduser/terasort-output 145

146 2. TestDFSIO 2.1. Write data using TestDFSIO-Write To generate 1000MB dataset, you need to specify an input with 10 files, and each file with 10MB. To allow this operation, the following command needs to be executed: hadoop jar hadoop-*test*.jar TestDFSIO -write -nrfiles 10 -filesize 10 A sample output of TestDFSIO-write operation provides information about the throughput, average I/O rate, I/O rate standard deviation and test execution time. 13/11/07 15:37:27 INFO fs.testdfsio: TestDFSIO : write 13/11/07 15:37:27 INFO fs.testdfsio:date & time: Thu Nov 07 15:37:27 UTC /11/07 15:37:27 INFO fs.testdfsio: Number of files: 10 13/11/07 15:37:27 INFO fs.testdfsio: Total MBytes processed: /11/07 15:37:27 INFO fs.testdfsio: Throughput mb/sec: /11/07 15:37:27 INFO fs.testdfsio: Average IO rate mb/sec: /11/07 15:37:27 INFO fs.testdfsio: IO rate std deviation: /11/07 15:37:27 INFO fs.testdfsio: Test exec time sec: /11/07 15:37:27 INFO fs.testdfsio: 2.2.Read data using TestDFSIO-Read After getting the results of TestDFSIO-write command, the next step is to run TestDFSIOread operation. In this case, to read the previous generated data, the following command needs to be executed. hadoop jar hadoop-*test*.jar TestDFSIO -read -nrfiles 10 -filesize 10 A sample output of write operation provides information about the throughput, average I/O rate, I/O rate standard deviation and test execution time. 13/11/07 15:38:11 INFO fs.testdfsio: TestDFSIO : read 13/11/07 15:38:11 INFO fs.testdfsio: Date & time: Thu Nov 07 15:38:11 UTC /11/07 15:38:11 INFO fs.testdfsio: Number of files: 10 13/11/07 15:38:11 INFO fs.testdfsio: Total MBytes processed: /11/07 15:38:11 INFO fs.testdfsio: Throughput mb/sec: /11/07 15:38:11 INFO fs.testdfsio: Average IO rate mb/sec: /11/07 15:38:11 INFO fs.testdfsio: IO rate std deviation: /11/07 15:38:11 INFO fs.testdfsio: Test exec time sec: /11/07 15:38:11 INFO fs.testdfsio: 2.3.Clean your cluster The last step is to clean up the generated data using the following command: bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean 146

147 Appendix E: Data Gathering for TeraSort 1. Hadoop Physical Cluster Number of Machines Dataset Size Map Test 1 Test 2 Test 3 Mean 100 MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce

148 7 8 1 GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce

149 2. Hadoop Virtualized Cluster- KVM Number of Dataset KVM VMs Size Map Test 1 Test 2 Test 3 Mean 100 MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map

150 10 GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce

151 3. Hadoop Virtualized Cluster- VMware ESXi Number of VMware VMs Dataset Size Map Test 1 Test 2 Test 3 Mean 100 MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map

152 10 GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce MB Map MB Shuffling MB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce GB Map GB Shuffling GB Reduce

153 Appendix F: Data Gathering for TestDFSIO 1. Hadoop Physical Cluster Dataset Size 100 MB 1 GB 10 GB 100 GB Operatio n Write Read Write Read Write Read Write Read Number of Nodes = 3 Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

154 Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 4 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

155 Data Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 5 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

156 Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 6 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read 300 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

157 Data Size 100 MB 1 GB 10 GB 100 GB 300 GB Number of Nodes = 7 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

158 Dataset Size 100 MB 1 GB 10 GB 100 GB Number of Nodes = 8 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read Write Read 300 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

159 2. Hadoop Virtualized Cluster- KVM Data Size Number of KVM VMs = 3 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Average IO rate (mb/sec) MB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation GB Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec)

160 Data Size Number of KVM VMs = 4 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Average IO rate (mb/sec) MB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec)

161 Data Size Number of KVM VMs = 5 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Write Average IO rate (mb/sec) MB IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB Write IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB Write IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB Write IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec)

162 Data Size Number of KVM VMs = 6 Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Write Average IO rate (mb/sec) MB IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Write Average IO rate (mb/sec) GB IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Write Average IO rate (mb/sec) GB IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Write Average IO rate (mb/sec) GB IO rate standard deviation Read Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

163 Number of KVM VMs = 7 Data Size Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Average IO rate (mb/sec) MB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) GB IO rate standard deviation Write Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec)

164 Number of KVM VMs = 8 Data Size Operation Criteria Test1 Test2 Test3 Mean Throughput (mb/sec) Write Average IO rate (mb/sec) IO rate standard deviation MB Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Write Average IO rate (mb/sec) IO rate standard deviation GB Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Write IO rate standard deviation GB Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) Read IO rate standard deviation Execution time (sec) Throughput (mb/sec) Write Average IO rate (mb/sec) IO rate standard deviation 100 GB Execution time (sec) Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) 164

165 4. Hadoop Virtualized Cluster- VMware ESXi Dataset Size 100 MB 1 GB 10 GB Number of VMware ESXi VMs = 3 Operation Criteria Test1 Test2 Test3 Mean Write Read Write Read Write Read 100 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

166 Number of VMware ESXi VMs = 4 Dataset Size 100 MB Operation Criteria Test1 Test2 Test3 Mean Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) GB 10 GB 100 GB Write Read Write Read Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

167 Number of VMware ESXi VMs = 5 Dataset Size Operation Cretiria Test1 Test2 Test3 Mean Throughput (mb/sec) MB 1 GB 10 GB 100 GB Write Read Write Read Write Read Write Read Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

168 Dataset Size 100 MB Number of VMware ESXi VMs = 6 Operation Criteria Test1 Test2 Test3 Mean Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

169 Dataset Size 100 MB Write Number of VMware ESXi VMs = 7 Operation Criteria Test1 Test2 Test3 Mean Read 1 GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)

170 Dataset Size 100 MB Write Number of VMware ESXi VMs = 8 Operation Cretiria Test1 Test2 Test3 Mean Read 1 GB Write Read 10 GB Write Read 100 GB Write Read Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec) Throughput (mb/sec) Average IO rate (mb/sec) IO rate standard deviation Execution time (sec)