Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24 10129 Torino (Italy) {name}.{surname}@polito.it Abstract. In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an important role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also witnessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems are usually designed for CPU- and memory-intensive applications. However, the use of already available HPC infrastructures for data-intensive applications is an interesting topic, in particular in academia where the budget is usually limited and the same cluster is used by many researchers with different requirements. In this paper, we investigate the integration of Hadoop, and its performance, in an already existing low-budget general purpose HPC cluster characterized by heterogeneous nodes and a low amount of secondary memory per node. Keywords: HPC, Hadoop, MapReduce, MPI applications 1 Introduction The amount of available data increases every day. This huge amount of data is a resource that, if properly exploited, provides useful knowledge. However, to be able to extract useful knowledge from it, efficient and powerful systems are needed. One possible solution to the introduced problem consists in adopting the Hadoop framework [6], which exploits the MapReduce [1] paradigm for the efficient implementation of data-intensive distributed applications. The recent years have also witnessed the increasing availability of general purpose HPC systems [3], such as clusters, commonly installed in many computing centers. They are usually used to provide different services to communities of users (e.g., academic researches) with different requirements. These systems are usually designed for CPU- and memory-intensive applications. However, we witnessed some attempts to integrate Hadoop also in general purpose HPC systems, in particular in academia. Due to limited budgets, the integration of Hadoop in already available HPC systems is an interesting and fascinating problem. It will allow academic researches to continue to use their current MPI-based applications and, at the same time, to exploit Hadoop to address new (data-intensive) problems without further costs.

In this paper, we describe the integration of Hadoop inside an academic HPC cluster located at the Politecnico di Torino computing center HPC@polito. This cluster, called CASPER, hosts dozens of scientific softwares, often based on MPI. We decided to integrate Hadoop in CASPER to understand if it is able to manage also huge data or if an upgrade is needed for this new purpose. Our main goals consist in continuing to provide the already available services, based on MPI applications, and the new ones based on Hadoop using the same system. 2 HPC@polito HPC@POLITO (www.hpc.polito.it) aims at providing computational resources and technical support to both academic research and university teaching. To pursue these goals, the computing center has set up a heterogeneous cluster called CASPER (Cluster Appliance for Scientific Parallel Execution and Rendering) with a peak performance of 1.2 TFLOPS. The initiative counts 25 hosted projects and 12 papers developed thanks to our HPC and published by groups operating in different research areas. A detailed technical description of the system is available in previous papers [2] [5]. In our vision CASPER is a continuously evolving system, developed in collaboration with several research groups, and being renewed regularly. 2.1 Cluster configuration CASPER is a standard MIMD distributed shared memory InfiniBand heterogeneous cluster. It has 1 master node and 9 computational nodes, 136 cores, and 632 GB of main memory. From Figure 1, you can see that the system is composed of three different types of computational nodes, which have been added to the cluster in subsequent stages according to the needs of the research groups. The cluster is evolving into a massively parallel, large memory system, having average cpu speed and many cores per node. The nodes have very small and low performance local hard disks, which have been designed to contain only the operating system; experimental data are maintained in a central NAS. CASPER is normally used through the scheduler/resource manager SGE (now OGE) for running custom code or third-party software, often taking advantage of MPI libraries (e.g., Esys-Particle, Matlab, OpenFOAM, Quantum- Espresso). The cluster configuration is therefore a compromise that provides sufficient performance for all of the cited softwares. However, the current applications are rarely data-intensive and do not use huge data. 2.2 Hadoop Deployment The installation of Apache Hadoop was made trying to harmonize its needs to those of our specific system. CASPER is an installation of Rocks Cluster 5.4.3 1. 1 http://www.rocksclusters.org

Fig. 1. CASPER cluster configuration, early 2013. To install Hadoop 1.0.4 on it, we used the packages from the EPEL repository for CentOS, from which the Rocks Cluster distribution derives. We decided to use the master node of the cluster also as master node of Hadoop, while the computation nodes were configured as slave nodes. HDFS was configured with permissions disabled, data block size set to 256MB and number of replicas per data block set to 2. On five nodes (nodes from 0-0 to 0-4), we added 1TB local hard disks for the exclusive use of HDFS. On the remaining four nodes, data was stored in /state/partition1, which is a partition created by the Rocks node installer and corresponds to all the remaining space on the local disk (the local disk size is 160GB for these four nodes). We configure Hadoop by taking into consideration the heterogeneous nature of CASPER. More specifically, for each node the maximum number of mappers was set equal to the number of cores, while the maximum number of reducers was set to 4 for the five nodes with a large and efficient 1TB local disk, and to 0 for the other four nodes due to the slow efficiency and small size and their hard disks. The package provided by Oracle to integrate Hadoop into our version of SGE is incompatible with the current version of Hadoop. Hence, we decided to use an integration approach based on the creation of a dedicated queue and a new Parallel Environment in the SGE scheduler, so that the Hadoop tasks are still subjected to Hadoop but managed through the queuing system of the cluster. 3 Experimental results As described in the introduction of the paper, for some future research applications we need to use CASPER on large/huge data. However, on the one hand we cannot buy a new dedicated cluster. On the other hand we cannot dismiss the softwares already hosted on CASPER. Hence, we performed an initial set of experiments to understand the scalability, in terms of data size, of an MPIapplication. These experiments allowed us to identify the data size limit of MPI programs on our cluster. Then, we performed a set of experiments based on

Hadoop (i) to evaluate the scalability of Hadoop on CASPER and (ii) to understand which upgrades are needed to be able to use CASPER on big data. 3.1 Experimental setting. To evaluate the scalability of MPI-based algorithms, we used an MPI implementation of the quicksort algorithm derived from a public available code [4]. It receives in input a single file containing a set of numbers (one number per line) and generates one single file corresponding to the sorted version of the input one. The algorithm, similarly to all typical MPI applications, works exclusively in main memory. One of the benchmarking test usually performed to analyze efficiency and scalability of Hadoop is the Hadoop-based implementation of Terasort [6]. Hence, we did the same also to test the scalability of Hadoop on our cluster. 3.2 MPI-based sorting algorithm The scalability of the MPI sorting algorithm was evaluated on files ranging from 21GB to 100GB. Larger files are not manageable due to the memory-based nature of the algorithm. Detailed results are reported in Table 1(a) for three difference configurations, characterized by a different number of nodes/cores. We considered initially only the nodes 0-8 and 0-9 reported in Figure 1, then nodes from 0-6 to 0-9, and finally all nodes. The first configuration is homogeneous but it is characterized by only 64 cores, while the last one is the most heterogeneous but it exploits all the available resources. The results reported in Table 1(a) highlight that our MPI sorting algorithm is not able to process file larger than 100GB. Hence, it cannot manage big data. As expected, the execution time decreases when the number of nodes/cores increases. Figure 2(a) shows that the execution time decreases linearly with respect to the number of cores. However, the slope of the curves depend on the file size. Hence, the availability of more cores is potentially a positive factor. However, the increase of the number of nodes, in some cases, has a negative impact. More (a) MPI-based sorting algorithm. DNF=Did not finished. Dataset Configuration Execution size (GB) #cores/total RAM(GB) time 64 cores/256gb 39m49s 21GB 96 cores/512gb 31m52s 136 cores/632gb 26m30s 64 cores/256gb 1h41m32s 42GB 96 cores/512gb 1h24m41s 136 cores/632gb DNF 64 cores/256gb 3h38m59s 100GB 96 cores/512gb 2h52m59s 136 cores/632gb DNF Table 1. Execution time (b) Hadoop-based terasort. Dataset Configuration Execution size (GB) #cores/total RAM(GB) time 100GB 40 cores/120gb 1h23m2s 136 cores/632gb 57m26s 200GB 40 cores/120gb 3h52m57s 136 cores/632gb 3h21m49s

Execution time (s) 14000 12000 10000 8000 6000 4000 2000 100GB 42GB 21GB 0 60 70 80 90 100 110 120 130 140 Number of cores (a) MPI-based sorting algorithm. Execution time (s) 16000 14000 12000 10000 8000 6000 4000 2000 200GB 100GB 0 40 50 60 70 80 90 100 110 120 130 140 Number of cores (b) Hadoop-based terasort. Fig. 2. Execution time specifically, if we use all nodes, the sorting process does not end with files larger than or equal to 42 GB. The problem is given by the (limited) size of the RAM of the 5 Intel nodes (24GB per node). They are not able to process the tasks assigned to them by the MPI-based sorting program when the file size is larger than approximatively 40GB. 3.3 Terasort (Hadoop-based application) Since the MPI application does not allow processing large files on CASPER, we decided to test Hadoop on it. Hadoop is usually used on commodity hardware. However, CASPER has a set of peculiarity (e.g., it is extremely heterogeneous) and hence it could be not adapt for Hadoop. We performed the tests by means of a standard algorithm that is called Terasort. We decided to evaluate Hadoop-based implementation of Terasort on two extreme configurations. The first configuration is based only on the 5 Intel nodes (40 cores), while the second one exploits all nodes (136 cores). The first configuration is homogeneous (5 Intel 3.2GHz nodes with 1TB of secondary memory per node). The second one is extremely heterogeneous (different CPU frequencies and local disks with size ranging from 160GB to 1TB). The results reported in Table 1(b) and Figure 2(b) show that the Hadoopbased Terasort algorithm can process in less than 4 hours a 200GB file. Hence, also on CASPER, which is not designed for Hadoop, the use of Hadoop allows processing files larger than those manageable by means of MPI algorithms. However, the time of the first configuration (5 homogeneous nodes) is slightly slower than the second one (composed of all nodes). The second configuration has +240% more cores than the first one but the execution time decrease is only -13% when the file size is 200GB (-31% when the file size is 100GB). These results confirm than we need more homogeneous nodes in our cluster and on the average larger local disks on each computational node if we want to increase the scalability of CASPER for data-intensive applications based on Hadoop. We will consider this important point during the planning of the next upgrade of CASPER.

General considerations. Based on the achieved results, we can conclude that CASPER can potentially be used to run both the already hosted MPI-based applications and new Hadoop-based applications. However, some upgrades are needed in order to improve the performance of CASPER on large datasets. The results reported in Sections 3.2 and 3.3 can be exploited also to decide how to allocate the different applications on CASPER. Hadoop seems to achieve better results when homogeneous nodes, with large and efficient local disks, are used (i.e., the 5 Intel nodes in our current system), while the MPI-based application, which is a main memory-intensive application, seems to perform better on nodes with more efficient processors and a large amount of main memory (i.e., the AMD nodes in our current system). On CASPER, analogously to all traditional clusters, a set of queues can be created. Each queue is associated with a set of nodes and can be characterized by a priority level. Based on the discussed results, the association of the applications based on Hadoop to a queue that includes the 5 Intel nodes, and the association of the MPI-based applications to a queue that includes the AMD nodes seems to be, potentially, a good configuration. This configuration should allow to execute contemporaneously Hadoopand MPI-based applications. 4 Conclusions Due to the increasing request of data-intensive applications, we decided to analyze the potentiality of our low-budget general purpose cluster for this type of applications. In this paper, we reported the results of this experience. The performed experiments highlighted the limitations of our current cluster and helped us to identify potential upgrades that should be considered in the future. Further experiments will be performed on other algorithms (e.g., the merge sort algorithm) to confirm the achieved results. References 1. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107 113, 2008. 2. F. Della Croce, E. Piccolo, and N. Nepote. A terascale, cost-effective open solution for academic computing: early experience of the dauin hpc initiative. In AICA 2011, pages 1 9, 2011. 3. J. Dongarra. Trends in high performance computing: a historical overview and examination of future developments. IEEE Circuits and Devices Magazine, 22(1):22 27, 2006. 4. P. Maier. qsort.c. http://www.macs.hw.ac.uk/ pm175/f21dp2/src/, 2010. 5. N. Nepote, E. Piccolo, C. Demartini, and P. Montuschi. Why and how using HPC in university teaching? a case study at polito. In DIDAMATICA 2013, pages 1019 1028, 2013. 6. T. White. Hadoop: The Definitive Guide. O Reilly Media, Inc., 1st edition, 2009.