Building OLAP cubes on a Cloud Computing environment with MapReduce

Transcription

1 Building OLAP cubes on a Cloud Computing environment with MapReduce Billel ARRES Universite Lumire Lyon 2 5 avenue Pierre Mands-France Bron, France Billel.Arres@univ-lyon2.fr Nadia KABBACHI Universite Claude Bernard Lyon 1 43 Boulevard du 11 Novembre Villeurbanne, France Nadia.Kabachi@univ-lyon1.fr Omar BOUSSAID Universite Lumire Lyon 2 5 avenue Pierre Mands-France Bron, France Omar.Boussaid@univ-lyon2.fr Abstract Large-scale data analysis has become increasingly important for many enterprises, and Cloud Computing, under the impulse of large companies, has recently endowed a special attention both in industry and academic researches. Hadoop, based on a new distributed computing paradigm, called MapReduce, has allowed to facilitate access to such environments, due to its impressive scalability and flexibility to handle structured as well as unstructured data. The goal of our work is to develop a Cloud Computing environment for exploiting data warehouses and perform online analysis. It consists of handling large nonrelational databases and supporting data warehouse with a new generation of database management systems (DBMS) such as Hive. Thus, to set up such an environment, we implemented a data warehouse under Hadoop and Hive and we used the Map and Reduce functions of this environment, then we compared the cost of loading the warehoused data and constructing OLAP cubes between a virtual and a physical cluster, as well as the rise in data loading on a physical cluster. Obtained results allows MapReduce developers to fully compare the performance, help in the choice of platform, in which a customer application can be developed to translate SQL requests to HQL (Hive-QL) requests, and check if a not-relational model is adequate or not. I. INTRODUCTION Data warehouses and OLAP systems (Online Analytical Processing) represent decision aid technologies that allow online analysis of large volumes of data [18]. If the use of this two systems were originally dedicated to organize, store and optimally exploit simple data, they were not designed for new units of data measurement accumulated so far. Notice that, its not uncommon to see Petabyte data warehouse nowadays [3] [11]. On the other hand, high performance architectures are designed to overcome the growing needs in terms of computation and storage of scientific and industrial applications. Among these architectures, the Cloud Computing [9] which consists globally to outsource data processing and storage. Popularized by Google, an innovative model of parallel data processing, called MapReduce, is presented in a now famous article [6]. The breakthrough of this model made possible to realize, with ordinary equipments, the treatment in one minute of a problem that previously required an hour on condition to multiply by 60 the number of machines [15]. In addition, the arrival of the Hadoop project, based on MapReduce, allowed an easy exploit of these high performance architectures. The construction of data warehouses and on-line analysis on the Cloud is a new research field. Indeed, if on one part, the University of California has identified research areas for the Cloud (availability, quality of service, security, etc..) [9], on the other, little attention has been paid to what is done, technically, on the construction of data warehouses and OLAP analysis upon the Cloud [1]. Hence, the necessity to move in this direction. To do this, we set the goal to build a development environment based on Hadoop and Hive, to test this new generation of DBMS. In this paper, we propose an original architecture to, first, set up a data warehouse on the Cloud for performing online analysis, and second, evaluate the performance of different variants of the architecture set up with Hadoop and Hive. We decided to use the SSB [22] data sample, which is a benchmark of decision support, designed to measure the performance of a star schema data warehouse. We propose to realize such an environment with the implementation of a data warehouse under Hive, and this, to exploit Map and Reduce functions in this environment and compare the results of loading data warehouse time and building OLAP cube time. This was done, on one hand, on a virtual cluster, with a virtual machine (VM), then on a physical cluster. On the other hand, we evaluate the performance of Hive, and that, in scalability on a physical cluster with four and six nodes. The results will serve as a basis for future work for the purpose of benchmarking, help in the choice of platform, where client applications can be developed to translate SQL queries to Hive-QL, and finally to validate the adequacy of non-relational models with data warehousing complex data. This paper is organized as follows. Section II describes the background of this work. Section III is devoted to the proposed approach with an overview of our data warehouse system and schema design. We have performed several experiments to test this environment that we develop in section IV. The performance evaluation reesults are presented in section V. we conclude in section VI. II. STATE OF THE ART AND RELATED WORK To understand the rest of the paper, the reader may need some basic knowledge on the tools we are using. In this section we just summarize the main concepts of Cloud Computing and MapReduce with its open sourced implementation Hadoop as they where presented by Google in [6] and [16], respectively. A. Cloud Computing Cloud Computing is a set of services deployed across a network. It is a concept which consists to move on distant servers the storage and computer processing usually located /13/$ IEEE

2 on local servers [8]. The architecture of a Cloud Computing environment is generally based on a layered organization, as shown in Fig.1. Fig. 1. Management data architecture on the cloud. The first level corresponds to the infrastructure as a service (IaaS). In general, it is composed of data centers, made available by the cloud providers. Amazon EC2 [20] and Microsoft Azure [21] are examples of such infrastructures. The second level is dedicated to platforms as a service (PaaS). It allows to provide a runtime environment quickly available. The best known example in this regard is MapReduce [6], and its open source implementation Hadoop [16]. The third level is the execution environment (SaaS). It offers software solutions as hosted services and aims to allow an easier use and a total transparency of the other architecture s layers. Hive from Facebook [3], or Scope [12] from Microsoft, are examples of this last layer based on particular data models, as columnoriented model or models extending the relational model such as NoSQL (Not Only SQL). Cloud Computing provides elastic property to adjust the resources according to applications, increasing computing power and storage at peak times usage and decreasing it during slack periods, while allowing the parallelization of storage and data processing. According to the approach, there are two main models of deployment for the Cloud Computing services [9]: Public Cloud, accessible to a wide public and belongs to a service provider; and Private Cloud, where the infrastructure is completely dedicated to a single organization. In our work, a private cloud architecture based on Hadoop and Hive is implemented. It allows the compilation of Hive-QL instructions to set up a data warehouse and construct OLAP cube on a parallel environments (section III). B. MapReduce and Hadoop 1) The MapReduce paradigm: MapReduce is a programming model suited to massively parallel processing of very large amounts of data [6]. This programming model is based on two main steps Map and Reduce. In Map step, the node (machine) which is submitted to a problem cuts it in subproblems and delegates them to other nodes (which can do the same recursively). In Reduce step, lowest nodes trace their results to parents nodes who solicited them. At the end of the process, the original node can reconstruct a solution to the problem which he had been requested for. Introduced by Google, MapReduce has been used, in 2007, to process more than 400 TB of data in 6 minutes, with a number of machines equal to 436 [15]. The most popular implementation of the MapReduce paradigm is the framework Hadoop, which allows applications to be run on large clusters deployed on low cost machines. Other implementations of MapReduce paradigm are available for different architectures, such as multicore architectures [13], multiple virtual machines architectures [14], Grid Computing environments [4] or even mobile environments [2]. 2) Hadoop: Hadoop is an open source project based on the MapReduce paradigm and Google File System. It can be considered as a processing system for scalable data storage and batch processing of large quantities of data. It is perfectly suited for ad hoc storage and analysis on very large volumes of data [16]. With Hadoop, since it comes to management of very large volumes of data, we must also optimize the use of the bandwidth on the network. That is why MapReduce is generally used in combination with a distributed file management system. in the case of Hadoop it comes to the HDFS (Hadoop Data File System). HDFS has a master/slave architecture [17]. In this logic, a Hadoop cluster consists of a single master server, named NameNode, which manages the file system and access rights; but also of servers that are both a computation and storage tools, named DataNodes, in general one per node. Hadoop has been widely adopted by the decisional community, and made the field of data warehousing in the cloud more accessible. III. PROPOSED APPROACH We have implemented a private cloud architecture, limited in terms of infrastructure. The objective is to test the feasibility of our approach of building a data warehouse in the Cloud. It does not allow us to study the transition to very large scale. However, it will help us to deploy the parallelization of storage and treatments of the data warehouse. This allows us to observe the reaction of this approach by performing scalability, although it remains relatively limited. The goal is to note that the performance does not degrade when we increase the size of the data warehouse. Our work consists, at first, to the implementation of the parallel data processing paradigm: MapReduce, across multiple distributed clusters, using Hadoop and Hive. Then, we propose a performance evaluation of Hive s loading data warehouses time and OLAP cubes construction time. This evaluation was performed using HiveQL queries and under different variants of architecture. We proposed an architecture (Fig.2) that allows the partitioning of the data warehouse on different clusters (nodes), the construction and the interrogation of OLAP cube by the user. Fig. 2. The proposed architecture. As it can be seen on Fig.2, user can issue HQL query through either Web UI, or command line interface (CLI), or Java code via JDBC. The query is sent to the node that runs the Query Driver. The main task for Query Driver is to translate the query to a MapReduce job, which includes the map phase plan and the reduce phase plan.

3 In the following, we introduce the data warehouse used for the study (section III.A). Then, we explain the HiveQL requests used for building the OLAP cube (section III.B). A. Construction phase of the data warehouse To build the data warehouse, we have opted to use the SSB [22] data model. It is a test bench designed to measure the performance of a star schema data warehouse [10]. Using the classic star schema model of sales stores, the data warehouse implemented (Fig.3) consists of a fact table named LINEORDER. It has seventeen attributes to provide information about an order, including a primary key, which consist of ORDERKEY and LINENUMBER, and foreign keys of the dimension tables CUSTOMER, PART, DDATE and SUPPLIER. environment, by fixing the number of nodes (1 node) and the size of the implemented data warehouse (1GB). The third experiment (section IV.C) allows the scaling up in a physical environment, with the variation of the number of nodes (4 and 6 nodes) and the size of the data warehouse (from 1GB to 1TB). A. Experiments on a virtual cluster (1 node) For evaluation and familiarization purposes with the Hadoop system, the virtual machine is a good choice of starting. It allows the deployment of Hadoop with a minimum of time and material resources. We chose to install the development package Cloudera-CDH4 [23], for virtual machine (VM), which provides a pre-installation of Hadoop and it s components, on Ubuntu, including Hive. The package has been installed on a physical machine (Fig.4) with standard features, including: 4GB RAM memory and an Intel 3.10GHz x 4 processor. The OS is Windows 7. The configuration is equivalent to a virtual cluster with a single node, where the machine operates as NameNode and DataNode at the same time. In this first part of the experiment, we used a data warehouse size equal to 1 GB, which is sufficient to test the feasibility of the proposed architecture on a virtual environment. Fig. 3. Data warehouse schema. B. Loading phase and the OLAP cube construction The majority of the operations supported by the HiveQL are very similar to SQL. However, the architectural difference between the two systems on which are based these languages, and especially, the use of the HDFS as File Management system by Hive, imposes other additional operations that require the adaptation of the user. Thus, to measure the performance of Hive on time of loading the data warehouse and OLAP cube construction, we have developed a set of HiveQL queries. In this paper, we consider an OLAP query as a query that involves two levels of aggregation. Indeed, a classic OLAP query [7] consists of a set of queries on different levels of aggregation joined by a union operator, which is also implemented in HiveQL. We chose to build a data cube answering the following decision query: What is the sum of sales revenues by year and brand in Asia since 1992?. Here, we look for the total income (REVENUE) as a measure according to the dimensions PRODUCT (PART) and TIME (DATE). IV. EXPERIMENTATION In this section, we evaluate the performance of loading data warehouse and OLAP cube construction with Hive, according to different variants of the defined architecture. The first two experiments (sections IV.A and IV.B) used to compare the performance of our plateform between virtual and physical Fig. 4. Cloudera development environment on Linux. B. Experiments on a physical cluster (1 node) The second part of the experiment aimed to compare the performance of Hive deployed on a physical cluster to the virtual cluster already implemented. Thus, the same platform has been deployed, but this time, on a physical machine, with 4GB RAM memory, an Intel 3.10GHz x 4 processor and Ubuntu bit-type as operating system. We used the version of Hadoop provided by Apache. The data warehouse used is the same as the first part of the experiment (1GB). The configuration set up is equivalent to a cluster with a single node and the machine operates as NameNode and DataNode at the same time. Hive was installed from a stable version 0.9.0, available on the Apache s website [19]. C. Scalability on the physical cluster (4-6 nodes) This part of the experiment aims to measure the response time of the platform Hadoop/Hive according to its solicitation. This will allow us to exploit the heart of the Hadoop system which is the advanced use of the HDFS, the distribution of the data and loads on the network. All this, with the implementation of a physical cluster according to the proposed architecture, and thus, to test modestly, what is currently done

4 by the big actors of the web with more resources, while trying to evaluate, at our level, the performance of such infrastructures. The first evaluation of Hive on the scalability was performed on a cluster of machines composed of four identical nodes, including the master machine. The second evaluation of Hive on the scalability was performed, this time, on a configuration of six identical nodes, including the master machine. The machines used present all the same features ie 4GB RAM memory and an Intel 3.10GHz x 4 processor. The OS is Ubuntu This allow to assess the performance of the Hadoop platform with increasing the size of the data warehouse and the number of nodes. B. Configuration on a physical cluster with 4 and 6 nodes The results of the scalability experimentation with Hive are shown in the table below. The size of the data warehouse varies from 1 GB to 1 TB. Times are measured in seconds: V. RESULTS PRESENTATION AND ANALYSIS In this section we present the experimental results on two phases. The first phase consisted in fixing the size of the data warehouse (1 GB) and the number of nodes (1 node) to compare performance results between a physical cluster and virtual cluster. The second phase consists in comparing the performance results on a physical cluster, but this time, by varying the size of the data warehouse (from 1 GB to 1 TB) and the number of nodes (4 and 6 nodes). A. Configuration with a virtual cluster and a physical cluster The results of the tests performed on both virtual and physical platforms are illustrate in the next charts (Fig.5). Time loading of each table is measured in seconds. The sum of loading time of all tables is equal to the total time of loading the data warehouse by Hive. The time of OLAP cube construction is equivalent to the time required for Hive to execute all the jobs (tasks) MapReduce created by compiling the query group HiveQL of cube construction. Fig. 6. Results of the scalability on physical clusters with nodes 4 and 6 We note that the loading time of the data warehouse evolves with the increase of its size. Thus, if we take the results of loading time and building the OLAP cube for a cluster of 6 nodes, it clearly increases with the increasing size of the warehouse, ranging from seconds to load a 1 GB data warehouse to more than 29 minutes for 1 TB data warehouse; and a little less than 12 minutes for the OLAP cube construction of 1 GB data warehouse to less than 4 hours for a data warehouse of 1 TB. What is at first sight huge. The fact that the machines used for testing are ordinary low capacity machines, compared to what is commonly used. However, these results remain encouraging, because it should be noted that there s a big difference in the transition from four to six nodes, especially the results of the OLAP cube construction. Those, require the execution of an HQL block queries and treatment of results detailed below. Fig.7 compares the results of the tests performed on clusters with 4 and 6 nodes: Fig. 5. Results of loading data warehouse time and OLAP cube construction time between the virtual and the physical cluster (1 node). The results show clearly the superiority of the physical environment on loading data warehouse (39.09 s) and building the OLAP cube (17.91 min), and this, by consuming half of the time required to the virtual cluster for loading the data warehouse (2.38 min) and building the OLAP cube (35.8 min). The virtual cluster provides a quick and easy installation especially by exploiting the Cloudera package. It facilitates the handling of the Hadoop system, with all its components. It also allows the creation of an environment easy to use in order to familiarize with the cloud computing and data warehousing tools such as Hive. However, the widespread use with larger scale data shows the limits of this virtual environment and advantage the physical environment. Fig. 7. Results of the increase in charge on the physical clusters with 4 and 6 nodes. The loading time of the data warehouse are almost the same for both clusters with 4 and 6 nodes. These results follow an evolution along a straight line with a relatively low slope, from the fact that the loading phase does not require the creation or execution of MapReduce jobs by the Hive compiler. The data is divided into small units and mounted in the same way

5 on the different nodes (DataNodes) of the cluster via HDFS. Thus, the larger the size (of the warehouse) increases the more it is necessary to Hive to consume more time. However, the evolution of this time remains relatively low, ranging from 2 minutes for loading a data warehouse of 10 GB to 20 minutes for a data warehouse of 100 GB. Although, the variation of the number of nodes is probably not significant enough (from 4 to 6) to highlight the differences between the loading time which more specific tests, at the scale of Hadoop with a dozen or a hundred nodes may reveal. In terms of OLAP cube construction time, test results are almost the same for data warehouses which Hadoop and Hive consider as relatively small (inferior to 10 GB). For example, the OLAP cube construction time for a 1 GB warehouse is 12 minutes 30 with 4 nodes and 12 minutes with 6 nodes so a difference of 30 seconds. The gap between the results of building OLAP cube time between the two clusters appear well from a warehouse of 10 GB of data, and benefits to add two nodes to the cluster are appreciable as the size of the warehouse increases. For example, the OLAP cube construction time of 100 GB warehouse takes a little over 3 hours with 6 nodes and almost 5 hours with 4 nodes, so a difference of 2 hours. Construction OLAP cube time with Hive on a 6 nodes cluster are clearly better than those obtained with the 4 node cluster. In this case, the construction of the cube is done by executing the HiveQL queries Group, previously defined. MapReduce jobs are created automatically by the Hive compiler and executed by the different cluster s nodes that also contain data partitions. Thus, more the number of nodes increases less is the size of data partitions, and the more quickly the MapReduce Jobs return their results. In the end, the execution plan remains the same between the two clusters, but the number of nodes increases, which reduces the size of the different data partitions and accelerates the execution of the different jobs accelerating by the same way the construction of the OLAP cube. VI. CONCLUSION The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. In this paper we have shown how to benefit from cloud computing technologies to build OLAP cubes by using MapReduce and Hive. Our motivation was to manipulate these concepts by creating an environment with a project around these technologies. The aim of this was the creation of an environment type on Hadoop with the establishment of a private cloud computing environment for creating data warehouses and perform online analysis. The experiments carried out have allowed handling and mounting various data warehouses on Hive and evaluating the performance of different variants of architecture implemented with Hadoop. In addition, these experiments allowed us to exploit the important part of the Hadoop system which is its file management system (HDFS), and especially to understand the logic of its operation. Also, the environment of Hadoop on Ubuntu (in addition to the Cloudera Virtual Machine distribution), was used to validate new versions of the Hadoop ecosystem projects such as Hive. These are more and more stable and relatively easy to deploy in order to exploit and benefit from the power of MapReduce in managing large data warehouses. The perspectives of this project are, in the short term, to continue these experimentations by increasing the number of nodes and the size of the data across several terabytes or even petabytes, in order to evaluate better the performance of these systems. In the medium term, we plan to develop and implement new algorithms with Map and Reduce functions of this environment, and exploit the Cloud architecture at the PaaS level. Finally, in the long term we plan to develop BI (Business Intelligence) tools solution in this type of environment. REFERENCES [1] A.Abello, J.Ferrarons, O.Romero. Building Cubes with MapReduce. DOLAP11, October 28, 2011, Glasgow, Scotland, UK. [2] A.Dou, et.al. Misco: a mapreduce framework for mobile systems, in: Proceedings of the PETRA10, ACM, New York, NY, USA, 2010, pp. 32:132:8. [3] A.Thusoo, et. al. Hive : a warehousing solution over a map-reduce framework. Facebook Data Infrastructure Team [4] C.Miceli, et.al. Programming abstractions for data intensive computing on clouds and grids, in: Proceedings of the CCGRID09, IEEE Com-puter Society, Washington, DC, USA, 2009, pp [5] E.Capriolo, D. Wampler, J. Rutherglen. Programming Hive. O Reilly, [6] J.Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, [7] J.Gray, et.al. Data cube : A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the International Conference on Data Engineering, pages , New Orleans, USA, [8] L.D Orazio, S. Bimonte. Intgration des Tableaux Multidimensionnels en Pig pour l Entreposage de Donnes sur les Nuages, [9] M.Armbrust, A. Fox, R. Griffith. Above the Clouds : A berkeley view of Cloud Computing, Technical Report UCB/EECS,2009. [10] P.O Neil, B.O Neil, X.Chen. Star Schma Benchmark, 2009, Web Page. poneil/starschemab.pdf. [11] Q. Wang, et.al. On The Correctness Criteria of Fine-Grained Access Control in Relational Databases. In Proceedings of VLDB, pages , [12] R. Chaiken, et.al. Scope : easy and efficient parallel processing of massive data sets. PVLDB, 1(2) : , [13] R.Chen, et.al. mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling, in Proceedings of the 19th PACT10, ACM, New York, NY, USA, 2010, pp [14] S.Ibrahim,et.al. Cloudlet: towards mapreduce im-plementation on virtual machines, in: Proceedings of the 18th ACM HPDC09, ACM, New York, NY, USA, 2009, pp [15] S.Genaud. MapReduce : un cadre de programmation parallle pour lanalyse de grandes donnes, Universit de Strasbourg,2011. [16] T.White. O Reilly. Hadoop: The Definitive Guide, [17] V.Guana and J. Davidson. On Comparing Inverted Index Parallel Implementations Using MapReduce, University of Alberta, [18] W.Inmon. Building the Data Warehouse. Wiley, New York, USA, [19] Apache Hive Releases. Web page. [20] Amazon EC2. Web page. http ://aws.amazon.com/ec2/ [21] Microsoft azure. Web page. http :// [22] Star Schma Benchmark. Web page. poneil/ StarSchemaB.pdf [23] Cloudera Entreprise and CDH4.0. Web page. /blog/2012/06/cdh4-and-cloudera-enterprise-4-0-now-available/