THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

Transcription

1 THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Computer Science Degree Awarded: Spring Semester, 2010

2 The members of the committee approve the dissertation of Vanessa Cedeno defended on April 15, Feifei Li Professor Directing Dissertation Zhenghao Zhang Committee Member Piyush Kumar Committee Member The Graduate School has verified and approved the above-named committee members. ii

3 I dedicated this to my parents. iii

4 ACKNOWLEDGEMENTS I would like to acknowledge and thank my major professor Dr. Feifei Li for all his advice and guidance. I would also like to thank Dr. Zhenghao Zhang, along with Dr. Piyush Kumar for their willingness to serve on my committee. Also, I would like to thank my friend Daniel Gomez for his advice and feedback throughout all the various stages of the project. Lastly, I would like to thank the entire faculty and staff of the Computer Science Department, Florida State University. iv

5 TABLE OF CONTENTS List of Tables... vi List of Figures... vii Abstract... viii 1. INTRODUCTION PROBLEM DEFINITION Parallel DBMS MapReduce HADOOP HADOOPDB Installing HadoopDB RESULTS Data Loading Grep Task Selection Task Aggregation Task Join Task CONCLUSION REFERENCES BIOGRAPHICAL SKETCH v

6 LIST OF TABLES 3.1 Differences between parallel databases and Hadoop Summary of results vi

7 LIST OF FIGURES 5.1 Grep, Selection and Aggregation Tasks Selection Task vii

8 ABSTRACT This project presents and evaluates the performance of HadoopDB compared to the DBMS PostgreSQL. Currently, the amount of data is expanding and the analysis of this information can be done efficiently with hundreds of machines working in parallel. For analytical data management of a considerable amount of data, HadoopDB tries to combine the efficiency and performance of parallel databases with the scalability, fault tolerance and flexibility of MapReduce. Parallel databases consist of analytical DBMS systems that deploy on a shared-nothing architecture. The data analysis includes scan operations, multidimensional aggregation and star schema joins. MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Because the data analyzed is growing considerably MapReduce based systems are proposed as an alternative to manage thousands of nodes in a shared-nothing architecture, but it lacks the characteristics to analyze structure data. HadoopDB is a project from Yale University proposed as a hybrid system using MapReduce as the communication layer between multiple nodes, each of them running DBMS instances. In this paper, we compare the improvement in the performance of different types of queries using HadoopDB with PostgreSQL as the database layer, with using just PostgreSQL as the tool for the same data analysis. viii

9 CHAPTER 1 INTRODUCTION Each year approximately 5 Exabyte, 1018 bytes, of new information is produced [8]. One reason for this is that technology drives storage demand. Enterprises are always in the search of a system that can manage, process and granularly analyze their data. But parallel databases do not scale well into the hundreds of nodes. The reasons are failures while increasing the nodes, also there is the assumption from databases of a homogeneous array of machines. Also parallel databases have not been tested on thousands of nodes, but with the increasing in data this will be necessary in the future. To solve these problems a new hybrid system called HadoopDB was proposed. It takes advantage of the MapReduce programming model developed to scale thousands of nodes in a shared-nothing architecture, while having the performance and efficiency advantages of parallel databases [1]. Queries expressed in SQL are translated into MapReduce for the distribution between the nodes and then manage by the higher performance single node databases. HadoopDB is build out of open source components. Hadoop, the open source version of MapReduce, is used to manage the communication between the increasing nodes and the database layer is managed by PostgreSQL. PostgreSQL is a powerful, open source object-relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness [7]. This hybrid system is evaluated to determine how much there is an improvement in performance comparing it to a DBMS, in this case using just PostgreSQL. In chapter 2 the problem is defined, describing the parallel DBMS and MapReduce technologies. In chapter 3, the Hadoop project is explained. In chapter 4, the HadoopDB system and the installation process is presented. In chapter 5 the DBMS PostgreSQL is described. In chapter 6 the results of the experiments are presented. In chapter 7 the conclusion of the project is given. 1

10 CHAPTER 2 PROBLEM DEFINITION HadoopDB is a hybrid system that takes advantage of two existing technologies, DBMS and MapReduce. It targets analytical workloads and is designed to run on a shared-nothing cluster of commodity machines. It tries to offers a free and open source parallel DBMS. It is scalable as Hadoop, while achieving superior performance on structured data analysis workloads. 2.1 Parallel DBMS A parallel database system tries to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is managed just by performance considerations. Parallel databases improve processing and input and output speeds by using multiple CPUs and disks in parallel [5]. Analyzing large scale data workloads is the advantage of using parallel databases. There exist automated tools that can design, deploy, tune and maintain these databases. But parallel databases do not work well with fault tolerance or working with a heterogeneous environment. The main reason for this is that for parallel databases a failure is a rare event and a typical large cluster is a dozen of nodes and not hundreds of thousands. 2.2 MapReduce MapReduce is a programming model that specifies a map function that processes a key-value pair to generate a set of intermediate key-value pairs. First a set of map tasks are processed in parallel individually by each node. Then data is distributed across all the nodes in the cluster. Then in parallel each nodes takes the partition he had received and performs the Reduce task. A reduce function merges all intermediate values associated with the same intermediate key [9]. 2

11 MapReduce has the advantage to checkpoint the output of each Map task to local disk to recover while facing a failure. This is important in terms of fault tolerance and operating in heterogeneous environment properties because of the redundantly executed task. If a failure is detected Map tasks are reassigned to other live nodes [1]. Because MapReduce requires that the user develop a certain structure for the existing data it is ideal to combine it advantages with the performance of parallel databases. 3

12 CHAPTER 3 HADOOP The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes [6]. Hadoop includes various subprojects like Hadoop Common, Avro, Chukwa, HBase, HDFS, Hive, MapReduce, Pig and ZooKeeper. HadoopDB uses Hadoop Common because these are the utilities that support the other Hadoop subprojects [2]. Also, MapReduce based systems do not accept SQL commands, but Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying [4]. Hadoop implements MapReduce, and in addition, it provides a distributed file system, HDFS, that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. Table 3.1: Differences between parallel databases and hadoop [1]. Parallel Database MapReduce Data Designed for structured, relational data Designed for unstructured, data Query Interface SQL MapReduce programs written in a variety of languages, including SQL Query Execution Pipelines results between operator Materializes results between Map and Reduce phases Job Granularity Entire query Determined by data storage block size The table shows that parallel databases provide a high performance, scalability is not a property. While MapReduce is scalable but high performance is not one of his strengths. It will be ideal to combine the scalability and high performance in one tool. 4

13 CHAPTER 4 HADOOPDB HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workloads [3]. HadoopDB gives Hadoop access to multiple single-node DBMS servers deployed across the cluster, in this project is PostgreSQL. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries to create a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost and because HadoopDB relies on MapReduce framework it guaranties scalability, fault and heterogeneity tolerance similar to Hadoop. The Hadoop framework on HadoopDB consist of two layers, the Hadoop Distributed File System or HDFS and the MapReduce Framework. HDFS is a blockstructured file system managed by a central NameNode. Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. The NameNode maintains metadata about the size and location of blocks and their replicas. The MapReduce Framework follows a master-slave architecture. The master is a single JobTracker and the slaves or worker nodes are TaskTrackers. The JobTracker handles the runtime scheduling of MapReduce jobs and maintains information on each TaskTracker s load and available resources. Each job is broken down into Map tasks based on the number of data blocks that require processing, and Reduce tasks. The JobTracker assigns tasks to TaskTrackers based on locality and load balancing. 4.1 Installing HadoopDB There are not many tutorials that help with the installation of HadoopDB, the official tutorial is very helpful but doesn t provide a detailed process. The following tutorial will help future students to install HadoopDB in a faster way. First the required software includes JavaTM 1.6.x, Hadoop and the single node databases installed on each slave node in the cluster. In our project the database is PostgreSQL. 5

14 After unpacking the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define JAVA_HOME to be the root of your Java installation. export JAVA_HOME=/usr/ To start the Hadoop cluster in fully distributed mode we have to configure the file conf/hadoop-site.xml. The fs.default.name value specifies the name of the machine that will become the master node. The mapred.job.tracker value specifies the name of the machine that will become the job tracker, it could be the same master node but in our case it will be a different node. The dfs.replication value indicates how many time the data is going to be replicated through the nodes, in our project is 1. Now check that you can ssh to the localhost without a passphrase. Also, the file conf/slaves need to include all the slave nodes that are going to be part of the cluster. To start a Hadoop cluster you will need to start both the HDFS and MapReduce: bin/hadoop namenode format To start the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves: $ bin/start-dfs.sh To stop the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves: $ bin/stop-dfs.sh To start MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves: 6

15 bin/start-mapred.sh To stop MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves: bin/stop-mapred.sh After installing Hadoop, HadoopDB's binary version, hadoopdb.jar, needs to placed under HADOOP_HOME/lib on each node. The JDBC jar driver is also required and needs to be placed in the lib directory. For Hadoop to recognize these changes, a restart may be required. You should stop and then start Hadoop. After installing PostgreSQL the pg_hba.conf file should be edited as follows: local all all trust host all all /0 password host all all ::1/128 trust host all all xxx.xxx.xxx.x password Then edit the file postgresql.conf to be as follows: listen_addresses = '*' port = 5432 shared_buffers = 512MB work_mem = 1024MB # what IP address(es) to listen on; # (change requires restart) Then, restart Postgres using pg_ctl restart, and if PostgreSQL refuses to start with such large memory settings, you may need to run: sysctl -w kernel.shmmax=

16 This setting should also be added to /etc/sysctl.conf. The SMS Planner setup refers to the SQL to MapReduce to SQL, SMS. The planner consists of a slightly modified Hive build and SMS specific classes. Installing Hive is easy, and after the installation the command HIVE_HOME/bin/hive will run it. But before executing any HadoopDB jobs the data needs to be prepare and loaded. First, data need to be loaded into HDFS: bin/hadoop dfs -put file.txt name_of_the_file_on_hdfs.txt If you want to see the data in HDFS to make sure the loading was effective: bin/hadoop fs lsr / Then, a custom-made Hadoop job, GlobalHasher, repartitions data into a specified number of partitions. In our case we were working with 8 nodes: bin/hadoop jar /usr/local/hadoop /lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.globalhasher /file.txt /parts/ 6 \ 0 Each partition should be downloaded into a node's local file system, for example: bin/hadoop fs -get /parts/ part /usr/local/hadoop /p0 Then we created a table in PostgreSQL on each node: CREATE TABLE grep ( userid int, NAME varchar(300), rating int); Each chunked file is then bulk-loaded into a separate database within a node and indexed appropriately: 8

17 COPY grep FROM '/usr/local/hadoop /p0' WITH DELIMITER E'\t'; The HadoopDB Catalog needs to reflect the location of all chunks. In the current implementation the catalog is an XML file stored in HDFS. The SimpleCatalogGenerator tool assumes uniformly distributed chunks across all nodes, with each node having the exact same number of chunks. It associates each relation's chunks with a partition id and allows the existence of chunked and complete relations per node. The following command to generate the HadoopDB.xml catalog: java -cp hadoopdb.jar edu.yale.cs.hadoopdb.catalog.simplecataloggenerator path_to_catalog.properties generation: The Catalog.properties file contains parameters necessary for Catalog nodes_file=machines.txt #specifies the nodes in the cluster relations_unchunked=grep, EntireRankings #specifies the table names relations_chunked=rankings, UserVisits #specifies the table names After generating the catalog, upload it to HDFS: hadoop dfs -put HadoopDB.xml HadoopDB.xml If you modify an existing file of HaddopDB.xml file don t forget to remove it first from the HDFS before uploading it again: bin/hadoop dfs -rmr HadoopDB.xml Before any query execution, we update the MetaStore with references to our database tables. Hive allows tables to exist externally, outside HDFS. 9

18 CREATE EXTERNAL TABLE grep( userid int, NAME string, rating int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.smsinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputfor mat' LOCATION '/db/grep'; the path. Note the location path needs to have the relation name as the final directory in 10

19 CHAPTER 5 RESULTS The following experiments compare performance of HadoopDB and PostgreSQL. Hadoop version running on Java was used. The characteristics of the machines used as nodes were the following: Fedora Release 9 Memory 2.0 GiB Processor: Intel (R) Xeon (R) CPU 1.86 GHz Available disk space:22.2 GiB We worked with random data of 560 MB on each node. 5.1 Data Loading Putting the 3.3 GB of data on the HDFS before performing the partition was 291 seconds. While the global hasher took 482 seconds to pslit the data in 6 chunks. 5.2 Grep Task Each record consists of a unique key in the first 10 bytes, followed by a 90-byte character string. The pattern XYZ is searched for in the 90 byte field. Each node contains 560MB of data. HadoopDB and PostgreSQL executed the identical SQL: SELECT * FROM grep WHERE NAME LIKE %xyz% ; None of the benchmarked systems contained an index on the field attribute. This means that for all systems, this query requires a full table scan and is mostly limited by disk speed. HadoopDB s SMS planner pushes the WHERE clause into the PostgreSQL instances. The performance of each system is presented in Figure

20 5.3 Selection Task The first structured data task evaluates a simple selection predicate on the rating attribute from the grep table. There are approximately 36,000 tuples on each node that pass this predicate. SELECT NAME, rating FROM grep WHERE rating >5; HadoopDB s SMS planner pushes the selection and projection clauses into the PostgreSQL instances. The performance of each system is presented in Figure Aggregation Task Unlike the previous tasks, this task requires intermediate results to be exchanged between different nodes in the cluster so that the final aggregate can be calculated. SELECT SUM( rating), NAME FROM grep group by NAME; The SMS planner for HadoopDB pushes the entire SQL query into the PostgreSQL instances. The output is then sent to Reduce jobs inside of Hadoop that perform the final aggregation after collecting all pre-aggregated sums from each PostgreSQL instance. The performance of each system is presented in Figure Join Task The key difference between this task and the previous tasks is that it must read in two different data sets and join them together. We push the selection, join, and partial aggregation into the PostgreSQL instances with the following SQL: SELECT g.name, AVG(h.userid), SUM(g.rating) FROM grep g JOIN grep h ON (g.name=h.name) 12

21 GROUP BY g.name; The performance of each system is presented in Figure Seconds HadoopDB PostgreSQL 50 0 Grep Selection Aggregation Figure 5.1: Grep, Selection and Aggregation Tasks. Seconds Join Task 0 HadoopDB PostgreSQL Figure 5.2: Join Task. 13

22 Table 5.1: Summary of results. Task HadoopDB PostgreSQL Grep 47.4 s 180 s Selection 46.8 s 182 s Aggregation s 231 s Join 1709 s 6494 s 14

23 CHAPTER 6 CONCLUSION The experiments show that PostgreSQL performance is improved by using the hybrid system HadoopDB. The MapReduce based system allows the distribution of the data among the different nodes in the cluster. In this way the load is distributed and the time minimized. Table 5.1 shows that a cluster of 8 nodes running HadoopDB will perform 3.8 times faster than the same SQL performed on PostgreSQL with the same amount of data. HadoopDB is a hybrid of the parallel DBMS and Hadoop technologies to data analysis. It achieves the performance and efficiency of parallel databases, with the scalability, fault tolerance, and flexibility of MapReduce-based systems. The ability of HadoopDB to incorporate Hadoop and open source DBMS software makes HadoopDB flexible and extensible for performing data analysis at the large scales expected of future workloads. This is proven with the experiments comparing the same load of data performed on a single machine. 15

24 REFERENCES [1] A. Abouzeid, K. Bajda-Pawlikowski, and D. Abadi. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. [2] Hadoop. Web Page. [3] HadoopDB. Web Page. [4] Hive. Web Page. [5] Parallel database. Web Page. [6] Apache Hadoop. Web Page. [7] PostgreSQL. Web Page. [8] C. Mullins. The DBA corner. Web Page July [9] MapReduce. Web Page. 16

25 BIOGRAPHICAL SKETCH Vanessa Cedeno In the spring of 2007, Vanessa Cedeno completed her Bachelor s degree in Computer Engineering at ESPOL, Escuela Superior Politecnica del Litoral in Guayaquil, Ecuador. She enrolled in the master program at Florida State University in the fall of Vanessa s research interests include database systems, distributed systems and analysis of algorithms. 17