Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data



Similar documents
BIG DATA TRENDS AND TECHNOLOGIES

Big Data on Microsoft Platform

Microsoft Big Data. Solution Brief

Data processing goes big

How To Extend An Enterprise Bio Solution

Big Data & Cloud. 4 th European Summit on the Future Internet. António Miguel Ferreira, CEO, Lunacloud. Aveiro, 13 to 14th June 2013

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Course MS20467C Designing Self-Service Business Intelligence and Big Data Solutions

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Designing Self-Service Business Intelligence and Big Data Solutions

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Azure Data Lake Analytics

Big Data Analytics with PowerPivot and Power View

Please contact Cyber and Technology Training at for registration and pricing information.

The Inside Scoop on Hadoop

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Understanding NoSQL on Microsoft Azure

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop IST 734 SS CHUNG

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

ITG Software Engineering

BIG DATA What it is and how to use?

Microsoft Azure Data Technologies: An Overview

Enabling High performance Big Data platform with RDMA

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

Assignment # 1 (Cloud Computing Security)

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER

Information Architecture

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Maximizing Hadoop Performance with Hardware Compression

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

Moving From Hadoop to Spark

Implement Hadoop jobs to extract business value from large and varied data sets

How To Handle Big Data With A Data Scientist

Jeffrey D. Ullman slides. MapReduce for data intensive computing

343 Industries Gets New User Insights from Big Data in the Cloud

Big Data Analytics OverOnline Transactional Data Set

HDFS. Hadoop Distributed File System

Accelerating and Simplifying Apache

Data Management in SAP Environments

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Chase Wu New Jersey Ins0tute of Technology

project collects data from national events, both natural and manmade, to be stored and evaluated by

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Brief Outline on Bigdata Hadoop

Modernizing Your Data Warehouse for Hadoop

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Peers Techno log ies Pv t. L td. HADOOP

Designing a Data Solution with Microsoft SQL Server 2014

Sisense. Product Highlights.

Large scale processing using Hadoop. Ján Vaňo

Big Data Training - Hackveda

Workshop on Hadoop with Big Data

Virtualizing Apache Hadoop. June, 2012

Bringing Big Data Modelling into the Hands of Domain Experts

Hadoop in the Hybrid Cloud

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Deploying Hadoop with Manager

Data Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros

Processing Large Amounts of Images on Hadoop with OpenCV

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Computing. Big Data. High Performance Computing

NoSQL and Hadoop Technologies On Oracle Cloud

Microsoft SQL Server 2012 with Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Ubuntu and Hadoop: the perfect match

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Big Data Technologies Compared June 2014

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Integrating Apache Spark with an Enterprise Data Warehouse

In-Memory Analytics for Big Data

An Approach to Implement Map Reduce with NoSQL Databases

BIG DATA HADOOP TRAINING

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Creating a universe on Hive with Hortonworks HDP 2.0

HPC in Finance on Windows Azure. Philip Bull Azure Business Development Manager Microsoft UK

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Managed Services for the Cloud Foundry PaaS

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

File S1: Supplementary Information of CloudDOE

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Transcription:

Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop, which was designed for this purpose, is written in Java and was not available to.net developers. So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute the load and speed up big data computations. But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows Azure and compare their performance. By Sergey Klimov and Andrei Paleyes, Senior R&D Engineers at Altoros.

Contents 1. Introduction...3 2. Testing environment...3 3. The results of the research...4 3.1.Test results for an 8 MB HDFS block... 4 3.2.Test results for 64 a MB HDFS block... 5 3.3.Dependency between the block size and the number of Map tasks... 6 3.4.Dependency between performanceand the number of MapReduce tasks... 7 3.5.Dependency between performance and the type of a query... 8 4. Conclusion...8 5. About the authors:...9 Altoros Systems 2

1. Introduction When the R&D department at Altoros Systems, Inc. started this research, we only had access to a community technology preview (CTP) release of Apache Hadoop-based Service on Windows Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop Connection. We analyzed two ways of querying with Hadoop that were available from the Web panel: HiveQL querying and a JavaScript implementation of MapReduce jobs. A test data set was generated based on US Air Carrier Flight Delays information downloaded from Windows Azure Marketplace. It was used to test how the system would handle big data. We created the following eight types of queries in both languages and measured how fast they were processed: Count the number of flight delays by year. Count the number of flight delays and display information by year and month. Count the number of flight delays and display information by year, month, and day of month. Calculate the average flight delay time by year. Calculate the average flight delay and display information by year and month. Calculate the average flight delay time and display information by year, month, and day of month. From this analysis you will see performance results tests and observe how the throughput varies depending on the block size. The research contains twotables and three diagrams that demonstrate the findings. 2. Testing environment As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU were divided among three virtual machines that served as nodes.obviously, this could introduce some errors into performance measurements. Therefore, we launched each query several times and used the average value for our benchmark. The cluster consisted of three nodes (a small cluster). The data we used for the tests consisted of five CSV files of 1.83 GB each. In total, we processed 9.15 GB of data. The replication factor was equal to three. This means that each data set had a synchronized replica on each node in the cluster. The speed of data processing varied depending on the block size therefore, we compared results achieved with 8 MB, 64 MB, and 256 MB blocks. Altoros Systems 3

3. The results of the research 3.1.Test results for an 8 MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7:21.635 106:33.722 38/10 1170/1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7:56.113 111:13.209 38/10 1170/1 9:27.940 115:59.188 38/10 1170/1 4. Calculate the average flight delay time by year 12:41.158 99:4.989 38/10 1170/1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13:33.45 103:54.367 38/10 1170/1 14:48.310 110:18.658 38/10 1170/1 7. Count the number of delays by the origin airport 7:26.364 106:43.959 38/10 1170/1 8. Count the number of delays by the destination airport Table 1 7:26.57 106:50.534 38/10 1170/1 Brief summary As it is seen form the table,time spent on processing JavaScript queries didn t vary significantly and ranged from 99 to 116 seconds. Although Hive queries were processed 7-15 times faster than JavaScript implementations, their computation speed differed greatly depending on the type of the query. We ll expand on this dependency in Figure 3 Dependency between performance and the type of a query. Altoros Systems 4

3.2.Test results for 64 a MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7:0.277 50:29.8 37/10 150/1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7:40.574 52:2.86 37/10 150/1 9:9.143 55:56.7 37/10 150/1 4. Calculate the average flight delay time by year 12:45.775 47:40.880 37/10 150/1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13:21.515 50:54.123 37/10 150/1 14:35.23 53:55.645 37/10 150/1 7. Count the number of delays by the origin airport 7:17.265 49:54.347 37/10 150/1 8. Count the number of delays by the destination airport Table2 7:11.670 49:15.914 37/10 150/1 Brief summary As you can see, it took us seven minutes to process the first query created with Hive, while processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the Hive queries were also processed several times faster than queries based on JavaScript. To provide a more detailed picture, we indicated the total number of Map and Reduce tasks in both tables. ThetestresultsshowedthatHDFSblocksizeandthe number of MapReduce tasks affects computation speed.as you can see in Table 1, the Hive query generated 38 Map tasks and 10 Reduce tasks, but for the JavaScript implementations this ratio was 1170 Map tasks to a Reduce Altoros Systems 5

task. The similar results were achieved for 64 MB HDFS block. As you can see in Table 2, the first Hive query produced 37 Map tasks and 10 Reduce tasks. The JavaScript query generated 150 Map tasks and a Reduce task. This can be explained by the fact that JavaScript is not a native language for Hadoop, which was written in Java. Hive features a task manager that analyses the load, divides the data set into a number of Map and Reduce tasks, and chooses a certain ratio of Map tasks to Reduce tasks to ensure the fastest computation speed. Unfortunately, it is not clear yet how to optimize the JavaScript code and configure the task manager, so that it uses available resources in a more efficient manner. The results of the JavaScript query are written to the outputfileof the runjs command ( codefile, inputfiles, outputfile ) using a single Reduce task, that may be a reason of such a great difference in performance.. 3.3.Dependency between the block size and the number of Map tasks We have also analyzed how the size of a block in a distributed file system influenced the number of Map tasks triggered in Hive and JavaScript queries. For a 64 MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript query was processed, the task manager divided the total load into 150 Map tasks and a Reduce task. Altoros Systems 6

Referring to the table, we can conclude that the number of Reduce tasks does not depend on the block size and is equal to 10 for Hive queries and to 1 for JavaScript queries. 3.4.Dependency between performanceand the number of MapReduce tasks We also analyzed how the number of Map and Reduce tasks influenced the speed of processing Hive and JavaScript queries. From this diagram, you can see that Hive queries were properly optimized and the block size had almost no impact on execution time. In JavaScript, on the contrary, the processing speed depended directly on the number of Map tasks. Altoros Systems 7

3.5.Dependency between performance and the type of a query Below you can see the diagram that shows how processing speed depends on the query type for a 64 MB HDFS block. The difference between the queries 1-3 and 4-6 was in the number of grouping parameters. The first query calculated flight delay times by year. In the second query, we added such parameters as month and in the third day. The fourth query returned the average flight delay times by ear, which is a different arithmetic operation. The fifth and sixths queries calculated the average flight delay times by year, month, and by year, month, and day respectively. Judging by the diagram, additional grouping parameters had much greater influence on JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as transforming, converting, and computing data caused the processing speed to degrade significantly, which can be seen from the difference in processing time between the first and the fourth queries. The sixth query calculated average values and included three grouping parameters, which resulted in the slowest processing speed. 4. Conclusion When we started this analysis, only the community technology preview release was available. Hadoop on Windows Azure had no documentation and there were no manuals showing how to optimize JavaScript queries. On the other hand, HiveQL had been built on top of Apache Hadoop Altoros Systems 8

long before Microsoft offered their solution. That is why Hive is much faster when performing basic operations, such as various selections or doing various data manipulations like creating/updating/deleting, random data sampling, statistics, etc. However, you would have to opt for JavaScript for more complex algorithms, such as data mining or machine learning, anyway, since they cannot be implemented with Hive. In October 2012 Microsoft released a new CTP version of this service, which is now called Windows Azure HDInsight. Some of the issues we mentioned before were fixed, since the improvements included: updated versions of Hive, Pig, Hadoop Common, etc. an updated version of the Hive ODBC driver a local version of the HDInsight community technology preview (for Windows Server) guides and documentation describing how to use the service Now Microsoft offers a browser-based console that serves as an interface for running MapReduce jobs on Azure. The implementation of Hadoop on Windows Azure also simplifies installation, configuration, and management of a cloud cluster. In addition, the updated platform can be used with such tools as Excel, PowerPivot, Powerview, SQL Server Analysis Services, and Reporting Services.There is also the ODBC driver that connects Hive to Microsoft s products and helps to deal with business intelligence tasks. Such a solution that would enable.net developers to process huge amounts of data fast was long awaited. Although this article describes the out-of-the-box querying methods, the.net community is contributing to.net SDK for Hadoop. Currently, the 0.1.0.0 version is available to public at CodePlex. This library already enables developers to implement MapReduce jobs using any of the CLI Languages the solution comes with examples written in C# and F# and provides tools for building Hive queries using LINQ to Hive. Therefore, soon.net developers will be able to build native Hadoop-based applications, employing other libraries that conform to the Common Language Infrastructure. This SDK will be an even more efficient tool for in-depth data analysis, data mining, machine learning, and creating recommendation systems with.net. 5. About the authors: Andrei Paleyes has 5+ years of experience in MS.NET-related technologies applied in largescale international projects. Having a master s degree in mathematics, he is interested in big data analysis and implementation of mathematical methods used in data mining. He is a knowledge discovery enthusiast and presented a number of sessions on data science at local conferences. Recently, Andrei participated in architecture development of the analytical cloud-based platforms for genome sequencing and energy consumption solutions. Altoros Systems 9

Sergey Klimov is proficient in developing large-scale applications and corporate systems, as well as processes automation using MS.NET and cloud technologies. He has degrees in softwareengineering and technical automatics. Sergey focuses on projects that require processing large volumes of data using Hadoop and cloud technologies, in particular Windows Azure. Altoros is a global software delivery acceleration specialist that provides focused product engineering to technology companies and start-ups. Founded in 2001 and headquartered in Silicon Valley (Sunnyvale, California), Altoros has a sales office in Western Massachusetts, branch offices in Norway, Denmark, Switzerland, and UK, and a software development centers in Argentina and Eastern Europe (Minsk, Belarus). For more information, please visit www.altoros.com Altoros Systems 10