Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Size: px
Start display at page:

Download "Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data"

Transcription

1 Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop, which was designed for this purpose, is written in Java and was not available to.net developers. So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute the load and speed up big data computations. But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows Azure and compare their performance. By Sergey Klimov and Andrei Paleyes, Senior R&D Engineers at Altoros.

2 Contents 1. Introduction Testing environment The results of the research Test results for an 8 MB HDFS block Test results for 64 a MB HDFS block Dependency between the block size and the number of Map tasks Dependency between performanceand the number of MapReduce tasks Dependency between performance and the type of a query Conclusion About the authors:...9 Altoros Systems 2

3 1. Introduction When the R&D department at Altoros Systems, Inc. started this research, we only had access to a community technology preview (CTP) release of Apache Hadoop-based Service on Windows Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop Connection. We analyzed two ways of querying with Hadoop that were available from the Web panel: HiveQL querying and a JavaScript implementation of MapReduce jobs. A test data set was generated based on US Air Carrier Flight Delays information downloaded from Windows Azure Marketplace. It was used to test how the system would handle big data. We created the following eight types of queries in both languages and measured how fast they were processed: Count the number of flight delays by year. Count the number of flight delays and display information by year and month. Count the number of flight delays and display information by year, month, and day of month. Calculate the average flight delay time by year. Calculate the average flight delay and display information by year and month. Calculate the average flight delay time and display information by year, month, and day of month. From this analysis you will see performance results tests and observe how the throughput varies depending on the block size. The research contains twotables and three diagrams that demonstrate the findings. 2. Testing environment As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU were divided among three virtual machines that served as nodes.obviously, this could introduce some errors into performance measurements. Therefore, we launched each query several times and used the average value for our benchmark. The cluster consisted of three nodes (a small cluster). The data we used for the tests consisted of five CSV files of 1.83 GB each. In total, we processed 9.15 GB of data. The replication factor was equal to three. This means that each data set had a synchronized replica on each node in the cluster. The speed of data processing varied depending on the block size therefore, we compared results achieved with 8 MB, 64 MB, and 256 MB blocks. Altoros Systems 3

4 3. The results of the research 3.1.Test results for an 8 MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7: : / /1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7: : / /1 9: : / /1 4. Calculate the average flight delay time by year 12: : / /1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13: : / /1 14: : / /1 7. Count the number of delays by the origin airport 7: : / /1 8. Count the number of delays by the destination airport Table 1 7: : / /1 Brief summary As it is seen form the table,time spent on processing JavaScript queries didn t vary significantly and ranged from 99 to 116 seconds. Although Hive queries were processed 7-15 times faster than JavaScript implementations, their computation speed differed greatly depending on the type of the query. We ll expand on this dependency in Figure 3 Dependency between performance and the type of a query. Altoros Systems 4

5 3.2.Test results for 64 a MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7: : /10 150/1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7: : /10 150/1 9: : /10 150/1 4. Calculate the average flight delay time by year 12: : /10 150/1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13: : /10 150/1 14: : /10 150/1 7. Count the number of delays by the origin airport 7: : /10 150/1 8. Count the number of delays by the destination airport Table2 7: : /10 150/1 Brief summary As you can see, it took us seven minutes to process the first query created with Hive, while processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the Hive queries were also processed several times faster than queries based on JavaScript. To provide a more detailed picture, we indicated the total number of Map and Reduce tasks in both tables. ThetestresultsshowedthatHDFSblocksizeandthe number of MapReduce tasks affects computation speed.as you can see in Table 1, the Hive query generated 38 Map tasks and 10 Reduce tasks, but for the JavaScript implementations this ratio was 1170 Map tasks to a Reduce Altoros Systems 5

6 task. The similar results were achieved for 64 MB HDFS block. As you can see in Table 2, the first Hive query produced 37 Map tasks and 10 Reduce tasks. The JavaScript query generated 150 Map tasks and a Reduce task. This can be explained by the fact that JavaScript is not a native language for Hadoop, which was written in Java. Hive features a task manager that analyses the load, divides the data set into a number of Map and Reduce tasks, and chooses a certain ratio of Map tasks to Reduce tasks to ensure the fastest computation speed. Unfortunately, it is not clear yet how to optimize the JavaScript code and configure the task manager, so that it uses available resources in a more efficient manner. The results of the JavaScript query are written to the outputfileof the runjs command ( codefile, inputfiles, outputfile ) using a single Reduce task, that may be a reason of such a great difference in performance Dependency between the block size and the number of Map tasks We have also analyzed how the size of a block in a distributed file system influenced the number of Map tasks triggered in Hive and JavaScript queries. For a 64 MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript query was processed, the task manager divided the total load into 150 Map tasks and a Reduce task. Altoros Systems 6

7 Referring to the table, we can conclude that the number of Reduce tasks does not depend on the block size and is equal to 10 for Hive queries and to 1 for JavaScript queries. 3.4.Dependency between performanceand the number of MapReduce tasks We also analyzed how the number of Map and Reduce tasks influenced the speed of processing Hive and JavaScript queries. From this diagram, you can see that Hive queries were properly optimized and the block size had almost no impact on execution time. In JavaScript, on the contrary, the processing speed depended directly on the number of Map tasks. Altoros Systems 7

8 3.5.Dependency between performance and the type of a query Below you can see the diagram that shows how processing speed depends on the query type for a 64 MB HDFS block. The difference between the queries 1-3 and 4-6 was in the number of grouping parameters. The first query calculated flight delay times by year. In the second query, we added such parameters as month and in the third day. The fourth query returned the average flight delay times by ear, which is a different arithmetic operation. The fifth and sixths queries calculated the average flight delay times by year, month, and by year, month, and day respectively. Judging by the diagram, additional grouping parameters had much greater influence on JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as transforming, converting, and computing data caused the processing speed to degrade significantly, which can be seen from the difference in processing time between the first and the fourth queries. The sixth query calculated average values and included three grouping parameters, which resulted in the slowest processing speed. 4. Conclusion When we started this analysis, only the community technology preview release was available. Hadoop on Windows Azure had no documentation and there were no manuals showing how to optimize JavaScript queries. On the other hand, HiveQL had been built on top of Apache Hadoop Altoros Systems 8

9 long before Microsoft offered their solution. That is why Hive is much faster when performing basic operations, such as various selections or doing various data manipulations like creating/updating/deleting, random data sampling, statistics, etc. However, you would have to opt for JavaScript for more complex algorithms, such as data mining or machine learning, anyway, since they cannot be implemented with Hive. In October 2012 Microsoft released a new CTP version of this service, which is now called Windows Azure HDInsight. Some of the issues we mentioned before were fixed, since the improvements included: updated versions of Hive, Pig, Hadoop Common, etc. an updated version of the Hive ODBC driver a local version of the HDInsight community technology preview (for Windows Server) guides and documentation describing how to use the service Now Microsoft offers a browser-based console that serves as an interface for running MapReduce jobs on Azure. The implementation of Hadoop on Windows Azure also simplifies installation, configuration, and management of a cloud cluster. In addition, the updated platform can be used with such tools as Excel, PowerPivot, Powerview, SQL Server Analysis Services, and Reporting Services.There is also the ODBC driver that connects Hive to Microsoft s products and helps to deal with business intelligence tasks. Such a solution that would enable.net developers to process huge amounts of data fast was long awaited. Although this article describes the out-of-the-box querying methods, the.net community is contributing to.net SDK for Hadoop. Currently, the version is available to public at CodePlex. This library already enables developers to implement MapReduce jobs using any of the CLI Languages the solution comes with examples written in C# and F# and provides tools for building Hive queries using LINQ to Hive. Therefore, soon.net developers will be able to build native Hadoop-based applications, employing other libraries that conform to the Common Language Infrastructure. This SDK will be an even more efficient tool for in-depth data analysis, data mining, machine learning, and creating recommendation systems with.net. 5. About the authors: Andrei Paleyes has 5+ years of experience in MS.NET-related technologies applied in largescale international projects. Having a master s degree in mathematics, he is interested in big data analysis and implementation of mathematical methods used in data mining. He is a knowledge discovery enthusiast and presented a number of sessions on data science at local conferences. Recently, Andrei participated in architecture development of the analytical cloud-based platforms for genome sequencing and energy consumption solutions. Altoros Systems 9

10 Sergey Klimov is proficient in developing large-scale applications and corporate systems, as well as processes automation using MS.NET and cloud technologies. He has degrees in softwareengineering and technical automatics. Sergey focuses on projects that require processing large volumes of data using Hadoop and cloud technologies, in particular Windows Azure. Altoros is a global software delivery acceleration specialist that provides focused product engineering to technology companies and start-ups. Founded in 2001 and headquartered in Silicon Valley (Sunnyvale, California), Altoros has a sales office in Western Massachusetts, branch offices in Norway, Denmark, Switzerland, and UK, and a software development centers in Argentina and Eastern Europe (Minsk, Belarus). For more information, please visit Altoros Systems 10

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Microsoft Big Data. Solution Brief

Microsoft Big Data. Solution Brief Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

How To Extend An Enterprise Bio Solution

How To Extend An Enterprise Bio Solution Course 20467C: Designing Self-Service Business Intelligence and Big Data Solutions Module 1: Introduction to Self-Service Business Intelligence This module introduces self-service BI. Extending Enterprise

More information

Big Data & Cloud. 4 th European Summit on the Future Internet. António Miguel Ferreira, CEO, Lunacloud. Aveiro, 13 to 14th June 2013

Big Data & Cloud. 4 th European Summit on the Future Internet. António Miguel Ferreira, CEO, Lunacloud. Aveiro, 13 to 14th June 2013 Big Data & Cloud 4 th European Summit on the Future Internet António Miguel Ferreira, CEO, Lunacloud Aveiro, 13 to 14th June 2013 ? About Lunacloud is a cloud infrastructure and platform services provider

More information

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

More information

Course MS20467C Designing Self-Service Business Intelligence and Big Data Solutions

Course MS20467C Designing Self-Service Business Intelligence and Big Data Solutions 3 Riverchase Office Plaza Hoover, Alabama 35244 Phone: 205.989.4944 Fax: 855.317.2187 E-Mail: rwhitney@discoveritt.com Web: www.discoveritt.com Course MS20467C Designing Self-Service Business Intelligence

More information

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria

More information

Designing Self-Service Business Intelligence and Big Data Solutions

Designing Self-Service Business Intelligence and Big Data Solutions This five-day instructor-led course teaches students how to implement self-service Business Intelligence (BI) and Big Data analysis solutions using the Microsoft data platform. The course discusses the

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies

More information

Azure Data Lake Analytics

Azure Data Lake Analytics Azure Data Lake Analytics Compose and orchestrate data services at scale Fully managed service to support orchestration of data movement and processing Connect to relational or non-relational data

More information

Big Data Analytics with PowerPivot and Power View

Big Data Analytics with PowerPivot and Power View Big Data Analytics with PowerPivot and Power View Peter Myers Global Sponsors: Presenter Introduction Peter Myers BI Expert BBus,MCSE, MCT, SQL Server MVP 15 years of experience designing, developing and

More information

Please contact Cyber and Technology Training at (410)777-1333/technologytraining@aacc.edu for registration and pricing information.

Please contact Cyber and Technology Training at (410)777-1333/technologytraining@aacc.edu for registration and pricing information. Course Name Start Date End Date Start Time End Time Active Directory Services with Windows Server 8/31/2015 9/4/2015 9:00 AM 5:00 PM Active Directory Services with Windows Server 9/28/2015 10/2/2015 9:00

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Understanding NoSQL on Microsoft Azure

Understanding NoSQL on Microsoft Azure David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

SQLSaturday #399 Sacramento 25 July, 2015. Big Data Analytics with Excel

SQLSaturday #399 Sacramento 25 July, 2015. Big Data Analytics with Excel SQLSaturday #399 Sacramento 25 July, 2015 Big Data Analytics with Excel Presenter Introduction Peter Myers Independent BI Expert Bitwise Solutions BBus, SQL Server MCSE, SQL Server MVP since 2007 Experienced

More information

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions Type:Course Audience(s):IT Professionals Technology:Microsoft SQL Server Level:300 This Revision:C Delivery method: Instructor-led

More information

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include: Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Microsoft Azure Data Technologies: An Overview

Microsoft Azure Data Technologies: An Overview David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal BIG DATA AND MICROSOFT Susie Adams CTO Microsoft Federal THE WORLD OF DATA IS CHANGING Cloud What s making this possible? Electrical efficiency of computers doubles every year and ½. Laptops and mobile

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER 32-bit and 64-bit BarTender How to Select the Right Version for Your Needs WHITE PAPER Contents Overview 3 The Difference Between 32-bit and 64-bit 3 Find Out if Your Computer is Capable of Running 64-bit

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Maximizing Hadoop Performance with Hardware Compression

Maximizing Hadoop Performance with Hardware Compression Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

SQL Server 2014. What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

SQL Server 2014. What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft. SQL Server 2014 What s New? Christopher Speer Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.com The evolution of the Microsoft data platform What s New

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

343 Industries Gets New User Insights from Big Data in the Cloud

343 Industries Gets New User Insights from Big Data in the Cloud 343 Industries Gets New User Insights from Big Data in the Cloud Published: May 2013 The following content may no longer reflect Microsoft s current position or infrastructure. This content should be viewed

More information

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics OverOnline Transactional Data Set Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Data Management in SAP Environments

Data Management in SAP Environments Data Management in SAP Environments the Big Data Impact Berlin, June 2012 Dr. Wolfgang Martin Analyst, ibond Partner und Ventana Research Advisor Data Management in SAP Environments Big Data What it is

More information

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

project collects data from national events, both natural and manmade, to be stored and evaluated by

project collects data from national events, both natural and manmade, to be stored and evaluated by Joseph Sebastian CS 2994 Spring 2014 Undergraduate Research Final Paper GOALS The goal of my research was to assist the Integrated Digital Event Archive (IDEAL) team in transferring their Twitter data

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Modernizing Your Data Warehouse for Hadoop

Modernizing Your Data Warehouse for Hadoop Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist Audie.Wright@Microsoft.com O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Designing a Data Solution with Microsoft SQL Server 2014

Designing a Data Solution with Microsoft SQL Server 2014 20465C - Version: 1 22 June 2016 Designing a Data Solution with Microsoft SQL Server 2014 Designing a Data Solution with Microsoft SQL Server 2014 20465C - Version: 1 5 days Course Description: The focus

More information

Sisense. Product Highlights. www.sisense.com

Sisense. Product Highlights. www.sisense.com Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Big Data Training - Hackveda

Big Data Training - Hackveda Big Data Training - Hackveda Become a Hackveda Certified Big Data Professional - (Beginner) Skill level: Beginner Training fee: INR 9000 only (Topics covered: 108) Chief Trainer: Mr. Devanshu Shukla Training

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Hadoop in the Hybrid Cloud

Hadoop in the Hybrid Cloud Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Data Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros

Data Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros Data Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros 2013 Altoros, Any unauthorized republishing, rewriting or use of this material is prohibited.

More information

Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

Cloud Computing. Big Data. High Performance Computing

Cloud Computing. Big Data. High Performance Computing Cloud Computing Big Data High Performance Computing Intel Corporation copy right 2013 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Savanna Hadoop on. OpenStack. Savanna Technical Lead Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013 Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization

More information

Big Data Technologies Compared June 2014

Big Data Technologies Compared June 2014 Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Creating a universe on Hive with Hortonworks HDP 2.0

Creating a universe on Hive with Hortonworks HDP 2.0 Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh

More information

HPC in Finance on Windows Azure. Philip Bull Azure Business Development Manager Microsoft UK

HPC in Finance on Windows Azure. Philip Bull Azure Business Development Manager Microsoft UK HPC in Finance on Windows Azure Philip Bull Azure Business Development Manager Microsoft UK Major datacenter West US, North Central US, S. Central US, East US, N. Europe, W. Europe, E. Asia, S.E. Asia

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

Managed Services for the Cloud Foundry PaaS

Managed Services for the Cloud Foundry PaaS Managed Services for the Cloud Foundry PaaS Managed Services for the Cloud Foundry PaaS Contents Who can benefit? How Altoros s managed services can help you? Support and response time Altoros s service

More information

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines

More information

File S1: Supplementary Information of CloudDOE

File S1: Supplementary Information of CloudDOE File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.

More information

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information