Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data
|
|
- Nigel Hampton
- 8 years ago
- Views:
Transcription
1 Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop, which was designed for this purpose, is written in Java and was not available to.net developers. So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute the load and speed up big data computations. But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows Azure and compare their performance. By Sergey Klimov and Andrei Paleyes, Senior R&D Engineers at Altoros.
2 Contents 1. Introduction Testing environment The results of the research Test results for an 8 MB HDFS block Test results for 64 a MB HDFS block Dependency between the block size and the number of Map tasks Dependency between performanceand the number of MapReduce tasks Dependency between performance and the type of a query Conclusion About the authors:...9 Altoros Systems 2
3 1. Introduction When the R&D department at Altoros Systems, Inc. started this research, we only had access to a community technology preview (CTP) release of Apache Hadoop-based Service on Windows Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop Connection. We analyzed two ways of querying with Hadoop that were available from the Web panel: HiveQL querying and a JavaScript implementation of MapReduce jobs. A test data set was generated based on US Air Carrier Flight Delays information downloaded from Windows Azure Marketplace. It was used to test how the system would handle big data. We created the following eight types of queries in both languages and measured how fast they were processed: Count the number of flight delays by year. Count the number of flight delays and display information by year and month. Count the number of flight delays and display information by year, month, and day of month. Calculate the average flight delay time by year. Calculate the average flight delay and display information by year and month. Calculate the average flight delay time and display information by year, month, and day of month. From this analysis you will see performance results tests and observe how the throughput varies depending on the block size. The research contains twotables and three diagrams that demonstrate the findings. 2. Testing environment As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU were divided among three virtual machines that served as nodes.obviously, this could introduce some errors into performance measurements. Therefore, we launched each query several times and used the average value for our benchmark. The cluster consisted of three nodes (a small cluster). The data we used for the tests consisted of five CSV files of 1.83 GB each. In total, we processed 9.15 GB of data. The replication factor was equal to three. This means that each data set had a synchronized replica on each node in the cluster. The speed of data processing varied depending on the block size therefore, we compared results achieved with 8 MB, 64 MB, and 256 MB blocks. Altoros Systems 3
4 3. The results of the research 3.1.Test results for an 8 MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7: : / /1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7: : / /1 9: : / /1 4. Calculate the average flight delay time by year 12: : / /1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13: : / /1 14: : / /1 7. Count the number of delays by the origin airport 7: : / /1 8. Count the number of delays by the destination airport Table 1 7: : / /1 Brief summary As it is seen form the table,time spent on processing JavaScript queries didn t vary significantly and ranged from 99 to 116 seconds. Although Hive queries were processed 7-15 times faster than JavaScript implementations, their computation speed differed greatly depending on the type of the query. We ll expand on this dependency in Figure 3 Dependency between performance and the type of a query. Altoros Systems 4
5 3.2.Test results for 64 a MB HDFS block Query Type Total MapReduce CPU time spent (min:sec.msec) Number of Map/Reduce tasks HQL JavaScript HQL JavaScript 1. Count the number of flight delays by year 7: : /10 150/1 2. Count the number of flight delays and display information by year and month 3. Count the number of flight delays and display information by year, month, and day of month 7: : /10 150/1 9: : /10 150/1 4. Calculate the average flight delay time by year 12: : /10 150/1 5. Calculate the average flight delay and display information by year and month 6. Calculate the average flight delay time and display information by year, month, and day of month 13: : /10 150/1 14: : /10 150/1 7. Count the number of delays by the origin airport 7: : /10 150/1 8. Count the number of delays by the destination airport Table2 7: : /10 150/1 Brief summary As you can see, it took us seven minutes to process the first query created with Hive, while processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the Hive queries were also processed several times faster than queries based on JavaScript. To provide a more detailed picture, we indicated the total number of Map and Reduce tasks in both tables. ThetestresultsshowedthatHDFSblocksizeandthe number of MapReduce tasks affects computation speed.as you can see in Table 1, the Hive query generated 38 Map tasks and 10 Reduce tasks, but for the JavaScript implementations this ratio was 1170 Map tasks to a Reduce Altoros Systems 5
6 task. The similar results were achieved for 64 MB HDFS block. As you can see in Table 2, the first Hive query produced 37 Map tasks and 10 Reduce tasks. The JavaScript query generated 150 Map tasks and a Reduce task. This can be explained by the fact that JavaScript is not a native language for Hadoop, which was written in Java. Hive features a task manager that analyses the load, divides the data set into a number of Map and Reduce tasks, and chooses a certain ratio of Map tasks to Reduce tasks to ensure the fastest computation speed. Unfortunately, it is not clear yet how to optimize the JavaScript code and configure the task manager, so that it uses available resources in a more efficient manner. The results of the JavaScript query are written to the outputfileof the runjs command ( codefile, inputfiles, outputfile ) using a single Reduce task, that may be a reason of such a great difference in performance Dependency between the block size and the number of Map tasks We have also analyzed how the size of a block in a distributed file system influenced the number of Map tasks triggered in Hive and JavaScript queries. For a 64 MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript query was processed, the task manager divided the total load into 150 Map tasks and a Reduce task. Altoros Systems 6
7 Referring to the table, we can conclude that the number of Reduce tasks does not depend on the block size and is equal to 10 for Hive queries and to 1 for JavaScript queries. 3.4.Dependency between performanceand the number of MapReduce tasks We also analyzed how the number of Map and Reduce tasks influenced the speed of processing Hive and JavaScript queries. From this diagram, you can see that Hive queries were properly optimized and the block size had almost no impact on execution time. In JavaScript, on the contrary, the processing speed depended directly on the number of Map tasks. Altoros Systems 7
8 3.5.Dependency between performance and the type of a query Below you can see the diagram that shows how processing speed depends on the query type for a 64 MB HDFS block. The difference between the queries 1-3 and 4-6 was in the number of grouping parameters. The first query calculated flight delay times by year. In the second query, we added such parameters as month and in the third day. The fourth query returned the average flight delay times by ear, which is a different arithmetic operation. The fifth and sixths queries calculated the average flight delay times by year, month, and by year, month, and day respectively. Judging by the diagram, additional grouping parameters had much greater influence on JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as transforming, converting, and computing data caused the processing speed to degrade significantly, which can be seen from the difference in processing time between the first and the fourth queries. The sixth query calculated average values and included three grouping parameters, which resulted in the slowest processing speed. 4. Conclusion When we started this analysis, only the community technology preview release was available. Hadoop on Windows Azure had no documentation and there were no manuals showing how to optimize JavaScript queries. On the other hand, HiveQL had been built on top of Apache Hadoop Altoros Systems 8
9 long before Microsoft offered their solution. That is why Hive is much faster when performing basic operations, such as various selections or doing various data manipulations like creating/updating/deleting, random data sampling, statistics, etc. However, you would have to opt for JavaScript for more complex algorithms, such as data mining or machine learning, anyway, since they cannot be implemented with Hive. In October 2012 Microsoft released a new CTP version of this service, which is now called Windows Azure HDInsight. Some of the issues we mentioned before were fixed, since the improvements included: updated versions of Hive, Pig, Hadoop Common, etc. an updated version of the Hive ODBC driver a local version of the HDInsight community technology preview (for Windows Server) guides and documentation describing how to use the service Now Microsoft offers a browser-based console that serves as an interface for running MapReduce jobs on Azure. The implementation of Hadoop on Windows Azure also simplifies installation, configuration, and management of a cloud cluster. In addition, the updated platform can be used with such tools as Excel, PowerPivot, Powerview, SQL Server Analysis Services, and Reporting Services.There is also the ODBC driver that connects Hive to Microsoft s products and helps to deal with business intelligence tasks. Such a solution that would enable.net developers to process huge amounts of data fast was long awaited. Although this article describes the out-of-the-box querying methods, the.net community is contributing to.net SDK for Hadoop. Currently, the version is available to public at CodePlex. This library already enables developers to implement MapReduce jobs using any of the CLI Languages the solution comes with examples written in C# and F# and provides tools for building Hive queries using LINQ to Hive. Therefore, soon.net developers will be able to build native Hadoop-based applications, employing other libraries that conform to the Common Language Infrastructure. This SDK will be an even more efficient tool for in-depth data analysis, data mining, machine learning, and creating recommendation systems with.net. 5. About the authors: Andrei Paleyes has 5+ years of experience in MS.NET-related technologies applied in largescale international projects. Having a master s degree in mathematics, he is interested in big data analysis and implementation of mathematical methods used in data mining. He is a knowledge discovery enthusiast and presented a number of sessions on data science at local conferences. Recently, Andrei participated in architecture development of the analytical cloud-based platforms for genome sequencing and energy consumption solutions. Altoros Systems 9
10 Sergey Klimov is proficient in developing large-scale applications and corporate systems, as well as processes automation using MS.NET and cloud technologies. He has degrees in softwareengineering and technical automatics. Sergey focuses on projects that require processing large volumes of data using Hadoop and cloud technologies, in particular Windows Azure. Altoros is a global software delivery acceleration specialist that provides focused product engineering to technology companies and start-ups. Founded in 2001 and headquartered in Silicon Valley (Sunnyvale, California), Altoros has a sales office in Western Massachusetts, branch offices in Norway, Denmark, Switzerland, and UK, and a software development centers in Argentina and Eastern Europe (Minsk, Belarus). For more information, please visit Altoros Systems 10
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationMicrosoft Big Data. Solution Brief
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationHow To Extend An Enterprise Bio Solution
Course 20467C: Designing Self-Service Business Intelligence and Big Data Solutions Module 1: Introduction to Self-Service Business Intelligence This module introduces self-service BI. Extending Enterprise
More informationBig Data & Cloud. 4 th European Summit on the Future Internet. António Miguel Ferreira, CEO, Lunacloud. Aveiro, 13 to 14th June 2013
Big Data & Cloud 4 th European Summit on the Future Internet António Miguel Ferreira, CEO, Lunacloud Aveiro, 13 to 14th June 2013 ? About Lunacloud is a cloud infrastructure and platform services provider
More informationVOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved
Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2
More informationCourse MS20467C Designing Self-Service Business Intelligence and Big Data Solutions
3 Riverchase Office Plaza Hoover, Alabama 35244 Phone: 205.989.4944 Fax: 855.317.2187 E-Mail: rwhitney@discoveritt.com Web: www.discoveritt.com Course MS20467C Designing Self-Service Business Intelligence
More informationBenchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk
Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria
More informationDesigning Self-Service Business Intelligence and Big Data Solutions
This five-day instructor-led course teaches students how to implement self-service Business Intelligence (BI) and Big Data analysis solutions using the Microsoft data platform. The course discusses the
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More informationAn Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
More informationWINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
More informationAzure Data Lake Analytics
Azure Data Lake Analytics Compose and orchestrate data services at scale Fully managed service to support orchestration of data movement and processing Connect to relational or non-relational data
More informationBig Data Analytics with PowerPivot and Power View
Big Data Analytics with PowerPivot and Power View Peter Myers Global Sponsors: Presenter Introduction Peter Myers BI Expert BBus,MCSE, MCT, SQL Server MVP 15 years of experience designing, developing and
More informationPlease contact Cyber and Technology Training at (410)777-1333/technologytraining@aacc.edu for registration and pricing information.
Course Name Start Date End Date Start Time End Time Active Directory Services with Windows Server 8/31/2015 9/4/2015 9:00 AM 5:00 PM Active Directory Services with Windows Server 9/28/2015 10/2/2015 9:00
More informationThe Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationUnderstanding NoSQL on Microsoft Azure
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
More informationAnalytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
More informationSQLSaturday #399 Sacramento 25 July, 2015. Big Data Analytics with Excel
SQLSaturday #399 Sacramento 25 July, 2015 Big Data Analytics with Excel Presenter Introduction Peter Myers Independent BI Expert Bitwise Solutions BBus, SQL Server MCSE, SQL Server MVP since 2007 Experienced
More informationCourse 20467: Designing Self-Service Business Intelligence and Big Data Solutions
Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions Type:Course Audience(s):IT Professionals Technology:Microsoft SQL Server Level:300 This Revision:C Delivery method: Instructor-led
More informationIntroduction. Various user groups requiring Hadoop, each with its own diverse needs, include:
Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationMicrosoft Azure Data Technologies: An Overview
David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...
More informationEnabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
More informationBIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal
BIG DATA AND MICROSOFT Susie Adams CTO Microsoft Federal THE WORLD OF DATA IS CHANGING Cloud What s making this possible? Electrical efficiency of computers doubles every year and ½. Laptops and mobile
More informationAssignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
More information32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER
32-bit and 64-bit BarTender How to Select the Right Version for Your Needs WHITE PAPER Contents Overview 3 The Difference Between 32-bit and 64-bit 3 Find Out if Your Computer is Capable of Running 64-bit
More informationInformation Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
More informationTesting 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationMaximizing Hadoop Performance with Hardware Compression
Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationSQL Server 2014. What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.
SQL Server 2014 What s New? Christopher Speer Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.com The evolution of the Microsoft data platform What s New
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More information343 Industries Gets New User Insights from Big Data in the Cloud
343 Industries Gets New User Insights from Big Data in the Cloud Published: May 2013 The following content may no longer reflect Microsoft s current position or infrastructure. This content should be viewed
More informationBig Data Analytics OverOnline Transactional Data Set
Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationData Management in SAP Environments
Data Management in SAP Environments the Big Data Impact Berlin, June 2012 Dr. Wolfgang Martin Analyst, ibond Partner und Ventana Research Advisor Data Management in SAP Environments Big Data What it is
More informationBig Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.
More informationAGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationproject collects data from national events, both natural and manmade, to be stored and evaluated by
Joseph Sebastian CS 2994 Spring 2014 Undergraduate Research Final Paper GOALS The goal of my research was to assist the Integrated Digital Event Archive (IDEAL) team in transferring their Twitter data
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationModernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist Audie.Wright@Microsoft.com O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
More informationISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationDesigning a Data Solution with Microsoft SQL Server 2014
20465C - Version: 1 22 June 2016 Designing a Data Solution with Microsoft SQL Server 2014 Designing a Data Solution with Microsoft SQL Server 2014 20465C - Version: 1 5 days Course Description: The focus
More informationSisense. Product Highlights. www.sisense.com
Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationBig Data Training - Hackveda
Big Data Training - Hackveda Become a Hackveda Certified Big Data Professional - (Beginner) Skill level: Beginner Training fee: INR 9000 only (Topics covered: 108) Chief Trainer: Mr. Devanshu Shukla Training
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationVirtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationHadoop in the Hybrid Cloud
Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationData Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros
Data Visualization Frameworks: D3.js vs. Flot vs. Highcharts by Igor Zalutsky, JavaScript Developer at Altoros 2013 Altoros, Any unauthorized republishing, rewriting or use of this material is prohibited.
More informationProcessing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
More informationSector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
More informationWhere We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
More informationCloud Computing. Big Data. High Performance Computing
Cloud Computing Big Data High Performance Computing Intel Corporation copy right 2013 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationMicrosoft SQL Server 2012 with Hadoop
Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationUbuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationSavanna Hadoop on. OpenStack. Savanna Technical Lead
Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013 Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
More informationBig Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationIn-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationBIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationCreating a universe on Hive with Hortonworks HDP 2.0
Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh
More informationHPC in Finance on Windows Azure. Philip Bull Azure Business Development Manager Microsoft UK
HPC in Finance on Windows Azure Philip Bull Azure Business Development Manager Microsoft UK Major datacenter West US, North Central US, S. Central US, East US, N. Europe, W. Europe, E. Asia, S.E. Asia
More informationTAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP
Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify
More informationManaged Services for the Cloud Foundry PaaS
Managed Services for the Cloud Foundry PaaS Managed Services for the Cloud Foundry PaaS Contents Who can benefit? How Altoros s managed services can help you? Support and response time Altoros s service
More informationBig Data Analytics in LinkedIn. Danielle Aring & William Merritt
Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines
More informationFile S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
More informationApache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More information