Hadoop vs Apache Spark

Similar documents
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Architectures for massive data management

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Ecosystem B Y R A H I M A.

How To Handle Big Data With A Data Scientist

Big Data With Hadoop

Hadoop IST 734 SS CHUNG

Chase Wu New Jersey Ins0tute of Technology

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Unified Big Data Analytics Pipeline. 连 城

Dell In-Memory Appliance for Cloudera Enterprise

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Moving From Hadoop to Spark

Introduction to Big Data Training

Workshop on Hadoop with Big Data

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data Analytics - Accelerated. stream-horizon.com

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Brief Outline on Bigdata Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Apache HBase. Crazy dances on the elephant back

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Streaming items through a cluster with Spark Streaming

NoSQL and Hadoop Technologies On Oracle Cloud

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop. Sunday, November 25, 12

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Oracle Big Data Fundamentals Ed 1 NEW

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Hadoop & Spark Using Amazon EMR

Hadoop Big Data for Processing Data and Performing Workload

Real Time Data Processing using Spark Streaming

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Unified Batch & Stream Processing Platform

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Unified Big Data Processing with Apache Spark. Matei

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics Hadoop and Spark

Brave New World: Hadoop vs. Spark

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Architectures for Big Data Analytics A database perspective

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop implementation of MapReduce computational model. Ján Vaňo

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Upcoming Announcements

Hadoop Job Oriented Training Agenda

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Internals of Hadoop Application Framework and Distributed File System

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Case Study : 3 different hadoop cluster deployments

TRAINING PROGRAM ON BIGDATA/HADOOP

Large scale processing using Hadoop. Ján Vaňo

BIG DATA What it is and how to use?

Map Reduce & Hadoop Recommended Text:

A Brief Introduction to Apache Tez

Big Data Technology Core Hadoop: HDFS-YARN Internals

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Apache Flink Next-gen data analysis. Kostas

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Analytics - Accelerated. stream-horizon.com

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Apache Hadoop. Alexandru Costan

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

How Companies are! Using Spark

Unlocking the True Value of Hadoop with Open Data Science

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Scalable Architecture on Amazon AWS Cloud

Big Data Analytics. Lucas Rego Drumond

Data Services Advisory

The Internet of Things and Big Data: Intro

Big Data Course Highlights

BIG DATA HADOOP TRAINING

Spark: Cluster Computing with Working Sets

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop: Embracing future hardware

How To Scale Out Of A Nosql Database

Transcription:

Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com

Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies and implementations are gaining traction and moving at a fast pace with novel innovations happening in its space. Hadoop, NoSQL, MongoDB and Apache Spark are the buzzwords with big data technologies leaving a digital trace of data in every individual's life which can be used for analysis. Future of big data analytics market will revolve around Internet of Things (IoT), Sentiment Analysis, Trend Analysis, increase in sensor driven wearables, etc. Hadoop and Spark are popular Apache projects in the big data ecosystem. Apache Spark is an opensource platform, based on the original Hadoop MapReduce component of the Hadoop ecosystem. Here we come up with a comparative analysis between Hadoop and Apache Spark in terms of performance, storage, reliability, architecture, etc. Comparative Analysis: Hadoop vs Apache Spark Hadoop - Overview Apache developed Hadoop project as open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows distributed processing of large datasets across clusters of computers using simple programming models. Hadoop can be easily scaled-up to multi cluster machines, each offering local storage and computation. Hadoop libraries are designed in such a way that it can detect the failed cluster at application layer and can handle those failures by it. This ensures high-availability by default. The project includes these modules: Hadoop Common: These are Java libraries and utilities required for running other Hadoop modules. These libraries provide OS level and filesystem abstractions and contain the necessary Java files and scripts required to start and run Hadoop. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large datasets. Hadoop - Architecture Hadoop MapReduce, HDFS and YARN provide a scalable, faulttolerant and distributed platform for storage and processing of very large datasets across clusters of commodity computers. Hadoop uses the same set of nodes for data storage as well as to perform the computations. This allows Hadoop to improve the performance of large scale computations by combining computations along with the storage.

Client Job Submission YARN Resource Manager Node Status NodeManager NodeManager NodeManager Map Task 2 Map Reduce Application Master Resource Request Container Launch Request Containers Containers Containers MapReduce Coordination Map Task N Reduce Task 1 Map Task 1 Reduce Task M Hadoop Distributed File System HDFS HDFS is a distributed filesystem that is designed to store large volume of data reliably. HDFS stores a single large file on different nodes across the cluster of commodity machines. HDFS overlays on top of the existing filesystem. Data is stored in fine grained blocks, with default block size of 128MB. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. HDFS is a distributed, reliable and scalable file system. Hadoop YARN YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer process them in any available node as directed by YARN. Hadoop - Advantages: For organizations considering a big data implementation, Hadoop has many features that make it worth considering, including: Flexibility Hadoop enables businesses to easily access new data sources and tap into different types of data to generate value from that data. Scalability Hadoop is a highly scalable storage platform, as it can store and distribute very large datasets across hundreds of inexpensive commodity servers that operate in parallel. Affordability Hadoop is open source and runs on commodity hardware, including cloud providers like Amazon. Fault-Tolerance Automatic replication and ability to handle the failures at application layer makes Hadoop fault tolerant by default.

Apache Spark - Overview It is a framework for analyzing data analytics on a distributed computing cluster. It provides in-memory computations for increasing speed and data processing over MapReduce. It utilizes the Hadoop Distributed File System (HDFS) and runs on top of existing Hadoop cluster. It can also process both structured data in Hive and streaming data from different sources like HDFS, Flume, Kafka, and Twitter. Spark Stream Spark Streaming is an extension of the core Spark API. Processing live data streams can be done using Spark Streaming, that enables scalable, highthroughput, fault-tolerant stream. Input Data can be from any sources like Web Stream (TCP sockets), Flume, Kafka, etc., and can be processed using complex algorithms with high-level functions like map, reduce, join, etc. Finally, processed data can be pushed out to filesystems (HDFS), databases, and live dashboards. We can also apply Spark s graph processing algorithms and machine learning on data streams. Spark SQL Apache Spark provides a separate module Spark SQL for processing structured data. Spark SQL has an interface, which provides detailed information about the structure of the data and the computation being performed. Internally, Spark SQL uses this additional information to perform extra optimizations. Datasets and DataFrames A distributed collection of data is called as Dataset in Apache Spark. Dataset provides the benefits of RDDs along with utilizing the Spark SQL s optimized execution engine. A Dataset can be constructed from objects and then manipulated using functional transformations. A DataFrame is a dataset organized into named columns. It is equally related to a relational database table or a R/Python data frame, but with richer optimizations under the hood. A DataFrame can be constructed, using various data source like structured data file or Hive tables or external databases or existing RDDs. Resilient Distributed Datasets (RDDs) Spark works on fault-tolerant collection of elements that can be operated on in parallel, the concept called resilient distributed dataset (RDD). RDDs can be created in two ways, parallelizing an existing collection in driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, etc.

Apache Spark - Advantages Speed Spark process data in-memory, making it run 100x faster than applications in Hadoop clusters. Also this enables even 10x faster when running on disk. This also reduces number of read/write to disc. Spark stores its intermediate processing data in-memory. Ease of Use Spark lets developers to write applications in Java, Python or Scala, so that they can create and run their applications on their familiar programming languages. Combines SQL, Streaming and Complex Analytics In addition to simple map and reduce operations, Spark supports streaming data, SQL queries, and complex analytics such as graph algorithms and machine learning out-of-the-box. Runs Everywhere Spark runs on any system, either Hadoop, or standalone, or in the cloud. It can access diverse data sources including HDFS, HBase, S3 and Cassandra.

Conclusion Hadoop stores data on disk, whereas Spark stores data in-memory. Hadoop uses replication to achieve fault tolerance whereas Spark uses resilient distributed datasets (RDD), which guarantees fault tolerance that minimizes network I/O. Spark can run on top of HDFS along with other Hadoop components. Spark has become one more data processing engine in Hadoop ecosystem, providing more capabilities to Hadoop stack for businesses. For developers, there is no difference between the two. Hadoop is a framework, where one can write MapReduce jobs using Java classes. Whereas Spark is a library that enables writing code for parallel computation via function calls. For administrators running a cluster, there is an overlap in general skills, such as cluster management, code deployment, monitoring configuration, etc. Here is a quick comparison guideline before concluding. Aspects Hadoop Apache Spark Difficulty Interactive Mode MapReduce is difficult to program and needs abstractions. There is no in-built interactive mode, except Pig and Hive. Spark is easy to program and does not require any abstractions. It has interactive mode. Streaming Performance Latency Ease of coding Hadoop MapReduce just get to process a batch of large stored data. MapReduce does not leverage the memory of the Hadoop cluster to the maximum. MapReduce is disk oriented completely. Writing Hadoop MapReduce pipelines is complex and lengthy process. Spark can be used to modify in real time through Spark Streaming. Spark has been said to execute batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Spark ensures lower latency computations by caching the partial results across its memory of distributed workers. Writing Spark code is always more compact. ABOUT ALTEN CALSOFT LABS ALTEN Calsoft Labs is a next gen digital transformation, enterprise IT and product engineering services provider. The company enables clients innovate, integrate, and transform their business by leveraging disruptive technologies like mobility, big data, analytics, cloud, IoT and software-defined networking (SDN/NFV). ALTEN Calsoft Labs provides concept to market offerings for industry verticals like education, healthcare, networking & telecom, hi- tech, ISV and retail. Headquartered in Bangalore, India, the company has offices in US, Europe and Singapore. ALTEN Calsoft Labs is a part of ALTEN group, a leader in technology consulting and engineering services. www.altencalsoftlabs.com business@altencalsoftlabs.com ALTEN Calsoft Labs. All rights Reserved.