Brave New World: Hadoop vs. Spark

Similar documents

Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Spark and the Big Data Library

Unified Big Data Analytics Pipeline. 连城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

How Companies are! Using Spark

Architectures for massive data management

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Apache Spark and Distributed Programming

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics Hadoop and Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Moving From Hadoop to Spark

This is a brief tutorial that explains the basics of Spark SQL programming.

CSE-E5430 Scalable Cloud Computing Lecture 11

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

How To Create A Data Visualization With Apache Spark And Zeppelin

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

Big Data Processing. Patrick Wendell Databricks

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Introduction to Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data and Scripting Systems beyond Hadoop

Large scale processing using Hadoop. Ján Vaňo

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Ali Ghodsi Head of PM and Engineering Databricks

Big Data and Scripting Systems build on top of Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Real Time Data Processing using Spark Streaming

Conquering Big Data with BDAS (Berkeley Data Analytics)

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

Beyond Hadoop with Apache Spark and BDAS

Hadoop & Spark Using Amazon EMR

Big Data Analytics. Lucas Rego Drumond

Spark: Cluster Computing with Working Sets

Oracle Big Data SQL Technical Update

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Explained. An introduction to Big Data Science.

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data With Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Case Study : 3 different hadoop cluster deployments

Next-Gen Big Data Analytics using the Spark stack

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

HDFS. Hadoop Distributed File System

Introduction to Big Data with Apache Spark UC BERKELEY

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Hadoop Architecture. Part 1

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Apache Flink Next-gen data analysis. Kostas

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Streaming items through a cluster with Spark Streaming

Machine- Learning Summer School

Hadoop and Map-Reduce. Swati Gore

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Research in Berlin BBDC and Apache Flink

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Managing large clusters resources

Big Data Analytics with Cassandra, Spark & MLLib

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Oracle Big Data Fundamentals Ed 1 NEW

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Functional Query Optimization with" SQL

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

MLlib: Scalable Machine Learning on Spark

A Brief Outline on Bigdata Hadoop

Shark Installation Guide Week 3 Report. Ankush Arora

Big Data and Apache Hadoop s MapReduce

Large-Scale Data Processing

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Challenges for Data Driven Systems

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Science in the Wild

Big Data Spatial Analytics An Introduction

BIG DATA What it is and how to use?

Hadoop implementation of MapReduce computational model. Ján Vaňo

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Workshop. dattamsha.com

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Course Highlights

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data and Industrial Internet

Big Data. Lyle Ungar, University of Pennsylvania

Transcription:

Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7, 2015

Outline Motivation: Main Memory Processing Spark: Map Reduce SQL processing Streaming processing Machine learning Graph processing ZHAW Big Data project & industry trends 2

What is Apache? General purpose cluster computing system Originally developed at UC Berkeley, now one of the largest Apache projects Typically faster than Hadoop due to main-memory processing High-level APIs in Java, Scala, Python and R Functionality for: Map/Reduce SQL processing Real-time stream processing Machine learning Graph processing 3

Iterative Processing with Hadoop What is wrong with this approach? 4

Throughput of Main Memory vs. Disk Typical throughput of disk: ~ 100 MB/sec Typical throughput of main memory: 50 GB/sec => Main memory is ~ 500 times faster than disk 5

Apache Spark Approach: In-Memory Data Sharing (from Matei Zaharia 2012, UC Berkeley) 6

Spark vs. Hadoop #1 http://www.datasciencecentral.com/profiles/blogs/the-big-big-data-question-hadoop-or-spark 7

Spark vs. Hadoop #2 (from Ameet Talwalkar, UCLA, 2015) 8

Spark Runtime Driver: Application developer writes driver program Driver connects to cluster of workers Workers: Read data blocks from distributed file system Process and store data partitions in RAM across operations Local storage, HDFS, Amazon S3, 9

Resilient Distributed Dataset (RDD) Resilience: Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation. (Wikipedia) Fault tolerance Resilient Distributed Dataset (RDD): Immutable, partitioned collections of objects spread across a cluster Controllable persistence: stored in memory or on disk Built and manipulated through: Parallel transformations (map, filter, join) Parallel actions (count, collect, save) Automatically rebuilt on machine failure 10

Spark Transformations and Actions (from Matei Zaharia et al. 2012, UC Berkeley)

Outline Motivation: Main Memory Processing Spark: Map Reduce SQL processing Streaming processing Machine learning Graph processing ZHAW Big Data project & industry trends 12

Word Count Example Hadoop 2.0: Java: Python: ~130 lines 2 programs, ~30 lines Spark: Python 13

Spark + HDFS HDFS: Hadoop Distributed Files System Fault-tolerant distributed file system One of the main strengths: data replication >15 years back in time: Data replication was a major research topic at CERN W. Hoschek, J. J. -Martinez,Asad Samar, H. Stockinger, K. Stockinger. Data Management in an International Data Grid Project. ACM International Workshop on Grid Computing (Grid 2000), Bangalore, India, December 2000, IEEE Computer Society Press (distinguished paper award) Spark can leverage this Hadoop technology 14

Spark SQL Processing Example Load data from text file into RDD, create table, and query it: from pyspark import SparkContext from pyspark.sql import SQLContext, Row sc = SparkContext(appName= SparkSQLPeople") sqlcontext = SQLContext(sc) The csv file people.txt looks as follows: Michael, 29 Andy, 30 Justin, 19 # Load a text file and convert each line to a Row. lines = sc.textfile("data/people.txt, 4) words = lines.map(lambda l: l.split(",")) people = words.map(lambda p: Row(name=p[0], age=int(p[1]))) schemapeople = sqlcontext.createdataframe(people) schemapeople.registertemptable( peoplet") # SQL can be run over SchemaRDDs that have been registered as a table. teenagers = sqlcontext.sql("select name FROM peoplet WHERE age >= 13 AND age <= 19") teenagers.collect()

Reflections on SQL Processing Pros: Good integration with core Spark concepts (RDD) Easy to use Integration with Apache Hive ( Hadoop Data Warehouse ) Cons: No full support of SQL yet (no views, indexing, etc.) No easy overview of schema Cloudera is not happy about it 16

Spark Stream Processing Example #1 17

Spark Stream Processing Example #2 18

Reflections on Stream Processing Pros: Streaming functionality well integrated with core Spark (RDD) Similar functionality to Apache Storm but simpler API Cons: Main support currently only for Scala and Java Python? 19

Spark Machine Learning Example 20

Reflections on Machine Learning Pros: Many machine learning algorithms are iterative such as gradient descent or clustering Can fully take advantage of main memory architecture Similar functionality to Apache Mahout Cons:? 21

Spark Graph Processing Example Use Case: Analysis of author graph in Medline database Source: Axel Rogg, Analyse von strukturierten und unstrukturierten Daten mit Apache Spark, Bachelor Thesis, ZHAW, June 2015 22

Reflections on Graph Processing Pros: Powerful graph processing library SQL integration into graph algorithms Similar functionality to Apache Giraph Cons: Main support mostly for Scala Graph processing only graphs with simple nodes 23

Outline Motivation: Main Memory Processing Spark: Map Reduce SQL processing Streaming processing Machine learning Graph processing ZHAW Big Data project & industry trends 24

Data-Driven Financial System Modeling: Motivation: Crisis of 2007-2009 US Secretary of the Treasury Hank Paulson in September 2008 (when Lehman Brother filed for bankruptcy): The problems at Lehman have been known for many months. The counterparties have had ample opportunity to adjust their exposure. Therefore, we can safely let Lehman go down. ZHAW project to make stress tests and risk expose calculations comparable across banks Collaboration with regulators as European Central Bank, Bank of England 25

ZHAW Project ACTUS: Algorithmic Contract Type Unifying Standard Brammertz et al., Unified Financial Analysis. Chichester, 2009. Project requires interdisciplinary approach: Finance policies, mathematics, economics Data warehousing, big data, cloud computing Parallel programming Security 26

ZHAW Project: Data-Driven Financial Risk Modeling Implemented in Hadoop and simulated cash flows of millions of financial contracts Re-implement in Apache Spark leveraging HDFS + Hive More info at DW 2015, November in Zurich http://www.dw-konferenz.ch/dw2015/konferenz/konferenzprogramm/vortrag/m6/title/large-scale-data-driven-financial-risk-assessment.html 27

Trends from Industry Personal impressions & insights from ISC Cloud & Big Data (Frankfurt, Sept 29-30, 2015) Hadoop is gaining traction in Europe Spark is picking up Intel seems to be focused on Hadoop IBM Research Zurich is working on Spark for RDMA Swisscom has heavily embarked on Spark (especially for streaming) Many companies use the concept of Data Lake: Wild collection of data management/processing tools Data Warehousing technology Big Data technology (Hadoop, Spark, ) 28

Conclusions Spark combines functionality that is spread across various Apache Big Data tools: MapReduce SQL processing Stream processing Graph processing Machine Learning Main concept: resilient distributed dataset (RDD) Main memory architecture is well-suited for iterative data processing For Big and Small Data Spark can leverage the strengths of HDFS + Hive Contact: Kurt.Stockinger@zhaw.ch 29