Introduction to Spark



Similar documents
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Introduction to Big Data with Apache Spark UC BERKELEY

Unified Big Data Analytics Pipeline. 连 城

Architectures for massive data management

Scaling Out With Apache Spark. DTL Meeting Slides based on

Apache Spark and Distributed Programming

Big Data Frameworks: Scala and Spark Tutorial

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Apache Flink Next-gen data analysis. Kostas

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Analytics. Lucas Rego Drumond

Data Science in the Wild

Beyond Hadoop with Apache Spark and BDAS

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Real Time Data Processing using Spark Streaming

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data and Scripting Systems beyond Hadoop

Spark: Making Big Data Interactive & Real-Time

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data and Scripting Systems build on top of Hadoop

Spark: Cluster Computing with Working Sets

Moving From Hadoop to Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Streaming items through a cluster with Spark Streaming

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Brave New World: Hadoop vs. Spark

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Spark and the Big Data Library

Machine- Learning Summer School

BIG DATA APPLICATIONS

Big Data Analytics Hadoop and Spark

Hadoop: The Definitive Guide

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CSE-E5430 Scalable Cloud Computing Lecture 11

Using Apache Spark Pat McDonough - Databricks

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

COURSE CONTENT Big Data and Hadoop Training

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data With Hadoop

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Hadoop Parallel Data Processing

Hadoop & Spark Using Amazon EMR

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Spark Application Carousel. Spark Summit East 2015

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Case Study : 3 different hadoop cluster deployments

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implementation of Spark Cluster Technique with SCALA

Hybrid Software Architectures for Big

Introduc8on to Apache Spark

A Brief Introduction to Apache Tez

HiBench Introduction. Carson Wang Software & Services Group

How To Understand The Programing Framework For Distributed Computing

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Map Reduce & Hadoop Recommended Text:

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

ITG Software Engineering

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Kafka & Redis for Big Data Solutions

Disco: Beyond MapReduce

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop Ecosystem B Y R A H I M A.

This is a brief tutorial that explains the basics of Spark SQL programming.

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Writing Standalone Spark Programs

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Real-time Big Data Analytics with Storm

How To Choose A Data Flow Pipeline From A Data Processing Platform

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

HDFS. Hadoop Distributed File System

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Certification Study Guide. MapR Certified Spark Developer Study Guide

Unlocking the True Value of Hadoop with Open Data Science

Energy Efficient MapReduce

Productionizing a 24/7 Spark Streaming Service on YARN

Internals of Hadoop Application Framework and Distributed File System

University of Maryland. Tuesday, February 2, 2010

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Transcription:

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Quick Demo

Quick Demo

API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https:// store.continuum.io/ cshop/anaconda/

Introduction

Start Spark on a cluster Submit code to be run on it Spark Structure

Another Perspective

Step by step

Step by step

Step by step

Example: WordCount

Example: WordCount

Limitations of MapReduce Performance bottlenecks not all jobs can be cast as batch processes Graphs? Programming in Hadoop is hard Boilerplate boilerplate everywhere

Initial Workaround: Specialization

Along Came Spark Spark s goal was to generalize MapReduce to support new applications within the same engine Two additions: Fast data sharing General DAGs (directed acyclic graphs) Best of both worlds: easy to program & more efzicient engine in general

Codebase Size

More on Spark More general Supports map/reduce paradigm Supports vertex- based paradigm General compute engine (DAG) More API hooks Scala, Java, and Python More interfaces Batch (Hadoop), real- time (Storm), and interactive (???)

Interactive Shells Spark creates a SparkContext object (cluster information) For either shell: sc External programs use a static constructor to instantiate the context

Interactive Shells spark-shell --master

Interactive Shells Master connects to the cluster manager, which allocates resources across applications Acquires executors on cluster nodes: worker processes to run computations and store data Sends app code to executors Sends tasks for executors to run

Resilient Distributed Datasets (RDDs) Resilient Distributed Datasets (RDDs) are primary data abstraction in Spark Fault- tolerant Can be operated on in parallel 1. Parallelized Collections 2. Hadoop datasets Two types of RDD operations 1. Transformations (lazy) 2. Actions (immediate)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs) Can create RDDs from any Zile stored in HDFS Local Zilesystem Amazon S3 HBase Text Ziles, SequenceFiles, or any other Hadoop InputFormat Any directory or glob /data/201414*

Resilient Distributed Datasets (RDDs) Transformations Create a new RDD from an existing one Lazily evaluated: results are not immediately computed Pipeline of subsequent transformations can be optimized Lost data partitions can be recovered

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Closures in Java

Resilient Distributed Datasets (RDDs) Actions Create a new RDD from an existing one Eagerly evaluated: results are immediately computed Applies previous transformations (cache results?)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs) Spark can persist / cache an RDD in memory across operations Each slice is persisted in memory and reused in subsequent actions involving that RDD Cache provides fault- tolerance: if partition is lost, it will be recomputed using transformations that created it

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Broadcast Variables Spark s version of Hadoop s DistributedCache Read- only variable cached on each node Spark [internally] distributed broadcast variables in such a way to minimize communication cost

Broadcast Variables

Accumulators Spark s version of Hadoop s Counter Variables that can only be added through an associative operation Native support of numeric accumulator types and standard mutable collections Users can extend to new types Only driver program can read accumulator value

Accumulators

Key/Value Pairs

Resources Original slide deck: http://cdn.liber118.com/workshop/ itas_workshop.pdf Code samples: https://gist.github.com/ceteri/ f2c3486062c9610eac1d https://gist.github.com/ceteri/ 8ae5b9509a08c08a1132 https://gist.github.com/ceteri/11381941