Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Similar documents

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Analytics Pipeline. 连城

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Introduction to Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data and Scripting Systems beyond Hadoop

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Spark and the Big Data Library

Spark: Making Big Data Interactive & Real-Time

Moving From Hadoop to Spark

Big Data With Hadoop

Apache Spark and Distributed Programming

Big Data and Scripting Systems build on top of Hadoop

Apache Flink Next-gen data analysis. Kostas

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How Companies are! Using Spark

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Brave New World: Hadoop vs. Spark

Architectures for massive data management

Large scale processing using Hadoop. Ján Vaňo

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Certification Study Guide. MapR Certified Spark Developer Study Guide

Big Data Processing. Patrick Wendell Databricks

Beyond Hadoop with Apache Spark and BDAS

Hadoop Ecosystem B Y R A H I M A.

CSE-E5430 Scalable Cloud Computing Lecture 2

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Analytics. Lucas Rego Drumond

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop and Map-Reduce. Swati Gore

Big Data Frameworks: Scala and Spark Tutorial

Real Time Data Processing using Spark Streaming

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Shark Installation Guide Week 3 Report. Ankush Arora

Hadoop & Spark Using Amazon EMR

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Oracle Big Data Fundamentals Ed 1 NEW

Implementation of Spark Cluster Technique with SCALA

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

HDFS. Hadoop Distributed File System

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Spark: Cluster Computing with Working Sets

Hadoop: The Definitive Guide

Managing large clusters resources

Deploying Hadoop with Manager

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

A Brief Introduction to Apache Tez

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Job Oriented Training Agenda

Upcoming Announcements

Real-time Big Data Analytics with Storm

The Internet of Things and Big Data: Intro

This is a brief tutorial that explains the basics of Spark SQL programming.

Where is Hadoop Going Next?

How To Understand The Programing Framework For Distributed Computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hybrid Software Architectures for Big

Big Data Analytics Hadoop and Spark

Hadoop implementation of MapReduce computational model. Ján Vaňo

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Hadoop. Alexandru Costan

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Analytics with Cassandra, Spark & MLLib

How To Scale Out Of A Nosql Database

MapReduce with Apache Hadoop Analysing Big Data

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Hadoop-BAM and SeqPig

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Processing NGS Data with Hadoop-BAM and SeqPig

Constructing a Data Lake: Hadoop and Oracle Database United!

Introduc8on to Apache Spark

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Transcription:

Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

Data-centric Computing The data cannot be ignored Frameworks that hide complexity from the user: - Transparent scalability - Automatic failure recovery Hadoop: DFS + MapReduce Less suitable for fine-grained parallelism - Use MPI

Hadoop at SURFsara Current: - 86 nodes, 8 cores, 64GB RAM, 8TB disks - 90 nodes, 8 cores, 64GB RAM, 16TB disks - 1408 container slots - 2PB distributed filesystem - Hortonworks HDP 2.2 (Hadoop 2.6) - Kerberos authentication Frameworks - MapReduce - Pig - Spark - Cascading - Giraph

What is Apache Spark? Yet another computing framework for large-scale data processing Interfaces nicely with Hadoop API in Scala, Java and Python Compared to MapReduce: richer set of operators, in-memory computing Has additional components for SQL, streaming, graph analysis and machine learning

Motivation MapReduce greatly simplified big data analysis on large, unreliable clusters But users want more: - Iterative jobs (machine learning, etc.) - Interactive analytics

Motivation Both iterative and interactive queries need something that MapReduce lacks: Efficient primitives for data sharing In MapReduce, data between jobs is shared via the (slow) distributed filesystem Replication makes this even slower, but is necessary for fault tolerance

Proposed solution In-Memory Data Processing and Sharing

Proposed solution How to design a distributed memory abstraction that is both fault tolerant and efficient? Resilient Distributed Datasets (RDDs)!

Resilient Distributed Datasets A distributed memory abstraction Immutable collections of objects spread across a cluster A RDD is divided into a number of partitions these partitions can be stored on different nodes of the cluster

Spark Programming Model The programming model is based on parallelizable operators (just like MapReduce!) Parallelizable operators are higher-order functions that execute userdefined functions in parallel

RDD operators There are two types of RDD operators: Transformations: lazy operators that create new RDDs (remember immutability!) Actions: computations that return a value or write to external storage

Example operators - transformations Some common supported transformations are map(func) - pass each element through function func filter(func) - select elements for which func returns true union(otherrdd) - return union of the two datasets join(otherrdd) - on datasets of type (k,v), perform a join on k cartesian(otherrdd) - return a dataset with all pairs of elements

Example operators - actions Some common supported actions are reduce(func) - apply func to every element count() - return the number of elements take(n) - select the first n elements saveastextfile(file) - store dataset to disk foreach(func) - apply fun to every element (used for side-effects)

Demo time Demo: Spark Shell

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions (due to crashed, memory-evictions, etc.) No replicas Compute only the missing partitions of a RDD

Spark SQL and DataFrames Uses an abstraction on top of RDDs to work with columnar data: DataFrames DataFrames can be queried using a DSL or with the SQL query engine Data sources can be csv/json/parquet files, remote databases Schema of source data is specified (or inferred) on read

When to use Spark? Batch operations that apply a series of operations to all elements of a dataset SQL-like querying of large data volumes Not a replacement for a database backend of web-service (no efficient updates!)

Questions?

RDD dependencies - Narrow Narrow: each partition of a parent RDD is used by at most one partition of the child RDD Narrow dependencies allow pipelined execution on one cluster node: a map followed by a filter

RDD dependencies - Wide Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs Communications between nodes needed

Job Scheduling When the user runs an action the scheduler constructs a DAG of stages A stage contains many pipelined transformations with narrow dependencies