CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Similar documents
Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Unified Big Data Processing with Apache Spark. Matei

Architectures for massive data management

How Companies are! Using Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data Analytics. Lucas Rego Drumond

Moving From Hadoop to Spark

Spark and the Big Data Library

Unified Big Data Analytics Pipeline. 连 城

Introduction to Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Brave New World: Hadoop vs. Spark

Apache Spark and Distributed Programming

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop Ecosystem B Y R A H I M A.

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Analytics Hadoop and Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data and Scripting Systems build on top of Hadoop

Machine- Learning Summer School

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data and Scripting Systems beyond Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Streaming items through a cluster with Spark Streaming

Spark: Cluster Computing with Working Sets

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Chase Wu New Jersey Ins0tute of Technology

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Big Data Frameworks: Scala and Spark Tutorial

Hadoop & Spark Using Amazon EMR

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Beyond Hadoop with Apache Spark and BDAS

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

This is a brief tutorial that explains the basics of Spark SQL programming.

HDFS. Hadoop Distributed File System

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Shark Installation Guide Week 3 Report. Ankush Arora

Data Science in the Wild

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Certification Study Guide. MapR Certified Spark Developer Study Guide

MLlib: Scalable Machine Learning on Spark

Real Time Data Processing using Spark Streaming

Introduc8on to Apache Spark

Conquering Big Data with BDAS (Berkeley Data Analytics)

BIG DATA APPLICATIONS

Case Study : 3 different hadoop cluster deployments

Ali Ghodsi Head of PM and Engineering Databricks

HiBench Introduction. Carson Wang Software & Services Group

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Write Once, Run Anywhere Pat McDonough

Introduction to Big Data with Apache Spark UC BERKELEY

Hybrid Software Architectures for Big

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Hadoop. Sunday, November 25, 12

How To Scale Out Of A Nosql Database

Oracle Big Data SQL Technical Update

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Dell In-Memory Appliance for Cloudera Enterprise

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

The Internet of Things and Big Data: Intro

Kafka & Redis for Big Data Solutions

CS 378 Big Data Programming. Lecture 1 Introduc:on

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

A Brief Introduction to Apache Tez

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

The Stratosphere Big Data Analytics Platform


Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Rakam: Distributed Analytics API

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Hadoop IST 734 SS CHUNG

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

How To Handle Big Data With A Data Scientist

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop: The Definitive Guide

Workshop on Hadoop with Big Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Bayesian networks - Time-series models - Apache Spark & Scala

How To Choose A Data Flow Pipeline From A Data Processing Platform

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Big Data Processing. Patrick Wendell Databricks

Transcription:

CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving of packets in live streaming? Who handles spacing in the buffers? Dealing with major packet losses? Buffering is primarily for syncing up audio and video Publish subscribe Difference between AMQP and JMS? n Wire-formats vs. interfaces Popular type based pub/sub? Does flooding-based pub/sub ever make sense? Efficiency of distributed object protocols? Lots of overhead? Can message queuing handle multiple recipients? Circuit-switched networks? L16.1 L16.2 Topics covered in this lecture Spark Software stack Interactive shells in Spark Core Spark concepts Resilient Distributed Datasets GOSSIP MESSAGING L16.3 L16.4 Gossips: Variant of the epidemic scheme Also called rumor spreading If P has been updated for data item x Finds Q and tries to update it If Q was already updated? P may lose interest in spreading the update with p=1/k Gossiping: An excellent way to spread news But cannot guarantee that all nodes will actually be updated When a large number of nodes participate in epidemics A fraction of users (s) can miss updates (k +1)(1 s) s = e L16.5 L16.6 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.1

Managing Web Content Delivery Akamai Websites redirect users to Akamaized URLs IP address associated with client used to select server-farm closest to client. Most popular content served up from caches n Benefits of caching and network proximity Server farms sync up with managed websites to track content changes. APACHE SPARK L16.7 L16.8 Spark: What is it? Cluster computing platform Designed to be fast and general purpose Speed Extends MapReduce to support more types of computations n Interactive queries, iterative tasks, and stream processing Why is speed important? Difference between waiting for hours versus exploring data interactively Key enabling idea in Spark Memory resident data Spark loads data into the memory of worker nodes Processing is performed on memory-resident data L16.9 L16.10 A look at the memory hierarchy Spark covers a wide range of workloads Item time Scaled time in human terms (2 billion times slower) Processor cycle 0.5 ns (2 GHz) 1 second Cache access 1 ns (1 GHz) 2 seconds Memory access 70 ns 140 seconds Context switch 5,000 ns (5 μs) 167 minutes Disk access 7,000,000 ns (7 ms) 162 days Quantum 100,000,000 ns (100 ms) 6.3 years Batch applications Iterative algorithms Queries Stream processing This has previously required multiple, independent tools Source: Kay Robbins & Steve Robbins. Unix Systems Programming, 2nd edition, Prentice Hall. L16.11 L16.12 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.2

APIs Java, Python, Scala, and SQL Integrates well with other tools Can run in Hadoop clusters Access Hadoop data sources, including Cassandra At its core, Spark is a computational engine Spark is responsible for several aspects of applications that comprise Many tasks across many machines (compute clusters) Responsibilities include: 1 Scheduling 2 Distributions 3 Monitoring L16.13 L16.14 The Spark stack Spark SQL structured data Spark Streaming real-time MLib machine learning GraphX Graph processing Spark Core THE SPARK SOFTWARE STACK Standalone Scheduler YARN Mesos L16.15 L16.16 Spark Core Basic functionality of Spark Task scheduling, memory management, fault recovery, and interacting with storage systems Also, the API that defines Resilient Distributed Datasets (RDDs) Spark s main programming abstraction Represents collection of data items dispersed across many compute nodes n Can be manipulated concurrently (parallel) Spark SQL Package for working with structured data Allows querying data using SQL and HQL (Hive Query Language) Data sources: Hive tables, Parquet, and JSON Allows intermixing queries with programmatic data manipulations support by RDDs Using Scala, Java, and Python L16.17 L16.18 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.3

Spark Streaming Enables processing of live streams of data Sources Logfiles generated by production webservers Messages containing web service status updates MLib Library that contains common machine learning functionality Algorithms include: Clustering, classification, regression, clustering, and collaborative filtering Low-level primitives Generic gradient descent optimization algorithm Alternatives? Mahout, sci kit learn, VW, WEKA, and R among others L16.19 L16.20 Graph X Library for manipulating graphs Graph-parallel computations Extends Spark RDD API Create a directed graph, with arbitrary properties attached to each vertex and edge Cluster Managers Spark runs over a variety of cluster managers These include: Hadoop YARN Apache Mesos Standalone Scheduler n Included within Spark L16.21 L16.22 Storage Layers for Spark Spark can create distributed datasets from any file stored in HDFS Plus, other storage systems supported by the Hadoop API Amazon S3, Cassandra, Hive, HBase, etc. INTERACTIVE SHELLS IN SPARK L16.23 L16.24 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.4

Spark Shells Interactive [Python and Scala] Similar to shells like Bash or Windows command prompt Ad hoc data analysis Traditional shells manipulate data using disk and memory on a single machine Spark shells allow interaction with data that is distributed across many machines Spark manages complexity of distributing processing Several software were designed to run on the Java Virtual Machine Languages that compile to run on the JVM and can interact with Java software packages but are not actually Java There are a number of non-java JVM languages The two most popular ones used in real-time application development: Scala and Clojure L16.25 L16.26 Scala Has spent most of its life as an academic language Still largely developed at universities Has a rich standard library that has made it appealing to developers of high-performance server applications Like Java, Scala is a strongly typed object-oriented language Includes many features from functional programming languages that are not in standard Java Interestingly, Java 8 incorporate several of the more useful features of Scala and other functional languages. What is functional programming? When a method is compiled by Java, it is converted to instructions called byte code and Then largely disappears from the Java environment n Except when it is called by other methods In a functional language, functions are treated the same way as data Can be stored in objects similar to integers or strings, returned from functions, and passed to other functions L16.27 L16.28 What about Clojure? Based on Lisp Javascript? Name was a marketing gimmick Closer to Clojure and Scala than it is to Java CORE SPARK CONCEPTS L16.29 L16.30 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.5

Core Spark Concepts Drivers SparkContext Executors Drivers Every Spark application consists of a driver program Driver launches various parallel operations on the cluster Constituent elements Application s main function Defines distributed datasets on the clusters Applies operations to these datasets L16.31 L16.32 SparkContext Driver programs access Spark through a SparkContext object Represents a connection to a computing cluster Within the shell? Created as the variable sc n You can event print out sc to see the the type Once you have a SparkContext, you can use it to build RDDs And then run operations on the data Executors Driver programs manage a number of nodes, called executors Executors are responsible for running operations For example: If we were running a count() operation on cluster n Different machines might count lines in different ranges of the file L16.33 L16.34 Components for distributed execution in Spark Lot of Spark s API revolves around passing functions to its operators Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task def haspython(line) return Python in line pythonlines = lines.filter(haspython) pythonlines = lines.filter(line => line.contains( Python ) Also known as the lambda or => syntax L16.35 L16.36 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.6

Lot of Spark s API revolves around passing functions to its operators JavaRDD<String> pythonlines = lines.filter( new Function<String, Boolean> () { Boolean call(string line) { return line.conmtains( Python ); } } ); The contents of this slide-set are based on the following references Learning Spark: Lightning-Fast Big Data Analysis. 1st Edition. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. O'Reilly. 2015. ISBN-13: 978-1449358624. [Chapters 1-4] Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data. Byron Ellis. Wiley. [Chapter 2] JavaRDD<String> pythonlines = lines.filter(line => line.contains( Python ) ); L16.37 L16.38 SLIDES CREATED BY: SHRIDEEP PALLICKARA L16.7