Writing Standalone Spark Programs

Similar documents

Using Apache Spark Pat McDonough - Databricks

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Data Science in the Wild

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Frameworks: Scala and Spark Tutorial

How to Run Spark Application

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data Analytics. Lucas Rego Drumond

Spark. Fast, Interactive, Language- Integrated Cluster Computing

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Introduction to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark: Making Big Data Interactive & Real-Time

Apache Spark and Distributed Programming

Architectures for massive data management

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark and the Big Data Library

Certification Study Guide. MapR Certified Spark Developer Study Guide

Big Data and Scripting Systems beyond Hadoop

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Running Knn Spark on EC2 Documentation

Unified Big Data Analytics Pipeline. 连城

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Big Data Analytics with Cassandra, Spark & MLLib

Hadoop Tutorial. General Instructions

HiBench Introduction. Carson Wang Software & Services Group

MLlib: Scalable Machine Learning on Spark

Moving From Hadoop to Spark

IDS 561 Big data analytics Assignment 1

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Processing with Apache Spark. Matei

Spark: Cluster Computing with Working Sets

An Introduction to Apostolos N. Papadopoulos

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Streaming items through a cluster with Spark Streaming

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

Big Data Analytics Hadoop and Spark

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Conquering Big Data with Apache Spark

SparkLab May 2015 An Introduction to

Big Data and Scripting Systems build on top of Hadoop

Implementations of iterative algorithms in Hadoop and Spark

Testing Spark: Best Practices

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Spark Job Server. Evan Chan and Kelvin Chu. Date

Extreme Computing. Hadoop MapReduce in more detail.

This is a brief tutorial that explains the basics of Spark SQL programming.

The Spark Big Data Analytics Platform

Brave New World: Hadoop vs. Spark

Why Spark Is the Next Top (Compute) Model

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

BIG DATA APPLICATIONS

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Tachyon: memory-speed data sharing

MapReduce Evaluator: User Guide

Other Map-Reduce (ish) Frameworks. William Cohen

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Flink Next-gen data analysis. Kostas

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mrs: MapReduce for Scientific Computing in Python

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Ecosystem B Y R A H I M A.

The Easiest Way to Run Spark Jobs. How-To Guide

Getting Started with Hadoop with Amazon s Elastic MapReduce

A. Aiken & K. Olukotun PA3

H2O on Hadoop. September 30,

Fair Scheduler. Table of contents

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Hadoop Masterclass. Part 4 of 4: Analyzing Big Data. Lars George EMEA Chief Architect Cloudera

Machine- Learning Summer School

HDFS. Hadoop Distributed File System

COURSE CONTENT Big Data and Hadoop Training

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

HOD Scheduler. Table of contents

An Architecture for Fast and General Data Processing on Large Clusters

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Setup. 1 Cluster

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Improving MapReduce Performance in Heterogeneous Environments

HADOOP MOCK TEST HADOOP MOCK TEST I

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Assignment 2: More MapReduce with Hadoop

Developing a MapReduce Application

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Transcription:

Writing Standalone Spark Programs Matei Zaharia UC Berkeley www.spark- project.org UC BERKELEY

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Building Spark Requires: Java 6+, Scala 2.9.1+ git clone git://github.com/mesos/spark cd spark sbt/sbt compile # Build Spark + dependencies into single JAR # (gives core/target/spark*assembly*.jar) sbt/sbt assembly # Publish Spark to local Maven cache sbt/sbt publish-local

Adding it to Your Project Either include the Spark assembly JAR, or add a Maven dependency on: groupid: artifactid: spark-core_2.9.1 version: 0.5.1-SNAPSHOT org.spark-project

Creating a SparkContext import spark.sparkcontext import spark.sparkcontext._ Important to get some implicit conversions val sc = new SparkContext( masterurl, name, sparkhome, Seq( job.jar )) Mesos cluster URL, or local / local[n] Job name Spark install path on cluster List of JARs with your code (to ship)

Complete Example import spark.sparkcontext import spark.sparkcontext._ Static singleton object object WordCount { def main(args: Array[String]) { val sc = new SparkContext( local, WordCount, args(0), Seq(args(1))) val file = sc.textfile(args(2)) file.map(_.split( )).flatmap(word => (word, 1)).reduceByKey(_ + _).saveastextfile(args(3)) } }

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Why PageRank? Good example of a more complex algorithm» Multiple stages of map & reduce Benefits from Spark s in- memory caching» Multiple iterations over the same data

Basic Idea Give pages ranks (scores) based on links to them» Links from many pages è high rank» Link from a high- rank page è high rank Image: en.wikipedia.org/wiki/file:pagerank- hi- res- 2.png

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.0 1.0 1.0 1.0

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1 1.0 1.0 1.0 0.5 1.0 0.5 1 0.5 0.5

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.85 0.58 1.0 0.58

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 0.58 1.85 0.29 0.29 1.85 0.5 0.58 1.0 0.58 0.5

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs 1.31 0.39 1.72... 0.58

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / neighbors p to its neighbors 3. Set each page s rank to 0.15 + 0.85 contribs Final state: 1.44 0.46 1.37 0.73

Spark Program val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...)

Coding It Up

PageRank Performance Time per Iteration (s) 180 160 140 120 100 80 60 40 20 0 171 72 80 28 30 60 Hadoop Spark Number of machines

Other Iterative Algorithms Logistic Regression K- Means Clustering Time per Iteration (s) 80 60 40 20 0 76 Hadoop 3 Spark Time per Iteration (s) 125 100 75 50 25 0 106 Hadoop 33 Spark

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Differences in Java API Implement functions by extending classes» spark.api.java.function.function, Function2, etc Use special Java RDDs in spark.api.java» Same methods as Scala RDDs, but take Java Functions Special PairFunction and JavaPairRDD provide operations on key- value pairs» To maintain type safety as in Scala

Examples import spark.api.java.*; import spark.api.java.function.*; JavaSparkContext sc = new JavaSparkContext(...); JavaRDD<String> lines = ctx.textfile( hdfs://... ); JavaRDD<String> words = lines.flatmap( new FlatMapFunction<String, String>() { public Iterable<String> call(string s) { return Arrays.asList(s.split(" ")); } } ); System.out.println(words.count());

Examples import spark.api.java.*; import spark.api.java.function.*; JavaSparkContext sc = new JavaSparkContext(...); JavaRDD<String> lines = ctx.textfile(args[1], 1); class Split extends FlatMapFunction<String, String> { public Iterable<String> call(string s) { return Arrays.asList(s.split(" ")); } ); JavaRDD<String> words = lines.flatmap(new Split()); System.out.println(words.count());

Key- Value Pairs import scala.tuple2; JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(string s) { return new Tuple2(s, 1); } } ); JavaPairRDD<String, Integer> counts = ones.reducebykey( new Function2<Integer, Integer, Integer>() { public Integer call(integer i1, Integer i2) { return i1 + i2; } } );

Java PageRank

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Developing in Local Mode Just pass local or local[k] as master URL Still serializes tasks to catch marshaling errors Debug in any Java/Scala debugger

Running on a Cluster Set up Mesos as per Spark wiki» github.com/mesos/spark/wiki/running- spark- on- mesos Basically requires building Mesos and creating config files with locations of slaves Pass master:port as URL (default port is 5050)

Running on EC2 Easiest way to launch a Spark cluster git clone git://github.com/mesos/spark.git cd spark/ec2./spark-ec2 -k keypair i id_rsa.pem s slaves \ [launch stop start destroy] clustername Details: tinyurl.com/spark- ec2

Viewing Logs Click through the web UI at master:8080 Or, look at stdout and stdout files in the Mesos work directories for your program, such as: /tmp/mesos/slaves/<slaveid>/frameworks/ <FrameworkID>/executors/0/runs/0/stderr FrameworkID is printed when Spark connects, SlaveID is printed when a task starts

Common Problems Exceptions in tasks: will be reported at master 17:57:00 INFO TaskSetManager: Lost TID 1 (task 0.0:1) 17:57:00 INFO TaskSetManager: Loss was due to java.lang.arithmeticexception: / by zero at BadJob$$anonfun$1.apply$mcII$sp(BadJob.scala:10) at BadJob$$anonfun$1.apply(BadJob.scala:10) at... Fetch failure: couldn t communicate with a node» Most likely, it crashed earlier» Always look at first problem in log

Common Problems NotSerializableException:» Set sun.io.serialization.extendeddebuginfo=true to get a detailed trace (in SPARK_JAVA_OPTS)» Beware of closures using fields/methods of outer object (these will reference the whole object)

For More Help Join the Spark Users mailing list: groups.google.com/group/spark- users Come to the Bay Area meetup: www.meetup.com/spark- users