Spark Job Server. Evan Chan and Kelvin Chu. Date

Similar documents

Testing Spark: Best Practices

How to Run Spark Application

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark: Cluster Computing with Working Sets

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

JReport Server Deployment Scenarios

Quantifind s story: Building custom interactive data analytics infrastructure

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Monitoring HP OO 10. Overview. Available Tools. HP OO Community Guides

An Open Source Memory-Centric Distributed Storage System

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

How To Create A Data Visualization With Apache Spark And Zeppelin

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Glassfish Architecture.

Architectures for massive data management

<Insert Picture Here> Java Application Diagnostic Expert

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CLOUD STORAGE USING HADOOP AND PLAY

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Course Description. Course Audience. Course Outline. Course Page - Page 1 of 5

In-Memory BigData. Summer 2012, Technology Overview

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

12 Factor App. Best Practices for Scala Deployment

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Productionizing a 24/7 Spark Streaming Service on YARN

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

MagDiSoft Web Solutions Office No. 102, Bramha Majestic, NIBM Road Kondhwa, Pune Tel: /

Data sharing in the Big Data era

IBM WebSphere Server Administration

Wisdom from Crowds of Machines

Project SailFin: Building and Hosting Your Own Communication Server.

Oracle WebLogic Server 11g Administration

Moving From Hadoop to Spark

Atlas Technology Deployment Guide

In Memory Accelerator for MongoDB

PTC System Monitor Solution Training

CHAPTER 1 - JAVA EE OVERVIEW FOR ADMINISTRATORS

UFTP AUTHENTICATION SERVICE

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Big Data and Scripting Systems beyond Hadoop

SYSPRO Point of Sale: Architecture

Write Once, Run Anywhere Pat McDonough

Table Of Contents. 1. GridGain In-Memory Database

Session Service Architecture

COMMANDS 1 Overview... 1 Default Commands... 2 Creating a Script from a Command Document Revision History... 10

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Architectural Overview

Technical Overview Simple, Scalable, Object Storage Software

Tachyon: memory-speed data sharing

WebSphere Server Administration Course

HDFS Users Guide. Table of contents

Exam Name: IBM InfoSphere MDM Server v9.0

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

1 How to Monitor Performance

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Above the Clouds: Introducing Akka

Open source framework for interactive data exploration in server based architecture

MongoDB Developer and Administrator Certification Course Agenda

GlassFish v3. Building an ex tensible modular Java EE application server. Jerome Dochez and Ludovic Champenois Sun Microsystems, Inc.

ArcGIS for Server: Administrative Scripting and Automation

Big Data Frameworks: Scala and Spark Tutorial

CS169.1x Lecture 5: SaaS Architecture and Introduction to Rails " Fall 2012"

Introducing Storm 1 Core Storm concepts Topology design

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Writing Standalone Spark Programs

Cloud Elements! Marketing Hub Provisioning and Usage Guide!

Performance Testing of Java Enterprise Systems

SQL Anywhere New Features Summary

QL Integration into Scala and Excel. Martin Dietrich

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Zing Vision. Answering your toughest production Java performance questions

This is a brief tutorial that explains the basics of Spark SQL programming.

Clustering with Tomcat. Introduction. O'Reilly Network: Clustering with Tomcat. by Shyam Kumar Doddavula 07/17/2002

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

ebay's - LB Management Service (for OpenStack)

Continuous Integration The Full Monty Artifactory and Gradle. Yoav Landman & Frederic Simon

WEBLOGIC ADMINISTRATION

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Avalanche Site Edition

CONTINUOUS DEPLOYMENT WITH SINGULARITY

Cloud Powered Mobile Apps with Azure

ORACLE INSTANCE ARCHITECTURE

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Using Apache Spark Pat McDonough - Databricks

Apache Sling A REST-based Web Application Framework Carsten Ziegeler cziegeler@apache.org ApacheCon NA 2014

Chapter 4. Architecture. Table of Contents. J2EE Technology Application Servers. Application Models

Portal Factory CMIS Connector Module documentation

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Integrating VoltDB with Hadoop

Deployment Best Practices for Citrix XenApp over Galaxy Managed Network Services

Hierarchy storage in Tachyon.

vcenter Operations Management Pack for SAP HANA Installation and Configuration Guide

Transcription:

Spark Job Server Evan Chan and Kelvin Chu Date

Overview

Why We Needed a Job Server Created at Ooyala in 2013 Our vision for Spark is as a multi-team big data service What gets repeated by every team: Bastion box for running Hadoop/Spark jobs Deploys and process monitoring Tracking and serializing job status, progress, and job results Job validation No easy way to kill jobs

Spark as a Service REST API for Spark jobs and contexts. Easily operate Spark from any language or environment. Runs jobs in their own Contexts or share 1 context amongst jobs Great for sharing cached RDDs across jobs and low-latency jobs Works for Spark Streaming as well! Works with Standalone, Mesos, any Spark config Jars, job history and config are persisted via a pluggable API Async and sync API, JSON job results

Open Source!! http://github.com/ooyala/spark-jobserver

Creating a Job Server Project In your build.sbt, add this resolvers += "Ooyala Bintray" at "http://dl.bintray.com/ooyala/maven" librarydependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided" sbt assembly -> fat jar -> upload to job server "provided" is used. Don t want SBT assembly to include the whole job server jar. Java projects should be possible too

Example Job Server Job /** * A super-simple Spark job example that implements the SparkJob trait and * can be submitted to the job server. */ object WordCountExample extends SparkJob { override def validate(sc: SparkContext, config: Config): SparkJobValidation = { Try(config.getString( input.string )).map(x => SparkJobValid).getOrElse(SparkJobInvalid( No input.string )) } override def runjob(sc: SparkContext, config: Config): Any = { val dd = sc.parallelize(config.getstring( input.string ).split(" ").toseq) dd.map((_, 1)).reduceByKey(_ + _).collect().tomap } }

What s Different? Job does not create Context, Job Server does Decide when I run the job: in own context, or in pre-created context Upload new jobs to diagnose your RDD issues: POST /contexts/newcontext POST /jobs... context=newcontext Upload a new diagnostic jar... POST /jars/newdiag Run diagnostic jar to dump into on cached RDDs

Submitting and Running a Job curl --data-binary @../target/mydemo.jar localhost:8090/jars/demo OK[11:32 PM] ~ curl -d "input.string = A lazy dog jumped mean dog" 'localhost:8090/jobs? appname=demo&classpath=wordcountexample&sync=true' { "status": "OK", "RESULT": { "lazy": 1, "jumped": 1, "A": 1, "mean": 1, "dog": 2 } }

Retrieve Job Statuses ~/s/jobserver (evan-working-1 =) curl 'localhost:8090/jobs?limit=2' [{ "duration": "77.744 secs", "classpath": "ooyala.cnd.creatematerializedview", "starttime": "2013-11-26T20:13:09.071Z", "context": "8b7059dd-ooyala.cnd.CreateMaterializedView", "status": "FINISHED", "jobid": "9982f961-aaaa-4195-88c2-962eae9b08d9" }, { "duration": "58.067 secs", "classpath": "ooyala.cnd.creatematerializedview", "starttime": "2013-11-26T20:22:03.257Z", "context": "d0a5ebdc-ooyala.cnd.creatematerializedview", "status": "FINISHED", "jobid": "e9317383-6a67-41c4-8291-9c140b6d8459" }]

Use Case: Fast Query Jobs

Spark as a Query Engine Goal: spark jobs that run in under a second and answers queries on shared RDD data Query params passed in as job config Need to minimize context creation overhead Thus many jobs sharing the same SparkContext On-heap RDD caching means no serialization loss Need to consider concurrent jobs (fair scheduling)

LOW-LATENCY QUERY JOBS Create query context Load some data Query Result Query Result REST Job Server new SparkContext Spark Executors Load Data RDD Query Job Query Job Cassandra

Sharing Data Between Jobs RDD Caching Benefit: no need to serialize data. Especially useful for indexes etc. Job server provides a NamedRdds trait for threadsafe CRUD of cached RDDs by name (Compare to SparkContext s API which uses an integer ID and is not thread safe) For example, at Ooyala a number of fields are

Data Concurrency Single writer, multiple readers Managing multiple updates to RDDs Cache keeps track of which RDDs being updated Example: thread A spark job creates RDD A at t0 thread B fetches RDD A at t1 > t0 Both threads A and B, using NamedRdds, will get the RDD at time t2 when thread A finishes creating

Production Usage

Persistence What gets persisted? Job status (success, error, why it failed) Job Configuration Jars JDBC database configuration: spark.sqldao.jdbc.url jdbc:mysql://dbserver:3306/jobserverdb

Deployment and Metrics spark-jobserver repo comes with a full suite of tests and deploy scripts: server_deploy.sh for regular server pushes server_package.sh for Mesos and Chronos.tar.gz /metricz route for codahale-metrics monitoring /healthz route for health check0o

Challenges and Lessons Spark is based around contexts - we need a Job Server oriented around logical jobs Running multiple SparkContexts in the same process Global use of System properties makes it impossible to start multiple contexts at same time (but see pull request...) Have to be careful with SparkEnv Dynamic jar and class loading is tricky Manage threads carefully - each context uses lots of threads

Future Work

Future Plans Spark-contrib project list. So this and other projects can gain visibility! (SPARK-1283) HA mode using Akka Cluster or Mesos HA and Hot Failover for Spark Drivers/Contexts REST API for job progress Swagger API documentation

HA and Hot Failover for Jobs Job Server 1 Gossip Job Server 2 Job context dies: Active Job Context Checkpoint HDFS Standby Job Context Job server 2 notices and spins up standby context, restores checkpoint

Thanks for your contributions! All of these were community contributed: index.html main page saving and retrieving job configuration Your contributions are very welcome on Github!

Architecture

Completely Async Design http://spray.io - probably the fastest JVM HTTP microframework Akka Actor based, non blocking Futures used to manage individual jobs. (Note that Spark is using Scala futures to manage job stages now) Single JVM for now, but easy to distribute later via remote Actors / Akka Cluster

Async Actor Flow Spray web API Request actor Local Supervisor Job Manager Job Status Actor Job 1 Future Job 2 Future Job Result Actor

Message flow fully documented

Thank you! And Everybody is Hiring!!

Using Tachyon Pros Off-heap storage: No GC Can be shared across multiple processes Data can survive process loss Backed by HDFS Cons ByteBuffer API - need to pay deserialization cost Does not support random access writes