Write Once, Run Anywhere Pat McDonough

Size: px

Start display at page:

Download "Write Once, Run Anywhere Pat McDonough"

Imogen Walters
8 years ago
Views:

1 Write Once, Run Anywhere Pat McDonough

2 Write Once, Run Anywhere

3 Write Once, Run Anywhere You Might Have Heard This Before!

4 Java, According to Wikipedia

5 Java, According to Wikipedia Java is a computer programming language specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA)

few implementation dependencies as possible.

6 Java & WORA in the First Decade Java Client Applications Apps with GUIs (AWT or Swing) could be deployed to any OS with a JVM

7 Java & WORA in the First Decade Java Client Applications Apps with GUIs (AWT or Swing) could be deployed to any OS with a JVM Neat! but not all that useful - people don t want non-native GUIs

8 Java & WORA in the First Decade

9 Java & WORA in the First Decade Applets A way to deliver rich GUIs to many different platforms through the browser [Insert Ugly Applet Here]

10 Java & WORA in the First Decade Applets A way to deliver rich GUIs to many different platforms through the browser [Insert Ugly Applet Here] Neat!

11 Java & WORA in the First Decade Applets A way to deliver rich GUIs to many different platforms through the browser [Insert Ugly Applet Here] Neat! but basically ended at producing many gimmicky website animations

12 Java & WORA in the First Decade Back-end Applications! Windows Desktop for an IDE Unix Server for Production Neat! And actually useful too

13 Java & WORA in the First Back-end Java starts to formalize around standards > J2EE Decade Core libraries, deployment formats, etc. Vendors Offer J2EE App Servers Ironically, this immediately lead to no more WORA Specific App Servers required a specific SDK or even a specific IDE

Vendors Offer J2EE App Servers Ironically, this immediately lead to

14 Java & WORA in the First Decade Fixing WORA on the back-end: Fall back to the Least Common Denominator > Servlets (usually via Tomcat) Spring comes about to dominate as the SDK of choice for Java back-end applications specifically designed to have as few implementation dependencies as possible

about to dominate as the SDK of choice for Java back-end applications

15 So yes, you ve heard this before

16 So yes, you ve heard this before Which examples apply to the state of Big Data Ecosystem?

17 Important Changes Since Then Vendor Standards Open Source Data has overwhelmed us Distributed Systems Are The New Standard (specifically, Data Parallel systems)

18 Big Data Platforms Are Everywhere Now But Where Are the Big Data Applications? Big Data Applications don t exist very far beyond connecting ODBC/JDBC or simple ETL integrations Why?! Too many disparate systems to piece together Complicated matrix of compile-time and runtime dependencies across distributions i.e. each distribution effectively has it s own SDK

19 The Big Data Ecosystem Needs a Common SDK

20 The Big Data Ecosystem Needs a Common SDK Apache Spark is the answer

21 Spark An SDK for Big Data Applications SQL MLlib Streaming GraphX Core

22 Spark An SDK for Big Data Applications SQL MLlib Streaming GraphX Core Unified System With Libraries to Build a Complete Solution! Full-featured Programming Environment

23 Spark An SDK for Big Data Applications SQL MLlib Streaming GraphX Core Unified System With Libraries to Build a Complete Solution! Full-featured Programming Environment Single, Consistent Interface for Developers to Write Against! Runtimes available on several platforms

24 Develop Big Data Applications Python/Scala/Java SQL MLlib Streaming GraphX Dependencies Core Your Application

25 Develop Big Data Applications SQL MLlib Streaming GraphX Python/Scala/Java Dependencies Your Application Spark APIs Core Develop Applications using your preferred language, using existing libraries, using Spark s Public APIs (SparkContext, RDDs)

26 Work With Data SQL MLlib Streaming GraphX Python/Scala/Java Dependencies Core Your Application Data HDFS* Local S3 JDBC Cassandra

27 Work With Data Python/Scala/Java Dependencies Your Application Spark APIs SQL MLlib Streaming GraphX Core Spark Internals Care For Scheduling Data Operations Data Access & Scheduling Data HDFS* Local S3 JDBC Cassandra

28 Run Your Applications Python/Scala/Java SQL MLlib Streaming GraphX Dependencies Core Your Application YARN Mesos Spark Standalone Cluster

29 Run Your Applications Python/Scala/Java SQL MLlib Streaming GraphX Dependencies Core Your Application Submit Your Application and the Spark Runtime to a Cluster Manager YARN Mesos Spark Standalone Cluster

30 The Complete Picture Python/Scala/Java SQL MLlib Streaming GraphX Dependencies Core Your Application YARN Mesos Spark Standalone Clusters Data HDFS* Local S3 JDBC Cassandra

31 The Complete Picture Python/Scala/Java SQL MLlib Streaming GraphX Dependencies Core Your Application Spark Abstracts Runtime Dependencies from Developers YARN Mesos Spark Standalone Clusters Data HDFS* Local S3 JDBC Cassandra

32 How Spark Handles Hadoop Dependencies The Spark library is compiled with compatibility to a specific Hadoop version SQL MLlib Streaming GraphX At runtime, Spark uses reflection to find any Hadoop classes it needs Core Examples: # Apache Hadoop 2.2.X mvn -Pyarn -Phadoop-2.2 \ -Dhadoop.version=2.2.0 \ -DskipTests clean package # CDH with MapReduce v1 mvn -Dhadoop.version= mr1-cdh DskipTests \ clean package

33 How Spark Handles Hadoop Dependencies The Spark library is compiled with compatibility to a specific Hadoop version SQL MLlib Streaming GraphX At runtime, Spark uses reflection to find any Hadoop classes it needs Examples: Hadoop Client Core # Apache Hadoop 2.2.X mvn -Pyarn -Phadoop-2.2 \ -Dhadoop.version=2.2.0 \ -DskipTests clean package # CDH with MapReduce v1 mvn -Dhadoop.version= mr1-cdh DskipTests \ clean package

34 Spark Support Included on Big Data Platforms While this build process is very easy, it s even easier to have the runtime pre-built Platform support also indicates stronger integration testing, supported, and integrated management tools SQL MLlib Streaming GraphX Hadoop Client Core

35 Spark 1.0

36 Spark-Submit Spark-submit provides a consistent manner to launch jobs regardless of which platform Includes an important clean-up to make configurations more consistent # Run on a Spark standalone cluster./bin/spark-submit \ --class org.apache.spark.examples.sparkpi \ --master spark:// :7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000! # Run on a YARN cluster export HADOOP_CONF_DIR=XXX./bin/spark-submit \ --class org.apache.spark.examples.sparkpi \ --master yarn-cluster \ --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000

37 Spark SQL We actually wrestled with the name a bit because it s not only about SQL SQL is actually not the only developer interface - there is also a DSL SparkSQL introduces SchemaRDDs and an Optimizer (Catalyst) This provides a deeper integration for any structured data val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) import sqlcontext._ val people: RDD[Person] =... // An RDD of case class objects! // The following is the same as // SELECT name FROM people WHERE age >= 10 AND age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name)

38 Databricks Is Committed to Growing Apache Spark s Developer Ecosystem Developer Training, Online Materials, Free Resources Strong Commitment to Open Source Certification Programs

39 We re Hiring! Evangelists Trainers Solutions Architects Software Engineers

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing