Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Similar documents
Introduction to Spark

Architectures for massive data management

Introduction to Big Data with Apache Spark UC BERKELEY

Apache Spark and Distributed Programming

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark: Cluster Computing with Working Sets

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Unified Big Data Analytics Pipeline. 连 城

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

How to Run Spark Application

Big Data Frameworks: Scala and Spark Tutorial

This is a brief tutorial that explains the basics of Spark SQL programming.

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics Hadoop and Spark

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

HiBench Introduction. Carson Wang Software & Services Group

Brave New World: Hadoop vs. Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Scaling Out With Apache Spark. DTL Meeting Slides based on

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Data Science in the Wild

Real Time Data Processing using Spark Streaming

Writing Standalone Spark Programs

Conquering Big Data with Apache Spark

High-Speed In-Memory Analytics over Hadoop and Hive Data

Spark: Making Big Data Interactive & Real-Time

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data and Scripting Systems beyond Hadoop

Streaming items through a cluster with Spark Streaming

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark and the Big Data Library

Using Apache Spark Pat McDonough - Databricks

Big Data Analytics. Lucas Rego Drumond

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Big Data and Scripting Systems build on top of Hadoop

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Apache Flink Next-gen data analysis. Kostas

Moving From Hadoop to Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hive Interview Questions

Running Knn Spark on EC2 Documentation

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Certification Study Guide. MapR Certified Spark Developer Study Guide

SparkLab May 2015 An Introduction to

COURSE CONTENT Big Data and Hadoop Training

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem B Y R A H I M A.

The Hadoop Eco System Shanghai Data Science Meetup

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Reusable Data Access Patterns

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop IST 734 SS CHUNG

Hadoop: The Definitive Guide

Big Data With Hadoop

Spark Job Server. Evan Chan and Kelvin Chu. Date

Introduc8on to Apache Spark

HDFS. Hadoop Distributed File System

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Implementations of iterative algorithms in Hadoop and Spark

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Machine- Learning Summer School

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA APPLICATIONS

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Table Of Contents. 1. GridGain In-Memory Database

Rakam: Distributed Analytics API

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Getting to know Apache Hadoop

Openbus Documentation

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Productionizing a 24/7 Spark Streaming Service on YARN

Apache HBase. Crazy dances on the elephant back

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Spark Application Carousel. Spark Summit East 2015

CSE-E5430 Scalable Cloud Computing Lecture 11

Integrating Big Data into the Computing Curricula

Getting Started with Hadoop with Amazon s Elastic MapReduce

Hadoop Job Oriented Training Agenda

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop, Hive & Spark Tutorial

Transcription:

Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1

Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010 under a BSD license Proven scalability to over 8000 nodes in production 4/19/2015 EPL 660 - Information Retrieval and Search Engines 2

Key points about Spark Native integration with Java, Scala and Python through its API Programming at a higher level of abstraction Handles batch, interactive, and real-time within a single framework Map/reduce is just one set of supported constructs (structured and relational query processing, ML) 4/19/2015 EPL 660 - Information Retrieval and Search Engines 3

Installation In your EPL451 Virtual Machine download Spark: wget http://www.webhostingjams.com/mirror/apache/spark/spark- 1.2.1/spark-1.2.1-bin-hadoop2.3.tgz Unzip the file: tar xvzf spark-1.2.1-bin-hadoop2.3.tgz Change directory to unzipped folder: cd spark-1.2.1-bin-hadoop2.3 Start Spark Shell:./bin/spark-shell NOTE: JDK Required (already installed on your VM) 4/19/2015 EPL 660 - Information Retrieval and Search Engines 4

Spark Web UI In browser type localhost:4040 4/19/2015 EPL 660 - Information Retrieval and Search Engines 5

Spark Essentials: SparkContext First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster In the shell for either Scala or Python, this is the sc variable, which is created automatically Other programs must use a constructor to instantiate a new SparkContext Then in turn SparkContext gets used to create other variables 4/19/2015 EPL 660 - Information Retrieval and Search Engines 6

Hands on - Spark Shell From the scala> REPL prompt, type: scala> sc res: spark.sparkcontext = spark.sparkcontext@470d1f30 4/19/2015 EPL 660 - Information Retrieval and Search Engines 7

Spark Essentials: Master The master parameter for a SparkContext determines which cluster to use master description local local[k] spark://host:port mesos://host:port run Spark locally with one worker thread (no parallelism) run Spark locally with K worker threads (ideally set to # cores) connect to a Spark standalone cluster; PORT depends on config (7077 by default) connect to a Mesos cluster; PORT depends on config (5050 by default) 4/19/2015 EPL 660 - Information Retrieval and Search Engines 8

Hands on - Spark Shell Change the master parameter Through a program import org.apache.spark.sparkconf import org.apache.spark.sparkcontext val conf = new SparkConf().setMaster("local[2]") val sc = new SparkContext(conf) From command line:./bin/spark-submit --master local[2] 4/19/2015 EPL 660 - Information Retrieval and Search Engines 9

Spark Essentials: Master spark.apache.org/docs/1.2.0/ cluster-overview.html Node Executor task cache task Driver Program Cluster Manager SparkContext Node Executor task cache task 4/19/2015 EPL 660 - Information Retrieval and Search Engines 10

Spark Essentials: Master 1. connects to a cluster manager which allocates resources across applications 2. acquires executors on cluster nodes worker processes to run computations and store data 3. sends app code to the executors 4. sends tasks for the executors to run Node Executor task cache task Driver Program Cluster Manager SparkContext Node Executor task cache task 4/19/2015 EPL 660 - Information Retrieval and Search Engines 11

Spark Essentials: RDD Resilient Distributed Datasets (RDDs) are the primary abstraction in Spark a fault-tolerant collection of elements that can be operated on in parallel There are currently two types: parallelized collections take an existing Scala collection and run functions on it in parallel Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop 4/19/2015 EPL 660 - Information Retrieval and Search Engines 12

Spark Essentials: RDD two types of operations on RDDs: transformations and actions transformations are lazy (not computed immediately) the transformed RDD gets recomputed when an action is run on it (default) however, an RDD can be persisted into storage in memory or disk 4/19/2015 EPL 660 - Information Retrieval and Search Engines 13

Hands on - Spark Shell From the scala> REPL prompt, let s create some data: val data = 1 to 10000 then create an RDD based on that data: val distdata = sc.parallelize(data) finally use a filter to select values less than 10: distdata.filter(_ < 10).collect() 4/19/2015 EPL 660 - Information Retrieval and Search Engines 14

Spark Essentials: RDD Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, e.g., local file system, Amazon S3, Hypertable, HBase, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a glob (e.g. /data/201404*) 4/19/2015 EPL 660 - Information Retrieval and Search Engines 15

Hands on - Spark Shell now lets run word count example: val f = sc.textfile("readme.md") val words = f.flatmap(l => l.split(" ")).map(word => (word, 1)) words.reducebykey(_ + _).collect.foreach(println) words.reducebykey(_ + _).saveastextfile("words_output") of course README.md could be located in HDFS as well: val f = sc.textfile("hdfs://localhost:54310/user/hduesr/input/readme.md") val words = f.flatmap(l => l.split(" ")).map(word => (word, 1)) words.reducebykey(_ + _).collect.foreach(println) words.reducebykey(_ + _).saveastextfile("words_output") 4/19/2015 EPL 660 - Information Retrieval and Search Engines 16

Spark Essentials: Transformations Transformations create a new dataset from an existing one All transformations in Spark are lazy: they do not compute their results right away instead they remember the transformations applied to some base dataset optimize the required calculations recover from lost data partitions 4/19/2015 EPL 660 - Information Retrieval and Search Engines 17

Spark Essentials: Transformations & Actions A full list with all the supported transformations and actions can be found on the following links http://spark.apache.org/docs/0.9.0/scala-programmingguide.html#transformations http://spark.apache.org/docs/0.9.0/scala-programmingguide.html#actions 4/19/2015 EPL 660 - Information Retrieval and Search Engines 18

Spark Examples: Estimate PI Now, try using a Monte Carlo method to estimate the value of Pi./bin/run-example SparkPi 2 local 4/19/2015 EPL 660 - Information Retrieval and Search Engines 19

Spark Examples: Estimate PI import scala.math.random import org.apache.spark._ /** Computes an approximation to pi */ object SparkPi { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Spark Pi") val spark = new SparkContext(conf) val slices = if (args.length > 0) args(0).toint else 2 val n = 100000 * slices val count = spark.parallelize(1 to n, slices).map { i => val x = random * 2-1 val y = random * 2-1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) } } println("pi is roughly " + 4.0 * count / n) spark.stop() 4/19/2015 EPL 660 - Information Retrieval and Search Engines 20

Spark Examples: Estimate PI val count = sc.parallelize(1 to n, slices) base RDD.map { i => val x = random * 2-1 val y = random * 2-1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) transformed RDD action transformations RDD RDD RDD RDD action value 4/19/2015 EPL 660 - Information Retrieval and Search Engines 21

Spark Essentials: Persistence Spark can persist (or cache) a dataset in memory across operations Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset often making future actions more than 10x faster The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it 4/19/2015 EPL 660 - Information Retrieval and Search Engines 22

Spark Essentials: Persistence transformation description MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. Store the RDD partitions only on disk. Same as the levels above, but replicate each partition on two cluster nodes. 4/19/2015 EPL 660 - Information Retrieval and Search Engines 23

Spark Essentials: Persistence val f = sc.textfile("readme.md") val w = f.flatmap(l => l.split(" ")).map(word => (word, 1)).cache() w.reducebykey(_ + _).collect.foreach(println) How to choose Persistence: If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. Don t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, re-computing a partition may be as fast as reading it from disk. Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by re-computing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to re-compute a lost partition. 4/19/2015 EPL 660 - Information Retrieval and Search Engines 24

Spark Deconstructed: Log Mining Example // load error messages from a log into memory // then interactively search for various patterns // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() 4/19/2015 EPL 660 - Information Retrieval and Search Engines 25

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() Driver 4/19/2015 EPL 660 - Information Retrieval and Search Engines 26

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() block 1 Driver block 2 block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 27

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() block 1 Driver block 2 block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 28

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() block 1 read HDFS block Driver block 2 read HDFS block block 3 read HDFS block 4/19/2015 EPL 660 - Information Retrieval and Search Engines 29

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() cache 1 block 1 process, cache data Driver cache 2 block 2 process, cache data cache 3 process, cache data block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 30

Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textfile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() cache 1 // action 1 messages.filter(_.contains("mysql")).count() block 1 cache 2 Driver block 2 cache 3 block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 31

Spark Deconstructed: Log Mining Example // action 2 messages.filter(_.contains("php")).count() cache 1 process from cache block 1 Driver cache 2 block 2 process from cache cache 3 process from cache block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 32

Spark Deconstructed: Log Mining Example // action 2 messages.filter(_.contains("php")).count() cache 1 block 1 cache 2 Driver block 2 cache 3 block 3 4/19/2015 EPL 660 - Information Retrieval and Search Engines 33

Spark Examples: K-Means Next, try using K-Means to cluster a set of vector values: cp data/mllib/kmeans_data.txt../bin/run-example SparkKMeans kmeans_data.txt 3 0.01 local Based on the data set: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Please refer to the source code in: examples/src/main/scala/org/apache/spark/examples/sparkkmeans.scala 4/19/2015 EPL 660 - Information Retrieval and Search Engines 34

Spark Examples: PageRank Next, try using PageRank to rank the relationships in a graph: cp data/mllib/pagerank_data.txt../bin/run-example SparkPageRank pagerank_data.txt 10 local Based on the data set: 1 2 1 3 1 4 2 1 3 1 4 1 Please refer to the source code in: examples/src/main/scala/org/apache/spark/examples/sparkpagerank.scala 4/19/2015 EPL 660 - Information Retrieval and Search Engines 35

Spark Examples: Spark SQL val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) import sqlcontext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)) people.registerastable("people") // SQL statements can be run by using the sql methods provided by sqlcontext. val teenagers = sql("select name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support all the // normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagers.map(t => "Name: " + t(0)).collect().foreach(println) 4/19/2015 EPL 660 - Information Retrieval and Search Engines 36

Spark Essentials: Broadcast Variables Broadcast variables let programmer keep a readonly variable cached on each machine rather than shipping a copy of it with tasks For example, to give every node a copy of a large input dataset efficiently Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost val broadcastvar = sc.broadcast(array(1, 2, 3)) broadcastvar.value 4/19/2015 EPL 660 - Information Retrieval and Search Engines 37

Spark Essentials: Accumulators Accumulators are variables that can only be added to through an associative operation Used to implement counters and sums, efficiently in parallel Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types Only the driver program can read an accumulator s value, not the tasks val accum = sc.accumulator(0) sc.parallelize(array(1, 2, 3, 4)).foreach(x => accum += x) accum.value driver-side 4/19/2015 EPL 660 - Information Retrieval and Search Engines 38

Spark Examples: Spark Streaming # in one terminal run the NetworkWordCount example in Spark Streaming # expecting a data stream on the localhost:8080 TCP socket $./bin/run-example org.apache.spark.examples.streaming.networkwordcount localhost 8080 # in another terminal use Netcat http://nc110.sourceforge.net/ # to generate a data stream on the localhost:8080 TCP socket $ nc -lk 8080 hello world hi there fred what a nice world there 4/19/2015 EPL 660 - Information Retrieval and Search Engines 39

Spark Examples: Spark Streaming import org.apache.spark.streaming._ import org.apache.spark.streaming.streamingcontext._ // Create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a DStream that will connect to serverip:serverport val lines = ssc.sockettextstream(serverip, serverport) // Split each line into words! val words = lines.flatmap(_.split(" ")) // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) // Print a few of the counts to the console! wordcounts.print() ssc.start() // Start the computation ssc.awaittermination() // Wait for the computation to terminate 4/19/2015 EPL 660 - Information Retrieval and Search Engines 40