Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1 / 26
Big Data Analytics Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 1 / 26
Big Data Analytics 1. What is Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 1 / 26
Big Data Analytics 1. What is Spark Spark Overview Apache Spark is an open source framework for large scale processing and analysis Main Ideas: Processing occurs where the resides Avoid moving over the network Works with the in Memory Technical details: Written in Scala Work seamlessly with Java and Python Developed at UC Berkeley Apache Spark 1 / 26
Big Data Analytics 1. What is Spark Apache Spark Stack Data platform: Distributed file system / base Ex: HDFS, HBase, Cassandra Execution Environment: single machine or a cluster Standalone, EC2, YARN, Mesos Spark Core: Spark API Spark Ecosystem: libraries of common algorithms MLLib, GraphX, Streaming Apache Spark 2 / 26
Big Data Analytics 1. What is Spark Apache Spark Ecosystem Apache Spark 3 / 26
Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: The Spark Shell Available in Python and Scala Useful for learning the Framework Spark Applications Available in Python, Java and Scala For serious large scale processing Apache Spark 4 / 26
Big Data Analytics 2. Working with Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 5 / 26
Big Data Analytics 2. Working with Spark Working with Spark Working with Spark requires accessing a Spark Context: Main entry point to the Spark API Already preconfigured in the Shell Most of the work in Spark is a set of operation on Resilient Distributed Datasets (RDDs): Main abstraction The used and generated by the application is stored as RDDs Apache Spark 5 / 26
Big Data Analytics 2. Working with Spark Spark Interactive Shell (Python) $. / b i n / p y s p a r k Welcome to / / / / \ \/ \/ / / / / /. /\, / / / /\ \ v e r s i o n 1. 3. 1 / / Using Python v e r s i o n 2. 7. 6 ( d e f a u l t, Jun 22 2015 1 7 : 5 8 : 1 3 ) S p a r k C o n t e x t a v a i l a b l e as sc, SQLContext a v a i l a b l e as s q l C o n t e x t. >>> Apache Spark 6 / 26
Big Data Analytics 2. Working with Spark 2.1 Spark Context Spark Context The Spark Context is the main entry point for the Spark functionality. It represents the connection to a Spark cluster Allows to create RDDs Allows to broadcast variables on the cluster Allows to create Accumulators Apache Spark 7 / 26
Resilient Distributed Datasets (RDDs) A Spark application stores as RDDs Resilient if in memory is lost it can be recreated (fault tolerance) Distributed stored in memory across different machines Dataset coming from a file or generated by the application A Spark program is about operations on RDDs RDDs are immutable: operations on RDDs may create new RDDs but never change them Apache Spark 8 / 26
Resilient Distributed Datasets (RDDs) RDD RDD Element RDD elements can be stored in different machines (transparent to the developer) can have various types Apache Spark 9 / 26
RDD Data types An element of an RDD can be of any type as long as it is serializable Example: Primitive types: integers, characters, strings, floating point numbers,... Sequences: lists, arrays, tuples... Pair RDDs: key-value pairs Serializable Scala/Java objects A single RDD may have elements of different types Some specific element types have additional functionality Apache Spark 10 / 26
Example: Text file to RDD File: mydiary.txt I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. Apache Spark 11 / 26
RDD operations There are two types of RDD operations: Actions: return a value based on the RDD Example: count: returns the number of elements in the RDD first(): returns the first element in the RDD take(n): returns an array with the first n elements in the RDD Transformations: creates a new RDD based on the current one Example: filter: returns the elements of an RDD which match a given criterion map: applies a particular function to each RDD element reduce: applies a particular function to each RDD element Apache Spark 12 / 26
Actions vs. Transformations RDD Action Value BaseRDD Transformation NewRDD Apache Spark 13 / 26
Actions examples File: mydiary.txt I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. >>> my = s c. t e x t F i l e ( mydiary. t x t ) >>> my. count ( ) 5 >>> my. f i r s t ( ) u I had b r e a k f a s t t h i s morning. >>> my. t a k e ( 2 ) [ u I had b r e a k f a s t t h i s morning., u The c o f f e e was r e a l l y good. ] Apache Spark 14 / 26
Transformation examples RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: filtered I had breakfast this morning. I didn't like the bread though. filter But I had cheese. Oh I love cheese. filter(lambda line: "I " in line) RDD: filtered I had breakfast this morning. RDD: filtermap I HAD BREAKFAST THIS MORNING. I didn't like the bread though. But I had cheese. Oh I love cheese. map I DIDN'T LIKE THE BREAD THOUGH. BUT I HAD CHEESE. OH I LOVE CHEESE. map(lambda line: line.upper()) Apache Spark 15 / 26
Transformations examples >>> f i l t e r e d = my. f i l t e r ( lambda l i n e : I i n l i n e ) >>> f i l t e r e d. count ( ) 4 >>> f i l t e r e d. t a k e ( 4 ) [ u I had b r e a k f a s t t h i s morning., u I d i d n t l i k e t h e bread though., u But I had c h e e s e., u Oh I l o v e c h e e s e. ] >>> f i l t e r M a p = f i l t e r e d. map( lambda l i n e : l i n e. upper ( ) ) >>> f i l t e r M a p. count ( ) 4 >>> f i l t e r M a p. t a k e ( 4 ) [ u I HAD BREAKFAST THIS MORNING., u I DIDN T LIKE THE BREAD THOUGH., u BUT I HAD CHEESE., u OH I LOVE CHEESE. ] Apache Spark 16 / 26
Operations on specific types Numeric RDDs have special operations: mean() min() max()... >>> l i n e l e n s = my. map( lambda l i n e : l e n ( l i n e ) ) >>> l i n e l e n s. c o l l e c t ( ) [ 2 9, 27, 31, 17, 1 7 ] >>> l i n e l e n s. mean ( ) 2 4. 2 >>> l i n e l e n s. min ( ) 17 >>> l i n e l e n s. max ( ) 31 >>> l i n e l e n s. s t d e v ( ) 6. 0133185513491636 Apache Spark 17 / 26
Operations on Key-Value Pairs Pair RDDs contain a two element tuple: (K, V ) Keys and values can be of any type Extremely useful for implementing MapReduce algorithms Examples of operations: groupbykey reducebykey aggregatebykey sortbykey... Apache Spark 18 / 26
Word Count Example Map: Input: document-word list pairs Output: word-count pairs Reduce: Input: word-(count list) pairs Output: word-count pairs (d k, w 1,..., w m) [(w i, c i )] (w i, [c i ]) (w i, c [c i ] c) Apache Spark 19 / 26
Word Count Example (d1, love ain't no stranger ) (d2, crying in the rain ) (love, 1) (stranger,1) (crying, 1) (rain, 1) (ain't, 1) (stranger,1) (love, 1) (love, 2) (love, 2) (crying, 1) (stranger,1) (love, 5) (crying, 2) (crying, 1) (d3, looking for love ) (love, 2) (d4, I'm crying ) (d5, the deeper the love ) (crying, 1) (looking, 1) (deeper, 1) (ain't, 1) (ain't, 2) (ain't, 1) (rain, 1) (rain, 1) (looking, 1) (d6, is this love ) (love, 2) (ain't, 1) (looking, 1) (deeper, 1) (deeper, 1) (this, 1) (d7, Ain't no love ) (this, 1) (this, 1) Mappers Reducers Apache Spark 20 / 26
Word Count on Spark RDD: my I had breakfast this morning. RDD I RDD (I,1) RDD (I,4) The coffee was really good. had (had,1) (had,2) I didn't like the bread though. flatmap breakfast map (breakfast,1) reducebykey (breakfast,1) But I had cheese. this (this,1) (this,1) Oh I love cheese. morning. (morning.,1) (morning.,1)......... >>> c o u n t s = my. f l a t M a p ( lambda l i n e : l i n e. s p l i t ( ) ) \. map( lambda word : ( word, 1 ) ) \. reducebykey ( lambda x, y : x + y ) Apache Spark 21 / 26
ReduceByKey. reducebykey ( lambda x, y : x + y ) ReduceByKey works a little different from the MapReduce reduce function: It takes two arguments: combines two values at a time associated with the same key Must be commutative: reducebykey(x,y) = reducebykey(y,x) Must be associative: reducebykey(reducebykey(x,y), z) = reducebykey(x,reducebykey(y,z)) Spark does not guarantee on which order the reducebykey functions are executed! Apache Spark 22 / 26
Considerations Spark provides a much more efficient MapReduce implementation then Hadoop: Higher level API In memory storage (less I/O overhead) Chaining MapReduce operations is simplified: sequence of MapReduce passes can be done in one job Spark vs. Hadoop on training a logistic regression model: Source: Apache Spark. https://spark.apache.org/ Apache Spark 23 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 24 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark Overview MLLib is a Spark Machine Learning library containing implementations for: Computing Basic Statistics from Datasets Classification and Regression Collaborative Filtering Clustering Feature Extraction and Dimensionality Reduction Frequent Pattern Mining Optimization Algorithms Apache Spark 24 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark Logistic Regression with MLLib Import necessary packages: from p y s p a r k. m l l i b. r e g r e s s i o n import L a b e l e d P o i n t from p y s p a r k. m l l i b. u t i l import MLUtils from p y s p a r k. m l l i b. c l a s s i f i c a t i o n import L o g i s t i c R e g r e s s i o n W i t h S G D Read the (LibSVM format): d a t a s e t = MLUtils. l o a d L i b S V M F i l e ( sc, / m l l i b / s a m p l e l i b s v m d a t a. t x t ) Apache Spark 25 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark Logistic Regression with MLLib Train the Model: model = L o g i s t i c R e g r e s s i o n W i t h S G D. t r a i n ( d a t a s e t ) Evaluate: l a b e l s A n d P r e d s = d a t a s e t. map( lambda p : ( p. l a b e l, model. p r e d i c t ( p. f e a t u r e s ) ) ) t r a i n E r r = l a b e l s A n d P r e d s. f i l t e r ( lambda ( v, p ) : v!= p ). count ( ) / f l o a t ( d a t a s e t. count ( ) ) p r i n t ( T r a i n i n g E r r o r = + s t r ( t r a i n E r r ) ) Apache Spark 26 / 26