CS 378 Big Data Programming. Lecture 24 RDDs

Size: px

Start display at page:

Download "CS 378 Big Data Programming. Lecture 24 RDDs"

Clyde Wilson
8 years ago
Views:

1 CS 378 Big Data Programming Lecture 24 RDDs

2 Review Assignment 10 Download and run Spark WordCount implementaeon WordCount alternaeve implementaeon

3 Basic RDD TransformaEons we ve discussed filter(function) map(function) flatmap(function) maptopair(function) How do these work for muleple pareeons? Is a shuffle over the network required? MLSS 2015 Big Data Programming 3

4 Filter Apply a funceon to each element of an RDD, return those elements that evaluate to true Source RDD element type: T Result RDD element type: T Java funceon (class) type: Function<T, Boolean> Java method: Boolean call(t t) MLSS 2015 Big Data Programming 4

RDD element type: T Java funceon (class) type: Function<T,

5 Map Apply a funceon to each element of an RDD, return the result of applying the funceon RDD element type: T Result RDD element type: R Java funceon (class) type: Function<T, R> Java method: R call(t t) MLSS 2015 Big Data Programming 5

RDD element type: R Java funceon (class) type: Function<T,

6 Flat Map Apply a funceon to each element of an RDD, return the result of applying the funceon (an Iterable) Source RDD element type: T Result RDD element type: R Java funceon (class) type: FlatMapFunction<T, R> Java method: Iterable<R> call(t t) MLSS 2015 Big Data Programming 6

Result RDD element type: R Java funceon (class) type:

7 Map to Pair Apply a funceon to each element of an RDD, return the result of applying the funceon (a key/value pair) Source RDD element type: T Result RDD element type: <K, V> Java funceon (class) type: PairFunction<T, K, V> Java method: Tuple2<K, V> call(t t) MLSS 2015 Big Data Programming 7

Result RDD element type: <K, V> Java funceon (class) type:

8 Basic RDD AddiEonal transformaeons distinct() union(otherrdd) intersection(otherrdd) subtract(otherrdd) cartesian(otherrdd) sample(withreplacement, fraction, [seed]) MLSS 2015 Big Data Programming 8

$sample(withreplacement, fraction, [seed]) MLSS$

9 DisEnct Remove duplicates Source RDD element type: T Result RDD element type: T MLSS 2015 Big Data Programming 9

10 Union Compute the union of two RDDs Duplicates are not removed Source RDD element type: T Other RDD element type: T Result RDD element type: T MLSS 2015 Big Data Programming 10

11 IntersecEon Compute the interseceon of two RDDs Source RDD element type: T Other RDD element type: T Result RDD element type: T Requires a shuffle over the network. Why? Does union require a shuffle? MLSS 2015 Big Data Programming 11

element type: T Requires a shuffle over the network. Why?

12 Subtract Remove the elements of one RDD from another Source RDD element type: T Other RDD element type: T Result RDD element type: T Require a shuffle over the network? MLSS 2015 Big Data Programming 12

13 Cartesian Compute the cross product (Cartesian product) of two RDDs Source RDD element type: T Other RDD element type: U Result RDD element type: JavaPairRDD<T, U> Require a shuffle over the network? Very expensive in Eme and space MLSS 2015 Big Data Programming 13

RDD element type: JavaPairRDD<T, U> Require a shuffle over the

14 Sample Sample from an RDD Source RDD element type: T Result RDD element type: T sample(withreplacement, fraction, [seed]) Require a shuffle over the network? MLSS 2015 Big Data Programming 14

$sample(withreplacement, fraction, [seed])$

15 Basic RDD AcEons we ve discussed collect() count() take(n) first() saveastextfile() MLSS 2015 Big Data Programming 15

16 Basic RDD AddiEonal aceons countbyvalue() Returns a map from RDD element to count (integer) Could we use this for WordCount? takeordered(n, comparator) Sort the RDD elements, then take() takesample(withreplacement, num) Start sampling to get num elements MLSS 2015 Big Data Programming 16

takeordered(n, comparator) Sort the RDD elements, then take()

17 Basic RDD AddiEonal aceons reduce(function2<t,t,t> f) Returns an object of type T The funceon f should be commutaeve AssociaEve Does summing counts meet the specs for f? Does concatenaeng strings meet the specs for f? MLSS 2015 Big Data Programming 17

AssociaEve Does summing counts meet the specs for f?

18 Assignment 11 Inverted index, implemented in Spark We ll use the same data (Assignment 3) Remove punctuaeon For each word, output a list of verses containing the word, in sorted order MLSS 2015 Big Data Programming 18

For each word, output a list of verses containing the

CS 378 Big Data Programming. Lecture 24 RDDs

CS 378 Big Data Programming. Lecture 24 RDDs CS 378 Lecture 24 RDDs Review Assignment 11: Inverted index in Spark We ll use the same data (Assignment 3) Remove punctuakon For each word, output a list of verses containing the word, in sorted order