Other Map-Reduce (ish) Frameworks. William Cohen

Similar documents
Big Data Analytics. Lucas Rego Drumond

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,


Spark. Fast, Interactive, Language- Integrated Cluster Computing

Apache Flink Next-gen data analysis. Kostas

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

ITG Software Engineering

Introduction to Apache Pig Indexing and Search

Hadoop Hands-On Exercises

Hadoop Job Oriented Training Agenda

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Big Data Frameworks: Scala and Spark Tutorial

HDFS. Hadoop Distributed File System

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop: The Definitive Guide

@Scalding. Based on talk by Oscar Boykin / Twitter

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

COURSE CONTENT Big Data and Hadoop Training

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Getting Started with Hadoop with Amazon s Elastic MapReduce

Introduction to Spark

Data Management in the Cloud

Cloud Computing. Chapter Hadoop

Scaling Up Machine Learning

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Click Stream Data Analysis Using Hadoop

Unified Big Data Analytics Pipeline. 连 城

Machine- Learning Summer School

Introduction To Hive

Cloudera Certified Developer for Apache Hadoop

Big Data and Scripting Systems build on top of Hadoop

ITG Software Engineering

Hadoop Ecosystem B Y R A H I M A.

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Fast Analytics on Big Data with H20

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data and Scripting Systems build on top of Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Streaming. Table of contents

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

The Internet of Things and Big Data: Intro

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Architectures for massive data management

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Beyond Hadoop with Apache Spark and BDAS

Big Data Course Highlights

map/reduce connected components

Big Data and Apache Hadoop s MapReduce

Introduction to Big Data with Apache Spark UC BERKELEY

Parquet. Columnar storage for the people

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Data Intensive Computing Handout 6 Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Big Data for the JVM developer. Costin Leau,

Extreme computing lab exercises Session one

Hadoop Scripting with Jaql & Pig

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark and the Big Data Library

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Internals of Hadoop Application Framework and Distributed File System

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Introduction to Hadoop

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop: The Definitive Guide

Big Data Analytics with Cassandra, Spark & MLLib

Extreme computing lab exercises Session one

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CS 378 Big Data Programming

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Transcription:

Other Map-Reduce (ish) Frameworks William Cohen 1

Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci<ic systems Cascading, Pipes PIG, Hive Spark, Flink 2

Y:Y=Hadoop+X or Hadoop~=Y What else are people using? instead of Hadoop Not really covered this lecture on top of Hadoop 3

Issues with Hadoop Too much typing programs are not concise Too low level missing abstractions hard to specify a work<low Not well suited to iterative operations E.g., E/M, k- means clustering, Work<low and memory- loading issues 4

STREAMING AND MRJOB: MORE CONCISE MAP-REDUCE PIPELINES 5

Hadoop streaming start with stream & sort pipeline cat data mapper.py sort k1,1 reducer.py run with hadoop streaming instead bin/hadoop jar contrib/streaming/hadoop- *streaming*.jar - <ile mapper.py <ile reducer.py - mapper mapper.py - reducer reducer.py - input /hdfs/path/to/inputdir - output /hdfs/path/to/outputdir - mapred.map.tasks=20 - mapred.reduce.tasks=20 6

mrjob word count Python level over map- reduce very concise Can run locally in Python Allows a single job or a linear chain of steps 7

mrjob most freq word 8

MAP-REDUCE ABSTRACTIONS 9

Abstractions On Top Of Hadoop MRJob and other tools to make Hadoop pipelines more concise (Dumbo, ) still keep the same basic language of map- reduce jobs How else can we express these sorts of computations? Are there some common special cases of map- reduce steps we can parameterize and reuse? 10

Abstractions On Top Of Hadoop Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word- count pairs: ( aardvarks,1) è ( aardvark,1) Proposed syntax: f(row)!row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words ( aardvark,1) è ( aardvark,1) ( the,1) è deleted Proposed syntax: f(row)! {true,false} table2 = FILTER table1 BY λ row : f(row)) 11

Abstractions On Top Of Hadoop A non- obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (<latten) Proposed syntax: Example: tokenizing a line I found an aardvark è [ i, found, an, aardvark ] We love zymurgy è [ we, love, zymurgy ]..but <inal table is one word per row f(row)!list of rows table2 = FLATMAP table1 TO λ row : f(row)) i found an aardvark we love 12

Abstractions On Top Of Hadoop Another example from the Naïve Bayes test program 13

NB Test Step (Can we do better?) Event counts How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= 5245 1054 2120 37 3 w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

NB Test Step 1 (Can we do better?) The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= 5245 1054 2120 37 3 Proposed syntax: f(row)!field GROUP table BY λ row : f(row) Output: key, listofrowswithkey Could de<ine f via: a function, a <ield of a de<ined record structure, w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

NB Test Step 1 (Can we do better?) The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Proposed syntax: f(row)!field Aside: you guys know how to implement this, right? 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) 3. Reduce and aggregate by appending together all the values associated with the same key GROUP table BY λ row : f(row) Could de<ine f via: a function, a <ield of a de<ined record structure,

Abstractions On Top Of Hadoop And another example from the Naïve Bayes test program 17

Request-and-answer Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result

w found Request ~ctr to id1 w aardvark Counters C[w^Y=sports]=2 aardvark ~ctr to id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga ~ctr to id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]= ~ctr to id2 w Counters Requests aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]= ~ctr to id345 agent C[w^Y=sports]= ~ctr to id9854 agent C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id34742 zynga C[ ] ~ctr to id1 zynga C[ ]

w found Request id1 w aardvark Counters C[w^Y=sports]=2 aardvark id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]= id2 w Counters Requests aardvark C[w^Y=sports]=2 id1 agent C[w^Y=sports]= id345 agent C[w^Y=sports]= id9854 agent C[w^Y=sports]= id345 C[w^Y=sports]= id34742 zynga C[ ] id1 Proposed syntax: zynga C[ ] table3 = JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) Could de<ine f and g via: a function, a <ield of a de<ined record structure,

MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING 22

Y:Y=Hadoop+X Cascading Java library for map- reduce work<lows Also some library operations for common mappers/reducers 23

Cascading WordCount Example Input format Bind to HFS path Output format: pairs Bind to HFS path A pipeline of map-reduce jobs Replace line with bag of words Append a step: apply function to the line field Append step: group a (flattened) stream of tuples Append step: aggregate grouped values Run the pipeline 24

Cascading WordCount Example Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you? Is this inefficient? We explicitly form a group for each word, and then count the elements? We could be saved by careful optimization: we know we don t need the GroupBy intermediate result when we run the assembly.

Cascading WordCount Example Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you? Another pipeline: words = FLATMAP docs BY λ d: tokenize( d) contentwords = FILTER words BY λ w:!contains(stopwords,w) stems = MAP contentwords BY λ w: stem(w) stemgroups = GROUP stems BY λ s: s stemcounts = MAP stemgroups BY λ stem,listofstems: (stem,listofstems.length()) How many passes do we need to make over the data? Optimize to one reduce

Y:Y=Hadoop+X Cascading Java library for map- reduce work<lows expressed as Pipe s, to which you add Each, Every, GroupBy, Also some library operations for common mappers/ reducers e.g. RegexGenerator Turing- complete since it s an API for Java Pipes C++ library for map- reduce work<lows on Hadoop Scalding More concise Scala library based on Cascading 27

MORE DECLARATIVE LANGUAGES 28

Hive and PIG: word count Declarative.. Fairly stable PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 29

More on Pig Pig Latin atomic types + compound types like tuple, bag, map execute locally/interactively or on hadoop can embed Pig in Java (and Python and ) can call out to Java from Pig Similar (ish) system from Microsoft: DryadLinq 30

Tokenize built-in function Flatten special keyword, which applies to the next step in the process ie foreach is transformed from a MAP to a FLATMAP 31

PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton User de<ined functions as operators also for loading, aggregates, 32

ANOTHER EXAMPLE: COMPUTING TFIDF IN PIG LATIN 33

(docid,token) è (docid,token,tf(token in doc)) (docid,token,tf) è (docid,token,tf,length(doc)) (docid,token,tf,n)è (,tf/n) (docid,token,tf,n,tf/n)è (,df) ndocs.total_docs relation-to-scalar casting (docid,token,tf,n,tf/n)è (docid,token,tf/n * id) 34

Other PIG features Macros, nested queries, FLATTEN operation transforms a bag or a tuple into its individual elements this transform affects the next level of the aggregate STREAM and DEFINE SHIP DEFINE myfunc `python myfun.py` SHIP( myfun.py ) r = STREAM s THROUGH myfunc AS ( ); 35

TF-IDF in PIG - another version 36

Issues with Hadoop Too much typing programs are not concise Too low level missing abstractions hard to specify a work<low Not well suited to iterative operations E.g., E/M, k- means clustering, Work<low and memory- loading issues First: an iterative algorithm in Pig Latin 37

How to use loops, conditionals, etc? Embed PIG in a real programming language. Julien Le Dem Yahoo 38

39

An example from Ron Bekkerman 40

Example: k-means clustering An EM- like algorithm: Initialize k cluster centroids E- step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M- step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation 41

k-means Clustering centroids 42

Parallelizing k-means 43

Parallelizing k-means 44

Parallelizing k-means 45

k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 46

k-means in Apache Pig: input data Assume we need to cluster documents Stored in a 3- column table D: Document Word Count doc1 Carnegie 2 doc1 Mellon 2 Initial centroids are k randomly chosen docs Stored in table C in the same format as above 47

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 48

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 49

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 50

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 51

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 52

k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS i c2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 53

k-means in Apache Pig: M-step D_C_W = JOIN CLUSTERS BY d, D BY d; D_C_W g = GROUP D_C_W BY (c, w); SUMS = FOREACH D_C_W g GENERATE c, w, SUM(i d ) AS sum; D_C_W gg = GROUP D_C_W BY c; SIZES = FOREACH D_C_W gg GENERATE c, COUNT(D_C_W) AS size; SUMS_SIZES = JOIN SIZES BY c, SUMS BY c; C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i c ; Finally - embed in Java (or Python or.) to do the looping 54

The problem with k-means in Hadoop I/O costs 55

Data is read, and model is written, with every iteration Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 56

SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS: SPARK AND FLINK 57

Spark word count example Research project, based on Scala and Hadoop Now APIs in Java and Python as well Familiar-looking API for abstract operations (map, flatmap, reducebykey, ) Most API calls are lazy ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database) 58

Spark logistic regression example 59

Spark logistic regression example Allows caching data in memory 60

Spark logistic regression example 61

FLINK Recent Apache Project just moved to top- level at 0.8 formerly Stratosphere. 62

FLINK Java API Apache Project just getting started. 63

FLINK 64

FLINK Like Spark, in- memory or on disk Everything is a Java object Unlike Spark, contains operations for iteration Allowing query optimization Very easy to use and install in local model Very modular Only needs Java 65

MORE EXAMPLES IN PIG 66

Phrase Finding in PIG 67

Phrase Finding 1 - loading the input 68

69

PIG Features comments - - like this /* or like this */ shell- like commands: fs - ls - - any hadoop fs command some shorter cuts: ls, cp, sh ls - al - - escape to shell 70

71

PIG Features comments - - like this /* or like this */ shell- like commands: fs - ls - - any hadoop fs command some shorter cuts: ls, cp, sh ls - al - - escape to shell LOAD hdfs- path AS (schema) schemas can include int, double, schemas can include complex types: bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation operators include +, -, and, or, can extend this set easily (more later) DESCRIBE alias - - shows the schema ILLUSTRATE alias - - derives a sample tuple 72

Phrase Finding 1 - word counts 73

74

PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP r BY x like a shufnle- sort: produces relation with Nields group and r, where r is a bag 75

PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce 76

PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce aggregates: COUNT, SUM, AVERAGE, MAX, MIN, you can write your own 77

PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce 78

Phrase Finding 3 - assembling phrase- and word-level statistics 79

80

81

PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, 82

Phrase Finding 4 - adding total frequencies 83

84

How do we add the totals to the phrasestats relation? STORE triggers execution of the query plan. it also limits optimization 85

Comment: schema is lost when you store. 86

PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton newer pigs allow singleton relation to be cast to a scalar 87

Phrase Finding 5 - phrasiness and informativeness 88

How do we compute some complicated function? With a UDF 89

90

PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton User de<ined functions as operators also for loading, aggregates, 91

The full phrase-finding pipeline 92

93