Other Map-Reduce (ish) Frameworks. William Cohen

Size: px
Start display at page:

Download "Other Map-Reduce (ish) Frameworks. William Cohen"

Transcription

1 Other Map-Reduce (ish) Frameworks William Cohen 1

2 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci<ic systems Cascading, Pipes PIG, Hive Spark, Flink 2

3 Y:Y=Hadoop+X or Hadoop~=Y What else are people using? instead of Hadoop Not really covered this lecture on top of Hadoop 3

4 Issues with Hadoop Too much typing programs are not concise Too low level missing abstractions hard to specify a work<low Not well suited to iterative operations E.g., E/M, k- means clustering, Work<low and memory- loading issues 4

5 STREAMING AND MRJOB: MORE CONCISE MAP-REDUCE PIPELINES 5

6 Hadoop streaming start with stream & sort pipeline cat data mapper.py sort k1,1 reducer.py run with hadoop streaming instead bin/hadoop jar contrib/streaming/hadoop- *streaming*.jar - <ile mapper.py <ile reducer.py - mapper mapper.py - reducer reducer.py - input /hdfs/path/to/inputdir - output /hdfs/path/to/outputdir - mapred.map.tasks=20 - mapred.reduce.tasks=20 6

7 mrjob word count Python level over map- reduce very concise Can run locally in Python Allows a single job or a linear chain of steps 7

8 mrjob most freq word 8

9 MAP-REDUCE ABSTRACTIONS 9

10 Abstractions On Top Of Hadoop MRJob and other tools to make Hadoop pipelines more concise (Dumbo, ) still keep the same basic language of map- reduce jobs How else can we express these sorts of computations? Are there some common special cases of map- reduce steps we can parameterize and reuse? 10

11 Abstractions On Top Of Hadoop Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word- count pairs: ( aardvarks,1) è ( aardvark,1) Proposed syntax: f(row)!row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words ( aardvark,1) è ( aardvark,1) ( the,1) è deleted Proposed syntax: f(row)! {true,false} table2 = FILTER table1 BY λ row : f(row)) 11

12 Abstractions On Top Of Hadoop A non- obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (<latten) Proposed syntax: Example: tokenizing a line I found an aardvark è [ i, found, an, aardvark ] We love zymurgy è [ we, love, zymurgy ]..but <inal table is one word per row f(row)!list of rows table2 = FLATMAP table1 TO λ row : f(row)) i found an aardvark we love 12

13 Abstractions On Top Of Hadoop Another example from the Naïve Bayes test program 13

14 NB Test Step (Can we do better?) Event counts How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

15 NB Test Step 1 (Can we do better?) The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Proposed syntax: f(row)!field GROUP table BY λ row : f(row) Output: key, listofrowswithkey Could de<ine f via: a function, a <ield of a de<ined record structure, w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

16 NB Test Step 1 (Can we do better?) The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Proposed syntax: f(row)!field Aside: you guys know how to implement this, right? 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) 3. Reduce and aggregate by appending together all the values associated with the same key GROUP table BY λ row : f(row) Could de<ine f via: a function, a <ield of a de<ined record structure,

17 Abstractions On Top Of Hadoop And another example from the Naïve Bayes test program 17

18 Request-and-answer Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

19 Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result

20 w found Request ~ctr to id1 w aardvark Counters C[w^Y=sports]=2 aardvark ~ctr to id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga ~ctr to id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]= ~ctr to id2 w Counters Requests aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]= ~ctr to id345 agent C[w^Y=sports]= ~ctr to id9854 agent C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id34742 zynga C[ ] ~ctr to id1 zynga C[ ]

21 w found Request id1 w aardvark Counters C[w^Y=sports]=2 aardvark id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]= id2 w Counters Requests aardvark C[w^Y=sports]=2 id1 agent C[w^Y=sports]= id345 agent C[w^Y=sports]= id9854 agent C[w^Y=sports]= id345 C[w^Y=sports]= id34742 zynga C[ ] id1 Proposed syntax: zynga C[ ] table3 = JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) Could de<ine f and g via: a function, a <ield of a de<ined record structure,

22 MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING 22

23 Y:Y=Hadoop+X Cascading Java library for map- reduce work<lows Also some library operations for common mappers/reducers 23

24 Cascading WordCount Example Input format Bind to HFS path Output format: pairs Bind to HFS path A pipeline of map-reduce jobs Replace line with bag of words Append a step: apply function to the line field Append step: group a (flattened) stream of tuples Append step: aggregate grouped values Run the pipeline 24

25 Cascading WordCount Example Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you? Is this inefficient? We explicitly form a group for each word, and then count the elements? We could be saved by careful optimization: we know we don t need the GroupBy intermediate result when we run the assembly.

26 Cascading WordCount Example Many of the Hadoop abstraction levels have a similar flavor: Define a pipeline of tasks declaratively Optimize it automatically Run the final result The key question: does the system successfully hide the details from you? Another pipeline: words = FLATMAP docs BY λ d: tokenize( d) contentwords = FILTER words BY λ w:!contains(stopwords,w) stems = MAP contentwords BY λ w: stem(w) stemgroups = GROUP stems BY λ s: s stemcounts = MAP stemgroups BY λ stem,listofstems: (stem,listofstems.length()) How many passes do we need to make over the data? Optimize to one reduce

27 Y:Y=Hadoop+X Cascading Java library for map- reduce work<lows expressed as Pipe s, to which you add Each, Every, GroupBy, Also some library operations for common mappers/ reducers e.g. RegexGenerator Turing- complete since it s an API for Java Pipes C++ library for map- reduce work<lows on Hadoop Scalding More concise Scala library based on Cascading 27

28 MORE DECLARATIVE LANGUAGES 28

29 Hive and PIG: word count Declarative.. Fairly stable PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 29

30 More on Pig Pig Latin atomic types + compound types like tuple, bag, map execute locally/interactively or on hadoop can embed Pig in Java (and Python and ) can call out to Java from Pig Similar (ish) system from Microsoft: DryadLinq 30

31 Tokenize built-in function Flatten special keyword, which applies to the next step in the process ie foreach is transformed from a MAP to a FLATMAP 31

32 PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton User de<ined functions as operators also for loading, aggregates, 32

33 ANOTHER EXAMPLE: COMPUTING TFIDF IN PIG LATIN 33

34 (docid,token) è (docid,token,tf(token in doc)) (docid,token,tf) è (docid,token,tf,length(doc)) (docid,token,tf,n)è (,tf/n) (docid,token,tf,n,tf/n)è (,df) ndocs.total_docs relation-to-scalar casting (docid,token,tf,n,tf/n)è (docid,token,tf/n * id) 34

35 Other PIG features Macros, nested queries, FLATTEN operation transforms a bag or a tuple into its individual elements this transform affects the next level of the aggregate STREAM and DEFINE SHIP DEFINE myfunc `python myfun.py` SHIP( myfun.py ) r = STREAM s THROUGH myfunc AS ( ); 35

36 TF-IDF in PIG - another version 36

37 Issues with Hadoop Too much typing programs are not concise Too low level missing abstractions hard to specify a work<low Not well suited to iterative operations E.g., E/M, k- means clustering, Work<low and memory- loading issues First: an iterative algorithm in Pig Latin 37

38 How to use loops, conditionals, etc? Embed PIG in a real programming language. Julien Le Dem Yahoo 38

39 39

40 An example from Ron Bekkerman 40

41 Example: k-means clustering An EM- like algorithm: Initialize k cluster centroids E- step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M- step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation 41

42 k-means Clustering centroids 42

43 Parallelizing k-means 43

44 Parallelizing k-means 44

45 Parallelizing k-means 45

46 k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 46

47 k-means in Apache Pig: input data Assume we need to cluster documents Stored in a 3- column table D: Document Word Count doc1 Carnegie 2 doc1 Mellon 2 Initial centroids are k randomly chosen docs Stored in table C in the same format as above 47

48 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 48

49 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 49

50 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 50

51 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 51

52 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = c FOREACH = arg C GENERATE max c, i c * i c AS w i c2 ; d SQR g = GROUP d SQR BY c; LEN_C = FOREACH SQR g GENERATE c c, SQRT(SUM(i c2 )) AS 2 len c ; w c w d ( w ) i DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; i c i w c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 52

53 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS i c2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); 53

54 k-means in Apache Pig: M-step D_C_W = JOIN CLUSTERS BY d, D BY d; D_C_W g = GROUP D_C_W BY (c, w); SUMS = FOREACH D_C_W g GENERATE c, w, SUM(i d ) AS sum; D_C_W gg = GROUP D_C_W BY c; SIZES = FOREACH D_C_W gg GENERATE c, COUNT(D_C_W) AS size; SUMS_SIZES = JOIN SIZES BY c, SUMS BY c; C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i c ; Finally - embed in Java (or Python or.) to do the looping 54

55 The problem with k-means in Hadoop I/O costs 55

56 Data is read, and model is written, with every iteration Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2 56

57 SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS: SPARK AND FLINK 57

58 Spark word count example Research project, based on Scala and Hadoop Now APIs in Java and Python as well Familiar-looking API for abstract operations (map, flatmap, reducebykey, ) Most API calls are lazy ie, counts is a data structure defining a pipeline, not a materialized table. Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database) 58

59 Spark logistic regression example 59

60 Spark logistic regression example Allows caching data in memory 60

61 Spark logistic regression example 61

62 FLINK Recent Apache Project just moved to top- level at 0.8 formerly Stratosphere. 62

63 FLINK Java API Apache Project just getting started. 63

64 FLINK 64

65 FLINK Like Spark, in- memory or on disk Everything is a Java object Unlike Spark, contains operations for iteration Allowing query optimization Very easy to use and install in local model Very modular Only needs Java 65

66 MORE EXAMPLES IN PIG 66

67 Phrase Finding in PIG 67

68 Phrase Finding 1 - loading the input 68

69 69

70 PIG Features comments - - like this /* or like this */ shell- like commands: fs - ls - - any hadoop fs command some shorter cuts: ls, cp, sh ls - al - - escape to shell 70

71 71

72 PIG Features comments - - like this /* or like this */ shell- like commands: fs - ls - - any hadoop fs command some shorter cuts: ls, cp, sh ls - al - - escape to shell LOAD hdfs- path AS (schema) schemas can include int, double, schemas can include complex types: bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation operators include +, -, and, or, can extend this set easily (more later) DESCRIBE alias - - shows the schema ILLUSTRATE alias - - derives a sample tuple 72

73 Phrase Finding 1 - word counts 73

74 74

75 PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP r BY x like a shufnle- sort: produces relation with Nields group and r, where r is a bag 75

76 PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce 76

77 PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce aggregates: COUNT, SUM, AVERAGE, MAX, MIN, you can write your own 77

78 PIG parses and optimizes a sequence of commands before it executes them It s smart enough to turn GROUP FOREACH SUM into a map-reduce 78

79 Phrase Finding 3 - assembling phrase- and word-level statistics 79

80 80

81 81

82 PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, 82

83 Phrase Finding 4 - adding total frequencies 83

84 84

85 How do we add the totals to the phrasestats relation? STORE triggers execution of the query plan. it also limits optimization 85

86 Comment: schema is lost when you store. 86

87 PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton newer pigs allow singleton relation to be cast to a scalar 87

88 Phrase Finding 5 - phrasiness and informativeness 88

89 How do we compute some complicated function? With a UDF 89

90 90

91 PIG Features LOAD hdfs- path AS (schema) schemas can include int, double, bag, map, tuple, FOREACH alias GENERATE AS, transforms each row of a relation DESCRIBE alias/ ILLUSTRATE alias - - debugging GROUP alias BY FOREACH alias GENERATE group, SUM(.) GROUP/GENERATE aggregate op together act like a map- reduce JOIN r BY Nield, s BY Nield, inner join to produce rows: r::f1, r::f2, s::f1, s::f2, CROSS r, s, use with care unless all but one of the relations are singleton User de<ined functions as operators also for loading, aggregates, 91

92 The full phrase-finding pipeline 92

93 93

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

http://glennengstrand.info/analytics/fp

http://glennengstrand.info/analytics/fp Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Hadoop Hands-On Exercises

Hadoop Hands-On Exercises Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Big Data Frameworks: Scala and Spark Tutorial

Big Data Frameworks: Scala and Spark Tutorial Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter

@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter @Scalding https://github.com/twitter/scalding Based on talk by Oscar Boykin / Twitter What is Scalding? Why Scala for Map/Reduce? How is it used at Twitter? What s next for Scalding? Yep, we re counting

More information

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Getting Started with Hadoop with Amazon s Elastic MapReduce

Getting Started with Hadoop with Amazon s Elastic MapReduce Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

Scaling Up Machine Learning

Scaling Up Machine Learning Scaling Up Machine Learning Parallel and Distributed Approaches Ron Bekkerman, LinkedIn Misha Bilenko, MSR John Langford, Y!R http://hunch.net/~large_scale_survey Outline Introduction Tree Induction Break

More information

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc. Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Unified Big Data Analytics Pipeline. 连 城 [email protected]

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Machine- Learning Summer School - 2015

Machine- Learning Summer School - 2015 Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Introduction to Big Data with Apache Spark UC BERKELEY

Introduction to Big Data with Apache Spark UC BERKELEY Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various

More information

Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive CSE 6242 / CX 4242 Scaling Up 2 HBase, Hive Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am

More information

Data Intensive Computing Handout 6 Hadoop

Data Intensive Computing Handout 6 Hadoop Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

More information

Hadoop Scripting with Jaql & Pig

Hadoop Scripting with Jaql & Pig Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages

More information

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo. Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com Hadoop Hadoop is an open source distributed platform for data storage

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12 Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage

More information

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

More information

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data

More information

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics with Cassandra, Spark & MLLib Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information