Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Size: px

Start display at page:

Download "Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone"

Cody Chambers
8 years ago
Views:

1 Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter

2 Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine Learning : Recommender System Use Case : Next Product To Buy Q&A

3 What s hadoop The Apache Hadoop project develops opensource software for reliable, scalable, distributed computing. Java framework for storage and running data transformation on large cluster of commodity hardware Licensed under the Apache v2 license Created from Google's MapReduce, BigTable and Google File System (GFS) papers

4 HDFS : Distributed Storage Distributed, Scalable, Portable, Reliable file system for the Hadoop framework. Metadata / data separation: Name Nodes Data Nodes

5 Map Reduce Map() : parse inputs and generate 0 to n <key, value> Reduce() : sums all values of the same key and generate a <key, value> WordCount Example Each map take a line as an input and break into words It emits a key/value pair of the word and 1 Each Reducer sums the counts for each word It emits a key/value pair of the word and sum

6 Map Reduce Data Node 1 Data Node 2

7 Map Reduce

8 Map Reduce

9 Map Reduce

10 Map Reduce

11 Hadoop MapReduce v1

12 Hadoop MapReduce v1

13 Hadoop MapReduce v1

14 Hadoop MapReduce v1 Not good for low-latency jobs on smallest dataset

15 Hadoop MapReduce v1 Good for off-line batch jobs on massive data

16 Hadoop 1 Batch ONLY High latency jobs HIVE Query Pig Scripting Cascading Accelerate Dev. MapReduce1 Cluster Resource Management + Data Processing BATCH HDFS (Redundant, Reliable Storage)

17 Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications MapReduce1 Data Processing BATCH Other Data Processing YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)

18 Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (Hbase HOYA) STREAMING (Storm, Samza Spark Streaming) GRAPH (Giraph, GraphX) Machine Learning (Spark MLLIb) In-Memory (Spark) OTHER (ElasticSearch) YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)

19 Stinger.next

20 Stinger.next

21 Apache Spark is a fast and general engine for large-scale data processing.

22 The most active project Patches MapReduce Storm Yarn Spark Lines Added MapReduce Storm Yarn Spark

23 Spark won the Daytona GraySort contest! Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

24 RDD & Operation Resilient Distributed Datasets (RDDs) Operations Transformations (e.g. map, filter, groupby) Actions (e.g. count, collect, save)

25 Spark scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> textfile.count() res0: Long = 126 scala> textfile.first() res1: String = # Apache Spark scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string]=spark.filteredrdd@7dd4 scala> textfile.filter(line=>line.contains("spark")).count() res3: Long = 15

26 Streaming Streaming

27 Storm

28 Storm

29 Storm vs Spark Spark Storm Scope Batch, Streaming, Graph, ML, SQL Streaming only Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community

30 Machine Learning Library (Mllib)

31 Collaborative Filtering

32 Collaborative Filtering (learning)

33 Collaborative Filtering (learning)

34 Collaborative Filtering (learning)

35 Collaborative Filtering : Let s use the model

36 Collaborative Filtering : similar behaviors

37 Collaborative Filtering Prediction

38 Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media

39 Input Data UserID::MovieID::Rating::Timestamp 1::1193::5:: ::661::3:: ::914::3:: Etc 2::1357::5:: ::3068::4:: ::1537::4::

40 Matric Factorization

41 The result 1 ; Lyndon Wilson ; ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; ; 356 ; Forrest Gump (19

42 Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy

43 Ready for Omni-channel? Traditional marketing Current approach cannot keep up 200m people on Do Not Call list 99.9% of online banners are never clicked. 44% of direct marketing is never opened. 86% of TV viewers skip commercials Buyers complete 60% of their research before reaching out to vendors.

44 Statement Multi Channel Cross Channel Omni Channel Consumer Graph

45 Next Product to Buy in Action 1 Open data Premium data

46 Next Product to Buy in Action 1 ERP Brand data CRM Loyalty Open data Premium data

47 Next Product to Buy in Action 2 ERP Brand data CRM Loyalty Open data Premium data

48 Next Product to Buy in Action 3 ERP Brand data CRM Loyalty Open data Premium data

49 Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

50 Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

51 Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

52 Next Product to Buy in Action 5 ERP Brand data CRM Loyalty Open data Premium data

53 Brand Premium Open Social Influans OnBoard Graph Suggest + Fine Tune + Social Interactions Engage Sales

54 Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy Right Person Right Product Right Price Right Time Right Channel

Questions? We g r a p h c o n s u m e r s Cédric Carbone cedric@influans.

55 Questions? We g r a p h c o n s u m e r s Cédric Carbone

How Companies are! Using Spark

How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made