Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone
Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine Learning : Recommender System Use Case : Next Product To Buy Q&A
What s hadoop The Apache Hadoop project develops opensource software for reliable, scalable, distributed computing. Java framework for storage and running data transformation on large cluster of commodity hardware Licensed under the Apache v2 license Created from Google's MapReduce, BigTable and Google File System (GFS) papers
HDFS : Distributed Storage Distributed, Scalable, Portable, Reliable file system for the Hadoop framework. Metadata / data separation: Name Nodes Data Nodes
Map Reduce Map() : parse inputs and generate 0 to n <key, value> Reduce() : sums all values of the same key and generate a <key, value> WordCount Example Each map take a line as an input and break into words It emits a key/value pair of the word and 1 Each Reducer sums the counts for each word It emits a key/value pair of the word and sum
Map Reduce Data Node 1 Data Node 2
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Hadoop MapReduce v1
Hadoop MapReduce v1
Hadoop MapReduce v1
Hadoop MapReduce v1 Not good for low-latency jobs on smallest dataset
Hadoop MapReduce v1 Good for off-line batch jobs on massive data
Hadoop 1 Batch ONLY High latency jobs HIVE Query Pig Scripting Cascading Accelerate Dev. MapReduce1 Cluster Resource Management + Data Processing BATCH HDFS (Redundant, Reliable Storage)
Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications MapReduce1 Data Processing BATCH Other Data Processing YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)
Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (Hbase HOYA) STREAMING (Storm, Samza Spark Streaming) GRAPH (Giraph, GraphX) Machine Learning (Spark MLLIb) In-Memory (Spark) OTHER (ElasticSearch) YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)
Stinger.next
Stinger.next
https://spark.apache.org Apache Spark is a fast and general engine for large-scale data processing.
The most active project 250 45000 40000 200 150 100 50 0 Patches MapReduce Storm Yarn Spark 35000 30000 25000 20000 15000 10000 5000 0 Lines Added MapReduce Storm Yarn Spark
Spark won the Daytona GraySort contest! Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
RDD & Operation Resilient Distributed Datasets (RDDs) Operations Transformations (e.g. map, filter, groupby) Actions (e.g. count, collect, save)
Spark scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> textfile.count() res0: Long = 126 scala> textfile.first() res1: String = # Apache Spark scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string]=spark.filteredrdd@7dd4 scala> textfile.filter(line=>line.contains("spark")).count() res3: Long = 15
Streaming Streaming
Storm
Storm
Storm vs Spark Spark Storm Scope Batch, Streaming, Graph, ML, SQL Streaming only Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++
Machine Learning Library (Mllib)
Collaborative Filtering
Collaborative Filtering (learning)
Collaborative Filtering (learning)
Collaborative Filtering (learning)
Collaborative Filtering : Let s use the model
Collaborative Filtering : similar behaviors
Collaborative Filtering Prediction
Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media
Input Data UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc 2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620
Matric Factorization
The result 1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19
Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy
Ready for Omni-channel? Traditional marketing Current approach cannot keep up 200m people on Do Not Call list 99.9% of online banners are never clicked. 44% of direct marketing is never opened. 86% of TV viewers skip commercials Buyers complete 60% of their research before reaching out to vendors.
Statement 2000 2010 2013 2015 Multi Channel Cross Channel Omni Channel Consumer Graph
Next Product to Buy in Action 1 Open data Premium data
Next Product to Buy in Action 1 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 2 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 3 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data
Next Product to Buy in Action 5 ERP Brand data CRM Loyalty Open data Premium data
Brand Premium Open Social Influans OnBoard Graph Suggest + Fine Tune + Social Interactions Engage Sales
Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy Right Person Right Product Right Price Right Time Right Channel
Questions? We g r a p h c o n s u m e r s Cédric Carbone cedric@influans.com @carbone www.hugfrance.fr hug-france-orga@googlegroups.com @hugfrance