Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Similar documents
How Companies are! Using Spark

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Processing with Apache Spark. Matei

Hadoop Ecosystem B Y R A H I M A.

Ali Ghodsi Head of PM and Engineering Databricks

Moving From Hadoop to Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Beyond Hadoop with Apache Spark and BDAS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HiBench Introduction. Carson Wang Software & Services Group

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Architectures for massive data management

Dell In-Memory Appliance for Cloudera Enterprise

Conquering Big Data with BDAS (Berkeley Data Analytics)

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Hadoop IST 734 SS CHUNG

Large scale processing using Hadoop. Ján Vaňo

Hadoop in the Enterprise

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Analytics. Lucas Rego Drumond

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark and the Big Data Library

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Create A Data Visualization With Apache Spark And Zeppelin

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Machine- Learning Summer School

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Introduction to Spark

HDP Enabling the Modern Data Architecture

BIG DATA What it is and how to use?

Upcoming Announcements

Apache Flink Next-gen data analysis. Kostas

HDP Hadoop From concept to deployment.

Dominik Wagenknecht Accenture

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

#TalendSandbox for Big Data

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data Analytics Hadoop and Spark

Spark: Making Big Data Interactive & Real-Time

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Job Oriented Training Agenda

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big Data With Hadoop

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Introduc8on to Apache Spark

Streaming items through a cluster with Spark Streaming

Why Spark on Hadoop Matters

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data and Apache Hadoop s MapReduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop & Spark Using Amazon EMR

Big Data Analysis: Apache Storm Perspective

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Brief Introduction to Apache Tez

Cost-Effective Business Intelligence with Red Hat and Open Source

Big Data Workshop. dattamsha.com

Fast, Low-Overhead Encryption for Apache Hadoop*

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture


Big Data Processing. Patrick Wendell Databricks

Big Data Analytics with Cassandra, Spark & MLLib

Constructing a Data Lake: Hadoop and Oracle Database United!

Sentimental Analysis using Hadoop Phase 2: Week 2

How To Scale Out Of A Nosql Database

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

CSE-E5430 Scalable Cloud Computing Lecture 11

The Internet of Things and Big Data: Intro

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Roadmap Talend : découvrez les futures fonctionnalités de Talend

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Real Time Data Processing using Spark Streaming

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop Big Data for Processing Data and Performing Workload

Case Study : 3 different hadoop cluster deployments

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Big Data and Industrial Internet

Introduction to Hadoop

Enterprise Data Storage and Analysis on Tim Barr

Kafka & Redis for Big Data Solutions

Hadoop and Map-Reduce. Swati Gore

SAP and Hortonworks Reference Architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Transcription:

Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone

Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine Learning : Recommender System Use Case : Next Product To Buy Q&A

What s hadoop The Apache Hadoop project develops opensource software for reliable, scalable, distributed computing. Java framework for storage and running data transformation on large cluster of commodity hardware Licensed under the Apache v2 license Created from Google's MapReduce, BigTable and Google File System (GFS) papers

HDFS : Distributed Storage Distributed, Scalable, Portable, Reliable file system for the Hadoop framework. Metadata / data separation: Name Nodes Data Nodes

Map Reduce Map() : parse inputs and generate 0 to n <key, value> Reduce() : sums all values of the same key and generate a <key, value> WordCount Example Each map take a line as an input and break into words It emits a key/value pair of the word and 1 Each Reducer sums the counts for each word It emits a key/value pair of the word and sum

Map Reduce Data Node 1 Data Node 2

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Hadoop MapReduce v1

Hadoop MapReduce v1

Hadoop MapReduce v1

Hadoop MapReduce v1 Not good for low-latency jobs on smallest dataset

Hadoop MapReduce v1 Good for off-line batch jobs on massive data

Hadoop 1 Batch ONLY High latency jobs HIVE Query Pig Scripting Cascading Accelerate Dev. MapReduce1 Cluster Resource Management + Data Processing BATCH HDFS (Redundant, Reliable Storage)

Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications MapReduce1 Data Processing BATCH Other Data Processing YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)

Hadoop2 : Big Data Operating System Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Data analysts and real-time applications BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (Hbase HOYA) STREAMING (Storm, Samza Spark Streaming) GRAPH (Giraph, GraphX) Machine Learning (Spark MLLIb) In-Memory (Spark) OTHER (ElasticSearch) YARN (Cluster Resource Management) HDFS (Redundant, Reliable Storage)

Stinger.next

Stinger.next

https://spark.apache.org Apache Spark is a fast and general engine for large-scale data processing.

The most active project 250 45000 40000 200 150 100 50 0 Patches MapReduce Storm Yarn Spark 35000 30000 25000 20000 15000 10000 5000 0 Lines Added MapReduce Storm Yarn Spark

Spark won the Daytona GraySort contest! Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

RDD & Operation Resilient Distributed Datasets (RDDs) Operations Transformations (e.g. map, filter, groupby) Actions (e.g. count, collect, save)

Spark scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> textfile.count() res0: Long = 126 scala> textfile.first() res1: String = # Apache Spark scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string]=spark.filteredrdd@7dd4 scala> textfile.filter(line=>line.contains("spark")).count() res3: Long = 15

Streaming Streaming

Storm

Storm

Storm vs Spark Spark Storm Scope Batch, Streaming, Graph, ML, SQL Streaming only Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++

Machine Learning Library (Mllib)

Collaborative Filtering

Collaborative Filtering (learning)

Collaborative Filtering (learning)

Collaborative Filtering (learning)

Collaborative Filtering : Let s use the model

Collaborative Filtering : similar behaviors

Collaborative Filtering Prediction

Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media

Input Data UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc 2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620

Matric Factorization

The result 1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19

Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy

Ready for Omni-channel? Traditional marketing Current approach cannot keep up 200m people on Do Not Call list 99.9% of online banners are never clicked. 44% of direct marketing is never opened. 86% of TV viewers skip commercials Buyers complete 60% of their research before reaching out to vendors.

Statement 2000 2010 2013 2015 Multi Channel Cross Channel Omni Channel Consumer Graph

Next Product to Buy in Action 1 Open data Premium data

Next Product to Buy in Action 1 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 2 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 3 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 4 ERP Brand data CRM Loyalty Open data Premium data

Next Product to Buy in Action 5 ERP Brand data CRM Loyalty Open data Premium data

Brand Premium Open Social Influans OnBoard Graph Suggest + Fine Tune + Social Interactions Engage Sales

Real Time Big Data Use Case Next Gen Data Marketing Platform Next Product To Buy Right Person Right Product Right Price Right Time Right Channel

Questions? We g r a p h c o n s u m e r s Cédric Carbone cedric@influans.com @carbone www.hugfrance.fr hug-france-orga@googlegroups.com @hugfrance