Building Recommenda/on Pla1orm with Hadoop. Jayant Shekhar

Similar documents
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop Ecosystem B Y R A H I M A.

Workshop on Hadoop with Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Job Oriented Training Agenda

ITG Software Engineering

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Cloudera Certified Developer for Apache Hadoop

Big Data Course Highlights

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data and Data Science: Behind the Buzz Words

Data processing goes big

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

The Future of Data Management

Case Study : 3 different hadoop cluster deployments

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

ITG Software Engineering

Comprehensive Analytics on the Hortonworks Data Platform

BIG DATA - HADOOP PROFESSIONAL amron

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Analysis of Web Archives. Vinay Goel Senior Data Engineer

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Mammoth Scale Machine Learning!

Constructing a Data Lake: Hadoop and Oracle Database United!

Implement Hadoop jobs to extract business value from large and varied data sets

Qsoft Inc

Data Analyst Program- 0 to 100

The Future of Data Management with Hadoop and the Enterprise Data Hub

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How to Hadoop Without the Worry: Protecting Big Data at Scale

Apache Hama Design Document v0.6

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

HADOOP. Revised 10/19/2015

Complete Java Classes Hadoop Syllabus Contact No:

Big Data: Tools and Technologies in Big Data

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Building Scalable Big Data Pipelines

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop: The Definitive Guide

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop and Map-Reduce. Swati Gore

Dominik Wagenknecht Accenture

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

COURSE CONTENT Big Data and Hadoop Training

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

BIG DATA HADOOP TRAINING

Peers Techno log ies Pv t. L td. HADOOP

Important Notice. (c) Cloudera, Inc. All rights reserved.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data and Scripting Systems build on top of Hadoop

Big Data Too Big To Ignore

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

KNIME & Avira, or how I ve learned to love Big Data

The Internet of Things and Big Data: Intro

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Deploying Hadoop with Manager

Unified Big Data Processing with Apache Spark. Matei

Advanced Big Data Analytics with R and Hadoop

Apache Hadoop: Past, Present, and Future

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Moving From Hadoop to Spark

Apache HBase. Crazy dances on the elephant back

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Advanced In-Database Analytics

How To Scale Out Of A Nosql Database

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Using distributed technologies to analyze Big Data

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Certified Big Data and Apache Hadoop Developer VS-1221

Best Practices for Hadoop Data Analysis with Tableau

Transcription:

Building Recommenda/on Pla1orm with Hadoop Jayant Shekhar 1

Agenda Why Big Data Recommenda/on Pla1orm? Architecture Design & Implementa/on Common Recommenda/on PaEerns 2

Recommendations is one of the commonly used use cases of Hadoop Recommendations Broader Use Cases Product Recommendation People/Social Recommendation Merchant Recommendation Content Recommendation Query Recommendation Sponsored Search Advertising Recommendations can be Realtime News Recommendations Merchant/Offer Recommendations on mobile Offline Similar Profiles/Resumes In Between Recommendation Patterns People who viewed this also viewed Content/Product Recommendation Frequently Bought Together Related Searches or Query Recommendation Related Articles/News People/Social Recommendation 3

Recommendations are delivered through Data Sets Involved Web Mobile Email Postal mail Newspaper/Magazine ads Different Time to Action Targeted Items/Products/Content Transaction Data User Data Logs & User Activity Additional 3 rd Party Data Geo Social Reviews User would view the content now/buy the Product Now User would buy the product in a week Next when he/she goes grocery shopping User would buy the product in the next 3 months TV/Dishwasher etc. Vacation Also need to be able to determine/differentiate between the users in a household 4

In Summary Users have lots of options for most things Their attention span is very limited Highly targeted Recommendations stand a much better chance of conversions/ better user experience Larger datasets of items, user behavior, related data (from 3 rd party etc.) are needed for targeting Recommendations Use Cases are Many, Fast Changing & Complex Having an agile, flexible, scalable, end-to-end Platform is the way to go Hence we need a Big Data Platform for Recommendations 5

Architecture 6

Components A Hadoop based Recommendation Platform has to provide the following: ETL Feature Generation Recommendation Algorithms/Model Generation Workflows & Scheduling Serving Systems Support Testing of the Models Measure Performance 7

Recommendation Platform RDBMS Logs Sqoop HIVE/PIG/ Impala HBase Real-time Interactions Web Server Logs Flume Giraph HDFS Mahout, etc Recommenda/on Server Web Server Logs Oozie ZooKeeper Solr Web Server SAN 8

Feature Generation & Model Building HIVE/PIG/Impala HBase Real- &me Analy&cs Data Data Model Generation Model Model Data Data Feature Generation Model Model Data Data HDFS Model Model Hadoop enables easy iteration over the process of Model Generation and testing it out offline. 9

Serving Systems Real-time Content HBase Web Server Recommenda/on Server Web Server Docs HDFS Pre- calculated Results Features Models Solr Web Server Real-time Content Web Server 10

Components/Design 11

ETL RDBMS Logs Sqoop HIVE/PIG/ Impala HBase Mahout, etc Logs Flume HDFS Logs Oozie ZooKeeper HTTP SAN 12

ETL/Flume Web Server Interceptors Serializers Web Server Flume Flume HBase HDFS Web Server Flume Web Server Has concept of Source/Channel/Sink *Flume User Guide 13

Flume agent.sources.s1.interceptors.i1.type = org.apache.flume.interceptor.hostinterceptor$builder agent.sinks.hdfs- sink.serializer = com.cloudera.logs.flume.avro.myserializer$builder agent_foo.sources = s1 agent_foo.sinks = k1 agent_foo.channels = c1 # proper/es of s1 agent_foo.sources.s1.type = avro agent_foo.sources.s1.bind = localhost agent_foo.sources.s1.port = 10000 agent_foo.sources.s1.channels = c1 # proper/es of c1 agent_foo.channels.c1.type = memory agent_foo.channels.c1.capacity = 1000 agent_foo.channels.c1.transac/oncapacity = 100 # proper/es of k1 agent_foo.sinks.k1.type = hdfs agent_foo.sinks.k1.hdfs.path = hdfs:// analy/cs- 1.cloudera.com/user/jayant/flume1 agent_foo.sinks.k1.hdfs.rollsize = 3000000 agent_foo.sinks.k1.hdfs.rollinterval = 60 agent_foo.sinks.k1.hdfs.rollcount = 0 agent_foo.sinks.k1.hdfs.filetype = DataStream agent_foo.sinks.k1.hdfs.fileprefix = logs- agent_foo.sinks.k1.hdfs.writeformat = Text agent_foo.sinks.k1.serializer = avro_event agent_foo.sinks.k1.channel = c1 flume- ng agent - - conf- file avro.conf - - name agent_foo - Dflume.root.logger=INFO,console flume- ng avro- client - H localhost - p 10000 - F./error_log 14

ETL Transform Load Flume Interceptors, Serializers etc. Map Reduce HIVE PIG Streaming Data loaded onto HDFS/HIVE Data loaded into HBase Data loaded into external systems RDBMS/No SQL Sessionization Field Extraction Match/Dedup.. 15

HIVE & Apache Logs CREATE EXTERNAL TABLE apachelog ( remote_ip STRING, 66.249.68.6 - - [14/Jan/2012:06:25:03-0800] request_date STRING, "GET /example.com HTTP/1.1" 200 708 "- " method STRING, "Mozilla/5.0 (compa/ble; Googlebot/2.1; request STRING, +hep://www.google.com/bot.html)" protocol STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]).. [([^]]+)] \"([^ ]) ([^ ]) ([^ \"])\" *", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) STORED AS TEXTFILE loca/on '/user/jayant/apachelog' ; select remote_ip from apachelog; 16

HIVE & Transforms add file t2.pl; INSERT OVERWRITE TABLE processed_logs SELECT t.x, t.y, t.z from ( ) t; SELECT transform( body) using 't2.pl' as (x, y, z) from mylogs 17

Feature Generation Generates the features needed for the Recommendation Algorithms Various groups can share the features / work done by others on the Platform Daily Query Volume Number of transaction for a user at a Merchant in the last X months Sentiment of a user for a given merchant Distance of a User from a Merchant List of merchants visited by a user in a trip Number of times the user viewed an item in the last week Items bought together 18

Feature Generation Hadoop and HBase are ideal for Building and Sharing of features offline on HDFS MR/HIVE/PIG/Impala/Mahout/Streaming provide extreme flexibility in writing code for Feature Generation Makes it tremendously easy to share features, results, iterate and innovate etc. MR/HIVE/PIG/Streaming Hbase Real- &me Analy&cs Data Data Data Transformation Feature Generation Applying ML Data Data HDFS Data 19

Workflows with Oozie Start Map Reduce PIG Java/ Mahout End FS HIVE Fork Join Decision Sqoop oozie job -oozie http://xyz.com:11000/oozie -config job.properties -run Can also be triggered by time frequency and data availability 20

ML on Hadoop Different users prefer different set of tools for Data Science/ ML on Hadoop Mahout Impala/HIVE/ PIG RHIPE Rhadoop 21

Model Generation Mahout/HIVE/PIG/R Data Data Model Generation Model Model Data Data Clustering / Collaborative Filering Model Model Data Data HDFS Model Model 22

ML Tools on Hadoop Mahout ML Algorithms in Java/MR Input as text or SequenceFiles. Can write your own embedded Driver Programs Can control the output format with your own embedded code PIG RANK/sample functions Supports UDFs Has built in math UDFs Streaming foreach/group by/order by HIVE Sampling Supports sampling bucketized tables Block sampling Hook into streaming with Transform UDFs Group by/order by CREATE TABLE mymodel AS SELECT fit_logis/c(actual, features) model FROM obs1; CREATE TABLE test_model AS SELECT obs1.actual, predict(features, model) as predicted FROM obs1 JOIN model_out; 23

ML Tools on Hadoop Streaming with PIG trans= load my_transac/ons' as (id, item_id, dt, amt); highdivs = stream trans through `calculate.pl` as (id, score); calculate.pl is invoked once on every map or reduce task 24

ML Tools R Popular tool with Data Scientists & Analysts Has numerical and visualization methods for statistics & machine learning Strong support for statistical analysis, predictive analytics, data mining, etc. Use Reduce Script to invoke R script on a list of records RHadoop rmr MapReduce rhdfs R interface to HDFS rhbase R interface to HBase RHIPE Provides access to HDFS and MR from within the R environment Requires R, RHIPE and protobuf installed on all the data nodes. 25

ML Algorithms & Mahout Collaborative Filtering Clustering Classification Pattern Mining MR with Aggregation Decision Tree Content Analysis Collaborative Filtering Item Based User Based Slope One Recommender Clustering K-means Clustering Canopy Clustering Fuzzy K-Means Classification Pattern Mining Logistic Regression Naïve Bayes Complementary Naïve Bayes Random Forests Parallel FP Growth 26

Mahout - Clustering K-Means hadoop jar my.jar com.cloudera.clustering.featuregen.kmeansinputgenerator input output hadoop jar org.apache.mahout.vectorizer.sparsevectorsfromsequencefiles -i kmeans_input mahout seq2sparse -i kmeans_input o tfidf-vectors hadoop jar my.jar com.cloudera.clustering.kmeans.kmeansdriver -i../tfidf-vectors -k 8000 - dm org.apache.mahout.common.distance.cosinedistancemeasure -x 10 -ow 27

Item Based Collaborative Filtering Build an item-item matrix determining relationships between pairs of items Using the matrix, and the data on the current user, infer his/her taste UserID, ItemID, Ra.ng 10, 1000, 5.0 10, 1001, 3.0 10, 1004, 2.5 13, 1001, 3.5 13, 1002, 4.5 13, 1003, 1.0 13, 1004, 3.5 15, 1000, 4.5 15, 1001, 3.5 15, 1002, 2.5 1000 1001 1002 1003 1004 1000 2 2 1 0 1 1001 2 3 2 1 2 1002 1 2 2 1 1 1003 0 1 1 1 1 1004 1 2 1 1 2 28

Collaborative Filtering 1000 1001 1002 1003 1004 UserID R 1000 2 2 1 0 1 5.0 18.5 1001 2 3 2 1 2 1002 1 2 2 1 1 x 3.0 0.0 = 24 13.5 1003 0 1 1 1 1 0.0 5.5 1004 1 2 1 1 2 2.5 16 R value for item 1002: 1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) == 13.5 29

Mahout Collaborative Filtering Item Based Collaborative Filtering hadoop jar mahout.jar org.apache.mahout.cf.taste.hadoop.similarity.item.itemsimilarityjob - - input /user/xyz/ra/ngs - - output preference - m 10 Can also be used to generate features for say Logis/c Regression 30

Trends, Aggregation & Counters Most Popular Searches/Downloads/News Articles/Movies/Products Load results into HBase Use HBase where we need NRT count of things (categories/products etc.) Impala is very useful here for faster SLAs HBase Counters Has concept of Incrementing column values Avoids lock row/read value/increment it/write it back/unlock rows Great for counting specific metrics over time Example - count per URL/Product Can disable write to WAL on puts 31

Graph Processing with Giraph on Hadoop Provides a more natural way to model graph problems One implements a Vertex, which has a value and edges and is able to send and receive messages to other ver/ces in the graph as the computa/on iterates. Bulk Synchronous Processing compu/ng model. Sent messages are available at the next superstep during compute Vertex to Vertex messages delivered in the next superstep Computa/on completes when all components complete Runs as a single map only job. Executes the compute method for every Vertex it is assigned once per Superstep 32

Graph Processing with Giraph on Hadoop Use Cases Recommend Missing Links in a Social Network How are users connected Clustering find related people in groups Iterative Graph Ranking 33

Offline Training & Testing of Models Hadoop provides an excellent platform to train and test out the Models and various Algorithms Train Model Test Score Training Set Test Set 34

A/B Testing A/B testing is used to test the performance of the Models online X% New Model Traffic (100-X)% Old Model A/B testing involves: Partitioning real traffic to two models and then measuring the performance to the desired result (maximize CTR, revenue, page views etc.). The partitioning logic can get complicated. In such cases they can be pre-computed on Hadoop offline and pushed to an online store. 35

Serving Systems Real-time Content HBase Web Server Recommenda/on Server Web Server Docs HDFS Pre- calculated Results Features Models Index Solr Web Server Real-time Content Load Pre-computed stuff into HBase Models generated offline are loaded into the serving systems to compute in real-time Cache recommendation for certain time periods and re-calculate on expiry Web Server 36

Bulk Loading into HBase Prepare via a MR job using HFileOutputFormat Or, use importtsv to prepare the StoreFiles from tsv files Import the prepared data using completebulkload tool If the target table does not already exist in HBase, it would be created automatically. $ bin/hbase org.apache.hadoop.hbase.mapreduce.importtsv - Dimporttsv.columns=a,b,c - DimporEsv.bulk.output=/user/xyz/myoutput <tablename> <hdfs-inputdir> hadoop jar hbase-version.jar completebulkload /user/xyz/myoutput mytable Takes the same output path where the imporesv put its results 37

Measuring Performance Metrics CTR Revenue Engagement Why use Hadoop? Store all tracking events on HDFS Easier to scale to joining high number of tracking streams Look back far in time Track Behavior & Performance over time Retrain a Model if performance drops Measure the performance of any newly deployed model Try out new stuff 38

Common Recommenda/on PaEerns Content/Product Recommenda/on People who viewed this Product/Profile also viewed Frequently bought together Related Searches or Query Recommenda/on Related Ar/cles/News People/Social Recommenda/ons 39

Content/Product Recommendation Use Cases Recommend Product Recommend Movies/Videos Design Collaborative Filtering Logistic Regression Frequently bought together Use Cases Find items that are frequently bought together Related Searches Design Parallel FP Growth http://infolab.stanford.edu/ ~echang/recsys08-69.pdf 40

Parallel FP Growth mahout fpg - i core/src/test/resources/retail.dat - o paeerns \ - k 50 \ - method mapreduce \ - regex '[\ ] 32 41 59 60 61 62 3 39 48 63 64 65 66 67 68 32 69 48 70 71 72 39 73 74 75 76 77 78 79 36 38 39 41 48 79 80 81 82 83 84 41 85 86 87 88 41

People who viewed this Product also viewed Use Cases View Item Page View Content (Movies, Videos, News) Design This data does not change often Record the views in a session co-view Collaborative Filtering 42

Related Searches or Query Recommendation Design Use Query Log Data Cluster similar queries Use Parallel FP Growth to find the related searches Query Distance Based on keywords or phrases Based on searches in the same session Based on common clicked URLs Based on the distance of the clicked documents Related Articles/News Batch clustering with K-Means NRT clustering using the centroids Perform canopy on left over articles 43

Social/People Recommendations Use Case Recommend Missing Links in a Social Network Bipartite Matching Recommend Men/Women Design Take existing edges and friend of friends Build Regression Models based on latest activity Scale easily offline with Hadoop as number of friends of friends and activities could be very high. Giraph 44

June 13 at SF MarrioE Marquis Call for Speakers is open un/l April 1 Early Bird reg open as of Monday night, and Strata aeendees get a $25 discount (use code promo- strata) hbasecon.com 45

Thank you! Jayant Shekhar, Solu/ons Architect, Cloudera @jshekhar