Building Data Products using Hadoop at Linkedin. Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn

Similar documents

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Hadoop Job Oriented Training Agenda

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Putting Apache Kafka to Use!

Click Stream Data Analysis Using Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Workshop on Hadoop with Big Data

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction To Hive

L1: Introduction to Hadoop

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Cloudera Certified Developer for Apache Hadoop

BIG DATA SOLUTION DATA SHEET

A Brief Outline on Bigdata Hadoop

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Bigtable is a proven design Underpins 100+ Google services:

Map Reduce & Hadoop Recommended Text:

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Large scale processing using Hadoop. Ján Vaňo

Hadoop Parallel Data Processing

How To Scale Out Of A Nosql Database

HADOOP MOCK TEST HADOOP MOCK TEST II

ITG Software Engineering

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

How To Use Big Data For Telco (For A Telco)

Data Pipeline with Kafka

The Big Data Ecosystem at LinkedIn

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Ecosystem B Y R A H I M A.

Integrating Big Data into the Computing Curricula

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Pipeline and Analytics Platform

Similarity Search in a Very Large Scale Using Hadoop and HBase

SAP and Hortonworks Reference Architecture

Real-time Big Data Analytics with Storm

Lecture 10 - Functional programming: Hadoop and MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop implementation of MapReduce computational model. Ján Vaňo

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

HDFS. Hadoop Distributed File System

Beyond "Big data": Introducing the EOI framework for analytics teams to drive business impact

Big Data for Investment Research Management

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Kafka & Redis for Big Data Solutions

Big Data Weather Analytics Using Hadoop

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

CRITEO INTERNSHIP PROGRAM 2015/2016

Big data blue print for cloud architecture

Apache Hadoop Ecosystem

Fast Data in the Era of Big Data: Twitter s Real-

Using Data Mining and Machine Learning in Retail

How To Handle Big Data With A Data Scientist

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Course Highlights

Applications for Big Data Analytics

MapReduce with Apache Hadoop Analysing Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

ITG Software Engineering

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data and Data Science: Behind the Buzz Words

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectures for massive data management

NOT IN KANSAS ANY MORE

XpoLog Competitive Comparison Sheet

Transcription:

Building Data Products using Hadoop at Linkedin Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn 1 1

Who am I? 2 2

What do I mean by Data Products? 3 3

People You May Know 4 4

Profile Stats: WVMP 5 5

Viewers of this profile also... 6 6

Skills 7 7

InMaps 8 8

Data Products: Key Ideas Recommendations People You May Know, Viewers of this profile... Analytics and Insight Profile Stats: Who Viewed My Profile, Skills Visualization InMaps 9 9

Data Products: Challenges LinkedIn: 2nd largest social network 120 million members on LinkedIn Billions of connections Billions of pageviews Terabytes of data to process 10 10

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 11 11

Systems and Tools Kafka (LinkedIn) Hadoop (Apache) Azkaban (LinkedIn) Voldemort (LinkedIn) 12 12

Systems and Tools Kafka publish-subscribe messaging system transfer data from production to HDFS Hadoop Azkaban Voldemort 13 13

Systems and Tools Kafka Hadoop Java MapReduce and Pig process data Azkaban Voldemort 14 14

Systems and Tools Kafka Hadoop Azkaban Hadoop workflow management tool to manage hundreds of Hadoop jobs Voldemort 15 15

Systems and Tools Kafka Hadoop Azkaban Voldemort Key-value store store output of Hadoop jobs and serve in production 16 16

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 17 17

People You May Know How do people know each other? Alice Bob Carol 18 18

People You May Know How do people know each other? Alice Bob Carol 19 19

People You May Know How do people know each other? Alice Bob Carol Triangle closing 20 20

People You May Know How do people know each other? Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections 21 21

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 22 22

Pig Overview Load: load data, specify format Store: store data, specify format Foreach, Generate: Projections, similar to select Group by: group by column(s) Join, Filter, Limit, Order,... User Defined Functions (UDFs) 23 23

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 24 24

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 25 25

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 26 26

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 27 27

Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 28 28

Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 29 connections = LOAD `connections` USING PigStorage(); 29

Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) group_conn = GROUP connections BY source_id; 30 30

Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) pairs = FOREACH group_conn GENERATE generatepair(connections.dest_id) as (id1, id2); 31 31

Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; 32 32

Our Workflow triangle-closing 33 33

Our Workflow triangle-closing top-n 34 34

Our Workflow triangle-closing top-n push-to-prod 35 35

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 36 36

Our Workflow triangle-closing top-n push-to-prod 37 37

Our Workflow triangle-closing remove connections top-n push-to-prod 38 38

Our Workflow triangle-closing remove connections top-n push-to-qa push-to-prod 39 39

PYMK Workflow 40 40

Workflow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Configuration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Logs 41 41

Workflow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Configuration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Logs Azkaban 42 42

Sample Azkaban Job Spec type=pig pig.script=top-n.pig dependencies=remove-connections top.n.size=100 43 43

Azkaban Workflow 44 44

Azkaban Workflow 45 45

Azkaban Workflow 46 46

Our Workflow triangle-closing remove connections top-n push-to-prod 47 47

Our Workflow triangle-closing remove connections top-n push-to-prod 48 48

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 49 49

Production Storage Requirements Large amount of data/scalable Quick lookup/low latency Versioning and Rollback Fault tolerance Offline index building 50 50

Voldemort Storage Large amount of data/scalable Quick lookup/low latency Versioning and Rollback Fault tolerance through replication Read only Offline index building 51 51

Data Cycle 52 52

Voldemort RO Store 53 53

Our Workflow triangle-closing remove connections top-n push-to-prod 54 54

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 55 55

Data Quality Verification QA store with viewer Explain Versioning/Rollback Unit tests 56 56

Outline What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 57 57

Performance 58 58

Performance Symmetry Bob knows Carol then Carol knows Bob 58 58

Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections 58 58

Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections Sampling Sample k-connections 58 58

Things Covered What do I mean by Data Products? Systems and Tools we use Let s build People You May Know Managing workflow Serving data in production Data Quality Performance 59 59

SNA Team Thanks to SNA Team at LinkedIn http://sna-projects.com We are hiring! 60 60

Questions? 61 61