Big Data Analytics in LinkedIn. Danielle Aring & William Merritt



Similar documents
CitusDB Architecture for Real-Time Big Data

CHAPTER 5: BUSINESS ANALYTICS

How To Handle Big Data With A Data Scientist

itunes Store Publisher User Guide Version 1.1

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Sentimental Analysis using Hadoop Phase 2: Week 2

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

How to Choose Between Hadoop, NoSQL and RDBMS

CHAPTER 4: BUSINESS ANALYTICS

Best Practices for Hadoop Data Analysis with Tableau

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Lofan Abrams Data Services for Big Data Session # 2987

Oracle Big Data SQL Technical Update

Integrating Big Data into the Computing Curricula

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Exploring the Synergistic Relationships Between BPC, BW and HANA

Data processing goes big

Using distributed technologies to analyze Big Data

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Implement Hadoop jobs to extract business value from large and varied data sets

Visualizing a Neo4j Graph Database with KeyLines

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Oracle Siebel Marketing and Oracle B2B Cross- Channel Marketing Integration Guide ORACLE WHITE PAPER AUGUST 2014

Sisense. Product Highlights.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

TIBCO Live Datamart: Push-Based Real-Time Analytics

Qlik REST Connector Installation and User Guide

QUICK START GUIDE. Cloud based Web Load, Stress and Functional Testing

How, What, and Where of Data Warehouses for MySQL

An Approach to Implement Map Reduce with NoSQL Databases

Getting Started Guide for Developing tibbr Apps

Integrating VoltDB with Hadoop

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Big Data Analytics Nokia

Creating a universe on Hive with Hortonworks HDP 2.0

SAP BO Course Details

How to Enhance Traditional BI Architecture to Leverage Big Data

An introduction to creating Web 2.0 applications in Rational Application Developer Version 8.0

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Client Overview. Engagement Situation. Key Requirements

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

tibbr Now, the Information Finds You.

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Can the Elephants Handle the NoSQL Onslaught?

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Toad for Oracle 8.6 SQL Tuning

Mitra Innovation Leverages WSO2's Open Source Middleware to Build BIM Exchange Platform

Internals of Hadoop Application Framework and Distributed File System

XML Processing and Web Services. Chapter 17

REST web services. Representational State Transfer Author: Nemanja Kojic

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Creating Hybrid Relational-Multidimensional Data Models using OBIEE and Essbase by Mark Rittman and Venkatakrishnan J

Oracle Data Integrator 11g New Features & OBIEE Integration. Presented by: Arun K. Chaturvedi Business Intelligence Consultant/Architect

Visualizing an OrientDB Graph Database with KeyLines

Big Data Analytics - Accelerated. stream-horizon.com

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

General principles and architecture of Adlib and Adlib API. Petra Otten Manager Customer Support

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

SAP BusinessObjects Business Intelligence (BOBI) 4.1

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

SAP Data Services 4.X. An Enterprise Information management Solution

Putting Apache Kafka to Use!

Search and Real-Time Analytics on Big Data

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Oracle Big Data Building A Big Data Management System

Table of Contents. Table of Contents 3

Category: Business Process and Integration Solution for Small Business and the Enterprise

Using IBM dashdb With IBM Embeddable Reporting Service

Apache Kylin Introduction Dec 8,

Luncheon Webinar Series May 13, 2013

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Getting Real Real Time Data Integration Patterns and Architectures

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Using RDBMS, NoSQL or Hadoop?

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Next-Generation Cloud Analytics with Amazon Redshift

Oracle Business Intelligence Foundation Suite 11g Essentials Exam Study Guide

SAP Business Objects BO BI 4.1

Trafodion Operational SQL-on-Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

IBM Campaign Version-independent Integration with IBM Engage Version 1 Release 3 April 8, Integration Guide IBM

Qlik Sense Enabling the New Enterprise

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Transcription:

Big Data Analytics in LinkedIn by Danielle Aring & William Merritt

2

Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines : Jobs and Subscriptions - 2006: Launched public profiles (achieved portability/new features) - 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/) - 2012: Site transformation/rapid growth - 2013: ~225 million members (27 % of LinkedIn subscribers are recruiters) - 2014: Next decade focused on map of digital economy 3

4

5

Three Major Data Dimensions @LinkedIn 6

LinkedIn Challenges for Web-scale OLAP Horizontally scalable currently over 200+ million users adding 2 new members per second Quick response time to user s queries High availability High read & write throughput (billions of monthly page views) Heavy dependency on slowest node s response as data is spread across various nodes 7

Current OLAP Solutions - not suited for high-traffic website What is OLAP - Online Analytical Processing Long transactions Complex queries Mining and analyzing large amounts of data Infrequent updates of data Traditional for Business Intelligence (i.e. SAP, Oracle and etc) retrieve & consolidate partial results across nodes (causing slow responses) Distributed (problems: w/latency, availability and cost) Materialized Cubes (loading billions of page views - load too high) 8

Avatara: solution for Web-scale Analytics Products Provides fast scalable OLAP system handles small cubes scenarios simple grammar for cube construction and query at scale sharding of cube dimension into key-value model leverage distributed key-value store for low-latency high availability access to cubes leverages hadoop for joins Two examples of analytics features: WVMP - cube sharded by member ID Who s viewed my profile? (WVMP) WVTJ - cube sharded across jobs Who s viewed this job? (WVTJ) 9

Avatara: solution con t Sharding (i.e horizontal scaling) divides the data set and distributes the data over multiple servers. Each shard is an independent database and together the shards make up a single logical database sharding on a primary key (turning a big cube into smaller ones) Store cube data s in one location requires a single disk fetch Offline Batch Engine High throughput Batch processing (Hadoop Jobs) Online Query Engine low latency, high availability key-value paradigm for storing data (Voldemort) 10

Avatara: Architecture -- 11

Avatara: Offline Batch Engine - Three Phases driven by a simple configuration file Preprocessing preparing the data using built-in functions to roll up data customized scripting for further processing Projections and Joins builds the dimension & fact tables a join key ties dimension & fact tables Cubification partitions the data by cube shard key & produces small cubes data can be retrieved in a single disk fetch for faster responses cubes are bulk loaded into a distributed key-value store (i.e. Voldemort) 12

Avatara: Online Query Engine Serves queries in real time Retrieves & processes data from key-value store (i.e. Voldemort) Fast retrieval because of compact cubes per sharded key (i.e. member_id) SQL-like syntax for clients Supports select, where, group-by, having, order and etc. operations Simplifies development for developers 13

Cube Thinning Avatara s mechanism for thinning cubes too large to process on page load (such as: President Obama or Lebron James) Allows developers to do the following: set priorities and constraints on dimensions aggregated to a specific value (such as other category) drop data across pre-defined dimensions ex: WVMP can opt to drop data across time dimension resulting in a shorter history! 14

15

16

In Summary Avatara has been working several years at LinkedIn (i.e. in-house OLAP system) Allows developers to build OLAP cubes with a single configuration file Hybrid offline/online strategy combined with sharding into key-value store Powers large web-scale applications such as: WVMP, WVTJ and Jobs You May Be Interested In Avatara uses Hadoop for batch computing infrastructure SQL-like query interaction Hadoop batch engine can handle TBs of data & process in less than hrs of time Voldemort can respond to online queries in milliseconds Future Work: Near real-time cubing Streaming joins Dimension and schema changes 17

BIG DATA PROJECT 18

Data Mining with LinkedIn using AJAX call to REST API Overview: 1. 2. 3. 4. 5. Extract a large quantity of data from LinkedIn using AJAX call to REST API Transform data into structured csv file format via scripts Create tables in nosql database Hive installed on top of HDFS Query database to make Analytic insights on people, jobs, and companies Visualize Hive queries via Tableau 19

Issue: Extracting Data From LinkedIn API Followed instructions on LinkedIn Developer: Authenticating with Oauth 2.0 Success: configuring app, requesting authorization code Fail: Exchanging authorization code for request token (INVALID) 20

How to Extract Data From LinkedIn When Refused by LinkedIn API? Unable to download streamed data using CONVENTIONAL tools Solution: Data Extraction via AJAX call to REST API Jase Clamp tutorial on YouTube How to Extract Data from LinkedIn 21

Tools For Data Extraction/Transformation FireFox Web Console Firebug JavaScript AJAX JQuery Scripts company, person and jobs Since we can not stream data into HDFS data transformation to structured file format done externally! 22

REST API, AJAX, JQuery REST API: REpresentational State Transfer stateless separates server from client leverages layered system JQuery: JavaScript and DOM manipulation library makes client side scripting easier simplifies syntax for finding, selecting and manipulating DOM elements AJAX request: (asynchronous JavaScript and XML) ~ XMLHttpRequest uses client side scripting to exchange data with web server types: GET, POST, PUT, DELETE loads all data once 23

theurl='https://www.linkedin.com/vsearch/jj? orig=trnv&rsid=3824972411448555669155&trk=vsrp_jobs_sel&trkinfo=vsrpsearchid%3a3824972411448555665180, VSRPcmpt%3Atrans_nav&locationType=I&countryCode=us&openFacets=L, C&page_num='+i+'&pt=jobs&rnd=1448555696180'; 24

Retrieving Data Structure 25

Structure of Companies Data 26

Structure of Jobs Data 27

Structure of Person Data 28

Data Extraction/Transformation Steps 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Login Linkedin account Create search (on Jobs, Companies, People etc) Right click select inspect element using firebug (console should display) Code/Paste script into right side of console window move to All tab Navigate to page 2 of search results Locate GET REST api call (in console window) right click copy location Paste call into theurl variable inside script Change pagenum=2 in URL to pagenum= +i+ Run script (parse into comma separated JSON) navigate to info in console copy and paste data into text editor remove empty lines and pasted in csv 29

30

Hadoop and Hive ~versions 1.2.1 Hive chosen because data transformed into structured csv 31

Hive ~Table Creation and Data Upload created 3 tables: companies, jobs, person ~Hive HQL: similar to SQL DDL statements for table creation and DML for insert 32

HQL (Hive Query Language) Used HQL queries to derive insights from our data Includes: Top companies with highest # followers Top locations with highest job count Job title and count per location Top job titles recently listed Location of jobs listed 1 day ago Comparison of # of connections of people with and without profile image Comparison Profile Headlines with Highest Connection Count vs those with lower connection count Query visualization done in Tableau 33

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','Select followercount, name, rank() over (ORDER BY followercount DESC) as rank from companies ranked_followers WHERE ranked_followers.rank < 10 ORDER BY followercount DESC; Top companies with highest number of followers F1~ # of followers 34

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' Select location, jobcount FROM (select location, rank() over (ORDER BY jobcount DESC) as rank, jobcount from companies) ranked_jobs WHERE ranked_jobs.rank < 51 ORDER BY location, jobcount DESC; Top locations that have the highest number of jobs F2~ # of jobs 35

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','SELECT c.location, j.jobtitle FROM companies c left outer join jobs j on (c.location = j.location); Join on companies and jobs table selecting location and jobtitle (looking at number of jobs listed in each area) 36

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' SELECT companyname, jobtitle, jobrecency FROM (select companyname, jobtitle, rank() over (ORDER BY jobrecency DESC) as rank, jobrecency from jobs) ranked_jobtitles WHERE ranked_jobtitles.rank < 11 ORDER BY jobtitle, jobrecency DESC; Top Job titles recently listed 37

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' select location, companyname, jobtitle from jobs where jobrecency="1 day ago"; locations of jobs listed 1 day ago 38

insert overwrite local directory '/usr/local/hql' select count(*), sum(connectioncount) from person where imageurl!="undefined"; insert overwrite local directory '/usr/local/hql' select count(*), sum(connectioncount) from person where imageurl ="undefined"; Comparison: # connections of people with and without profile photo on webpage. ratio 5 : 454 on Average those w/out profile pic: ~470 connections with profile pic: ~394 76 person connection difference! 39

insert overwrite local directory '/usr/local/hql' select connectioncount, firstname, headline from person where connectioncount > 500; Profile Headlines with Highest Connections 40

insert overwrite local directory '/usr/local/hql' select connectioncount, firstname, headline from person where connectioncount < 200; Profile Headlines with lowest Connections 41

Interested in trying on your own? Links: FireBug add-on to FireFox: https://addons.mozilla.org/en-us/firefox/addon/firebug/ Jase Clamp tutorial Extracting Data From LinkedIn : https://www.youtube.com/watch?v=s-9bwrtxodw Data Extraction Script on Github: https://gist.github.com/jaseclamp/2c74062bac1cc4dd929f\ Tableau Download: http://www.tableau.com/products/desktop/download?os=windows 42

Sources 1. 2. 3. 4. 5. 6. http://vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf http://www.slideshare.net/liliwu/avatara-olap-for-webscale-analytics-products https://ourstory.linkedin.com/#year-2004 http://www.slideshare.net/michaelli17/how-business-analytics-drives-businessvalue-teradata-partners-conference-nashvile-2014?next_slideshow=1 https://engineering.linkedin.com/olap/avatara-olap-web-scale-analytics-products https://www.youtube.com/watch?v=9s-vsewej1u 43

THANK YOU 44