Big Data Analytics in LinkedIn by Danielle Aring & William Merritt
2
Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines : Jobs and Subscriptions - 2006: Launched public profiles (achieved portability/new features) - 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/) - 2012: Site transformation/rapid growth - 2013: ~225 million members (27 % of LinkedIn subscribers are recruiters) - 2014: Next decade focused on map of digital economy 3
4
5
Three Major Data Dimensions @LinkedIn 6
LinkedIn Challenges for Web-scale OLAP Horizontally scalable currently over 200+ million users adding 2 new members per second Quick response time to user s queries High availability High read & write throughput (billions of monthly page views) Heavy dependency on slowest node s response as data is spread across various nodes 7
Current OLAP Solutions - not suited for high-traffic website What is OLAP - Online Analytical Processing Long transactions Complex queries Mining and analyzing large amounts of data Infrequent updates of data Traditional for Business Intelligence (i.e. SAP, Oracle and etc) retrieve & consolidate partial results across nodes (causing slow responses) Distributed (problems: w/latency, availability and cost) Materialized Cubes (loading billions of page views - load too high) 8
Avatara: solution for Web-scale Analytics Products Provides fast scalable OLAP system handles small cubes scenarios simple grammar for cube construction and query at scale sharding of cube dimension into key-value model leverage distributed key-value store for low-latency high availability access to cubes leverages hadoop for joins Two examples of analytics features: WVMP - cube sharded by member ID Who s viewed my profile? (WVMP) WVTJ - cube sharded across jobs Who s viewed this job? (WVTJ) 9
Avatara: solution con t Sharding (i.e horizontal scaling) divides the data set and distributes the data over multiple servers. Each shard is an independent database and together the shards make up a single logical database sharding on a primary key (turning a big cube into smaller ones) Store cube data s in one location requires a single disk fetch Offline Batch Engine High throughput Batch processing (Hadoop Jobs) Online Query Engine low latency, high availability key-value paradigm for storing data (Voldemort) 10
Avatara: Architecture -- 11
Avatara: Offline Batch Engine - Three Phases driven by a simple configuration file Preprocessing preparing the data using built-in functions to roll up data customized scripting for further processing Projections and Joins builds the dimension & fact tables a join key ties dimension & fact tables Cubification partitions the data by cube shard key & produces small cubes data can be retrieved in a single disk fetch for faster responses cubes are bulk loaded into a distributed key-value store (i.e. Voldemort) 12
Avatara: Online Query Engine Serves queries in real time Retrieves & processes data from key-value store (i.e. Voldemort) Fast retrieval because of compact cubes per sharded key (i.e. member_id) SQL-like syntax for clients Supports select, where, group-by, having, order and etc. operations Simplifies development for developers 13
Cube Thinning Avatara s mechanism for thinning cubes too large to process on page load (such as: President Obama or Lebron James) Allows developers to do the following: set priorities and constraints on dimensions aggregated to a specific value (such as other category) drop data across pre-defined dimensions ex: WVMP can opt to drop data across time dimension resulting in a shorter history! 14
15
16
In Summary Avatara has been working several years at LinkedIn (i.e. in-house OLAP system) Allows developers to build OLAP cubes with a single configuration file Hybrid offline/online strategy combined with sharding into key-value store Powers large web-scale applications such as: WVMP, WVTJ and Jobs You May Be Interested In Avatara uses Hadoop for batch computing infrastructure SQL-like query interaction Hadoop batch engine can handle TBs of data & process in less than hrs of time Voldemort can respond to online queries in milliseconds Future Work: Near real-time cubing Streaming joins Dimension and schema changes 17
BIG DATA PROJECT 18
Data Mining with LinkedIn using AJAX call to REST API Overview: 1. 2. 3. 4. 5. Extract a large quantity of data from LinkedIn using AJAX call to REST API Transform data into structured csv file format via scripts Create tables in nosql database Hive installed on top of HDFS Query database to make Analytic insights on people, jobs, and companies Visualize Hive queries via Tableau 19
Issue: Extracting Data From LinkedIn API Followed instructions on LinkedIn Developer: Authenticating with Oauth 2.0 Success: configuring app, requesting authorization code Fail: Exchanging authorization code for request token (INVALID) 20
How to Extract Data From LinkedIn When Refused by LinkedIn API? Unable to download streamed data using CONVENTIONAL tools Solution: Data Extraction via AJAX call to REST API Jase Clamp tutorial on YouTube How to Extract Data from LinkedIn 21
Tools For Data Extraction/Transformation FireFox Web Console Firebug JavaScript AJAX JQuery Scripts company, person and jobs Since we can not stream data into HDFS data transformation to structured file format done externally! 22
REST API, AJAX, JQuery REST API: REpresentational State Transfer stateless separates server from client leverages layered system JQuery: JavaScript and DOM manipulation library makes client side scripting easier simplifies syntax for finding, selecting and manipulating DOM elements AJAX request: (asynchronous JavaScript and XML) ~ XMLHttpRequest uses client side scripting to exchange data with web server types: GET, POST, PUT, DELETE loads all data once 23
theurl='https://www.linkedin.com/vsearch/jj? orig=trnv&rsid=3824972411448555669155&trk=vsrp_jobs_sel&trkinfo=vsrpsearchid%3a3824972411448555665180, VSRPcmpt%3Atrans_nav&locationType=I&countryCode=us&openFacets=L, C&page_num='+i+'&pt=jobs&rnd=1448555696180'; 24
Retrieving Data Structure 25
Structure of Companies Data 26
Structure of Jobs Data 27
Structure of Person Data 28
Data Extraction/Transformation Steps 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Login Linkedin account Create search (on Jobs, Companies, People etc) Right click select inspect element using firebug (console should display) Code/Paste script into right side of console window move to All tab Navigate to page 2 of search results Locate GET REST api call (in console window) right click copy location Paste call into theurl variable inside script Change pagenum=2 in URL to pagenum= +i+ Run script (parse into comma separated JSON) navigate to info in console copy and paste data into text editor remove empty lines and pasted in csv 29
30
Hadoop and Hive ~versions 1.2.1 Hive chosen because data transformed into structured csv 31
Hive ~Table Creation and Data Upload created 3 tables: companies, jobs, person ~Hive HQL: similar to SQL DDL statements for table creation and DML for insert 32
HQL (Hive Query Language) Used HQL queries to derive insights from our data Includes: Top companies with highest # followers Top locations with highest job count Job title and count per location Top job titles recently listed Location of jobs listed 1 day ago Comparison of # of connections of people with and without profile image Comparison Profile Headlines with Highest Connection Count vs those with lower connection count Query visualization done in Tableau 33
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','Select followercount, name, rank() over (ORDER BY followercount DESC) as rank from companies ranked_followers WHERE ranked_followers.rank < 10 ORDER BY followercount DESC; Top companies with highest number of followers F1~ # of followers 34
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' Select location, jobcount FROM (select location, rank() over (ORDER BY jobcount DESC) as rank, jobcount from companies) ranked_jobs WHERE ranked_jobs.rank < 51 ORDER BY location, jobcount DESC; Top locations that have the highest number of jobs F2~ # of jobs 35
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','SELECT c.location, j.jobtitle FROM companies c left outer join jobs j on (c.location = j.location); Join on companies and jobs table selecting location and jobtitle (looking at number of jobs listed in each area) 36
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' SELECT companyname, jobtitle, jobrecency FROM (select companyname, jobtitle, rank() over (ORDER BY jobrecency DESC) as rank, jobrecency from jobs) ranked_jobtitles WHERE ranked_jobtitles.rank < 11 ORDER BY jobtitle, jobrecency DESC; Top Job titles recently listed 37
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' select location, companyname, jobtitle from jobs where jobrecency="1 day ago"; locations of jobs listed 1 day ago 38
insert overwrite local directory '/usr/local/hql' select count(*), sum(connectioncount) from person where imageurl!="undefined"; insert overwrite local directory '/usr/local/hql' select count(*), sum(connectioncount) from person where imageurl ="undefined"; Comparison: # connections of people with and without profile photo on webpage. ratio 5 : 454 on Average those w/out profile pic: ~470 connections with profile pic: ~394 76 person connection difference! 39
insert overwrite local directory '/usr/local/hql' select connectioncount, firstname, headline from person where connectioncount > 500; Profile Headlines with Highest Connections 40
insert overwrite local directory '/usr/local/hql' select connectioncount, firstname, headline from person where connectioncount < 200; Profile Headlines with lowest Connections 41
Interested in trying on your own? Links: FireBug add-on to FireFox: https://addons.mozilla.org/en-us/firefox/addon/firebug/ Jase Clamp tutorial Extracting Data From LinkedIn : https://www.youtube.com/watch?v=s-9bwrtxodw Data Extraction Script on Github: https://gist.github.com/jaseclamp/2c74062bac1cc4dd929f\ Tableau Download: http://www.tableau.com/products/desktop/download?os=windows 42
Sources 1. 2. 3. 4. 5. 6. http://vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf http://www.slideshare.net/liliwu/avatara-olap-for-webscale-analytics-products https://ourstory.linkedin.com/#year-2004 http://www.slideshare.net/michaelli17/how-business-analytics-drives-businessvalue-teradata-partners-conference-nashvile-2014?next_slideshow=1 https://engineering.linkedin.com/olap/avatara-olap-web-scale-analytics-products https://www.youtube.com/watch?v=9s-vsewej1u 43
THANK YOU 44