The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn



Similar documents
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Introduction To Hive

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Hadoop. Sunday, November 25, 12

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Advanced Big Data Analytics with R and Hadoop

Chase Wu New Jersey Ins0tute of Technology

Hadoop & Spark Using Amazon EMR

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

BIG DATA TRENDS AND TECHNOLOGIES

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Putting Apache Kafka to Use!

Hadoop Ecosystem B Y R A H I M A.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

L1: Introduction to Hadoop

A Brief Introduction to Apache Tez

Apache Hadoop: The Big Data Refinery

Using distributed technologies to analyze Big Data

Big Data and Data Science. The globally recognised training program

Dell In-Memory Appliance for Cloudera Enterprise

From Spark to Ignition:

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

How To Handle Big Data With A Data Scientist

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Native Connectivity to Big Data Sources in MSTR 10

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Real-time Big Data Analytics with Storm

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Best Practices for Hadoop Data Analysis with Tableau

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Ganzheitliches Datenmanagement

SAP and Hortonworks Reference Architecture

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BIG DATA - HADOOP PROFESSIONAL amron

CitusDB Architecture for Real-Time Big Data

A Brief Outline on Bigdata Hadoop

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Information Architecture

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Analytics on Spark &

Hadoop IST 734 SS CHUNG

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

SQL Server 2012 End-to-End Business Intelligence Workshop

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Job Oriented Training Agenda


Moving From Hadoop to Spark

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

The Game of Big Data! Analytics Infrastructure at KIXEYE

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Big Data Pipeline and Analytics Platform

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Upcoming Announcements

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

Big Data Analytics Nokia

Big Data and Analytics: Challenges and Opportunities

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA What it is and how to use?

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big data blue print for cloud architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav

CONTENT Abstract Introduction Related work The Ecosystem Ingress Workflows Egress Applications Conclusions & Future Work

ABSTRACT The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. This paper also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

INTRODUCTION Transforming data mining and machine learning into a core production use case. For instance, many social networking and e-commerce web sites provide derived data features, which usually consist of some data mining application offering insights to the user. Typical example include collaborative filtering. Technique also used by Amazon, YouTube and LinkedIn. At LinkedIn their are over 200 million members, collaborative filtering is used for people, job, company, group, and other recommendations and is one of the principal components of engagement. This paper describes about ingress and egress out of the Hadoop system and presents case studies of how data mining applications are built at LinkedIn.

RELATED WORK There is little literature available on productionizing machine learning workflows. Twitter has developed a Pig extension. Facebook uses Hive for its data analytics and warehousing platform, but little is known on productionizing its machine learning applications. In terms of ingress, ETL has been studied extensively. Hadoop ingress, in Lee et al., the authors describe Twitter s transport mechanisms and data layout for Hadoop, which uses Scribe(was originally developed at Facebook ). Yahoo has a similar system for log ingestion, called Chukwa. OLAP is a well studied problem in data warehousing, but there is little related work on using MapReduce for cubing and on serving queries in the request/response path of a website. None of these egress systems explore the data pipeline or describe an end-to-end pipeline for data mining applications.

T H E E C O S Y S T E M

THE ECOSYSTEM(cont.) In reference to architecture of LinkedIn s data pipeline, Member and activity data generated by the online portion of the website flows into our offline systems for building various derived data sets, which is then pushed back into the online serving side. LinkedIn has been using HDFS, the distributed filesystem for Hadoop, as the sink for all this data. HDFS then acts as the input for further processing for data features. The data coming into HDFS can be classified into two categories: activity data and core database snapshots. The database change-logs are periodically compacted into timestamped snapshots for easy access. The activity data consists of streaming events generated by the services handling requests on LinkedIn.

THE ECOSYSTEM(cont.) Once data is available into an ETL HDFS instance, it is then replicated to two Hadoop instances, one for development and one for production. Azkaban, LinkedIn s open source scheduler, to manage these workflows. Azkaban is a general purpose execution framework and supports diverse job types such as native MapReduce, Pig, Hive, shell scripts, and others. LinkedIn production workflows are predominantly Pig, though native MapReduce is sometimes used for performance reasons.

INGRESS Data loaded into Hadoop comes in two forms: database and event data. The primary challenge in supporting a healthy data ecosystem is providing infrastructure that can make all this data available without manual intervention or processing. First, datasets are large so all data loading must scale horizontally. Second, data is diverse. Third, data schemas are evolving as the functionality of the site itself changes rather rapidly. Monitoring and validating data quality is of utmost importance. Focus on the activity data pipeline. LinkedIn has developed a system called Kafka as the infrastructure for all activity data. Activity data is very high in volume several orders of magnitude larger than database data. The persistence layer is optimized to allow for fast linear reads and writes; data is batched end-to-end to avoid small I/O operations; and the network layer is optimized to zero-copy data transfer.

WORKFLOWS The data stored on HDFS is processed by numerous chained MapReduce jobs that form a workflow, a directed acyclic graph of dependencies. Workflows are built using a variety of available tools for the Hadoop ecosystem(hive, Pig, and native MapReduce are the three primary processing interfaces). To support data interoperability, Avro is used as the serialization format. In the case of Hive, we create partitions for every event data topic during the data pull, thereby allowing users to run queries within partitions. A wrapper helps restrict data in Pig and native MapReduce, being read to a certain time range based on parameters specified. To manage these workflows, LinkedIn uses Azkaban, an open source workflow scheduler as these workflows can get fairly complex. LinkedIn maintains three Azkaban instances, one corresponding to each of our Hadoop environments.

EGRESS The result of these workflows are usually pushed to other systems, either back for online serving or as a derived data-set for further consumption. This is achieved by the workflows adding an extra job(consisting of a simple 1-line command for data deployment.) at the end of their pipeline for data delivery out of Hadoop. There are three main mechanisms available depending on the needs of the application: Key-value value: the derived data can be accessed as an associative array or collection of associative arrays; Streams: data is published as a change-log of data tuples; OLAP: data is transformed offline into multi-dimensional cubes for later online analytical queries.

APPLICATIONS Most features at LinkedIn rely on this data pipeline either explicitly(data being the product), or implicitly (derived data infusing into the application). Many applications leverage the Kafka ingress flow and use Azkaban as their workflow and dependency manager to schedule their computation at some schedule (e.g., every 4 hours) or based on some trigger (e.g., when new data arrives). Key-Value:- Key-value access using Voldemort is the most common egress mechanism from Hadoop. Over 40 different products use this mechanism and it accounts for approximately 70% of all Hadoop data deployments at LinkedIn. Some Common mechanism and features are depicted later.

Streams

OLAP

CONCLUSION AND FUTURE WORK This paper present end-to-end Hadoop-based analytics stack at LinkedIn. Analysis of data flows into the offline system, how workflows are constructed and executed, and the mechanisms available for sending data back to the online system Avenues for future work:- MapReduce is not suited for processing large graphs as it can lead to suboptimal performance and usability issues. Second, the batch-oriented nature of Hadoop is limiting to some derived data applications.

Thank You