The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Similar documents
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

The Big Data Ecosystem at LinkedIn

Introduction To Hive

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

ITG Software Engineering

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Data Analyst Program- 0 to 100

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Complete Java Classes Hadoop Syllabus Contact No:

Workshop on Hadoop with Big Data

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Dominik Wagenknecht Accenture

Cloudera Certified Developer for Apache Hadoop

Hadoop Development & BI- 0 to 100

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

BIG DATA - HADOOP PROFESSIONAL amron

COURSE CONTENT Big Data and Hadoop Training

Putting Apache Kafka to Use!

Hadoop Ecosystem B Y R A H I M A.

Qsoft Inc

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Using Kafka to Optimize Data Movement and System Integration. Alex

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

HADOOP. Revised 10/19/2015

Implement Hadoop jobs to extract business value from large and varied data sets

Constructing a Data Lake: Hadoop and Oracle Database United!

Kafka & Redis for Big Data Solutions

Big Data Course Highlights

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Upcoming Announcements

Certified Big Data and Apache Hadoop Developer VS-1221

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Apache Hadoop: Past, Present, and Future

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Comprehensive Analytics on the Hortonworks Data Platform

Native Connectivity to Big Data Sources in MSTR 10

HDP Hadoop From concept to deployment.

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Dell In-Memory Appliance for Cloudera Enterprise

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop: The Definitive Guide

Introduction to Big Data Training

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Bringing Big Data to People

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Infrastructure at Spotify

Apache Hadoop Ecosystem

TRAINING PROGRAM ON BIGDATA/HADOOP

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Unified Batch & Stream Processing Platform

Hadoop & Spark Using Amazon EMR

Big Data Management and Security

Hadoop Job Oriented Training Agenda

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Streaming items through a cluster with Spark Streaming

Deploying Hadoop with Manager

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

June Production Hadoop systems in the enterprise

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Cloud Scale Distributed Data Storage. Jürmo Mehine

Building Scalable Big Data Pipelines

How to Enhance Traditional BI Architecture to Leverage Big Data

BIG DATA HADOOP TRAINING

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Data Security in Hadoop

Important Notice. (c) Cloudera, Inc. All rights reserved.

Internals of Hadoop Application Framework and Distributed File System

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Analytics - Accelerated. stream-horizon.com

Transcription:

The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang

Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah.

The Ecosystems

Hadoop Ecosystem

Sqoop Flume ZooKeeper Pig Hive Oozie MapReduce HBase HDFS

Data Integration Problem

Some analysts performed this integration themselves. In other cases, analysts especially application users and scripters relied on the IT team to assemble this data for them.! Sean Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study!!

Ecosystem at LinkedIn

Online datacenters Hadoop for ETL Offline production datacenters Offline development datacenters

Online datacenters Offline production datacenters

ETL Extract-Transform-Load More information about ETL is available from Cloudera

Online Datacenters Voldemort! Oracle

Voldemort Combines In-memory caching and storage system! Possible to emulate the storage layer! Reads and Writes scale horizontally! Simple API! Transparent data partitioning More information about Voldemort is available at: http://www.project-voldemort.com/voldemort/

Next Ingress and egress out of Hadoop system! Data flows

Ingress From online to offline

Online datacenters Hadoop for ETL

Online datacenters Database includes information about users, companies, connections, and other primary site data.! Event data includes logs of page views Hadoop for ETL being served, search queries, and clicks.

Challenges Provide infrastructure that can make all data available without manual intervention or processing.

Challenges Datasets are so large! Data is diverse! Data schemas are evolving! Data monitoring and validating

Kafka A high-throughput distributed messaging system, for all activity data.

Persists messages in a write-ahead log, partitioned and distributed over multiple brokers! Allows data publishers to add records to a log! New data streams to be added and scaled independent of one another.

Example Collecting data about searches being performed. The search service would produce these records and publish them to a topic named SearchQueries where any number of subscribers might read these messages.

Data Evolution Requires schemas changes! Two common solutions:! Load data streams in whatever form they appear.! Manually map the source data into a stable, well-thoughtout schema.

LinkedIn s Solution Retains the same structure throughout our data pipeline! Enforces compatibility and other correctness conventions! Schema evolves automatically! Schema check is done at compile and run time.! Review process.

Load into Hadoop Pull data from Kafka to Hadoop with a Map-only job.! Recurrence every 10 minutes.! Scheduled by Azkaban scheduling system

Online Datacenters Services Kafka Kafka Services Two Kafka clusters kept synchronized automatically.! The primary Kafka supports production. Offline Development Datacenters

Each step in the flow all publish an audit trail! Consist of the topic, Audit machine name, etc.! Confirm all published event reached all consumers by aggregating audit data.

Workflow Processed offline

A directed acyclic graph of dependencies.! Wrappers help restrict data Workflow in Hadoop being read to a certain time range based on parameters specified.! Hadoop-Kafka data pull job places the data into timepartitioned folders.

Workflow in Hadoop A directed acyclic graph of dependencies.! One workflow can have a size of 50-100 jobs.

Azkaban An open-sourced workflow scheduler.

Supports multiple job types! Run as individual or chained! Azkaban Configurations and dependencies are maintained! Visualize and manipulate dependencies via graphs in UI.

Experimenting, massage it into Construction of Workflows a useful form! Transform features into feature vectors! Trained into models! Iterate workflows

Kafka Azkaban Azkaban Hadoop for Hadoop for ETL development Offline Development Datacenters

Azkaban Staging Voldemort Read- Hadoop for production Offline Production Datacenters

Egress Back online!

Results Usually pushed to other systems (Back online or for further consumption)

Three main mechanism Key-Value! Streams! OLAP

Based on Amazon s Dynamo! Distributed and Elastic! Voldemort Horizontally scalable! Bulk load pipeline from Hadoop! Simple to use Reference: Slides of The big data ecosystem at LinkedIn, presented at SIGMOD 2013

Implemented using Hadoop API! Stream output performed by" Kafka Each MR slot acts as a Kafka produce More details available at http://kafka.apache.org

Automatically generates corresponding Azkaban job OLAP by Avatara pipelines.! Small cubes are multidimensional arrays of tuples! Each tuple is combination of dimension and measure pairs! Output cubes to Voldemort as a read-only store.

Applications

People you may know Collaborative Filtering Key-Value Skill Endosements Related Searches Applications Stream News Feed Updates Email Relationship Strength OLAP Who s viewed my profile? Who s viewed this job?

Future Work

MapReduce for processing large graphs! Streaming System for near line low-latency data processing