An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov



Similar documents
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

APACHE HADOOP JERRIN JOSEPH CSU ID#

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Cleveland State University

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop implementation of MapReduce computational model. Ján Vaňo

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data on Microsoft Platform

BIG DATA TRENDS AND TECHNOLOGIES

Workshop on Hadoop with Big Data

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Application Development. A Paradigm Shift

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Open source Google-style large scale data analysis with Hadoop

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop & its Usage at Facebook

From Spark to Ignition:

Hadoop Ecosystem B Y R A H I M A.

Hadoop & its Usage at Facebook

CSE-E5430 Scalable Cloud Computing Lecture 2

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cleveland State University

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data and Market Surveillance. April 28, 2014

Apache Hadoop FileSystem and its Usage in Facebook

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Putting Apache Kafka to Use!

Cloudera Certified Developer for Apache Hadoop

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop IST 734 SS CHUNG

Kafka & Redis for Big Data Solutions

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Apache Hadoop: Past, Present, and Future

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

How To Create A Data Visualization With Apache Spark And Zeppelin

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

A Brief Outline on Bigdata Hadoop

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Chase Wu New Jersey Ins0tute of Technology

HDP Hadoop From concept to deployment.

Chapter 7. Using Hadoop Cluster and MapReduce

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Implement Hadoop jobs to extract business value from large and varied data sets

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

How To Handle Big Data With A Data Scientist

Hadoop Architecture and its Usage at Facebook

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Native Connectivity to Big Data Sources in MSTR 10

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Entering the Zettabyte Age Jeffrey Krone

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Processing of Hadoop using Highly Available NameNode

White Paper: Hadoop for Intelligence Analysis

Introduction To Hive

How Companies are! Using Spark

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analyzing Big Data at. Web 2.0 Expo, 2010 Kevin

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BIG DATA CHALLENGES AND PERSPECTIVES

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

How To Scale Out Of A Nosql Database

Are You Ready for Big Data?

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Map Reduce & Hadoop Recommended Text:

Open source large scale distributed data management with Google s MapReduce and Bigtable

L1: Introduction to Hadoop

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

The Inside Scoop on Hadoop

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Transcription:

An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov

agenda 03.12.2015 2

agenda Introduction 03.12.2015 2

agenda Introduction Research goals 03.12.2015 2

agenda Introduction Research goals Hadoop ecosystem in Facebook 03.12.2015 2

agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn 03.12.2015 2

agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress 03.12.2015 2

agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress References 03.12.2015 2

introduction 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code 03.12.2015 3

introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code Hadoop also refers to the ecosystem of related projects (Apache Pig, Apache Hive, Apache Spark, etc.) 03.12.2015 3

research goals 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation 03.12.2015 4

research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation Provide patterns and advices for real-world implementation of Hadoop ecosystem 03.12.2015 4

hadoop ecosystem in facebook 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution 03.12.2015 5

hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution Hadoop, Hive and Scribe form log collection, storage and analytics 03.12.2015 5

hadoop ecosystem in facebook 03.12.2015 6

hadoop ecosystem in facebook Web Servers (viewing ads) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6

hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6

HADOOP HIVE hadoop ecosystem in facebook JDBC ODBC Command Line Interface Web Interface Thrift Server Driver (Compiler, Optimizer, Executor) Metastore Job Tracker Name Node Data Node + Task Tracker Data Node + Task Tracker Data Node + Task Tracker 03.12.2015 7

big data in linkedin 03.12.2015 8

big data in linkedin Collaborative filtering 03.12.2015 8

big data in linkedin Collaborative filtering People you may know 03.12.2015 8

big data in linkedin Collaborative filtering People you may know Analytical dashboards 03.12.2015 8

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters 03.12.2015 9

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters 03.12.2015 10

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11

hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Subscribers might read these messages from the system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11

data deployment Output 03.12.2015 12

data deployment Output Large 03.12.2015 12

data deployment Output Large Key-Value 03.12.2015 12

data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members? 03.12.2015 12

data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members? Solution: Project Voldemort Distributed key-value store Support fast online read-writes Scalable Open Source 03.12.2015 12

workflows managing 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools Workflows can be really complex 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use 03.12.2015 13

workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use LinkedIn maintains two different Azkaban instances Developer instance Production instance 03.12.2015 13

research progress and future work 03.12.2015 14

research progress and future work Done: 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation 03.12.2015 14

research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation Provide patterns and recommendation for real-world implementation of Hadoop ecosystem 03.12.2015 14

references Thusoo, Ashish, et al. "Data warehousing and analytics infrastructure at facebook." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. Thusoo, Ashish, et al. "Hive-a petabyte scale data warehouse using hadoop." Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010. Borthakur, Dhruba, et al. "Apache Hadoop goes realtime at Facebook." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. Sumbaly, Roshan, Jay Kreps, and Sam Shah. "The big data ecosystem at linkedin." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013. Lin, Jimmy, and Alek Kolcz. "Large-scale machine learning at twitter." Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012. Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010. Vavilapalli, Vinod Kumar, et al. "Apache hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013. Islam, Mohammad, et al. "Oozie: towards a scalable workflow management system for hadoop." Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. ACM, 2012. 03.12.2015 15

questions? 03.12.2015 16