An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Transcription

1 An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov

2 agenda

3 agenda Introduction

4 agenda Introduction Research goals

5 agenda Introduction Research goals Hadoop ecosystem in Facebook

6 agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn

7 agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress

8 agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress References

9 introduction

10 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware

11 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce)

12 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes

13 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data

14 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code

15 introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code Hadoop also refers to the ecosystem of related projects (Apache Pig, Apache Hive, Apache Spark, etc.)

16 research goals

17 research goals How Hadoop ecosystem is used in real-world scenarios?

18 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo

19 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented

20 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development

21 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case

22 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation

23 research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation Provide patterns and advices for real-world implementation of Hadoop ecosystem

24 hadoop ecosystem in facebook

25 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily

26 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year

27 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware

28 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities

29 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem

30 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers

31 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution

32 hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution Hadoop, Hive and Scribe form log collection, storage and analytics

33 hadoop ecosystem in facebook

34 hadoop ecosystem in facebook Web Servers (viewing ads)

35 hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads)

36 hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines)

37 hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info)

38 hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info)

41 HADOOP HIVE hadoop ecosystem in facebook JDBC ODBC Command Line Interface Web Interface Thrift Server Driver (Compiler, Optimizer, Executor) Metastore Job Tracker Name Node Data Node + Task Tracker Data Node + Task Tracker Data Node + Task Tracker

42 big data in linkedin

43 big data in linkedin Collaborative filtering

44 big data in linkedin Collaborative filtering People you may know

45 big data in linkedin Collaborative filtering People you may know Analytical dashboards

46 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters

47 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters

48 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Apache Kafka Hadoop Apps Offline datacenters

49 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters

50 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Apache Kafka Apache Kafka Hadoop Apps Offline datacenters

51 hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Subscribers might read these messages from the system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters

52 data deployment Output

53 data deployment Output Large

54 data deployment Output Large Key-Value

55 data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members?

56 data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members? Solution: Project Voldemort Distributed key-value store Support fast online read-writes Scalable Open Source

57 workflows managing

58 workflows managing Workflows are built using different Hadoop tools

59 workflows managing Workflows are built using different Hadoop tools Workflows can be really complex

60 workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager

61 workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source

62 workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use

63 workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use LinkedIn maintains two different Azkaban instances Developer instance Production instance

64 research progress and future work

65 research progress and future work Done:

66 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo

67 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented

68 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case

69 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress:

70 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation

71 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation

72 research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation Provide patterns and recommendation for real-world implementation of Hadoop ecosystem

73 references Thusoo, Ashish, et al. "Data warehousing and analytics infrastructure at facebook." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, Thusoo, Ashish, et al. "Hive-a petabyte scale data warehouse using hadoop." Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, Borthakur, Dhruba, et al. "Apache Hadoop goes realtime at Facebook." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, Sumbaly, Roshan, Jay Kreps, and Sam Shah. "The big data ecosystem at linkedin." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, Lin, Jimmy, and Alek Kolcz. "Large-scale machine learning at twitter." Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, Vavilapalli, Vinod Kumar, et al. "Apache hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, Islam, Mohammad, et al. "Oozie: towards a scalable workflow management system for hadoop." Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. ACM,

74 questions?