An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov
agenda 03.12.2015 2
agenda Introduction 03.12.2015 2
agenda Introduction Research goals 03.12.2015 2
agenda Introduction Research goals Hadoop ecosystem in Facebook 03.12.2015 2
agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn 03.12.2015 2
agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress 03.12.2015 2
agenda Introduction Research goals Hadoop ecosystem in Facebook Hadoop ecosystem in LinkedIn Research progress References 03.12.2015 2
introduction 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code 03.12.2015 3
introduction Apache Hadoop is an open-source framework for distributed storage and processing of Big Data on clusters of commodity hardware Core parts: storage part (HDFS) and a processing part (MapReduce) Data is split into blocks and distributed among cluster nodes Code is transferred to nodes for parallel processing, based on data Nodes data manipulate locally using assigned code Hadoop also refers to the ecosystem of related projects (Apache Pig, Apache Hive, Apache Spark, etc.) 03.12.2015 3
research goals 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation 03.12.2015 4
research goals How Hadoop ecosystem is used in real-world scenarios? Selected a set of Big Data companies that use Hadoop ecosystem: Facebook, LinkedIn, Twitter, Yahoo Explore how Hadoop ecosystem was implemented Investigate what problems occurred during ecosystem development Examine how problems were solved in each case Summarize experience of Hadoop ecosystem implementation Provide patterns and advices for real-world implementation of Hadoop ecosystem 03.12.2015 4
hadoop ecosystem in facebook 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution 03.12.2015 5
hadoop ecosystem in facebook Facebook deals with tremendous data amounts daily Growth from 5-6TB to 10-15TB of compressed data daily in half a year Strong scalability requirements, using commodity hardware Hadoop and Hive to provide storage and computation capabilities Hive brings SQL, meta data, etc., to Hadoop ecosystem Scribe is a service to aggregate logs from thousands of web servers Scribe with Hadoop provide a scalable log aggregation solution Hadoop, Hive and Scribe form log collection, storage and analytics 03.12.2015 5
hadoop ecosystem in facebook 03.12.2015 6
hadoop ecosystem in facebook Web Servers (viewing ads) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6
hadoop ecosystem in facebook Scribe-Hadoop Clusters (log aggregation) Web Servers (viewing ads) Adhoc Hive-Hadoop Cluster (relaxed deadlines) Production Hive- Hadoop Cluster (strict deadlines) Federated MySQL (ads info) 03.12.2015 6
HADOOP HIVE hadoop ecosystem in facebook JDBC ODBC Command Line Interface Web Interface Thrift Server Driver (Compiler, Optimizer, Executor) Metastore Job Tracker Name Node Data Node + Task Tracker Data Node + Task Tracker Data Node + Task Tracker 03.12.2015 7
big data in linkedin 03.12.2015 8
big data in linkedin Collaborative filtering 03.12.2015 8
big data in linkedin Collaborative filtering People you may know 03.12.2015 8
big data in linkedin Collaborative filtering People you may know Analytical dashboards 03.12.2015 8
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters 03.12.2015 9
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Hadoop Apps Offline datacenters 03.12.2015 10
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11
hadoop ecosystem in linkedin Online datacenter Oracle Web apps Web apps Apache Kafka Publish-subscribe system All messages are divided into topics Subscribers might read these messages from the system Apache Kafka Apache Kafka Hadoop Apps Offline datacenters 03.12.2015 11
data deployment Output 03.12.2015 12
data deployment Output Large 03.12.2015 12
data deployment Output Large Key-Value 03.12.2015 12
data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members? 03.12.2015 12
data deployment Output Large Key-Value Problem: How to serve these massive outputs to all members? Solution: Project Voldemort Distributed key-value store Support fast online read-writes Scalable Open Source 03.12.2015 12
workflows managing 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools Workflows can be really complex 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use 03.12.2015 13
workflows managing Workflows are built using different Hadoop tools Workflows can be really complex LinkedIn Azkaban Hadoop workflow manager Open Source Easy to use LinkedIn maintains two different Azkaban instances Developer instance Production instance 03.12.2015 13
research progress and future work 03.12.2015 14
research progress and future work Done: 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation 03.12.2015 14
research progress and future work Done: Selected a set of Big Data companies: Facebook, LinkedIn, Twitter, Yahoo Explored how Hadoop ecosystem was implemented Examined how design problems were solved in each case In progress: Finalize analysis of ecosystem implementation Summarize experience of Hadoop ecosystem implementation Provide patterns and recommendation for real-world implementation of Hadoop ecosystem 03.12.2015 14
references Thusoo, Ashish, et al. "Data warehousing and analytics infrastructure at facebook." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. Thusoo, Ashish, et al. "Hive-a petabyte scale data warehouse using hadoop." Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010. Borthakur, Dhruba, et al. "Apache Hadoop goes realtime at Facebook." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. Sumbaly, Roshan, Jay Kreps, and Sam Shah. "The big data ecosystem at linkedin." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013. Lin, Jimmy, and Alek Kolcz. "Large-scale machine learning at twitter." Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012. Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010. Vavilapalli, Vinod Kumar, et al. "Apache hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013. Islam, Mohammad, et al. "Oozie: towards a scalable workflow management system for hadoop." Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. ACM, 2012. 03.12.2015 15
questions? 03.12.2015 16