The Hadoop ecosystem: an overview Lamine Aouad University College Dublin

Transcription

1 The Hadoop ecosystem: an overview Lamine Aouad University College Dublin Hadoop [1] has become the de facto standard in the big data space. It has also become the kernel of a large distributed ecosystem. A collection of projects, many at Apache, have emerged as essential tools to be used alongside Hadoop for a number of important operations, including data management, collection and aggregation, transfer, high- level interfaces, BI reporting, and so on. The community of applications and use cases running on Hadoop, and using these projects, is constantly growing. In this paper, we will have a look at these different projects, and how they are used and complement Hadoop. A high- level map of the ecosystem is illustrated in Figure 1. Before describing these projects, it is worth mentioning a very interesting emerging project in the ecosystem called Bigtop [2]. Its primary goal is to build a community around the Hadoop ecosystem by providing a common packaging and interoperability testing. Indeed, the Hadoop ecosystem includes a range of sophisticated projects, and although each of these projects has a rigorous community practicing high- degree of software development discipline, there were a gap when it came to inter- project integration. There are different dependency issues between these projects. The Hadoop space now is actually full of companies filling this gap and delivering fully integrated pre- packaged Hadoop stacks, such as Cloudera [4], Karmasphere [3], among others. Bigtop is also filling this gap for the community by focusing on the ecosystem as a whole, rather than individual projects. Figure 1. The Hadoop Ecosystem Workflow UI & BI, High-level monitoring / Packaging Data Collection / I/O Scribe, Flume, Chukwa, Sqoop, Oozie, Azkaban, Cascading OLTP & NoSQL HBase, Cassandra, Hue / Cloudera, Karmasphere, Management / Support Zookeeper, Avro Core High-level Interfaces Pig, Hive Cloud deployments Whirr, Amazon EMR, Infochimps A vanilla Hadoop distribution comes with two core components: its MapReduce engine and its file system, called HDFS (Hadoop Distributed File System). The MapReduce engine coordinates and executes jobs (via two roles: the jobtracker and the tasktrackers). Note that in Hadoop terminology, a job is the full program, and is then a set of tasks. On the other hand, as already mentioned, Hadoop comes with HDFS (by default), but can be (and has been) integrated and used 1

2 with other file systems and storage systems. Many distributions and deployments actually use custom file systems, such as QFS from Quantcast [44] recently released as open source, IBM s GPFS [49], Amazon s S3 [45], among others. HDFS also has two basic roles: a single namenode storing metadata (not involved in ordinary activity) and the datanodes storing the actual data chunks (typically collocated with the tasktrackers). There is a lot to say about Hadoop execution and the way MapReduce and HDFS work, how to develop an application (using the native Java interface or streaming) and so on. However, this is out of the scope of this paper, as the primary focus is to introduce the ecosystem around these two core components. There are a lot of introductory courses, and training resources and tutorials available online. In the following, we are going to start with describing the data integration within the ecosystem, which I believe is one of the most important aspects here. Data connections Data has many shapes, formats, and is captured and stored using a range of technologies. It is also distributed in different places. There are a number of tools that deal with delivering the data to Hadoop, structured or unstructured. Many companies have built their own tools to serve their own needs. Some have been released as open source to the community, and have gained in functionality and support. These include Scribe [5], Flume [6], Sqoop [7], among others. There are also a number of proprietary tools providing connectivity, extraction, and mapping capabilities, such as Pervasive s data integrator. The next few paragraphs briefly introduce these tools. Ø Unstructured data Reliable and efficient data preparation is paramount. However, the data preparation here means basic assembly and delivery of the data. This includes collection, aggregation, and movement of the distributed data, as opposed to the data mining point of view where it includes an extra step manipulating the data to enhance its utility for the mining process. The challenge here is how to get the large amounts of data, logs or any other data, to HDFS to be processed by Hadoop. This has to be done in a reliable and fault- tolerant manner. Losing all the click logs of a website for a given period of time, because of a namenode failure for instance, is not tolerable. A number of (quite similar) tools have been developed to stream large flow of data to Hadoop. These include Scribe and Flume (already mentioned), and Chukwa [8]. They provide similar functionality, but they have some differences in the way they are configured, the monitoring, levels of reliability available, etc. Note that basic tools exist and can be scripted out, such as HDFS copy commands, which provide a way to get data in and out of Hadoop. But this cannot realistically be used in a large- scale production environment. Facebook s Scribe for instance was initially designed to reliably scale to tens of billions of messages a day. It is also used at Twitter and Digg, among others. Cloudera s Flume is also used in production by a number of companies including Photobucket. An additional similar system is called Kafka [9], initially developed at LinkedIn. It is comparable to Flume and Scribe, and can be used for activity stream processing. 2

3 However, it is different in its primitives as it was initially designed as a distributed messaging system. Although they offer similar functionality, they might serve different use cases, in terms of setup, manageability, integration, etc. in addition to the support and development momentum around them, which could be a deciding factor for choosing one over the other. Ø Structured Data The same challenge applies to structured data sources, including relational databases, enterprise warehouses, and also NoSQL systems. Although some tools exist, properly placing Hadoop with existing sources of data is still quite challenging. In this case, it is not a collection issue but a rather integration or a mapping issue, i.e. the integration of Hadoop to the existing infrastructure, and the mapping of the existing data to Hadoop and related systems. Efficient integration and mapping processes will certainly make the transition to Hadoop a lot easier. Sqoop [7] is one of the tools allowing the transfer of large amounts of data between Hadoop and structured data stores. Sqoop can be used to import data into HDFS or to systems working alongside Hadoop including HBase [10] (c.f. NoSQL section). It comes with various connectors to popular databases including MySQL, PostgreSQL, or proprietary such as Oracle, Teradata, etc. It can also be plugged in with new connectors. Sqoop provides an API for operation and management, which is very handy for integration with external support and workflow systems. Sqoop was initially started at Cloudera, and is now quite established in enterprise systems. There is another system called Hiho [11], from a company called Nubetech. It aims at providing similar integration functionality with existing data stores, however, it does not seem to be in active development anymore. The importance of properly bridging existing data stores with Hadoop has led many companies, especially those in the fields of data delivery and integration, to enter the Hadoop space. A data integration company, called Pervasive [12], has developed a Hadoop edition of their data integrator, which enables users to roll data in and out from more than 200 different data sources. It offers a browser- based user interface, and visual drag- and- drop mappers integrating sources and targets. Another visual drag- and- drop data extraction and integration tool is called Pentaho s Kettle [13], and it has an open source community edition. Kettle offers data migration, export, and loading capabilities for a range of formats and database engines, including csv, XML, Oracle, SQL Server, MySQL, PostgreSQL, among others. Management - support Deploying, configuring, managing, and maintaining distributed platforms and applications are complex matters. One of the main reasons for that would be the very large scale and thus failure, which is intrinsic to these systems. There have been efforts by the community to develop coordination and support tools around Hadoop. One of these tools is Yahoo! s Zookeeper [14], which can be basically described as a fault- tolerant middleware providing common services that are needed in distributed infrastructures in order to maintain the general 3

4 availability, configuration and synchronisation of the system. Zookeeper is in essence a centralised service, but it can maintain a hierarchical tree of servers, which contain what is called znodes, which is a core concept in Zookeeper. A znode is basically a unified notion of a node that holds the data and metadata managing the cluster- wide configuration and status of the system and applications. In another context, a useful support tool for Hadoop is a serialisation tool called Avro [15]. Note that Hadoop has its own marshalling format and does not use the native one, which incidentally and according to Doug Cutting (creator of Hadoop) was big and hairy and that he thought Hadoop needed something lean and mean. Avro defines a serialisation format with support to richer data types (such as nested data) and support to RPC, versioning, and many languages. The variety of possible pieces in a Hadoop- based system has indeed made data interoperability a primary issue, and Avro is a good support to manage data throughout potential complex data flows generated by Hadoop set- ups. It is becoming a common format for the ecosystem. A Cloudera tool called RecordBreaker [53] automatically turns text- formatted data (server logs, sensor readings, etc.) into structured Avro data, without the need to write parsers or extractors, which is extremely useful. High- level interfaces Hadoop s MapReduce is still fairly low- level. Developers must think in terms of maps and reduces, and keys and values, the partitioning, etc. They also need skills in Java, which is the native interface, and/or a scripting language using the streaming interface. Implementing complicated applications in MapReduce can be quite challenging. Mapping applications into the map and reduce patterns, which usually require multiple stages, is far from being trivial. The Hadoop community has developed high- level interfaces raising the level of abstraction, essentially implementing common building blocks and operators for ad- hoc data analysis. Pig [16] and Hive [17] are two widely used high- lever interfaces. I am going to be a bit biased towards Pig since I have been using it for a little while. Pig was initially developed at Yahoo!, and is a data analysis platform made up of two pieces: a data flow language called PigLatin, and an execution environment mapping statements (written in PigLatin) into maps and reduces, and executing them on Hadoop. In Pig, the data structure is much richer, typically multivalued and nested. PigLatin defines a DAG (Directed Acyclic Graph), which is a step- by- step set of operations, each performing some transformation on the data. PigLatin would be quite straightforward for people familiar with a scripting language. It also allows rapid prototyping of algorithms for processing large datasets, and it has a pretty decent compiler. Indeed, a project called PigMix [18], which is a set of queries used to test Pig performance, compared to pure MapReduce jobs, is currently showing 1.1 times the MapReduce speed on average (over 20+ benchmarks). Pig runs as a client side application. It can actually reside outside of the Hadoop cluster. It can be used in different ways: o The shell, called Grunt, 4

5 o Submitting scripts directly, or o Embedded in Java, using PigServer Java class (similar to SQL using JDBC), or in Python. These can be executed under two modes: local and Hadoop. In the local mode, Pig still depends on Hadoop s local job runner (in its versions >0.7), which reads from the local file system and executes jobs locally. The Hadoop mode needs access to a Hadoop cluster and an HDFS installation. There is also a range of tools and plugins that ease the editing, monitoring, debugging, and integration of Pig workflows, including PigPen, Penny, PigPy, among many others [19]. In terms of the language, PigLatin has multiple statements that make it look like SQL. However, it is actually quite different from relational systems. There are many features that Pig has that are absent in relational database systems, and vice versa. These include more complex data structures in Pig as opposed to flatter data structures in SQL. There is also Pig s ability to use user- defined functions and streaming, which makes it more customisable. On the other hand, transactions and indexes are absent in Pig, as well as random reads and writes supported in relational systems, and so on. The other high- level interface we have mentioned, namely Hive, sits actually between Pig and relational database systems, as its language is a variant of SQL and it operates on data stored in tables. Let s have a look at a simple PigLatin example. The following example calculates the top locations in a check- in data. This data comes from Gowalla [20], which was a location- based social networking site. This very simple analysis will show the top social landmarks across a city. records = LOAD 'input/checkindata' AS (userid:int, date:chararray, lat:double, lg:double, locid:int); -- get the users and their check-in locations checkins = FOREACH records GENERATE userid, locid; -- group checkins by locid grouped_records = GROUP checkins BY locid; -- number of checkins per location how_many_per_location = FOREACH grouped_records GENERATE $0, COUNT($1); ordered = ORDER how_many_per_location BY $1 DESC; -- get the top 50 top_locations = LIMIT ordered 50; STORE top_locations INTO 'output/toplocations'; This script loads the input data from a field- delimited text file, with each line having the user id, the date, latitude and longitude, and the location id. Pig supports different formats, and can be customised to load and store additional ones. The AS clause, in the LOAD statement, associates a schema to records. The Schema simply associates names and types of the fields in the relation, e.g. user id is an integer. This is optional but recommended. If the AS clause is not there, Pig will consider all the fields as bytearray, and this may lead to errors. LOAD takes a URI argument. It is important to note that the output of any statement in Pig is a set of tuples. A tuple is a sequence of fields of any type. For this statement it produces a set of (user id, date, latitude, longitude, location id) tuples. The user can give any name or alias to the successive statements. Note that each statement is not immediately executed, but rather added to a logical plan. The execution is not started until the whole flow has been defined. The logical plan is then compiled into a physical one and executed. In the meantime, there are few 5

6 useful diagnostic operators when using the Grunt shell, which allow the user to interact with each statement and see what it produces: DUMP: which in fact triggers the execution and displays the results, ILLUSTRATE: shows a sample execution, DESCRIBE: shows the schema, EXPLAIN: shows the logical and physical plans. The second statement in the example use the FOREACH GENERATE statement, which acts on columns on every row of a relation. In this case, from records it generates only two fields in the relation checkins. The checkins are then grouped using locid as the grouping key. At this stage, we have tuples, one for each location in the input data, containing the location id, and a bag of tuples for that location. A bag is just a collection of tuples represented by curly braces in Pig. It will look like this: grouped_records: (group: int, checkins: bag ({userid: int, locid: int}) By grouping checkins in this way, we have created a row per location. The grouping field (the location) is given the alias group, and the second field has the same structure as checkins. What it remains now is just to add up the tuples of each bag to find the actual number of checkins per location. The next statement carries out the counting of the number of checkins per location. It uses in this case positional references, i.e. $0 and $1. This is similar to using names of the fields. The following statements order them, get the top 50, and store the result. In 6 lines of PigLatin code we have carried out an analytics operation, which would have been a lot more complex using maps and reduces. An interesting read about the mapping of PigLatin statements into MapReduce jobs can be found here [21]. On the other hand, Hive, as already mentioned, sits between Pig and relational systems, as it uses a variant of SQL as a language, called HiveQL. Hive defines tables for the data and keeps its metadata, such as schema and partitions (subdirectories in HDFS), in either a shared or local database. The schema is explicit and the types are defined upfront (different from Pig). As in Pig on the other hand, users can define custom functions; called User Defined Functions. It also provides both shell and web interfaces. Hive would be a good option for data analysts already familiar with SQL, and is consequently better suited for traditional data analytics. Hive was initially developed at Facebook, and is currently used by many sites including Digg, Scridb, among others. There is also a query language called Jaql [50], based on JSON (JavaScript Object Notation). It was open sourced by IBM, but its development has been put on hold for a little while now. It is however part of BigInsights, IBM s Hadoop distribution [51]. Ø Workflow systems Workflow systems have been increasingly attractive in a range of fields including scientific computing, coordinating business processes, etc. Workflow systems basically describe, coordinate, and map a sequence of jobs to one of different underlying execution systems (or individuals). A workflow engine supporting different types of Hadoop jobs (Java, streaming, Pig, etc.) is called Oozie [22]. Many people in distributed systems are actually very familiar with workflow systems, and would prefer to use them in developing their distributed 6

7 applications as they offer simple interfaces and languages coupled with resource- independent and management technologies able to efficiently schedule and map the workflow specification onto a large pool of resources. As many of the workflow languages, Oozie uses XML syntax to define the dependent set of actions into a DAG. There is another workflow tool called Azkaban [23], from LinkedIn, which has more or less similar functionality. Another useful system, in the same sprit of allowing the composition of rich analytics applications on top of Hadoop, is called Cascading [24]. It consists of a data processing API and query planner to define, share, and execute workflows on top of Hadoop. One of the core concepts in Cascading is the notion of pipes, which define a series of reusable data processing operations such as parsing, filtering, joining, etc. and a workflow would be a set of pipes. Cascading can be seen as a different way of doing what Pig does with a Java API, which is different from the workflow systems we have mentioned above, as they describe the workflow specification using a dedicated language (using XML for instance). In this sense, one can end up with an implementation combining both Oozie and Cascading for instance. There is another tool from Cloudera called Crunch [52], which is a Java library focusing on providing a set of primitive operations and user defined functions that can be combined to create complex multi- stage pipelines. Crunch then compiles these pipelines into a sequence of MapReduce jobs and manages their execution. This sounds a lot like Cascading, the main difference seems to be that Crunch would be more suitable for more complex data types (that would not naturally map to the tuple- based model). This is to avoid the implementation of complex user- defined functions to support this data types under Cascading or Pig. NoSQL data stores There are too many aspects to NoSQL technologies that we cannot even begin to do it justice in this small section. There will be a dedicated paper on these technologies and NoSQL data management on the cloud. Briefly, core aspects to NoSQL have initially been around unstructured data, a variety of data models, and the sacrifice of consistency over availability and scalability. The origin of the latter is Brewer s CAP theorem in It basically stated that consistency and availability cannot be maintained when a database is partitioned across a fallible large network. Or in general, for any given system, one can strongly support only two of these requirements: Consistency, Availability, and Partition tolerance. Many of the NoSQL technologies rose alongside big Internet companies. The list of existing NoSQL systems is now quite impressive [25]. As opposed to master/slave setup in relational databases, most of the NoSQL stores feature a peer- to- peer protocol to maintain their data. Not all of them though, as MongoDB [28] uses a master/slave scheme. Decentralisation has actually an advantage in removing the single point of failure, because of all the nodes playing identical roles. They are also easier to maintain and configure for the same reason. It also serves scalability, which is an essential architectural feature. A core requirement 7

8 is that the data store must be able to accept new data, nodes, and users, without major disruption of the system, reconfiguration, or performance degradation. Existing NoSQL data stores have different data models: Key- value stores, such as Redis [30], Column families stores, such as Cassandra [27], Document stores, such as CouchBase [26] or MongoDB [28], Graph- based stores, such as Neo4j [29]. The integration of these technologies with Hadoop & co. is also a hot topic and is of great interest to the community and the big data players. A company called DataStax [41] for instance has its business model around the integration of Cassandra and Hadoop. Packaged distributions - deployments The Hadoop marketplace is quite crowded nowadays! Hadoop- based solution packaging and distributions have flourished in the last couple of years. The most well known (used?) distribution would be from Cloudera [4] (the company that employs the original author of Hadoop, Doug Cutting). Others include MapR [31], Hortonworks [32], Karmasphere [3], and others. The Bigtop project, already mentioned earlier in this paper, would serve a similar purpose of packaging the ecosystem, but under the Apache umbrella. On the other hand, there are also different ready- to- use deployments on the cloud, including Amazon Elastic MapReduce (EMR) [33], Infochimps [34], Hadoop on Azure [35], among others probably. Google s AppEngine [46] also offers a MapReduce facility that could be use as part of an application; but it is not an analytics offering per se (as Google has it own big data offerings, including BigQuery). Now, most of these offerings can be programmed in the usual ways, using the native Java, streaming, Pig, Hive, etc. and offer SDKs, tools, and interfaces to develop, access, and manage resources and jobs, in addition to NoSQL stores, working and big data storage solutions, etc. There is also a possibility for a user to deploy its own Hadoop cluster on the cloud. Whirr [36] for instance provides a set of libraries for running cloud- neutral services using a common API. It has started initially as a set of bash scripts in the Hadoop project to allow the deployment on EC2, then has been ported to Python to include extra services and a wider range of providers. Its CLI provides a quite easy and convenient way to launch a cluster in few minutes. A quick word on the multi- cloud trend, which allows the usage of various cloud providers to deploy a given application or service. Many libraries (and also PaaS offerings) are actually supporting a range of provider operations, languages, frameworks, and services. These include libcloud [37] and jclouds [38] (which was used to create a Java version of the initial scripts in Whirr). On the PaaS side, Cloud Foundry would be an interesting project to mention here. Ø Bigtop Hadoop distribution Installing Hadoop and each of its projects independently can be very cumbersome. Installing the Bigtop Hadoop distribution provides a running instance of Hadoop with various ecosystem projects quickly and easily. In [39], the interested reader will find a walk through on how to install Hadoop from 8

9 Bigtop on a range of Linux distributions. This will avoid the pain of figuring out the dependency issues the hard way! Alternatively, a user can also download pre- packaged and configured virtual machines, from vendors and other sources, which can be used to explore Hadoop and the ecosystem straightaway (such as Cloudera s Hadoop demo VMs [4]). Ø User interfaces / BI Packaged solutions also provide additional user interfaces facilitating the interaction with Hadoop and other tools in the ecosystem. Cloudera s distributions for instance offer a web- based interface called Hue [40], which include a file browser, a job designer and browser, a Hive UI (called Beeswax), workflow scheduling, profiling, etc. Note that Hadoop comes with basic web interfaces that provide some information about what happening in the cluster in terms of job tracking, capacity, job statistics, etc. Hue is open source, but seems to only work with Cloudera s distributions. There are also quite few proprietary tools, we have mentioned DataStax for instance, which has a management tool called OpsCenter providing advanced functionality in managing their Cassandra/Hadoop stack. Another visual development interface is called Pentaho [42] (we have mentioned Kettle, the open source data integration tool from this company). Now, in addition to development, data access, integration, and so on, these tools offer also business intelligence and data mining capabilities. Another dedicated tool in that aspect is called Mahout [43]. It is a machine learning library implementing a range of algorithms in clustering, classification, filtering, etc. on top of Hadoop. Mahout has also built a vibrant community around a range of use cases, in frequent itemset generation, personalisation, recommendation, and so on. Yahoo!, Foursquare, twitter, among many other use Mahout implementations for various services. Another tool, dedicated to web mining, is called Bixo [47]. Bixo uses Cascading to run specialised web mining applications (pipes to use the Cascading terminology) on top of Hadoop. Bixo fetches the content from the web, parses it (using Tika parsers [48]), and analyses the data (tokenise, classify, etc.). Conclusion This article briefly presented the Hadoop ecosystem, which is still rapidly growing. There is, and will be, a lot of business and research opportunities around the whole ecosystem in terms of implementation, integration, support, management, user experience, and so on. On the business side for instance, one of the big challenges would be to help potential users sort out their options to suit their objectives, and end- up using the right stack/components for their needs. In terms of research, many areas are of prime importance including availability (the story of the namenode being a single point of failure has certainly been widely looked at, and many distributions and research projects provide solution to that), there are also performance issues, a need of concurrency- optimised and a wider range of input/output patterns, efficient data placement and movement, versioning, improved data integration, a more resource- aware scheduling, support to real- time data streaming, an so on. These issues, among others, will make up an important part of the future of Hadoop and its ecosystem. 9

10 Links [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] 10