Pydoop: a Python MapReduce and HDFS API for Hadoop

Transcription

1 Pydoop: a Python MapReduce and HDFS API for Hadoop Simone Leo CRS4 Pula, CA, Italy [email protected] Gianluigi Zanetti CRS4 Pula, CA, Italy [email protected] ABSTRACT MapReduce has become increasingly popular as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop, written in Java. There is considerable interest, however, in tools that enable Python programmers to access the framework, due to the language s high popularity. Here we present a Python package that provides an API for both the MapReduce and the distributed file system sections of Hadoop, and show its advantages with respect to the other available solutions for Hadoop Python programming, Jython and Hadoop Streaming. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features Modules, packages 1. INTRODUCTION In the past few years, MapReduce [16] has become increasingly popular, both commercially and academically, as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop [5], which also includes a distributed file system, HDFS, inspired by the Google File System [17]. Hadoop, a top-level Apache project, scales up to thousands of computing nodes and is able to store and process data in the order of the petabytes. It is widely used across a large number of organizations [2], most notably Yahoo, which is also the largest contributor [7]. It is also included as a feature by cloud computing environments such as Amazon Web Services [1]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'10, June 20 25, 2010, Chicago, Illinois, USA. Copyright 2010 ACM /10/06...$ Hadoop is fully written in Java, and provides Java APIs to interact with both MapReduce and HDFS. However, programmers are not limited to Java for application writing. The Hadoop Streaming library (included in the Hadoop distribution), for instance, allows the user to provide the map and reduce logic through executable scripts, according to a fixed text protocol. This, in principle, allows to write an application using any programming language, but with restrictions both for what concerns the set of available features and the format of the data to be processed. Hadoop also includes a C library to interact with HDFS (interfaced with the Java code through JNI) and a C++ MapReduce API which takes advantage of the Hadoop Pipes package. Pipes splits the application-specific C++ code into a separate process, exchanging serialized objects with the framework via a socket. As a consequence, a C++ application is able to process any type of input data and has access to a large subset of the MapReduce components. Hadoop s standard distribution also includes Python examples that are, however, meant to be run on Jython [13], a Java implementation of the language which has several limitations compared to the official C implementation (CPython). In this work, we focused on building Python APIs for both the MapReduce and the HDFS parts of Hadoop. Python is an extremely popular programming language characterized by very high level data structures and good performances. Its huge standard library, complemented by countless thirdparty packages, allows it to handle practically any application domain. Being able to write MapReduce applications in Python constitutes, therefore, a major advantage for any organization that focuses on rapid application development, especially when there is a consistent amount of internally developed libraries that could be reused. The standard and probably most common way to develop Python Hadoop programs is to either take advantage of Hadoop Streaming or use Jython. In Hadoop Streaming, an executable Python script can act as the mapper or reducer, interacting with the framework through standard input and standard output. The communication protocol is a simple text-based one, with new line characters as record delimiters and tabs as key/value separators. Therefore, it cannot process arbitrary data streams, and the user directly controls only the map and reduce parts (i.e., you can t write a RecordReader or Partitioner). Jython is a Java implementation of the Python language, which allows a Python programmer to import and use Java packages. This, of course, allows access to the same features available to Java. However, this comes at a cost to the Python programmer: Jython is typically, at any given time, one or more releases older than CPython, it does not implement the full range of standard 819

2 library modules and does not allow to use modules written as C/C++ extensions. Usability of existing Python libraries for Hadoop with Jython is therefore only possible if they meet the above restrictions. Moreover, the majority of publicly available third-party packages are not compatible with Jython, most notably the numerical computation libraries (e.g., numpy [10]) that constitute an indispensable complement to Python for scientific programming. Our main goal was to provide access to as many as possible of the Hadoop features available to Java while allowing compact CPython code development with little impact on performances. Since the Pipes/C++ API seemed to meet these requirements well, we developed a Python package, Pydoop, by wrapping its C++ code with Boost.Python [15]. We also wrapped the C libhdfs code to made HDFS operations available to Python. To evaluate the package in terms of performances, we ran a series of tests, purposely running a very simple application in order to focus on interaction with the Java core framework (any application with a nontrivial amount of computation performed inside the mapper and/or reducer would likely run faster if written in C++, independently of how it interfaces with the framework), where we compare Pydoop with the other two solutions for Python application writing in Hadoop, Jython and Hadoop Streaming (with Python scripts), and also to Java and C++ to give a general idea of how much performance loss is to be expected when choosing Python as the programming language for Hadoop. Pydoop is currently being used in production for bioinformatics applications and other purposes [18]. It currently supports Python 2.5 and Python 2.6. The rest of the paper is organized as follows. After discussing related work, we introduce Pydoop s architecture and features; we then present a performance-wise comparison of our implementation with the other ones; finally, we give our conclusions and plans for future work. 2. RELATED WORK Efforts aimed at easing Python programming on Hadoop include Happy [6] and Dumbo [4]. These frameworks, rather than providing a Java-like Python API for Hadoop, are focused on building high-level wrappers that hide all job creation and submission details from the user. As far as MapReduce programming is concerned, they are built upon, respectively, Jython and Hadoop Streaming, and thus they suffer from the same limitations. Existing non-java HDFS APIs use Thrift [14] to make HDFS calls available to other languages [8] (including Python) by instantiating a Thrift server that acts as a gateway to HDFS. In contrast, Pydoop s HDFS module, being built as a wrapper around the C libhdfs code, is specific to Python but does not require a server to communicate with HDFS. Finally, there are several non-hadoop Mapreduce implementations available to write applications in languages other than Java. These include Starfish [12] (Ruby), Octopy [11] (Python) and Disco [3] (Python, framework written in Erlang). Since Hadoop is still the most popular and widely deployed implementation, however, we expect our Python bindings to be useful for the majority of MapReduce programmers. Figure 1: Hadoop pipes data flows. Hadoop communicates with user-supplied executables by means of a specialized protocol. Almost all components can be delegated to the user-supplied executables, but, at a minimum, it is necessary to provide the Mapper and Reducer classes. 3. ARCHITECTURE 3.1 MapReduce Jobs in Hadoop MapReduce [16] is a distributed computing paradigm tailored for the analysis of very large datasets. It operates on a input key/value pair stream that is converted into a set of intermediate key/value pairs by a user-defined map function; the user also provides a reduce function that merges together all intermediate values associated to a given key to yield the final result. In Hadoop, MapReduce is implemented according to a master-slave model: the master, which performs task dispatching and overall job monitoring, is called Job Tracker; the slaves, which perform the actual work, are called Task Trackers. A client launching a job first creates a JobConf object whose role is to set required (Mapper and Reducer classes, input and output paths) and optional job parameters (e.g., the number of mappers and/or reducers). The JobConf is passed on to the JobClient, which divides input data into InputSplits and sends job data to the Job Tracker. Task Trackers periodically contact the Job Tracker for work and launch tasks in separate Java processes. Feedback is provided in two ways: by incrementing a counter associated to some application-related variable or by sending a status report (an arbitrary text message). The RecordReader class is responsible for reading InputSplits and provide a recordoriented view to the Mapper. Users can optionally write their own RecordReader (the default one yields a stream of text lines). Other key components are the Partitioner, which assigns key/value pairs to reducers, and the RecordWriter, which writes output data to files. Again, these are optionally written by the users (the defaults are, respectively, to partition based on a hash value of the key and to write tabseparated output key/value pairs, one per line). 820

3 >>> import os >>> from pydoop.hdfs import hdfs >>> fs = hdfs("localhost", 9000) >>> fs.open_file("f",os.o_wronly).write(open("f").read()) Figure 3: A compact HDFS usage example. In this case, a local file is copied to HDFS. from pydoop.hdfs import hdfs Figure 2: Integration of Pydoop with C++. In Pipes, method calls flow from the framework through the C++ and the Pydoop API, ultimately reaching user-defined classes; Python objects are wrapped by Boost.Python and returned to the framework. In the HDFS wrapper, instead, function calls are initiated by Pydoop. 3.2 Hadoop Pipes Fig. 1 shows data flows in Hadoop Pipes. Hadoop uses a specialized class of tasks, Pipes tasks, to communicate with user-supplied executables by means of a protocol that uses persistent socket connections to exchange serialized objects. The C++ application provides a factory that is used by the framework to create the various components it needs (Mapper, Reducer, RecordReader, Partitioner...). Almost all Hadoop framework components can be overridden by C++ implementations, but, at a minimum, the factory should provide Mapper and Reducer object creation. Fig. 2 shows integration of Pydoop with the C++ code. In the Pipes wrapper, method calls flow from the framework through the C++ code and the Pydoop API, ultimately reaching user-defined classes; Python objects resulting from these calls are returned to the framework wrapped in Boost.Python structures. In the HDFS wrapper, the control flow is inverted: function calls are initiated by Pydoop and translated into their C equivalent by the Boost.Python wrapper; resulting objects are wrapped back and presented as Python objects to the application level. 4. FEATURES Pydoop allows to develop full-fledged MapReduce applications with HDFS access. Its key features are: access to most Mapreduce application components: Mapper, Reducer, RecordReader, RecordWriter, Partitioner; access to context object passed by the framework that allow to get JobConf parameters, set counters and report status; programming style similar to the Java and C++ APIs: developers define classes that are instantiated by the framework, with methods also called by the framework (compare this to the Streaming approach, where you have to manually handle the entire key/value stream); CPython implementation: any Python module can be used, either pure Python or C/C++ extension (this is not possible with Jython); HDFS access from Python. MB = float(2**20) def treewalker(fs, root_info): yield root_info if root_info["kind"] == "directory": for info in fs.list_directory(root_info["name"]): for item in treewalker(fs, info): yield item def usage_by_bs(fs, root): stats = {} root_info = fs.get_path_info(root) for info in treewalker(fs, root_info): if info["kind"] == "directory": continue bs = int(info["block_size"]) size = int(info["size"]) stats[bs] = stats.get(bs, 0) + size return stats def main(argv): fs = hdfs("localhost", 9000) root = fs.working_directory() for k, v in usage_by_bs(fs, root).iteritems(): print "%.1f %d" % (k/mb, v) fs.close() Figure 4: A more complex, although somewhat contrived, HDFS example. Here, a directory tree is walked recursively and statistics on file system usage by block size are built. 4.1 A Simple HDFS Example One of the strengths of Python is interactivity. Fig. 3 shows an almost one-liner that copies a file from the local file system to HDFS. Of course, in this case the Hadoop HDFS shell equivalent would be more compact, but a full API provides more flexibility. The Pydoop HDFS interface is written as a Boost.Python wrapper around the C libhdfs, itself a JNI wrapping of the Java code, so it essentially supports the same array of features. We also added few extensions, such as a readline method for HDFS files, in order to provide the Python user with an interface that is reasonably close to that of standard Python file objects. Fig. 4 shows a more detailed, even though somewhat contrived, example of a script that walks through a directory tree and builds statistics of HDFS usage by block size. This is an example of a useful operation that cannot be performed with the Hadoop HDFS shell and requires writing only a small amount of Python code. 821

4 from pydoop.pipes import Mapper, Reducer, Factory, runtask class WordCountMapper(Mapper): def map(self, context): words = context.getinputvalue().split() for w in words: context.emit(w, "1") class WordCountReducer(Reducer): def reduce(self, context): s = 0 while context.nextvalue(): s += int(context.getinputvalue()) context.emit(context.getinputkey(), str(s)) runtask(factory(wordcountmapper, WordCountReducer)) Figure 5: The simplest implementation of the classic word count example in Pydoop. 4.2 A Simple MapReduce Example Fig. 5 shows the simplest implementation of the classic word count example in Pydoop. All communication with the framework is handled through the context object. Specifically, through the context, Mapper objects get input key/value pairs (in this case the key, equal to the byte offset of the input file, is not needed) and emit intermediate key/value pairs; reducers get intermediate keys along with their associated set of values and emit output key/value pairs. Fig. 6 shows how to include counter and status updates in the Mapper and Reducer. Counters are defined by the user: they are usually associated with relevant application parameters (in this case, the mapper counts input words and the reducer counts output words). Status updates are simply arbitrary text messages that the application reports back to the framework. As shown in the code snippet, communication of counter and status updates happens through the context object. Fig. 7 shows how to implement a RecordReader for the word count application. The RecordReader processes the InputSplit, a raw bytes chunk from an input file, and divides it into key/value pairs to be fed to the Mapper. In this example, we show a Python reimplementation of Hadoop s default RecordReader, where keys are byte offsets with respect to the whole file and values are text lines: this RecordReader is therefore not specific to word count (although it is the one that the word count Mapper expects). Note that we are using the readline method we added to Pydoop HDFS file objects. Finally, fig. 8 shows how to implement the RecordWriter and Partitioner for the word count application. Again, these are Python versions of the corresponding standard generalpurpose Hadoop components, so they are actually not specific to word count. The RecordWriter is responsible for writing key/value pairs to output files: the standard, which is replicated here, is to write one key/value pair per line, separated by a configurable separator that defaults to the tab character. This example also shows how to use the Job- Conf object to retrieve configuration parameters: in this case we are reading standard Hadoop parameters, but an application is free to define any number of arbitrary options class WordCountMapper(Mapper): super(wordcountmapper, self). init (context) context.setstatus("initializing") self.inputwords = context.getcounter(wc, INPUT_WORDS) def map(self, context): k = context.getinputkey() words = context.getinputvalue().split() for w in words: context.emit(w, "1") context.incrementcounter(self.inputwords, len(words)) class WordCountReducer(Reducer): super(wordcountreducer, self). init (context) context.setstatus("initializing") self.outputwords = context.getcounter(wc, OUTPUT_WORDS) def reduce(self, context): s = 0 while context.nextvalue(): s += int(context.getinputvalue()) context.emit(context.getinputkey(), str(s)) context.incrementcounter(self.outputwords, 1) Figure 6: A word count implementation that includes counters and status updates: through these, the application writer can send information on its progress to the framework. class WordCountReader(RecordReader): super(wordcountreader, self). init () self.isplit = InputSplit(context.getInputSplit()) self.host, self.port, self.fpath = split_hdfs_path( self.isplit.filename) self.fs = hdfs(self.host, self.port) self.file = self.fs.open_file(self.fpath, os.o_rdonly) self.file.seek(self.isplit.offset) self.bytes_read = 0 if self.isplit.offset > 0: # read by reader of previous split discarded = self.file.readline() self.bytes_read += len(discarded) def next(self): if self.bytes_read > self.isplit.length: return (False, "", "") key = struct.pack(">q", self.isplit.offset+self.bytes_read) record = self.file.readline() if record == "": return (False, "", "") self.bytes_read += len(record) return (True, key, record) def getprogress(self): return min(float(self.bytes_read)/self.isplit.length, 1.0) Figure 7: Word Count RecordReader example. The RecordReader converts the byte-oriented view of the InputSplit to the record-oriented view needed by the Mapper. Here, we show some code snippets from a plug-in replacement of Hadoop s standard Java LineRecordReader, where keys are byte offsets with respect to the whole file and values (records) are text lines. 822

5 class WordCountWriter(RecordWriter): super(wordcountwriter, self). init (context) jc = context.getjobconf() jc_configure_int(self, jc, "mapred.task.partition", "part") jc_configure(self, jc, "mapred.work.output.dir", "outdir") jc_configure(self, jc, "mapred.textoutputformat.separator", "sep", "\t") outfn = "%s/part-%05d" % (self.outdir, self.part) host, port, fpath = split_hdfs_path(outfn) self.fs = hdfs(host, port) self.file = self.fs.open_file(fpath, os.o_wronly) def emit(self, key, value): self.file.write("%s%s%s\n" % (key, self.sep, value)) class WordCountPartitioner(Partitioner): def partition(self, key, numofreduces): reducer_id = (hash(key) & sys.maxint) % numofreduces return reducer_id Figure 8: RecordWriter and Partitioner examples for word count. The former is responsible for writing key/value pairs to output files, while the latter decides how to assign keys to reducers. Again, these are Python implementations of the corresponding standard components. Note how the JobConf is used to retrieve application parameters. whose values are read from the application s configuration file (exactly as in C++ pipes applications). The Partitioner decides how to assign keys to reducers: again, in this example we show the standard way of doing this, by means of a hash function of the key itself. The framework passes the total number of reduce tasks to the partition method as the second argument. 5. COMPARISON In the previous sections, we compared Pydoop to the other solutions for Hadoop application writing in terms of characteristics such as convenience, development speed, flexibility, etc. In this section we will present a performance comparison, obtained by running the classical word count example in the different implementations. We chose word count for two main reasons: it is a well-known, representative MapReduce example; it is simple enough to make comparisons between different languages sufficiently fair. We ran our tests on a cluster of 48 identical machines, each one equipped with two dual core 1.8GHz AMD Opterons, 4GB of RAM and two 160GB hard disks (one dedicated to the operating system and the other one to HDFS and MapReduce temporary local directories), connected trough Gb Ethernet. To run the word count example, we generated 20GB of random text data by sampling from a list of English words for spell checkers [9]. Specifically, we merged together englishwords.* lists from the SCOWL package including levels 10 through 70, for a total of 144,577 words (each word in the final 20 GB database appeared about 15 thousand times). The data generation itself has been developed as a Pydoop map-only MapReduce application. The input stream consisted of N lines of text containing a single integer that represents the amount of data that has to be generated. We made the word list available to all machines via the Hadoop Distributed Cache. Word sampling from the list was uniform, while we used a Gaussian line length with mean value equal to 120 characters and unitary variance. Since each mapper processes a subset of the input records, N was set equal to a multiple of the maximum map task capacity (that is, the number of nodes multiplied by the number of concurrent map tasks per node). Due to the relatively low amount of memory available on the test cluster, and to the fact that only one disk was available, we set the maximum number of concurrent tasks (both map and reduce) to two in each node, obtaining a total capacity of 96. Consequently, we configured the data generation application to distributed the random text over 96 files, with an HDFS block size of 128 MB. We ran the actual word count application, in all cases, with 192 mappers and 90 reducers. All tests described here were run with Hadoop , Java , Python and Jython (we did not use Jython 2.5 because it no longer supports jythonc). Our main goal was to compare Pydoop to the two other main options for Python Hadoop application writing: Jython and Hadoop Streaming (in the latter case, we used Python executable scripts for the mapper and the reducer). In order to see how Pydoop compares with the two main other language frameworks, we also compared the Pydoop implementation of word count to the Java and the C++ ones. Since there is no way to add a combiner executable to Streaming (although you can provide a Java one), we ran two separate test sessions: in the first one, we ran the official word count example (with the combiner class set equal to the reducer class) in the Java, C++, Pydoop and Jython versions; in the second one, we ran a word count without combiner for all implementations included in the previous session plus Streaming (the mapper and reducer scripts have been implemented in Python). Fig. 9 shows results for the first test session (timings averaged over five iterations of the same run): Java is the fastest, second best is C++, Pydoop and Jython are the slowest (Pydoop yielded slightly better results, but the two are comparable within the error range). The fact that Java has the best performances is not surprising: this is the standard way of writing Hadoop applications, where everything runs inside the framework. The C++ application, on the other hand, communicates with the Java framework through an external socket-based protocol, adding overhead. Pydoop, being in turn built upon the C++ code, obviously adds more overhead. Moreover, in general, C++ code is expected to be much faster than its Python equivalent, even for simple tasks as this one. The better integration of Jython with the framework is probably counterbalanced by the fact that it is generally slower than CPython. Fig. 10 shows results for the second test session (again, timings are averaged over five iterations of the same run). For all implementations except Streaming, performance ranks are similar to those of the previous session: again, Jython s and Pydoop s performances are equal within the error range, 823

6 2.5 with combiner -- average total elapsed times class WordCountReducer(Reducer): def reduce(self, context): context.emit(context.input_key, str(sum(context.itervalues()))) Figure 11: Pydoop word count reducer code written according to the new planned interface java c++ pydoop jython Figure 9: Total elapsed times, scaled by the average Java elapsed time (238s), for a word count on 20GB of data with Java, C++, Pydoop and Jython implementations. In this case we ran the official example, where the combiner class is set equal to the reducer class. We used 96 CPU cores distributed over 48 computing nodes. Timings are averaged over five iterations of the same run. 5 4 no combiner -- average total elapsed times this time slightly better for the former. The Hadoop streaming implementation is considerably slower, mostly because of the plain text serialization for each key/value pair. 6. CONCLUSIONS AND FUTURE WORK Pydoop is a Python MapReduce and HDFS API for Hadoop that allows object-oriented Java-style MapReduce programming in CPython. It constitutes a valid alternative to the other two main solutions for Python Hadoop programming, Jython and Hadoop Streaming-driven Python scripts. With respect to Jython, Pydoop has the advantage of being a CPython package, which means application writers have access to all Python libraries, either built-in or third-party, including any C/C++ extension module. Performance-wise, there is no significant difference between Pydoop and Jython. With respect to Streaming, there are several advantages: application writing is done through a consistent API that handles communication with the framework through the context object, while in Streaming you are forced to manually handle key/value passing through the standard input and output of the executable mapper and reducer script; you have access to almost all MapReduce components, while in Streaming you can only write code for the map and reduce functions; the text protocol used by Streaming imposes limits on data types; finally, performances with Streaming are considerably worse. Moreover, Pydoop also provides access to HDFS, allowing to perform nontrivial tasks with the compactness and speed of development of Python java c++ jython pydoop streaming Although it is already being used internally in production, Pydoop is still under development. One of the most relevant enhancements we plan to add in the near future is a more pythonic interface to objects and methods, in order to help Python programmers get familiar with it more easily. The features we plan to add include property access to keys and values and Python-style iterators for traversing the set of values associated with a given key. Fig. 11 shows how the word count reducer code will change after the aforementioned features will be added. Figure 10: Total elapsed times, scaled by the average Java elapsed time (338s), for a word count on 20GB of data with Java, C++, Jython, Pydoop and Hadoop Streaming (Python scripts) implementations. In this case we did not set a combiner class. We used 96 CPU cores distributed over 48 computing nodes. Timings are averaged over five iterations of the same run. 7. AVAILABILITY Pydoop is available at 8. ACKNOWLEDGMENTS The work described here was partially supported by the Italian Ministry of Research under the CYBERSAR project. 9. REFERENCES [1] Amazon Elastic MapReduce

7 [2] Applications and organizations using hadoop. [3] Disco. [4] Dumbo. [5] Hadoop. [6] Hadoop + Python = Happy. [7] Hadoop Common Credits. [8] Hadoop Distributed File System (HDFS) APIs in perl, python, ruby and php. [9] Kevin s Word List Page. [10] NumPy. [11] Octopy Easy MapReduce for Python. [12] Starfish. [13] The Jython Project. [14] Thrift. [15] D. Abrahams and R. Grosse-Kunstleve. Building hybrid systems with Boost.Python. C/C++ Users Journal, 21(7):29 36, [16] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI 04: 6th Symposium on Operating Systems Design and Implementation, [17] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):43, [18] S. Leo, P. Anedda, M. Gaggero, and G. Zanetti. Using virtual clusters to decouple computation and data management in high throughput analysis applications. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, Italy, February 2010, pages ,