Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Transcription

1 Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project Paul Bone June 2008 Contents 1 Introduction 1 2 Method Hadoop and Python Data and MapReduce Implementation Dependencies Hadoop Pipes 4 Python Distributional Similarity Test Case Results 5 5 Evaluation 6 References 7 Appendices 8 A Dependencies 8 1 Introduction Modern Natural Language Processing (NLP) makes heavy use of prepared data in order to gather statistics. Many NLP analyses require large amounts of data in order to gather meaningful statistics. Many analyses are also computationally complex over a sentence, or over an entire set of data. These two factors can 1

2 make NLP very slow. This problem may be made worse as more data is collected for use in NLP. Cluster computing makes it easier to process large data sets. The Natural Language Toolkit (NLTK) is a framework and suite of software for NLP [4], whilst Hadoop is an open source MapReduce framework [1]. MapReduce style clustering is especially suited to processing large amounts of data [8]. It is also able to scale well as new cluster nodes are added, which may be necessary as a dataset grows. This report shows how NLTK and Hadoop can be combined to allow processing of very large datasets. 2 Method 2.1 Hadoop and Python NLTK is implemented in the Python programming language, whereas Hadoop is implemented in Java. Both these languages have separate runtime systems, which makes it difficult to integrate them directly. Jython is an implementation of Python in Java [10]. Using Jython makes it very easy to integrate both languages and make function calls call from one into the other. However Jython does not yet fully implement the Python 2.4 language which NLTK is written in. It is possible to back port NLTK to Python 2.2, however doing so would be a lot of effort and would introduce maintenance problems for NLTK. Hadoop provides two other ways of interacting with it. The first is Hadoop Streaming [7]. This allows using a regular Unix process to communicate with Hadoop over standard input and standard output. However this will not correctly handle arbitrary data, since it uses new line characters to separate records, and tab characters to separate keys from values. The other option is called Hadoop Pipes [6], a C++ library that communicates with Hadoop. It is more flexible than Hadoop Streaming, and is Swig-able. This means that an interface for a scripting language such as Python can be generated by the Swig [2] tool. A Python program can use the Hadoop Pipes C++ library via a Swig generated interface. The Hadoop Pipes library has other benefits. For example, it allows a programmer to write a Combiner 1 in C++. It may also be possible to extend Hadoop Pipes to allow Python programs to directly access the Hadoop distributed file-system. 2.2 Data and MapReduce MapReduce [8] is a recent (and simple) idea for parallelising a computation across a cluster, as long as that computation is Embarrassingly-Parallel. We 1 a reduce-like task that runs inside the map task, see Section 2.2 for information about Hadoop tasks 2

3 define an Embarrassingly-Parallel problem as a problem that can easily be divided into a large number of parts to be computed in parallel. In cases where a large amount of data is the input of the computation, the data is split up into a number of map tasks. One or more reduce tasks then run to aggregate the results of the map tasks into a single set of results. 2 Hadoop must split up input data when creating map tasks. By default this works on simple text files; extra formats can be supported by writing Java code that extends Hadoop. NLTK is distributed with many corpora 3, stored in different formats. Rather than writing Java code to read and write each individual format, it is easier to create a single new format capable of storing arbitrary data. NLTK can be used to process different types of data: words, sentences, and annotated text, including text in different languages. It is necessary to preserve all annotations and encodings in the data format to be used with Hadoop. Using the binary encoding provided by Python s Pickle 4 library any Python object can be encoded into a binary representation. By reserving a byte (0xFF) as a marker to begin records a Hadoop task to seek into any arbitrary position in the file, scan for a marker and begin processing records from that position. To prevent this marker from occurring within pickled data we replace it with a two byte sequence 0xFE 0xFD, and replace any 0xFE byte with the sequence 0xFE 0xFE. Each record is made up of a key and value pair. The marker byte is followed by the key, then by the value. In order for Hadoop to understand where the key ends and the value begins both are preceded with their size as measured in bytes. The size is encoded using only seven bits per byte to avoid producing the marker byte 5. Hadoop sorts and groups records by their key between the map and reduce phases. This ensures that the data with the same key is collected together and sent to the same reduce task in a sequence. Storing Python objects as Pickle data means that this sorting may not work as expected, however the grouping of data will not be affected. Reduce tasks will have to take this into consideration. 3 Implementation This project is build out of several components. Each component is described below, along with how to build and configure it. 3.1 Dependencies A complete list of dependencies can be found in Appendix A. Installation of most of these is simple, and available from the OS s package manager. Config- 2 The names map and reduce are inspired by functions that do a similar (often nonparallel) task in functional languages. 3 multiple collections of texts 4 Pickle is the Python term for object serialisation 5 There are better ways to encode numbers, this method was selected for it s simplicity 3

4 $ python setup.py build $ sudo python setup.py install Figure 1: Build and install the HP4P code uration of Hadoop is non-trivial, so extra information is provided below. At least Hadoop 0.17 is required to use Hadoop Pipes 4 Python. It can be downloaded from See the Hadoop documentation [1] and the NLTK documentation [3] for more information and installation instructions. deploy.py is provided with Hadoop Pipes 4 Python to aid in the deployment and configuration of a Hadoop cluster. Place this file and the Hadoop.tar.gz file in the same directory. Edit deploy.py to configure it s settings. Instructions are provided within deploy.py. When ready, execute deploy.py with a Python interpreter, and copy the resulting directory and deploy.py to a consistent location on each of the cluster nodes (consider a NFS-hosted location). Run deploy.py -s to start the cluster. 3.2 Hadoop Pipes 4 Python Hadoop Pipes 4 Python (HP4P) is the main component of this project. It is implemented in Java, C++ and Python. HP4P is a derived work of Hadoop Pipes [6] with minor modifications and extensions. Original work to handle the data encoding problems described in section 2.2 has been added to HP4P. The Encode module of HP4P defines functions for encoding data for use on the cluster, by pickling Python objects to strings and encoding the resulting strings before sending them to Java code in HP4P. The Java code adds the encoded size and prefixes every key-value pair with a marker byte. The reverse process occurs to send data to the map and reduce components of a program. HP4P also contains a Python library that can perform the entire encoding before uploading data to the cluster. As of the time of writing, this is not yet part of NLTK or Hadoop. It is intended that one of these projects will host and maintain this code, and currently no URL can be provided for this work. In order to build and install the Python section of this work, execute commands as seen in Figure 1 in the root directory of Hadoop Pipes 4 Python. This will also compile the C++ code and prepare Python bindings using it. To build the Java section change into the Java subdirectory of the project and edit the build.xml file, instructions are provided within this file. When done run ant to compile the Java code. Copy the resulting file (hdp4p.jar) to the directory from which programs will be run on the cluster. 4

5 3.3 Distributional Similarity Test Case To demonstrate this project a distributional similarity program has been written. A corpus can be scanned for all the uses of a word and the contexts where the word appears are returned. In a second pass over the corpus the most frequent contexts are found and the words appearing in those are returned. The result of this process is the words that are commonly seen in the same context as the first word. What is considered a context may vary between implementations. This program uses the word before and the word after. A simpler algorithm using only a single pass can be found in [5]. This algorithm requires two passes over the data; the first to find the contexts, and the second to find other words used in those contexts. This is performed over a portion of the Gigaword Corpus [9], using 6 years worth of New York Times articles, which is roughly 2.1GB when compressed. This data is made up of stories, each of which has a by-line and multiple paragraphs. The data is not otherwise processed. Therefore it is necessary that the input be tokenised. This is done in both phases and the tokenised text is lost after each phase. The map and reduce tasks for each of the jobs are global strings, use of combine phases can also be seen. Within the main() function calls to Hadoop Pipes 4 Python are made, these create the job, describe the data, run the job and retrieve and process the data. HP4P implements a lot of these features by executing the hadoop script, as a user would do to control the cluster. A control has been established by modifying this program so that it does not use the cluster. The cluster-specific parts of the program have been removed, however the map and reduce interface has been kept and code has been added to use this interface to drive the program. This made the modifications as simple as possible, and less likely to introduce bugs. This sequential program also uses the HP4P library to read and decode the input data from the distributed file system. 4 Results The times taken to run the programs discussed in section 3.3 are compared. Both are running on the same hardware, except that dist sim.py runs on a 20 node cluster. However two of these nodes manage the cluster while each of the other nodes store data and run tasks. Several runs are made for each program and the wall-time is recorded for each run. The average times can be seen in Table 1. These times exclude converting the data into a suitable format for use on the cluster. Interestingly, each run of the distributed program recorded a slower time than the previous run. This can be seen in Table 2. This phenomenon is worth investigating, however it is left for further work. Six instances of the sequential program were started on separate cluster nodes so that complete results would be available sooner. All of these failed to 5

6 Program Nodes Runtime (seconds) Average Standard Deviation dist sim seq.py 1 Killed after 24 hours N/A dist sim.py Table 1: Average runtine of distributional simulairty programs. Run Runtime (seconds) Table 2: Runtimes of dist sim.py program. complete within 24 hours and were killed. It is obvious that the cluster version of the program is faster. However the sequential version should not have been as slow as it was. 18 cluster nodes were used to speed up the distributed program, therefore at a worst case the sequential program should be 18 times slower. The entire sequential program was written in Python, yet the clustered version included Java and C++ components. This suggests that by implementing parts of the program in Java and C++ additional performance has been achieved. 5 Evaluation Encoding the data was slower than it should be. The data encoding work was created to avoid an individual node scanning through large amounts of data in order to reach the section it is interested in. Given that the encoding process is slow, it may be faster for each process to naively scan through all the data. However this does not allow arbitrary objects to be encoded using Python s Pickle library. Alternatively, it is possible to use the cluster to encode the data into the correct format, making this process much faster. This was done to encode the Gigaword data for use in the experiment, the distributed program to encode the Gigaword data was written solely in Java. When running MapReduce jobs, a minimal speed increase may be available by moving part of the string encoding work from Python into Java. This cannot be done to some of the code since Python code will still need to pickle objects. Unfortunately the sequential version of the program did not finish, and the clustered version became slower each time it was executed. This reduces the amount of confidence that can be put into these results. These problems should be investigated. Also, different programs should be tested with different numbers of cluster nodes before the benefits of this work can be clear. 6

7 Hadoop Pipes 4 Python has some rough edges. Further work may include: making it easier to setup and use, as well as implementing new features. For example, it should allow a user to place the hp4p.jar file in the Hadoop home directory and it should also be simpler to use the output of one job, as the input to another. These are rather minor changes. It may also be possible to extend HP4P to allow MapReduce jobs to access the distributed file system directly. This would be a large change involving all levels of HP4P This report has shown that it is possible to use the Hadoop Clustering software with NLTK, and more generally with any Python program. Measurements have shown that this greatly improves the performance of the example program. More testing is required before this conclusion can be extended to other NLP and Python programs. References [1] The Apache Software Foundation, The Apache Software Foundation 1901 Munsey Drive Forest Hill, MD U.S.A. Hadoop 0.17 Documentation, edition, May r0.17.0/. [2] David M. Beazley. SWIG 1.1 Users Manual. Department of Computer Science University of Utah, Salt Lake City, Utah 84112, 1.1 edition, June [3] Steven Bird, Ewan Klein, and Edward Loper. NLTK Documentation. Online: accessed April 2008, [4] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing in Python. Draft Avalible Online, Book. [5] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing in Python, chapter four. Draft Avalible Online, doc/en/tag.html. [6] The Apache Software Foundation. Hadoop pipes. Online Manual, April apache/hadoop/mapred/pipes/package-summary.html. [7] The Apache Software Foundation. Hadoop streaming. Online Manual, April streaming.html. [8] Google. MapReduce: Simplified Data Processing on Large Clusters, December

8 [9] David Graff. English gigaword. one DVD of structured text, Jan ISBN: [10] Jython Project. Jython FAQ. Online: accessed April 2008, jython.org/project/userfaq.html. Appendices A Dependencies The following basic dependencies are required. C/C++ Compiler No particular compiler required. Python 2.4 See Java 1.5 (Tested with Sun Java) See Apache Ant See Swig See and [2]. Apache Hadoop 0.17 See and [1]. NLTK is recommended for writing NLP programs, It can be downloaded from see also [4]. 8