A. Aiken & K. Olukotun PA3

Transcription

1 Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so the algorithm focuses on simplicity rather than search quality or performance. Disclaimers We STRONGLY recommend that you start this assignment early. As the datasets used in this assignment are larger than any other assignment, queue times may be unexpectedly long, especially immediately before the deadline. We are using Hadoop for this assignment. When looking at any Hadoop examples or documentation make sure you are looking at materials for version N-Grams The n-grams of a sequence of symbols are all of the subsequences of length n. If each symbol is a word, for example, then the 3-grams of now is the time for all good men are now is the, is the time, the time for, time for all, for all good, and all good men. n-grams have many uses in natural language processing. In this assignment you will use n-grams to compute a (toy) similarity metric between a query document and some sizeable subsets (up to 8GB) of the English Wikipedia. The score for a document consists of the number of n-grams from its text that also occur in the query document. (Note that this metric is not symmetric: The score for document A against query document B is not the same as the score for B when A is the query document.) Your program will find the Wikipedia page with the highest score. Specification Details Words are any consecutive sequence of alphanumeric characters. The punctuation marks '.', '!' and '?', and consecutive sequences of those marks, are considered to be a special word "#". (This gives the algorithm access to sentence boundaries.) All non-alphanumeric characters that are not one of the special punctuation marks are considered to be whitespace. Pages are concatenated together in the input document. To detect a new page, look for (and parse) a line containing "<title>" + new- title + "</title>" The line with the title is not included in the n-grams for a page. No n-gram should include symbols from more than one page. If two pages have the same score, prefer the title that is lexicographically largest. 1/5

2 You are not required to compute an answer if all pages have a score of 0. Don t worry about getting the first or last pages in a file correct. What We Give You Contents of /usr/class/cs149/assignments/pa3: Tokenizer.java A Java class that performs the tokenization procedure detailed above. Note that once you pass a line to Tokenizer it will remove all punctuation so you will have to check for a title line (as defined above) before passing the line to Tokenizer. query1.txt and query2.txt Example query documents. pa3-8gb- q1.sh and pa3-8gb- q2.sh Example Torque scripts for running query1.txt and query2.txt over the 8gb subset of Wikipedia. We will run these scripts to evaluate your program. Amazon EC2 only: /wikipedia Contains 8gb, 4gb, 2gb, and 1gb subsets of the text of English Wikipedia, broken up into 64MB chunks. Note that these files are stored in HDFS, not the local filesystem. Your Tasks Implement a map-reduce program Ngram.java that finds the title of the page with the highest matching metric. The output should consist of a single line containing the score, followed by a tab, followed by the title of the page. Your program should take as input parameters n, the name of the query file, a directory that contains the input files, and the name of a directory to create to store the output. You may find it helpful to start from the Hadoop MapReduce tutorial, which walks through a simple word-count example. After completing the tutorial, you can then adapt the tutorial code to complete the assignment. Although not covered in the basic tutorial, you will need to provide a custom implementation of the InputFormat interface. The default implementation, TextInputFormat, splits files by line. For n-grams, this is unacceptable, because all words on a page must be processed together so that they can be properly attributed to the page (and pages are generally multiple lines). The tutorial section Job Input contains links to the relevant APIs. Note that Hadoop provides multiple APIs: org.apache.hadoop.mapred Older API; formerly deprecated (then un-deprecated). org.apache.hadoop.mapreduce Newer API. So for example, InputFormat can be found in both packages. Ironically, however, the older mapred API is guaranteed to work in Hadoop 2.x while backwards compatibility with the newer mapreduce API was broken. For the purposes of this course, we do not care which API you use, as long as your code compiles and runs on Hadoop /5

3 Hadoop, HDFS and MapReduce Hadoop consists of two components: HDFS, a distributed file system, and the MapReduce framework which controls job execution over the cluster. A typical Hadoop cluster consists of the following: One NameNode. One or more DataNodes. One JobTracker. One or more TaskTrackers. The NameNode in a cluster keeps track of where each block of data is stored in the filesystem. Blocks are typically stored three times, on different DataNodes, to avoid data loss in the event of hardware failure. Similarly, the JobTracker in a cluster keeps track of all MapReduce jobs running in the system, and TaskTrackers are responsible for actually running the individual pieces of those jobs. In a typical setup, every DataNode is also a TaskTracker, to improve locality of data access and minimize the use of the cluster s network. The NameNode and JobTracker are typically kept separate, as both become bottlenecks in larger clusters. However, for our purposes a single combined NameNode/JobTracker is sufficient as our cluster sizes are relatively small. Amazon EC2 For this assignment, we recommend that you start using Amazon EC2 right away. We will send an to address with login credentials. The Torque head node controls access to a number of Hadoop clusters. Use the qsub command to submit jobs to run one of the clusters. or qsub - d "$PWD" pa3-8gb- q1.sh qsub - I Compiling Hadoop requires programs be provided as jar files. (A jar file is a zip file containing class files and metadata.) To create a jar file, first compile your Java code to class files as normal. Then use the jar command (which resembles tar) to combine the class files into a jar file. (Note the first command should be a single line; the backslashes are continuations.) find. - name '*.java' - print0 \ xargs - 0 javac - cp ${HADOOP_HOME}/hadoop- core jar - d class_dir jar - cvf ngram.jar - C class_dir/. 3/5

4 Running Note that although you can compile on the head node on Amazon EC2, you cannot call any of the following commands without first using qsub. To run a Hadoop job, use the following command: hadoop jar ngram.jar Ngram 4 query1.txt /wikipedia/8gb output This will run your Ngram code with n-grams of size 4 using the file query1.txt as input. Note that the arguments to the command (query1.txt, /wikipedia/8gb, and output) must all be located inside HDFS. The ngram.jar file, on the other hand, is in the local filesystem. You can use the following commands to create and (recursively) delete directories in HDFS: hadoop fs - mkdir <dirname> hadoop fs - rmr <dirname> You can also copy files to and from HDFS: hadoop fs - put <local_file> <hdfs_file> hadoop fs - get <hdfs_file> <local_file> For more information on the hadoop command, see the documentation online: Hints Do NOT attempt to copy the Wikipedia dataset out of HDFS and into your home directory on Amazon EC2. The machines do not have enough space available for you to do this. We will provide a separate, high-speed download link for the dataset. The Hadoop codebase reuses object instances by mutating them, in an attempt to reduce the amount of work required by the GC. The result of this is that your reduce function must copy the object returned from the values Iterable if it wishes to keep it after a call to next(). Running jobs on a distributed file systems can often result in failed reads or writes of files. Hadoop will report these errors as exceptions. In almost all cases Hadoop will recover from errors by itself. If you receive these messages, scrutinize them carefully and make sure you understand them as well as whether Hadoop handled them correctly before contacting the course staff. If your output directory contains a _SUCCESS file then Hadoop successfully recovered. Correctness (80%) There are several opportunities for performance results to be affected by contention, both on a per-node basis and on the cluster interconnect, so we are not going to include a specific performance target in the assignment. The bulk of the grade will be on the correctness of your implementation. The scripts pa3-8gb- q1.sh and pa3-8gb- q2.sh should run your code against the /wikipedia/8gb dataset with n=4 with query1.txt and query2.txt, respectively. The 4/5

5 initial versions of both scripts should work for many students, however, please make any changes necessary to ensure your scripts work flawlessly with your code. We will evaluate your code for correctness by running both scripts through qsub and comparing the resulting.o* file against our solution. If there is not an exact match we will grade for partial credit depending on how close your answer is to the ideal solution. Scalability (20%) We re not worried about the exact constants (map-reduce is a truck, not a sports car), but we care about the expected runtime of your algorithm in the limit. Let q be the size of the query document, n the size of the input pages within which to search, and P the number of processors. Your algorithm should complete in time O q +! + log P or better, and it should require only a! single map-reduce pass. Include in your README.txt a brief (half a page is more than enough), informal argument that your algorithms and implementation meet these requirements. Extra Credit (15%) Modify your code to compute the 20 best matches and their scores, rather than just the single best. The output will now contain multiple lines, each with the score of a page, a tab, and the title of the page. Your code should still make only a single map-reduce pass over the data. Extra Extra Credit (15%) In addition to the above, you may elect to implement the assignment in the Apache Spark framework (formerly a research project at Berkeley). Note that because we are not running Hadoop 2.x, you will not be able to run Spark on our Amazon EC2 machines. Please contact the course staff for more details if you are interested. Submission Instructions You should submit the complete source code for your working solution, as well as brief text file named README.txt (maximum 1 page) with your name, SUNet ID, and an explanation of how your code works and why it is correct, and the Scalability section mentioned above. To submit the contents of the current directory and all subdirectories, log into a Leland machine such as corn.stanford.edu and run /usr/class/cs149/bin/submit pa3 This will copy a snapshot of the current directory into the submission area along with a timestamp. We will take your last submission before the midnight deadline. Please send to the staff list if you encounter a problem. You may run the submission script on a Leland machine. 5/5