Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System

Transcription

1 Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System ABSTRACT Donald Metzler Information Sciences Institute University of Southern California Marina del Rey, CA metzler@isi.edu Paraphrase acquisition is an important natural language processing (NLP) task that has received a great deal of interest recently. Proposed solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that heavily rely on numerous language-dependent resources. Despite all of the work, there are no publicly available toolkits to support large-scale paraphrase mining research. There has also never been a direct empirical evaluation comparing the merits of simple, scalable approaches and those that make extensive use of expensive NLP resources. This paper introduces Mavuno, a Hadoop-based paraphrase acquisition toolkit that is both scalable and robust. Within the context of Mavuno, we empirically examine the tradeoffs between simple and sophisticated paraphrase acquisition approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that simple approaches have many advantages, including strong effectiveness, good coverage, low redundancy, and ability to handle noisy data. Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Text Analysis General Terms Algorithms, Experimentation Keywords paraphrase acquisition, Hadoop, large-scale text mining 1. INTRODUCTION A popular idiom states that variety is the spice of life. Variety also adds spice and appeal to language. Paraphrases make it possible to express the same meaning in an almost This is an extended version of a short paper published at ACL-HLT 2011 [17]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LDMTA 11 August 24, 2011, San Diego, CA USA Copyright 2011 ACM /11/11...$ Eduard Hovy Information Sciences Institute University of Southern California Marina del Rey, CA hovy@isi.edu unbounded number of ways. While variety prevents language from being overly rigid and boring, it also makes it difficult to algorithmically determine if two phrases or sentences express the same meaning. In an attempt to address this problem, a great deal of recent research has focused on identifying, generating, and acquiring phrase- and sentencelevel paraphrases [1, 16]. Many data-driven approaches to the paraphrase problem have been proposed. The approaches vastly differ in their complexity and the amount of NLP resources that they rely on. On one end of the spectrum are approaches that acquire paraphrases from a single (typically large) monolingual corpus and minimally rely on NLP tools. On the other end of the spectrum are more complex approaches that require access to bilingual parallel corpora and may also rely on part-of-speech taggers, chunkers, parsers, and statistical machine translation tools. Constructing large comparable and bilingual corpora is expensive and, in some cases, impossible. Additionally, reliance on language-dependent NLP tools typically hinders scalability, limits inputs to well-formed English, and increases data sparsity. Despite all of the previous research, there have not been any evaluations comparing the quality of simple and sophisticated data-driven paraphrase acquisition approaches. Evaluation is not only important from a practical perspective, but also from a methodological standpoint, as well, since it is often more fruitful to devote attention to building upon the current state-of-the-art as opposed to building upon less effective approaches. Although the more sophisticated approaches have garnered considerably more attention from researchers, from a practical perspective, simplicity, quality, and flexibility are the most important properties. But are simple methods adequate for the task? The primary goal of this paper is to show that simple, scalable paraphrase acquisition approaches are undervalued and have a number of desirable properties. To achieve this goal, we introduce our new (soon-to-be open source) paraphrase toolkit named Mavuno (Swahili for harvest ). We use Mavuno to implement a family of simple paraphrase acquisition approaches. To address the lack of comparative evaluations, we empirically evaluate six different paraphrase approaches, ranging from simple Mavuno-based approaches to more sophisticated approaches, such as the ones recently proposed by Bannard and Callison-Burch [2] and Callison- Burch [8]. Our evaluation helps develop a better understanding of the strengths and weaknesses of each approach. This paper has three primary contributions. First, we introduce Mavuno, a Hadoop-based paraphrasing toolkit that

2 we use to implement a simple paraphrase acquisition approach that does not use any language-dependent NLP tools and can successfully be applied to very large, noisy corpora. Second, we utilized Amazon s Mechanical Turk service to collect over 50,000 paraphrase annotations. The annotations are used to evaluate the quality of our system against several previously proposed approaches. The results show that our simple approaches are capable of achieving stateof-the-art effectiveness. Finally, we plan to make Mavuno available to the research community to help stimulate further large-scale NLP research. There are very few, if any, highly scalable text mining toolkits of this kind currently available. Therefore, our toolkit would provide a new, potentially useful resource to the broader research community. The remainder of this paper is laid out as follows. Section 2 briefly describes related work. Section 3 provides an overview of Mavuno and our proposed paraphrase acquisition approach. Section 4 details our experimental evaluation, respectively. Finally, Section 5 concludes the paper and describes a number of possible directions for future work. 2. RELATED WORK There are two primary lines of data-driven paraphrase research that are most related to our work. The first is the family of approaches that acquire paraphrases from monolingual corpora using some form of distributional similarity. The DIRT algorithm, proposed by Lin and Pantel [15], uses parse tree paths as contexts for computing distributional similarity. In this way, two phrases were considered similar if they occurred in similar contexts within many parse trees. Although parse trees serve as rich representations, they are costly to construct and yield sparse representations. The approach proposed by Pasca and Dienes [20] avoided the issues associated with parsing by using n-gram contexts. Given the simplicity of the approach, the authors were able to acquire paraphrases from a large collection of news articles. Bhagat and Ravichandran [7] proposed a similar approach as Pasca and Dienes. Instead of using n-gram contexts, the approach used noun phrase chunks. Furthermore, locality sensitive hashing was used to reduce the dimensionality of the context vectors. Despite how similar these approaches are, there have not been any attempts to compare their quality directly. Related research concerns computing distributional similarity at scale. Recent work by Pantel et al. [19] proposes an inverted indexing approach to efficiently compute the distributional similarity between two phrases. They show that the approach easily scales to a Web-scale corpus that contains over 5 billion sentences. When cosine similarity is used to compute distributional similarity, paraphrase acquisition can be reduced to an all-pairs similarity problem. There have been several approaches proposed for efficiently computing all-pairs similarity [6, 10]. Our system, Mavuno, was primarily designed to be a research toolkit for large data sets, hence we attempted to strike a delicate balance between efficiency, scalability, and flexibility. Mavuno is just as scalable, considerably more flexible, but less efficient than these specialized approaches. It is important to note that nearly all of the previously proposed paraphrase acquisition systems have come from within industry, and hence their code has never been made publicly available. One of the goals of Mavuno, as described in the introduction, is to help build a community resource that can help stimulate largescale research within the broader research community. The second line of research uses comparable or bilingual corpora as the pivot that binds paraphrases together [5, 4, 2, 8, 18]. Amongst the most effective recent work, Bannard and Callison-Burch [2] show how different English translations of the same entry in a statistically-derived translation table can be viewed as paraphrases. The primary drawback of these approaches is that they require a considerable amount of resource engineering. Furthermore, their quality has not been compared directly to the distributional paraphrase learning methods. 3. MAVUNO: A SCALABLE PARAPHRASE ACQUISITION TOOLKIT This section introduces Mavuno, a new Hadoop-based toolkit that was developed specifically for acquiring paraphrases from large monolingual corpora. The toolkit s scalability derives from MapReduce [9], which serves as Hadoop s distributed computation framework. We have successfully used the toolkit to acquire paraphrases from up to 4.5 terabytes of text, and expect it to easily scale well beyond that. Mavuno is meant to serve as a general-purpose research toolkit for implementing, testing, and evaluating new paraphrase acquisition approaches. Although the toolkit was designed for paraphrase acquisition, as we will show later, the framework employed makes it very easy to adapt the toolkit to many other types of large-scale text mining applications. 3.1 General Framework Before delving into the specific details of our framework, we first briefly provide a conceptual overview of how paraphrases are typically acquired from monolingual corpora using distributional similarity. First, each phrase is mapped onto a point in some representation space, such as a vector space. The representation usually encodes which contexts the phrase occurs in, and how related each context is to the phrase. Contexts and relatedness measures will be covered in more detail in Sections 3.1 and 3.3, respectively. The similarity between two phrases can then be measured by comparing the context-based representations. Pairs of phrases with high contextual similarity are then assumed to be likely paraphrases. Our proposed framework decomposes paraphrase acquisition into two primitive operations that implicitly encode contextual information. Thus, the framework can be viewed as computing distributional similarity in an indirect way. The two main operations are phrase context and context phrase. These operations are conceptually simple and align well with our MapReduce implementation. The first operation, phrase context, takes as input a weighted list of phrases and outputs a weighted list of contexts that the phrase occurs in within some document corpus. The underlying semantics of the operation assume that the phrase weights are indicative of the importance of each phrase and that the context weights represent how (semantically) related each context is to the set of input phrases. Given a set of input phrases P, we define the score s(c) of a context c to be: s(c) = p P s(c; p)s(p) where s(c; p) measures how related c is to phrase p, and s(p) is the weight of p. The intuition behind this formulation

3 Context type Left Context Phrase Right Context n-grams George Bush campaigned in a POS n-grams George:N Bush:N campaigned:v in:in a:dt Chunks Vice President George Bush campaigned in Typed Chunks Vice President George Bush:N campaigned:v in:p Parse Tree Paths George:N nn Bush N nsubj campaigned prep IN:in pobj N condominium Table 1: Context types currently supported by Mavuno. Example contexts of each type are shown for the sentence George Bush campaigned in a snow-bound condominium complex Friday. is that contexts that are highly related to many important input phrases should be assigned a large score. The phrase context operation can be implemented within a Hadoop MapReduce framework in a straightforward manner by defining a mapper and reducer algorithm as follows. The input to the mapper algorithm is a weighted set of phrases and a document corpus. The phrases are loaded into an efficient in-memory lookup table (i.e., a hash table). If there is not enough memory available to load the entire set, then the set is divided into smaller chunks and the following algorithm is applied to each separately. The documents in the corpus are the items that are mapped over to achieve scalability. The mapper takes a document as input and identifies all contexts within the document associated with the input phrases. Within Mavuno, extractors define the semantics of what constitutes a phrase and a context. As we will see, these notions are quite general and can be applied to other tasks beyond paraphrase acquisition. The mapper simply outputs a (context, phrase, weight) tuple for each occurrence of an input phrase in the document. In the output tuple, the context is the context in which the phrase occurs and the weight is the weight of the phrase (from the input). The following pseudo-code demonstrates the mapper code just described: Algorithm 1 phrase context mapper algorithm. e extractor d input document P input phrases W input phrase weights T e.extract(d) for (p, c) T do if P.contains(p) then emit(c, p, W.get(p)) end if where e is an extractor, and the extract function extracts all phrase-context pairs from a document (see Algorithm 3 for an example extractor). If one of the input phrases occurs in the document, then the context in which it occurs, the phrase, and the phrase weight are emitted (output) by the algorithm. This operation is highly scalable and can be massively distributed by running a large number of map tasks in parallel and partitioning the incoming documents across the tasks. The distribution and partitioning of the documents across map tasks is handled by Hadoop. All of the tuples output by the mappers that share the same key (i.e., context) are then gathered together by the Hadoop framework and input to a reducer. More specifically, for each key (context), its associated set of values (phrase-weight pairs) are input to a reducer algorithm that aggregates statistics across the values to distill a score for the context. In this way, the local information about each context mined by the mappers is combined into a global score for the context via the reducer. The following pseudo-code represents a conceptual implementation of the reducer algorithm just described: Algorithm 2 phrase context reducer algorithm. c input context T input tuples {(p, w)} score 0 for (p, w) T do score score + s(c; p) w emit(c, score) where the input is a context c, and its associated key values T. Each key (phrase-weight pair) is scored with respect to the context and gets added into the context s score in a weighted manner as described earlier. After all of the values have been processed, the context-score pair is output by the algorithm. The second operation, context phrase, is defined similarly to the phrase context operation. It takes as input a weighted list of contexts and produces a weighted list of phrases that occur within the input contexts. The score s(p) for each phrase p is computed according to: s(p) = c C s(p; c)s(c) where s(p; c) measures how related phrase p is to input context c, s(c) is the weight of c, and C is the set of input contexts. The mapper and reducer algorithms for the context phrase operation are defined analogously to those used for the phrase context operation. Now that the basic operations have been defined, we can now describe how they can be used for generating paraphrases. Given an input phrase p, paraphrases p can be acquired simply by chaining the two basic operations together (i.e., p contexts p ). In our framework, this is equivalent to setting s(p) to 1 for the input phrase and to 0 for all other phrases. The score of a paraphrase candidate p is scored using the two-step chain as follows: s(p ; p) = s(p ; c)s(c; p) (1) c C(p) where C(p) is the set of contexts that p occurs in. The operation chain used to generate paraphrases can be repeated multiple times by chaining together MapReduce jobs. By repeating the process, more contexts and phrases will be considered, possibly improving recall. In the limit, this process can be thought of as a random walk defined over

4 a bipartite graph consisting of phrases and contexts. Recent work by Kok et al. [12] showed that expanding the scope of the candidate paraphrases, via the use of random walks, can improve both precision and recall. Since their method is complementary to our approach, it would be interesting to synthesize them as part of future work. However, for simplicity, we focus our attention on the single step version of our algorithm throughout the remainder of the paper. To actually instantiate an instance of this framework, we must define an extractor, for defining the semantics of phrases and contexts, and a scoring function, for measuring the (semantic) similarity between a phrase and a context. Mavuno provides plug-and-play support for new extractors and scoring function, which makes it easy to implement and evaluate new techniques on large data sets without having to know anything about how the data and computation are being distributed. We cover several possible options for extractors and scoring functions in the remainder of this section. By using different combinations of methods our framework is capable of implementing a rich class of distributional similarity-based paraphrase acquisition approaches. 3.2 Contexts The quality of distributional similarity-based paraphrase acquisition systems depends on the types of contexts that are used to represent phrases. Researchers have explored a range of contexts, including n-grams of words surrounding the phrase [20], adjacent noun phrase chunks [7], and parse tree paths [15]. The n-gram contexts are the simplest, most computationally efficient, and result in the most dense representations. At the opposite end of the spectrum, parse tree path contexts are computationally expensive to compute and typically result in sparse representations. Table 1 summarizes all of the context types currently available in Mavuno (as extractors). It is straightforward to experiment with new contexts beyond the ones mentioned here by implementing new extractors. The types currently supported include n-grams, part-of-speech tagged n-grams, chunks, typed chunks, and parse tree paths. As the table shows, each context is typically decomposed into a left part and a right part, which is a technique often used to orient the contexts with respect to the phrase they are being used to represent. In this way, it is possible to represent phrases according to their left contexts, their right contexts, the union of the left and right contexts, or jointly (i.e., the concatenation of the left and right contexts). The best representation depends on the task and the amount of data sparsity. However, our extensive experiments have suggested that the representing contexts using both their left and right part together yields the best quality paraphrases. The types shown in Table 1 represent a natural progression from simple/dense to complex/sparse contextual representations. The goal of this work is not to identify the best context type, but rather to examine the largely ignored tradeoff between simple/dense contextual representations and complex/sparse ones. As a step towards better understanding, we compare n-gram (simple/dense) and typed chunk-based (complex/sparse) contexts in our experimental evaluation (Section 4). 3.3 Scoring In Mavuno, scoring functions are defined in terms of s(p; c)and s(c; p), the two relatedness measures used by the two primi- Algorithm 3 Pasca and Dienes extractor. c min minimum context length c max maximum context length p max maximum phrase length d input document T [ ] for pos = 1 to length(d) do for plen = 1 to p max do for clen = c min to c max do p getngram(d, pos, plen) c l getngram(d, pos clen, clen) c r getngram(d, pos + plen, clen) c (c l, c r) T.add((p, c)) return T tive operations. Instead of developing new scoring methods, we utilize two commonly used ones. The overlap scoring function asserts that a phrase is related to all of the contexts that it occurs in, regardless of the frequency. Relatedness is symmetric under this scheme and is defined as: { 1 if #(c, p) > 0 s(p; c) = s(c; p) = 0 otherwise where #(c, p) is the number of times that phrase p occurs in context c. Plugging these scores into Equation 1 gives the following paraphrase scoring function: s(p ; p) = { c : #(c, p) > 0 #(c, p ) > 0 } which is simply the total number of contexts that are shared between p and p. Although derived differently, this is the same scoring function used by Pasca and Dienes [20]. The other scoring function that we consider is cosine similarity, which measures the cosine distance between two vectors. Following common practice, the vectors are weighted using point-wise mutual information (PMI) [15]. Within our framework, this is equivalent to defining relatedness as: s(p; c) = s(c; p) = pmi(p, c) c C(p) pmi2 (p, c ) where pmi(p, c) is the PMI between p and c. Substituting these two relatedness measures into Equation 1 yields the cosine similarity between p and p (i.e., s(p ; p) = cos(p, p )). Although both of the measures described here use the same functional form for s(p; c) and s(c; p), other, possibly non-symmetric, forms can also be used, including probabilistic (e.g., s(p; c) = P (p c) and s(c; p) = P (c p)) and parameterized functions that can be learned from data [13, 21]. 4. EXPERIMENTS The goal of our experimental evaluation is to analyze the effectiveness of a variety of paraphrase acquisition approaches, ranging from the simple and scalable to the more sophisticated.

5 4.1 Paraphrasing Verb Phrase Chunks Acquiring paraphrases for verb phrase chunks is the first task that we evaluate. We chose this particular task because verb phrases tend to exhibit more variation than other types of phrases. Furthermore, our interest in paraphrase acquisition was initially inspired by challenges encountered during research related to machine reading [3]. Information extraction systems, which are key component of machine reading systems, can use paraphrases to automatically expand seed sets of relation triggers, which commonly contain verbs Systems Our evaluation compares the effectiveness of the following paraphrase acquisition approaches: BCB: The Bannard Callison-Burch [2] approach that acquires paraphrases from bilingual corpora using word alignment techniques. BCB-S: An extension of the BCB approach that constrains the paraphrases to have the same syntactic type as the original phrase [8]. We constrained all paraphrases to be verb phrases. BR: The Bhagat and Ravichandran [7] approach that uses noun phrase chunks as contexts and locality sensitive hashing to reduce the dimensionality of the contextual vectors. Scoring is achieved using the cosine similarity between PMI weighted contextual vectors. PD: The Pasca and Dienes [20] approach, as illustrated by Algorithm 3, that uses variable-length n-gram contexts and overlap-based scoring. The context of a phrase is defined as the concatenation of the n-grams immediately to the left and right of the phrase. We set the minimum length of an n-gram context to be 2 and the maximum length to be 3. The maximum length of a phrase is set to 5. Mav-N: Our proposed paraphrase acquisition approach using variable length n-gram contexts and cosine similarity for scoring. This is identical to the PD approach, except cosine similarity scoring is used in place of overlap scoring. Mav-C: Same as Mav-N, except typed chunk contexts are used instead. The context of a phrase is defined as the concatenation of the chunk (and its type) immediately before and after the phrase. This is similar to the Bhagat and Ravichandran approach, except we do not limit ourselves to noun phrases as contexts and do not use locality sensitive hashing. It is also similar to the BCB-S approach, in that it requires paraphrases to have the same syntactic type (i.e., same chunk type) as the input phrase. In our experiments, we use the publicly available implementations of BCB and BCB-S 1. For the BR method, we obtained a large database of paraphrases acquired using the approach from the authors of [7]. The PD, Mav-N, and Mav-C approaches are implemented in Mavuno, with paraphrases acquired from the English Gigaword Fourth Edition corpus 2. 1 Available at 2 Linguistic Data Consortium catalog number LDC2009T Data and Methods We randomly sampled 50 verb phrase chunks from 1000 news articles about terrorism and another 50 verb phrase chunks from 500 news articles about American football. Following the methodology used in previous paraphrase evaluations [2, 8, 12], we present annotators with two sentences. This first sentence is randomly selected from amongst all of the sentences in the evaluation corpus that contain the original phrase. The second sentence is the same as the first, except the original phrase is replaced with the system generated paraphrase. Annotators are given the following options, which were adopted from those described by Kok and Brockett [12], for each sentence pair: 0) Different meaning; 1) Same meaning; revised is grammatically incorrect; and 2) Same meaning; revised is grammatically correct. Table 2 shows three example sentence pairs and their corresponding annotations according to the guidelines just described. We used Amazon s Mechanical Turk service to obtain crowdsourced annotations. For each paraphrase system, we retrieve (up to) 10 paraphrases for each phrase in the evaluation set. This yields a total of 6,465 unique (phrase, paraphrase) pairs after pooling results from all systems. Each Mechanical Turk HIT consisted of 12 sentence pairs. To ensure high quality annotations and help identify spammers, 2 of the 12 sentence pairs per HIT were actually hidden tests for which the correct answer was known by us. We automatically rejected any HITs where the worker failed either of these hidden tests. We also rejected all work from annotators who failed at least 25% of their hidden tests. We collected a total of 51,680 annotations. We rejected 65% of the annotations based on the hidden test filtering just described, leaving 18,150 annotations for our evaluation. Each sentence pair received a minimum of 1, a median of 3, and maximum of 6 annotations. The raw agreement of the annotators (after filtering) was 77% and the Fleiss Kappa was 0.43, which signifies moderate agreement [11, 14]. Anonymized versions of the crowdsourced annotations will be made available to the research community. The systems were evaluated in terms of coverage and expected precision at k. Coverage is defined as the percentage of phrases for which the system returned at least one paraphrase. Expected precision at k is the expected number of correct paraphrases amongst the top k returned, and is computed as: E[p@k] = 1 k p i k where p i is the proportion of positive annotations for item i. It is important to keep in mind that expected precision is computed over the set of phrases each system returns one or more paraphrases for. If precision were to be averaged over all 100 phrases, then systems with poor coverage would perform significantly worse. We consider two binarized evaluation criteria. The lenient criterion allows for grammatical errors in the paraphrased sentence, while the strict criterion does not Basic Results Table 3 summarizes the results of our evaluation. Bolded values represent the best result for a given metric. As expected, the results show that the systems perform significantly worse under the strict evaluation criteria, which requires the paraphrased sentences to be grammatically cor- i=1

6 Sentence Pair A five-man presidential council for the independent state newly proclaimed in south Yemen was named overnight Saturday, it was officially announced in Aden. A five-man presidential council for the independent state newly proclaimed in south Yemen was named overnight Saturday, it was cancelled in Aden. Dozens of Palestinian youths held rally in the Abu Dis Arab village in East Jerusalem to protest against the killing of Sharif. Dozens of Palestinian youths held rally in the Abu Dis Arab village in East Jerusalem in protest of against the killing of Sharif. It says that foreign companies have no greater right to compensation establishing debts at a 1/1 ratio of the dollar to the peso than Argentine citizens do. It says that foreign companies have no greater right to compensation setting debts at a 1/1 ratio of the dollar to the peso than Argentine citizens do. Annotation Table 2: Example annotated sentence pairs. In each pair, the first sentence is the original and the second has a system-generated paraphrase filled in (denoted by the bold text). Method C Lenient Strict P1 P5 P10 P1 P5 P10 BCB BCB-S BR PD Mav-N Mav-C Table 3: Coverage (C) and expected precision at k (Pk) under lenient and strict evaluation criteria. Method Lenient Strict P1 P5 P10 P1 P5 P10 BCB-S BR Mav-N Table 4: Expected precision at k (Pk) when considering redundancy under lenient and strict evaluation criteria. rect. None of the approaches tested used any information from the evaluation sentences (other than the fact a verb phrase was to be filled in). Recent work showed that using language models and/or syntactic clues from the evaluation sentence can improve the grammaticality of the paraphrased sentences [8]. Such approaches could likely be used to improve the quality of all of the approaches under the strict evaluation criteria. Another interesting result is that the best systems under the lenient criteria (BR, BCB-S) are not the same as the best under the strict criteria (Mav-N, PD). The more sophisticated systems that made use of various NLP tools tended to perform better under the lenient criterion, while the approaches that did not use NLP tools and were based solely on n-gram contexts performed best with the strict criterion. In terms of coverage, the approaches that rely the least on language-dependent NLP tools have the best coverage, while those that made use of chunkers (BR, Mav-C) and dependency parsers (BCB-S) had the worst coverage Deeper Analysis After manually inspecting the results returned by the various paraphrase systems, we noticed that some approaches returned highly redundant paraphrases that were of limited practical use. For example, for the phrase were losing, the P@K K ClueWeb Gigaword TREC12 Figure 1: Comparison of precision at k for paraphrases acquired from three different corpora. BR system returned are losing, have been losing, have lost, lose, might lose, had lost, stand to lose, who have lost and would lose within the top 10 paraphrases. All of these are simple variants that contain different forms of the verb lose. Under the lenient evaluation criterion almost all of these paraphrases would be marked as correct, since the same verb is being returned with some grammatical modifications. Highly redundant output of this form is clearly less useful than output that provides new variants of the input verb. Therefore, we carried out another evaluation aimed at penalizing highly redundant outputs. For three of the approaches (BCB-S, BR, and Mav-N), we manually identified all of the paraphrases that contained the same verb as the main verb in the original phrase. During evaluation, these redundant paraphrases were regarded as non-related. The results from this experiment are provided in Table 4. The results are dramatically different compared to those in Table 3, suggesting that evaluations that do not consider this type of redundancy may over-estimate actual system quality. The percentage of results marked as redundant for the BCB-S, BR, and Mav-N approaches were 22.6%, 52.5%, and 22.9%, respectively. Thus, the BR system, which appeared to have excellent (lenient) precision in our initial evaluation, returns a very large number of redundant paraphrases. This remarkably reduces the lenient precision at 1 from 0.83 in our initial evaluation to just 0.05 in our redundancy-based evaluation. The BCB-S and Mav-N approaches return a comparable number of redundant results. As with our previous evaluation, the BCB-S approach tends to perform better un-

7 abt about r are ru en b4 before btw btwn chk check cld could deets details fab great fb facebook fav fave 4 3 fyi what a farce fwd forward gr8 great ic express itz syrenz vlog jk just kidding jsyk brusday mil corners on the fly k ck omw on my way 1 2 ppl people plz please shld shud tx thx da the 2 to yr year Table 5: Example Twitter paraphrases extracted using Mavuno. Bold entries indicate correct paraphrases. der the lenient evaluation, while Mav-N is better under the strict evaluation. Estimated 95% confidence intervals show all differences between BCB-S and Mav-N are statistically significant, except for lenient P10. Our final analysis looks at the effects of monolingual corpus choice on paraphrase quality. Table 1 plots expected precision at k versus k for the Mav-N approach with paraphrases acquired from three different corpora. The corpora included in the plot are TREC Disks 1 and 2 (800K news articles, 2.2GB of text), English Gigaword Fourth Edition (6.9M news articles, 21GB of text), and ClueWeb09 category B (50 million Web pages, 1.5TB of text). The results show that the paraphrases extracted from Gigaword are consistently better than those extracted from the other two corpora. Hence, both size and quality are important factors when acquiring paraphrases from a monolingual corpus. As part of future work we are interested in exploring whether spam and document quality information can help improve Web-based paraphrase acquisition. Of the baselines that we compared against, BCB-S was by far the most consistent and competitive across the different types of evaluations. However, its low coverage and reliance on bilingual corpora are drawbacks compared to our simple Mav-N approach that requires no resource engineering and can be applied to very large corpora. We also feel that it is important to note that the precision of these state-of-the-art paraphrase acquisition approaches is still very poor. After accounting for redundancy, the best approaches achieve a precision at 1 of less than 20% using the strict criteria and less than 26% when using the lenient criteria. This suggests that there is still substantial work left to be done before the output of these systems can reliably used in support of other tasks. 4.2 Paraphrasing Tweetspeak We also carried out an informal study on identifying paraphrases/synonyms for so-called Tweetspeak, which is abbreviated language used on Twitter. This is a challenging domain since each Twitter update is very short (maximum 140 characters). This results in widespread misuse of grammar and spelling. As a result, most paraphrase systems that depend on NLP tools would very likely fail due to noise. This is also a domain where it would be challenging, if not impossible, to construct comparable or bilingual corpora. Therefore, it is an excellent domain to show the true robustness of simple approaches, such as Mav-N approach that performed well in our previous evaluation. For this experiment, we gathered a list of 30 common examples of Tweetspeak from various sources on the Web. We then applied the Mav-N approach to a collection of approximately 500 million English tweets (61GB of text) that were collected during the latter part of The results are shown in Table 5. For each Tweetspeak term we show the highest scoring phrase returned by our system. Bold entries indicate correct results. For these phrases, our system achieves an accuracy of 67%. This small, but illustrative, evaluation again demonstrates the flexibility of large-scale approaches that do not rely on sophisticated NLP machinery. 4.3 Other Tasks Many text mining tasks can be cast within the Mavunostyle mining paradigm. For example, the framework can be used to computer the similarity between Web pages ( phrases ) based on their shared links, incoming anchor texts, and/or clicks (the contexts ). Implementing this within the Mavuno framework would be as simple as implementing a new extractor that emits Web pages and their contexts. The framework can then handle everything else. From the information retrieval perspective, query expansion can be viewed as finding the terms that are distributionally similar to the query terms, where a context here os defined as a documents in which a term occurs. Although it is beyond the scope of this paper to prove this, but in fact most query expansion approaches can be re-cast in this manner. Finally, the framework is also useful for a variety of information extraction-related tasks, including learning relation trigger patters (as discussed earlier) and class instance learning. For example, within the medical domain, given a small seed set of example diseases (phrases), Mavuno was able to acquire (from a large Web collection) the patterns suffering from *, with *, cases of *, and diagnosed with * via the phrase context operation. These patterns can then be fed back into the context phrase operation to mine new instances of diseases from unstructured texts. Therefore, although this paper primarily focused on paraphrase acquisition, the proposed framework can be used to solve a wide variety of large-scale mining tasks. 5. CONCLUSIONS AND FUTURE WORK This paper examined the tradeoffs between simple, scalable paraphrasing approaches that do not make use of any NLP tools and more sophisticated approaches that use a variety of computationally expensive NLP tools. These aspects were investigated within the context of Mavuno, a new Hadoop-based open source paraphrase acquisition toolkit. Our empirical evaluation compared simple and sophisticated paraphrase approaches. The evaluation demonstrated that simple acquisition approaches implemented in Mavuno achieve state-of-the-art effectiveness, can scale to very large data sets, and can be successfully applied to very noisy data. In the future, we are interested in making our simple, scalable approaches domain-adaptable to make paraphrasing more targetable. We are also interested in exploring more sophisticated (but simple/fast) contexts beyond n-grams and investigating the benefits possible from combining our framework with a random walk model.

8 Acknowledgments The authors gratefully acknowledge the support of the DARPA Machine Reading Program under AFRL prime contract no. FA C Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government. We would also like to thank the anonymous reviewers for their valuable feedback. 6. REFERENCES [1] I. Androutsopoulos and P. Malakasiotis. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38: , [2] C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 05, pages , Morristown, NJ, USA, Association for [3] K. Barker, B. Agashe, S.-Y. Chaw, J. Fan, N. Friedland, M. Glass, J. Hobbs, E. Hovy, D. Israel, D. S. Kim, R. Mulkar-Mehta, S. Patwardhan, B. Porter, D. Tecuci, and P. Yeh. Learning by reading: a prototype system, performance baseline and lessons learned. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 1, pages AAAI Press, [4] R. Barzilay and L. Lee. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL 03, pages 16 23, Morristown, NJ, USA, Association for [5] R. Barzilay and K. R. McKeown. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL 01, pages 50 57, Morristown, NJ, USA, Association for [6] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages , New York, NY, USA, ACM. [7] R. Bhagat and D. Ravichandran. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of ACL-08: HLT, pages , Columbus, Ohio, June Association for [8] C. Callison-Burch. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 08, pages , Morristown, NJ, USA, Association for [9] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages , San Francisco, California, [10] T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT 08, pages , Morristown, NJ, USA, Association for [11] J. L. Fleiss. Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin, 76(5): , [12] S. Kok and C. Brockett. Hitting the right paraphrases in good time. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages , Morristown, NJ, USA, Association for [13] Z. Kozareva and A. Montoyo. Paraphrase identification on the basis of supervised machine learning techniques. In 5th International Conference on NLP, pages , [14] J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1): , March [15] D. Lin and P. Pantel. Discovery of inference rules for question-answering. Nat. Lang. Eng., 7: , December [16] N. Madnani and B. J. Dorr. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Comput. Linguist., 36: , [17] D. Metzler, E. Hovy, and C. Zhang. An empirical evaluation of data-driven paraphrase generation techniques. In Proceedings of ACL-11: HLT, [18] B. Pang, K. Knight, and D. Marcu. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL 03, pages , Morristown, NJ, USA, Association for [19] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP 09, pages , Morristown, NJ, USA, Association for [20] M. Pasca and P. Dienes. Aligning needles in a haystack: Paraphrase acquisition across the web. In R. Dale, K.-F. Wong, J. Su, and O. Y. Kwong, editors, Natural Language Processing âăş IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg, [21] S. Zhao, C. Niu, M. Zhou, T. Liu, and S. Li. Combining multiple resources to improve SMT-based paraphrasing model. In Proceedings of ACL-08: HLT, pages , Columbus, Ohio, June Association for