Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System

Size: px
Start display at page:

Download "Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System"

Transcription

1 Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Acquisition System ABSTRACT Donald Metzler Information Sciences Institute University of Southern California Marina del Rey, CA metzler@isi.edu Paraphrase acquisition is an important natural language processing (NLP) task that has received a great deal of interest recently. Proposed solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that heavily rely on numerous language-dependent resources. Despite all of the work, there are no publicly available toolkits to support large-scale paraphrase mining research. There has also never been a direct empirical evaluation comparing the merits of simple, scalable approaches and those that make extensive use of expensive NLP resources. This paper introduces Mavuno, a Hadoop-based paraphrase acquisition toolkit that is both scalable and robust. Within the context of Mavuno, we empirically examine the tradeoffs between simple and sophisticated paraphrase acquisition approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that simple approaches have many advantages, including strong effectiveness, good coverage, low redundancy, and ability to handle noisy data. Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Text Analysis General Terms Algorithms, Experimentation Keywords paraphrase acquisition, Hadoop, large-scale text mining 1. INTRODUCTION A popular idiom states that variety is the spice of life. Variety also adds spice and appeal to language. Paraphrases make it possible to express the same meaning in an almost This is an extended version of a short paper published at ACL-HLT 2011 [17]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LDMTA 11 August 24, 2011, San Diego, CA USA Copyright 2011 ACM /11/11...$ Eduard Hovy Information Sciences Institute University of Southern California Marina del Rey, CA hovy@isi.edu unbounded number of ways. While variety prevents language from being overly rigid and boring, it also makes it difficult to algorithmically determine if two phrases or sentences express the same meaning. In an attempt to address this problem, a great deal of recent research has focused on identifying, generating, and acquiring phrase- and sentencelevel paraphrases [1, 16]. Many data-driven approaches to the paraphrase problem have been proposed. The approaches vastly differ in their complexity and the amount of NLP resources that they rely on. On one end of the spectrum are approaches that acquire paraphrases from a single (typically large) monolingual corpus and minimally rely on NLP tools. On the other end of the spectrum are more complex approaches that require access to bilingual parallel corpora and may also rely on part-of-speech taggers, chunkers, parsers, and statistical machine translation tools. Constructing large comparable and bilingual corpora is expensive and, in some cases, impossible. Additionally, reliance on language-dependent NLP tools typically hinders scalability, limits inputs to well-formed English, and increases data sparsity. Despite all of the previous research, there have not been any evaluations comparing the quality of simple and sophisticated data-driven paraphrase acquisition approaches. Evaluation is not only important from a practical perspective, but also from a methodological standpoint, as well, since it is often more fruitful to devote attention to building upon the current state-of-the-art as opposed to building upon less effective approaches. Although the more sophisticated approaches have garnered considerably more attention from researchers, from a practical perspective, simplicity, quality, and flexibility are the most important properties. But are simple methods adequate for the task? The primary goal of this paper is to show that simple, scalable paraphrase acquisition approaches are undervalued and have a number of desirable properties. To achieve this goal, we introduce our new (soon-to-be open source) paraphrase toolkit named Mavuno (Swahili for harvest ). We use Mavuno to implement a family of simple paraphrase acquisition approaches. To address the lack of comparative evaluations, we empirically evaluate six different paraphrase approaches, ranging from simple Mavuno-based approaches to more sophisticated approaches, such as the ones recently proposed by Bannard and Callison-Burch [2] and Callison- Burch [8]. Our evaluation helps develop a better understanding of the strengths and weaknesses of each approach. This paper has three primary contributions. First, we introduce Mavuno, a Hadoop-based paraphrasing toolkit that

2 we use to implement a simple paraphrase acquisition approach that does not use any language-dependent NLP tools and can successfully be applied to very large, noisy corpora. Second, we utilized Amazon s Mechanical Turk service to collect over 50,000 paraphrase annotations. The annotations are used to evaluate the quality of our system against several previously proposed approaches. The results show that our simple approaches are capable of achieving stateof-the-art effectiveness. Finally, we plan to make Mavuno available to the research community to help stimulate further large-scale NLP research. There are very few, if any, highly scalable text mining toolkits of this kind currently available. Therefore, our toolkit would provide a new, potentially useful resource to the broader research community. The remainder of this paper is laid out as follows. Section 2 briefly describes related work. Section 3 provides an overview of Mavuno and our proposed paraphrase acquisition approach. Section 4 details our experimental evaluation, respectively. Finally, Section 5 concludes the paper and describes a number of possible directions for future work. 2. RELATED WORK There are two primary lines of data-driven paraphrase research that are most related to our work. The first is the family of approaches that acquire paraphrases from monolingual corpora using some form of distributional similarity. The DIRT algorithm, proposed by Lin and Pantel [15], uses parse tree paths as contexts for computing distributional similarity. In this way, two phrases were considered similar if they occurred in similar contexts within many parse trees. Although parse trees serve as rich representations, they are costly to construct and yield sparse representations. The approach proposed by Pasca and Dienes [20] avoided the issues associated with parsing by using n-gram contexts. Given the simplicity of the approach, the authors were able to acquire paraphrases from a large collection of news articles. Bhagat and Ravichandran [7] proposed a similar approach as Pasca and Dienes. Instead of using n-gram contexts, the approach used noun phrase chunks. Furthermore, locality sensitive hashing was used to reduce the dimensionality of the context vectors. Despite how similar these approaches are, there have not been any attempts to compare their quality directly. Related research concerns computing distributional similarity at scale. Recent work by Pantel et al. [19] proposes an inverted indexing approach to efficiently compute the distributional similarity between two phrases. They show that the approach easily scales to a Web-scale corpus that contains over 5 billion sentences. When cosine similarity is used to compute distributional similarity, paraphrase acquisition can be reduced to an all-pairs similarity problem. There have been several approaches proposed for efficiently computing all-pairs similarity [6, 10]. Our system, Mavuno, was primarily designed to be a research toolkit for large data sets, hence we attempted to strike a delicate balance between efficiency, scalability, and flexibility. Mavuno is just as scalable, considerably more flexible, but less efficient than these specialized approaches. It is important to note that nearly all of the previously proposed paraphrase acquisition systems have come from within industry, and hence their code has never been made publicly available. One of the goals of Mavuno, as described in the introduction, is to help build a community resource that can help stimulate largescale research within the broader research community. The second line of research uses comparable or bilingual corpora as the pivot that binds paraphrases together [5, 4, 2, 8, 18]. Amongst the most effective recent work, Bannard and Callison-Burch [2] show how different English translations of the same entry in a statistically-derived translation table can be viewed as paraphrases. The primary drawback of these approaches is that they require a considerable amount of resource engineering. Furthermore, their quality has not been compared directly to the distributional paraphrase learning methods. 3. MAVUNO: A SCALABLE PARAPHRASE ACQUISITION TOOLKIT This section introduces Mavuno, a new Hadoop-based toolkit that was developed specifically for acquiring paraphrases from large monolingual corpora. The toolkit s scalability derives from MapReduce [9], which serves as Hadoop s distributed computation framework. We have successfully used the toolkit to acquire paraphrases from up to 4.5 terabytes of text, and expect it to easily scale well beyond that. Mavuno is meant to serve as a general-purpose research toolkit for implementing, testing, and evaluating new paraphrase acquisition approaches. Although the toolkit was designed for paraphrase acquisition, as we will show later, the framework employed makes it very easy to adapt the toolkit to many other types of large-scale text mining applications. 3.1 General Framework Before delving into the specific details of our framework, we first briefly provide a conceptual overview of how paraphrases are typically acquired from monolingual corpora using distributional similarity. First, each phrase is mapped onto a point in some representation space, such as a vector space. The representation usually encodes which contexts the phrase occurs in, and how related each context is to the phrase. Contexts and relatedness measures will be covered in more detail in Sections 3.1 and 3.3, respectively. The similarity between two phrases can then be measured by comparing the context-based representations. Pairs of phrases with high contextual similarity are then assumed to be likely paraphrases. Our proposed framework decomposes paraphrase acquisition into two primitive operations that implicitly encode contextual information. Thus, the framework can be viewed as computing distributional similarity in an indirect way. The two main operations are phrase context and context phrase. These operations are conceptually simple and align well with our MapReduce implementation. The first operation, phrase context, takes as input a weighted list of phrases and outputs a weighted list of contexts that the phrase occurs in within some document corpus. The underlying semantics of the operation assume that the phrase weights are indicative of the importance of each phrase and that the context weights represent how (semantically) related each context is to the set of input phrases. Given a set of input phrases P, we define the score s(c) of a context c to be: s(c) = p P s(c; p)s(p) where s(c; p) measures how related c is to phrase p, and s(p) is the weight of p. The intuition behind this formulation

3 Context type Left Context Phrase Right Context n-grams George Bush campaigned in a POS n-grams George:N Bush:N campaigned:v in:in a:dt Chunks Vice President George Bush campaigned in Typed Chunks Vice President George Bush:N campaigned:v in:p Parse Tree Paths George:N nn Bush N nsubj campaigned prep IN:in pobj N condominium Table 1: Context types currently supported by Mavuno. Example contexts of each type are shown for the sentence George Bush campaigned in a snow-bound condominium complex Friday. is that contexts that are highly related to many important input phrases should be assigned a large score. The phrase context operation can be implemented within a Hadoop MapReduce framework in a straightforward manner by defining a mapper and reducer algorithm as follows. The input to the mapper algorithm is a weighted set of phrases and a document corpus. The phrases are loaded into an efficient in-memory lookup table (i.e., a hash table). If there is not enough memory available to load the entire set, then the set is divided into smaller chunks and the following algorithm is applied to each separately. The documents in the corpus are the items that are mapped over to achieve scalability. The mapper takes a document as input and identifies all contexts within the document associated with the input phrases. Within Mavuno, extractors define the semantics of what constitutes a phrase and a context. As we will see, these notions are quite general and can be applied to other tasks beyond paraphrase acquisition. The mapper simply outputs a (context, phrase, weight) tuple for each occurrence of an input phrase in the document. In the output tuple, the context is the context in which the phrase occurs and the weight is the weight of the phrase (from the input). The following pseudo-code demonstrates the mapper code just described: Algorithm 1 phrase context mapper algorithm. e extractor d input document P input phrases W input phrase weights T e.extract(d) for (p, c) T do if P.contains(p) then emit(c, p, W.get(p)) end if where e is an extractor, and the extract function extracts all phrase-context pairs from a document (see Algorithm 3 for an example extractor). If one of the input phrases occurs in the document, then the context in which it occurs, the phrase, and the phrase weight are emitted (output) by the algorithm. This operation is highly scalable and can be massively distributed by running a large number of map tasks in parallel and partitioning the incoming documents across the tasks. The distribution and partitioning of the documents across map tasks is handled by Hadoop. All of the tuples output by the mappers that share the same key (i.e., context) are then gathered together by the Hadoop framework and input to a reducer. More specifically, for each key (context), its associated set of values (phrase-weight pairs) are input to a reducer algorithm that aggregates statistics across the values to distill a score for the context. In this way, the local information about each context mined by the mappers is combined into a global score for the context via the reducer. The following pseudo-code represents a conceptual implementation of the reducer algorithm just described: Algorithm 2 phrase context reducer algorithm. c input context T input tuples {(p, w)} score 0 for (p, w) T do score score + s(c; p) w emit(c, score) where the input is a context c, and its associated key values T. Each key (phrase-weight pair) is scored with respect to the context and gets added into the context s score in a weighted manner as described earlier. After all of the values have been processed, the context-score pair is output by the algorithm. The second operation, context phrase, is defined similarly to the phrase context operation. It takes as input a weighted list of contexts and produces a weighted list of phrases that occur within the input contexts. The score s(p) for each phrase p is computed according to: s(p) = c C s(p; c)s(c) where s(p; c) measures how related phrase p is to input context c, s(c) is the weight of c, and C is the set of input contexts. The mapper and reducer algorithms for the context phrase operation are defined analogously to those used for the phrase context operation. Now that the basic operations have been defined, we can now describe how they can be used for generating paraphrases. Given an input phrase p, paraphrases p can be acquired simply by chaining the two basic operations together (i.e., p contexts p ). In our framework, this is equivalent to setting s(p) to 1 for the input phrase and to 0 for all other phrases. The score of a paraphrase candidate p is scored using the two-step chain as follows: s(p ; p) = s(p ; c)s(c; p) (1) c C(p) where C(p) is the set of contexts that p occurs in. The operation chain used to generate paraphrases can be repeated multiple times by chaining together MapReduce jobs. By repeating the process, more contexts and phrases will be considered, possibly improving recall. In the limit, this process can be thought of as a random walk defined over

4 a bipartite graph consisting of phrases and contexts. Recent work by Kok et al. [12] showed that expanding the scope of the candidate paraphrases, via the use of random walks, can improve both precision and recall. Since their method is complementary to our approach, it would be interesting to synthesize them as part of future work. However, for simplicity, we focus our attention on the single step version of our algorithm throughout the remainder of the paper. To actually instantiate an instance of this framework, we must define an extractor, for defining the semantics of phrases and contexts, and a scoring function, for measuring the (semantic) similarity between a phrase and a context. Mavuno provides plug-and-play support for new extractors and scoring function, which makes it easy to implement and evaluate new techniques on large data sets without having to know anything about how the data and computation are being distributed. We cover several possible options for extractors and scoring functions in the remainder of this section. By using different combinations of methods our framework is capable of implementing a rich class of distributional similarity-based paraphrase acquisition approaches. 3.2 Contexts The quality of distributional similarity-based paraphrase acquisition systems depends on the types of contexts that are used to represent phrases. Researchers have explored a range of contexts, including n-grams of words surrounding the phrase [20], adjacent noun phrase chunks [7], and parse tree paths [15]. The n-gram contexts are the simplest, most computationally efficient, and result in the most dense representations. At the opposite end of the spectrum, parse tree path contexts are computationally expensive to compute and typically result in sparse representations. Table 1 summarizes all of the context types currently available in Mavuno (as extractors). It is straightforward to experiment with new contexts beyond the ones mentioned here by implementing new extractors. The types currently supported include n-grams, part-of-speech tagged n-grams, chunks, typed chunks, and parse tree paths. As the table shows, each context is typically decomposed into a left part and a right part, which is a technique often used to orient the contexts with respect to the phrase they are being used to represent. In this way, it is possible to represent phrases according to their left contexts, their right contexts, the union of the left and right contexts, or jointly (i.e., the concatenation of the left and right contexts). The best representation depends on the task and the amount of data sparsity. However, our extensive experiments have suggested that the representing contexts using both their left and right part together yields the best quality paraphrases. The types shown in Table 1 represent a natural progression from simple/dense to complex/sparse contextual representations. The goal of this work is not to identify the best context type, but rather to examine the largely ignored tradeoff between simple/dense contextual representations and complex/sparse ones. As a step towards better understanding, we compare n-gram (simple/dense) and typed chunk-based (complex/sparse) contexts in our experimental evaluation (Section 4). 3.3 Scoring In Mavuno, scoring functions are defined in terms of s(p; c)and s(c; p), the two relatedness measures used by the two primi- Algorithm 3 Pasca and Dienes extractor. c min minimum context length c max maximum context length p max maximum phrase length d input document T [ ] for pos = 1 to length(d) do for plen = 1 to p max do for clen = c min to c max do p getngram(d, pos, plen) c l getngram(d, pos clen, clen) c r getngram(d, pos + plen, clen) c (c l, c r) T.add((p, c)) return T tive operations. Instead of developing new scoring methods, we utilize two commonly used ones. The overlap scoring function asserts that a phrase is related to all of the contexts that it occurs in, regardless of the frequency. Relatedness is symmetric under this scheme and is defined as: { 1 if #(c, p) > 0 s(p; c) = s(c; p) = 0 otherwise where #(c, p) is the number of times that phrase p occurs in context c. Plugging these scores into Equation 1 gives the following paraphrase scoring function: s(p ; p) = { c : #(c, p) > 0 #(c, p ) > 0 } which is simply the total number of contexts that are shared between p and p. Although derived differently, this is the same scoring function used by Pasca and Dienes [20]. The other scoring function that we consider is cosine similarity, which measures the cosine distance between two vectors. Following common practice, the vectors are weighted using point-wise mutual information (PMI) [15]. Within our framework, this is equivalent to defining relatedness as: s(p; c) = s(c; p) = pmi(p, c) c C(p) pmi2 (p, c ) where pmi(p, c) is the PMI between p and c. Substituting these two relatedness measures into Equation 1 yields the cosine similarity between p and p (i.e., s(p ; p) = cos(p, p )). Although both of the measures described here use the same functional form for s(p; c) and s(c; p), other, possibly non-symmetric, forms can also be used, including probabilistic (e.g., s(p; c) = P (p c) and s(c; p) = P (c p)) and parameterized functions that can be learned from data [13, 21]. 4. EXPERIMENTS The goal of our experimental evaluation is to analyze the effectiveness of a variety of paraphrase acquisition approaches, ranging from the simple and scalable to the more sophisticated.

5 4.1 Paraphrasing Verb Phrase Chunks Acquiring paraphrases for verb phrase chunks is the first task that we evaluate. We chose this particular task because verb phrases tend to exhibit more variation than other types of phrases. Furthermore, our interest in paraphrase acquisition was initially inspired by challenges encountered during research related to machine reading [3]. Information extraction systems, which are key component of machine reading systems, can use paraphrases to automatically expand seed sets of relation triggers, which commonly contain verbs Systems Our evaluation compares the effectiveness of the following paraphrase acquisition approaches: BCB: The Bannard Callison-Burch [2] approach that acquires paraphrases from bilingual corpora using word alignment techniques. BCB-S: An extension of the BCB approach that constrains the paraphrases to have the same syntactic type as the original phrase [8]. We constrained all paraphrases to be verb phrases. BR: The Bhagat and Ravichandran [7] approach that uses noun phrase chunks as contexts and locality sensitive hashing to reduce the dimensionality of the contextual vectors. Scoring is achieved using the cosine similarity between PMI weighted contextual vectors. PD: The Pasca and Dienes [20] approach, as illustrated by Algorithm 3, that uses variable-length n-gram contexts and overlap-based scoring. The context of a phrase is defined as the concatenation of the n-grams immediately to the left and right of the phrase. We set the minimum length of an n-gram context to be 2 and the maximum length to be 3. The maximum length of a phrase is set to 5. Mav-N: Our proposed paraphrase acquisition approach using variable length n-gram contexts and cosine similarity for scoring. This is identical to the PD approach, except cosine similarity scoring is used in place of overlap scoring. Mav-C: Same as Mav-N, except typed chunk contexts are used instead. The context of a phrase is defined as the concatenation of the chunk (and its type) immediately before and after the phrase. This is similar to the Bhagat and Ravichandran approach, except we do not limit ourselves to noun phrases as contexts and do not use locality sensitive hashing. It is also similar to the BCB-S approach, in that it requires paraphrases to have the same syntactic type (i.e., same chunk type) as the input phrase. In our experiments, we use the publicly available implementations of BCB and BCB-S 1. For the BR method, we obtained a large database of paraphrases acquired using the approach from the authors of [7]. The PD, Mav-N, and Mav-C approaches are implemented in Mavuno, with paraphrases acquired from the English Gigaword Fourth Edition corpus 2. 1 Available at 2 Linguistic Data Consortium catalog number LDC2009T Data and Methods We randomly sampled 50 verb phrase chunks from 1000 news articles about terrorism and another 50 verb phrase chunks from 500 news articles about American football. Following the methodology used in previous paraphrase evaluations [2, 8, 12], we present annotators with two sentences. This first sentence is randomly selected from amongst all of the sentences in the evaluation corpus that contain the original phrase. The second sentence is the same as the first, except the original phrase is replaced with the system generated paraphrase. Annotators are given the following options, which were adopted from those described by Kok and Brockett [12], for each sentence pair: 0) Different meaning; 1) Same meaning; revised is grammatically incorrect; and 2) Same meaning; revised is grammatically correct. Table 2 shows three example sentence pairs and their corresponding annotations according to the guidelines just described. We used Amazon s Mechanical Turk service to obtain crowdsourced annotations. For each paraphrase system, we retrieve (up to) 10 paraphrases for each phrase in the evaluation set. This yields a total of 6,465 unique (phrase, paraphrase) pairs after pooling results from all systems. Each Mechanical Turk HIT consisted of 12 sentence pairs. To ensure high quality annotations and help identify spammers, 2 of the 12 sentence pairs per HIT were actually hidden tests for which the correct answer was known by us. We automatically rejected any HITs where the worker failed either of these hidden tests. We also rejected all work from annotators who failed at least 25% of their hidden tests. We collected a total of 51,680 annotations. We rejected 65% of the annotations based on the hidden test filtering just described, leaving 18,150 annotations for our evaluation. Each sentence pair received a minimum of 1, a median of 3, and maximum of 6 annotations. The raw agreement of the annotators (after filtering) was 77% and the Fleiss Kappa was 0.43, which signifies moderate agreement [11, 14]. Anonymized versions of the crowdsourced annotations will be made available to the research community. The systems were evaluated in terms of coverage and expected precision at k. Coverage is defined as the percentage of phrases for which the system returned at least one paraphrase. Expected precision at k is the expected number of correct paraphrases amongst the top k returned, and is computed as: E[p@k] = 1 k p i k where p i is the proportion of positive annotations for item i. It is important to keep in mind that expected precision is computed over the set of phrases each system returns one or more paraphrases for. If precision were to be averaged over all 100 phrases, then systems with poor coverage would perform significantly worse. We consider two binarized evaluation criteria. The lenient criterion allows for grammatical errors in the paraphrased sentence, while the strict criterion does not Basic Results Table 3 summarizes the results of our evaluation. Bolded values represent the best result for a given metric. As expected, the results show that the systems perform significantly worse under the strict evaluation criteria, which requires the paraphrased sentences to be grammatically cor- i=1

6 Sentence Pair A five-man presidential council for the independent state newly proclaimed in south Yemen was named overnight Saturday, it was officially announced in Aden. A five-man presidential council for the independent state newly proclaimed in south Yemen was named overnight Saturday, it was cancelled in Aden. Dozens of Palestinian youths held rally in the Abu Dis Arab village in East Jerusalem to protest against the killing of Sharif. Dozens of Palestinian youths held rally in the Abu Dis Arab village in East Jerusalem in protest of against the killing of Sharif. It says that foreign companies have no greater right to compensation establishing debts at a 1/1 ratio of the dollar to the peso than Argentine citizens do. It says that foreign companies have no greater right to compensation setting debts at a 1/1 ratio of the dollar to the peso than Argentine citizens do. Annotation Table 2: Example annotated sentence pairs. In each pair, the first sentence is the original and the second has a system-generated paraphrase filled in (denoted by the bold text). Method C Lenient Strict P1 P5 P10 P1 P5 P10 BCB BCB-S BR PD Mav-N Mav-C Table 3: Coverage (C) and expected precision at k (Pk) under lenient and strict evaluation criteria. Method Lenient Strict P1 P5 P10 P1 P5 P10 BCB-S BR Mav-N Table 4: Expected precision at k (Pk) when considering redundancy under lenient and strict evaluation criteria. rect. None of the approaches tested used any information from the evaluation sentences (other than the fact a verb phrase was to be filled in). Recent work showed that using language models and/or syntactic clues from the evaluation sentence can improve the grammaticality of the paraphrased sentences [8]. Such approaches could likely be used to improve the quality of all of the approaches under the strict evaluation criteria. Another interesting result is that the best systems under the lenient criteria (BR, BCB-S) are not the same as the best under the strict criteria (Mav-N, PD). The more sophisticated systems that made use of various NLP tools tended to perform better under the lenient criterion, while the approaches that did not use NLP tools and were based solely on n-gram contexts performed best with the strict criterion. In terms of coverage, the approaches that rely the least on language-dependent NLP tools have the best coverage, while those that made use of chunkers (BR, Mav-C) and dependency parsers (BCB-S) had the worst coverage Deeper Analysis After manually inspecting the results returned by the various paraphrase systems, we noticed that some approaches returned highly redundant paraphrases that were of limited practical use. For example, for the phrase were losing, the P@K K ClueWeb Gigaword TREC12 Figure 1: Comparison of precision at k for paraphrases acquired from three different corpora. BR system returned are losing, have been losing, have lost, lose, might lose, had lost, stand to lose, who have lost and would lose within the top 10 paraphrases. All of these are simple variants that contain different forms of the verb lose. Under the lenient evaluation criterion almost all of these paraphrases would be marked as correct, since the same verb is being returned with some grammatical modifications. Highly redundant output of this form is clearly less useful than output that provides new variants of the input verb. Therefore, we carried out another evaluation aimed at penalizing highly redundant outputs. For three of the approaches (BCB-S, BR, and Mav-N), we manually identified all of the paraphrases that contained the same verb as the main verb in the original phrase. During evaluation, these redundant paraphrases were regarded as non-related. The results from this experiment are provided in Table 4. The results are dramatically different compared to those in Table 3, suggesting that evaluations that do not consider this type of redundancy may over-estimate actual system quality. The percentage of results marked as redundant for the BCB-S, BR, and Mav-N approaches were 22.6%, 52.5%, and 22.9%, respectively. Thus, the BR system, which appeared to have excellent (lenient) precision in our initial evaluation, returns a very large number of redundant paraphrases. This remarkably reduces the lenient precision at 1 from 0.83 in our initial evaluation to just 0.05 in our redundancy-based evaluation. The BCB-S and Mav-N approaches return a comparable number of redundant results. As with our previous evaluation, the BCB-S approach tends to perform better un-

7 abt about r are ru en b4 before btw btwn chk check cld could deets details fab great fb facebook fav fave 4 3 fyi what a farce fwd forward gr8 great ic express itz syrenz vlog jk just kidding jsyk brusday mil corners on the fly k ck omw on my way 1 2 ppl people plz please shld shud tx thx da the 2 to yr year Table 5: Example Twitter paraphrases extracted using Mavuno. Bold entries indicate correct paraphrases. der the lenient evaluation, while Mav-N is better under the strict evaluation. Estimated 95% confidence intervals show all differences between BCB-S and Mav-N are statistically significant, except for lenient P10. Our final analysis looks at the effects of monolingual corpus choice on paraphrase quality. Table 1 plots expected precision at k versus k for the Mav-N approach with paraphrases acquired from three different corpora. The corpora included in the plot are TREC Disks 1 and 2 (800K news articles, 2.2GB of text), English Gigaword Fourth Edition (6.9M news articles, 21GB of text), and ClueWeb09 category B (50 million Web pages, 1.5TB of text). The results show that the paraphrases extracted from Gigaword are consistently better than those extracted from the other two corpora. Hence, both size and quality are important factors when acquiring paraphrases from a monolingual corpus. As part of future work we are interested in exploring whether spam and document quality information can help improve Web-based paraphrase acquisition. Of the baselines that we compared against, BCB-S was by far the most consistent and competitive across the different types of evaluations. However, its low coverage and reliance on bilingual corpora are drawbacks compared to our simple Mav-N approach that requires no resource engineering and can be applied to very large corpora. We also feel that it is important to note that the precision of these state-of-the-art paraphrase acquisition approaches is still very poor. After accounting for redundancy, the best approaches achieve a precision at 1 of less than 20% using the strict criteria and less than 26% when using the lenient criteria. This suggests that there is still substantial work left to be done before the output of these systems can reliably used in support of other tasks. 4.2 Paraphrasing Tweetspeak We also carried out an informal study on identifying paraphrases/synonyms for so-called Tweetspeak, which is abbreviated language used on Twitter. This is a challenging domain since each Twitter update is very short (maximum 140 characters). This results in widespread misuse of grammar and spelling. As a result, most paraphrase systems that depend on NLP tools would very likely fail due to noise. This is also a domain where it would be challenging, if not impossible, to construct comparable or bilingual corpora. Therefore, it is an excellent domain to show the true robustness of simple approaches, such as Mav-N approach that performed well in our previous evaluation. For this experiment, we gathered a list of 30 common examples of Tweetspeak from various sources on the Web. We then applied the Mav-N approach to a collection of approximately 500 million English tweets (61GB of text) that were collected during the latter part of The results are shown in Table 5. For each Tweetspeak term we show the highest scoring phrase returned by our system. Bold entries indicate correct results. For these phrases, our system achieves an accuracy of 67%. This small, but illustrative, evaluation again demonstrates the flexibility of large-scale approaches that do not rely on sophisticated NLP machinery. 4.3 Other Tasks Many text mining tasks can be cast within the Mavunostyle mining paradigm. For example, the framework can be used to computer the similarity between Web pages ( phrases ) based on their shared links, incoming anchor texts, and/or clicks (the contexts ). Implementing this within the Mavuno framework would be as simple as implementing a new extractor that emits Web pages and their contexts. The framework can then handle everything else. From the information retrieval perspective, query expansion can be viewed as finding the terms that are distributionally similar to the query terms, where a context here os defined as a documents in which a term occurs. Although it is beyond the scope of this paper to prove this, but in fact most query expansion approaches can be re-cast in this manner. Finally, the framework is also useful for a variety of information extraction-related tasks, including learning relation trigger patters (as discussed earlier) and class instance learning. For example, within the medical domain, given a small seed set of example diseases (phrases), Mavuno was able to acquire (from a large Web collection) the patterns suffering from *, with *, cases of *, and diagnosed with * via the phrase context operation. These patterns can then be fed back into the context phrase operation to mine new instances of diseases from unstructured texts. Therefore, although this paper primarily focused on paraphrase acquisition, the proposed framework can be used to solve a wide variety of large-scale mining tasks. 5. CONCLUSIONS AND FUTURE WORK This paper examined the tradeoffs between simple, scalable paraphrasing approaches that do not make use of any NLP tools and more sophisticated approaches that use a variety of computationally expensive NLP tools. These aspects were investigated within the context of Mavuno, a new Hadoop-based open source paraphrase acquisition toolkit. Our empirical evaluation compared simple and sophisticated paraphrase approaches. The evaluation demonstrated that simple acquisition approaches implemented in Mavuno achieve state-of-the-art effectiveness, can scale to very large data sets, and can be successfully applied to very noisy data. In the future, we are interested in making our simple, scalable approaches domain-adaptable to make paraphrasing more targetable. We are also interested in exploring more sophisticated (but simple/fast) contexts beyond n-grams and investigating the benefits possible from combining our framework with a random walk model.

8 Acknowledgments The authors gratefully acknowledge the support of the DARPA Machine Reading Program under AFRL prime contract no. FA C Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government. We would also like to thank the anonymous reviewers for their valuable feedback. 6. REFERENCES [1] I. Androutsopoulos and P. Malakasiotis. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38: , [2] C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 05, pages , Morristown, NJ, USA, Association for [3] K. Barker, B. Agashe, S.-Y. Chaw, J. Fan, N. Friedland, M. Glass, J. Hobbs, E. Hovy, D. Israel, D. S. Kim, R. Mulkar-Mehta, S. Patwardhan, B. Porter, D. Tecuci, and P. Yeh. Learning by reading: a prototype system, performance baseline and lessons learned. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 1, pages AAAI Press, [4] R. Barzilay and L. Lee. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL 03, pages 16 23, Morristown, NJ, USA, Association for [5] R. Barzilay and K. R. McKeown. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL 01, pages 50 57, Morristown, NJ, USA, Association for [6] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages , New York, NY, USA, ACM. [7] R. Bhagat and D. Ravichandran. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of ACL-08: HLT, pages , Columbus, Ohio, June Association for [8] C. Callison-Burch. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 08, pages , Morristown, NJ, USA, Association for [9] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages , San Francisco, California, [10] T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT 08, pages , Morristown, NJ, USA, Association for [11] J. L. Fleiss. Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin, 76(5): , [12] S. Kok and C. Brockett. Hitting the right paraphrases in good time. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages , Morristown, NJ, USA, Association for [13] Z. Kozareva and A. Montoyo. Paraphrase identification on the basis of supervised machine learning techniques. In 5th International Conference on NLP, pages , [14] J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1): , March [15] D. Lin and P. Pantel. Discovery of inference rules for question-answering. Nat. Lang. Eng., 7: , December [16] N. Madnani and B. J. Dorr. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Comput. Linguist., 36: , [17] D. Metzler, E. Hovy, and C. Zhang. An empirical evaluation of data-driven paraphrase generation techniques. In Proceedings of ACL-11: HLT, [18] B. Pang, K. Knight, and D. Marcu. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL 03, pages , Morristown, NJ, USA, Association for [19] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP 09, pages , Morristown, NJ, USA, Association for [20] M. Pasca and P. Dienes. Aligning needles in a haystack: Paraphrase acquisition across the web. In R. Dale, K.-F. Wong, J. Su, and O. Y. Kwong, editors, Natural Language Processing âăş IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg, [21] S. Zhao, C. Niu, M. Zhou, T. Liu, and S. Li. Combining multiple resources to improve SMT-based paraphrasing model. In Proceedings of ACL-08: HLT, pages , Columbus, Ohio, June Association for

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!

More information

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

How To Write A Summary Of A Review

How To Write A Summary Of A Review PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Mining Opinion Features in Customer Reviews

Mining Opinion Features in Customer Reviews Mining Opinion Features in Customer Reviews Minqing Hu and Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 {mhu1, liub}@cs.uic.edu

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Build Vs. Buy For Text Mining

Build Vs. Buy For Text Mining Build Vs. Buy For Text Mining Why use hand tools when you can get some rockin power tools? Whitepaper April 2015 INTRODUCTION We, at Lexalytics, see a significant number of people who have the same question

More information

Query term suggestion in academic search

Query term suggestion in academic search Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.

More information

A View Integration Approach to Dynamic Composition of Web Services

A View Integration Approach to Dynamic Composition of Web Services A View Integration Approach to Dynamic Composition of Web Services Snehal Thakkar, Craig A. Knoblock, and José Luis Ambite University of Southern California/ Information Sciences Institute 4676 Admiralty

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter

More information

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014 INTRO TO BIG DATA Big Data in Clinical Medicinel, 30 June 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: http://en.wikipedia.org/wiki/mount_everest 3 19 May 2012: 234 people

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

DEPENDENCY PARSING JOAKIM NIVRE

DEPENDENCY PARSING JOAKIM NIVRE DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes

More information

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,

More information

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,

More information

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Robust Sentiment Detection on Twitter from Biased and Noisy Data Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research lbarbosa@research.att.com Junlan Feng AT&T Labs - Research junlan@research.att.com Abstract In this

More information

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 Web-based Information Retrieval

More information

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED 17 19 June 2013 Monday 17 June Salón de Actos, Facultad de Psicología, UNED 15.00-16.30: Invited talk Eneko Agirre (Euskal Herriko

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Exploring Big Data in Social Networks

Exploring Big Data in Social Networks Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

More information

Keyphrase Extraction for Scholarly Big Data

Keyphrase Extraction for Scholarly Big Data Keyphrase Extraction for Scholarly Big Data Cornelia Caragea Computer Science and Engineering University of North Texas July 10, 2015 Scholarly Big Data Large number of scholarly documents on the Web PubMed

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

Analysis of Social Media Streams

Analysis of Social Media Streams Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24 Analysis of Social Media Streams Florian Weidner Dresden, 21.01.2014 Outline 1.Introduction 2.Social Media Streams Clustering Summarization

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

More information

How To Make Sense Of Data With Altilia

How To Make Sense Of Data With Altilia HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS. ALTILIA turns Big Data into Smart Data and enables businesses to

More information

Connecting library content using data mining and text analytics on structured and unstructured data

Connecting library content using data mining and text analytics on structured and unstructured data Submitted on: May 5, 2013 Connecting library content using data mining and text analytics on structured and unstructured data Chee Kiam Lim Technology and Innovation, National Library Board, Singapore.

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Text Analytics with Ambiverse. Text to Knowledge. www.ambiverse.com

Text Analytics with Ambiverse. Text to Knowledge. www.ambiverse.com Text Analytics with Ambiverse Text to Knowledge www.ambiverse.com Version 1.0, February 2016 WWW.AMBIVERSE.COM Contents 1 Ambiverse: Text to Knowledge............................... 5 1.1 Text is all Around

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system

More information

TREC 2003 Question Answering Track at CAS-ICT

TREC 2003 Question Answering Track at CAS-ICT TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China changyi@software.ict.ac.cn http://www.ict.ac.cn/

More information

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information

Text Analytics. A business guide

Text Analytics. A business guide Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application

More information

EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD

EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD 1 Josephine Nancy.C, 2 K Raja. 1 PG scholar,department of Computer Science, Tagore Institute of Engineering and Technology,

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Programming Languages in a Liberal Arts Education

Programming Languages in a Liberal Arts Education Programming Languages in a Liberal Arts Education Kim Bruce Computer Science Department Pomona College Claremont, CA 91711 Stephen N. Freund Computer Science Department Williams College Williamstown, MA

More information

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,

More information

DEALMAKER: An Agent for Selecting Sources of Supply To Fill Orders

DEALMAKER: An Agent for Selecting Sources of Supply To Fill Orders DEALMAKER: An Agent for Selecting Sources of Supply To Fill Orders Pedro Szekely, Bob Neches, David Benjamin, Jinbo Chen and Craig Milo Rogers USC/Information Sciences Institute Marina del Rey, CA 90292

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK 1 K. LALITHA, 2 M. KEERTHANA, 3 G. KALPANA, 4 S.T. SHWETHA, 5 M. GEETHA 1 Assistant Professor, Information Technology, Panimalar Engineering College,

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information