Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp Abstract One of the difficulties in Natural Language Processing is the fact that there are many way to express the same thing or event. These expressions are called Paraphrases. Paraphrase is important in applications such as IR, QA and IE, and one of the difficulties in paraphrase research is acquiring the requisite paraphrase knowledge. In this paper, we describe an unsupervised method to discover paraphrases containing two named entities from a large untagged corpus. The proposed method consists of two stages. First, it finds relations between named entities using similarity of context and clustering. Then, the phrases which express the relation are selected from each cluster to acquire paraphrases. Our experiments with one year of newspaper reveal that we can discover a variety of paraphrases with high precision and high recall. 1 Introduction One of the difficulties in Natural Language Processing is the fact that there are many way to express the same thing or event. If the expression is a word or a short phrase (like corporation and company ), it is called a synonym. There has been a lot of research on such lexical relations, along with the creation of resources such as WordNet. If the expression is longer or complicated (like A buys B and A s purchase of B ), it is called paraphrase, i.e. a set of phrases which express the same thing or event. Recently, this topic has been getting more attention, as is evident Satoshi Sekine and Ralph Grishman NewYorkUniversity 715 Broadway, 7th floor, New York, NY 10003, U.S.A. sekine@cs.nyu.edu from the Paraphrase Workshops in 2003 and 2004, driven by the needs of various NLP applications. For example, in Information Retrieval (IR), we have to match a user s query to the expressions in the desired documents, while in Question Answering (QA), we have to find the answer to the user s question even if the formulation of the answer in the document is different from the question. Also, in Information Extraction (IE), in which the system tries to extract elements of some events (e.g. date and company names of a corporate merger event), several event instances from different news articles have to be aligned even if these are expressed differently. We have realized the importance of paraphrase; however, the major obstacle is the construction of paraphrase knowledge. For example, we can easily imagine that the number of paraphrases for A buys B is enormous and it is not possible to create comprehensive knowledge. Also, we don t know how many kinds of such paraphrase sets are necessary to cover even some everyday things or events. Up to now, most IE researchers have been creating paraphrase knowledge (or IE patterns) by hand and for specific tasks. So, there is a limitation that IE can only be performed for a pre-defined task, like corporate mergers or management succession. In order to create an IE system for a new domain, you have to spend a long time to create the knowledge. So, it is too costly to make IE technology open-domain like IR or QA. In this paper, we will propose an unsupervised method to discover paraphrases from a large untagged corpus. We are focusing on phrases which have two Named Entities (NEs), as those types of phrases are very important for IE applications. After tagging a large corpus with an automatic NE tagger, the method tries to find sets of paraphrases automatically without being given a seed phrase or any kinds of cue. The proposed ap-
proach uses the relation discovery method described in (Hasegawa et al. 04). It is an unsupervised method for finding common relations from a large corpus. We will describe this method below, as it is integral to our paraphrase discovery procedure. The rest of this paper is organized as follows. We discuss the prior work in paraphrase discovery and their limitations in section 2. We describe our method in section 3. Then we report experiments and evaluations in section 4, and discuss the result in section 5. 2 Prior Work There have been several efforts to discover paraphrase automatically from corpora. One general approach uses comparable documents, which are sets of documents whose content are known to be almost the same. In other words, those methods need comparable corpora, implicit or explicit, such as different newspaper stories about the same event (Shinyama and Sekine 03) or different translations of the same story (Barzilay 01). They basically try to find paraphrases in the comparable parts of documents using clues like named entities. However, the availability of comparable corpora is limited; in particular, in the case of Barzilay s approach, the availability of multiple translations of the same story is clearly limited. This is a significant limitation on this general approach. Another approach to finding paraphrases is to find phrases which take similar subjects and objects in large corpora by using mutual information of word distribution (Lin and Pantel 01). This approach is designed to accumulate phrases useful for the QA task by giving a pair consisting of two important phrases from the question and the answer. So, this approach needs a phrase as an initial seed and thus the possible relationships to be extracted are naturally limited. There has also been work using a bootstrapping approach (Brin 98; Agichtein and Gravano 00; Ravichandran and Hovy 02). Their basic strategy is, for a given pair of entity types, to start with some examples, like several famous book title and author pairs; and find expressions which contains those names; then using the found expressions, find more author and book title pairs. This can be repeated several times and collect a list of author and book title pairs and expressions. Ravichandran demonstrated that the collected list improved the accuracy of a QA system. However, those methods need initial seeds, so the relation between entities has to be known in advance. This limitation is the obstacle to making the technology open domain. 3 Paraphrase Acquisition 3.1 Overview Our goal is to discover the paraphrases that represent a particular relation between two named entities. If we could identify pairs of named entities (such as Cingular and AT&T Wireless ) which have a particular relation (such as merger & acquisition ), we could also find paraphrases expressing the relation between these two names. Under this assumption, we propose an approach of paraphrase acquisition via relation discovery from large text documents. Our approach is fully unsupervised and we only need a named entity tagger and large text corpora. The outline of the method is as the follows: 1. Tag named entities in text corpora 2. Discover particular relations by clustering named entity pairs by their context 3. Select phrases which express the relation from those in the cluster Figure 1 shows the overview of the method. First, from the NE-tagged newspaper corpus, we extract expressions containing frequently-appearing pairs of named entities; in the figure, these are expressions containing the pair of COMPANY A and B, C and D, and E and F. Then, we accumulate the context words intervening between these entities, such as is offered to buy, negotiate to acquire for A and B. If the contexts for A and B and those for E and F are similar, it is likely that these pairs represent the same relation; in the figure, A and B and E and F have M&A relation. By this method, we believe we can accumulate the instances of phrases, as well as the instances of relations which are important in the text.
aaa NE tagged corpus (Newspaper) 1) Extract expressions between two NE instances 2) build clusters of NE pairs <Company-A ----- Company-B> A is offering to buy B A s proposed acquisitions of B A s interest in B A negotiates to acquire B A is discussing with B. <Company-C ----- Company-D> C s parent company D C is a subsidiary of D <Company-E ----- Company-F> E s acquisition of F E would buy F 3) Find phrases which express the relation (paraphrases) in each clusters Figure 1. Overview of the method Next, we try to acquire phrases to represent the relation from the expressions found in each cluster. The expressions in the cluster include expressions irrelevant to the relation, such as A is discussing with B which is not really the M&A relation, so we apply two constraints in order to select only the phrases expressing the relation. One is the phrase duplication constraint, where the phrase has to appear with some minimum number of NE pair instances in the cluster. The other constraint is the common word constraint, which is to select phrases which contain a frequent word in the cluster. For example, if the word acquisition appears frequently in the cluster, phrases including the word acquisition are likely to be phrases expressing the relation, here the M&A relation. 3.2 Named entity tagging Our proposed method is fully unsupervised. We do not need comparable corpora or any initial seeds which are manually selected. Instead, we use a named entity (NE) tagger. Current automatic named entity taggers have quite satisfactory performance. In addition, the set of NE types has been extended. For example, (Sekine et al. 02) proposed 150 NE types. Extending the NE types would lead to more effective relation discovery. For example, if the type ORGANIZATION is divided into several subtypes, like COMPANY, MILITARY, GOVERNMENT and so on, the discovery procedure could detect more specific relations such as those between COMPANYs. We use an extended NE tagger (reference to be provided in the final paper). 3.3 Relation Discovery We define the co-occurrence between NE pairs as follows: two named entities are considered to cooccur if they appear with no more than 5 intervening words in the same sentence. We collect the intervening words between two named entities for each co-occurrence. These words, which are stemmed, could be regarded as the context of the pair of named entities. Different orders of occurrence of the same named entities are considered as
different co-occurrence. Less frequent NE pairs are eliminated because they might be less reliable for relation discovery. We set the co-occurrence frequency threshold to be 30. The vector space model of the context words and the cosine similarity of the vectors are used in order to calculate the similarities between NE pairs. A context vector for each NE pair instance consists of the bag of words formed from all intervening words (excluding stop words) of two named entities. Each word of a context vector is weighted by tf*idf, the product of term frequency and inverse document frequency. Term frequency is the number of occurrences of a word in the collected context words. Document frequency is the number of documents which include the word. The similarity of two pairs of named entities is calculated by cosine similarity of the two vectors. We compare NE pairs of the same NE types, e.g., PERSON-GPE (a geographical-political entity -- a region with a government) pair. In this paper, we will refer to a pair of named entity types as a domain. In addition to the PERSON-GPE domain, we will report on our experiment on the COMPANY-COMPANY domain. 3.4 Clustering After we calculate the similarity among context vectors of NE pairs, we make clusters of NE pairs based on the similarity. We adopt hierarchical clustering and used complete linkage to avoid the chain effect of single-link clustering, which could join two not-so-similar members into a single cluster. In the complete linkage method, the distance between clusters is taken to be the distance of the furthest nodes in the two clusters. Now we have a set of named entity pairs which are likely to express the same relation. 3.5 Selection of Paraphrases Even though a set of named entity pairs in the same relation have been found, not all of the phrases used in those clusters express the relation. In order to filter out the phrases which do not express the relation, we applied two constrains: [Phrase duplication constraint:] A phrase must be shared by at least two NE pairs in a cluster. [Common word constraint:] A phrase must include one of the frequent common words in a cluster. The phrase duplication constraint requires a phrase to have appeared in multiple NE pairs in the same cluster. It is intended to delete phrases which appear accidentally or are specific phrases to a particular NE pair. In the common word constraint, we rely on the idea that words appearing frequently in the cluster are relevant to the relation of the cluster and if a phrase contains one or more of such words, the phrase is considered to express the relation. 4 Experiment We will report on our experiment in two successive stages. The first stage was relation discovery and the second stage was paraphrase acquisition. We conducted the experiment with one year of The New York Times (1995) as our corpus to verify the method. 4.1 Relation discovery First, the frequent NE pairs are found, and the NE pairs along with their intervening words are extracted and clustered. In order to evaluate the result, we analyzed all the extracted NE pair instances manually and identified the relations for two different domains. One was the PERSON- GPE domain, in which 177 distinct NE pairs are obtained and manually classified into 38 relations. The other was the COMPANY-COMPANY domain. We got 65 distinct NE pairs and manually classified them into 10 relations. We evaluated automatically extracted clusters consisting of two or more pairs. For each cluster, the most frequent relation represents the relation of the cluster. For example, in a cluster if there are seven NE pairs of relation A and three NE pairs of relation B, the cluster is labeled as A. When the relation of an NE pair instance is the same as the label of the cluster, it is counted as correct; the correct pair count, N correct, is defined as the total number of correct pairs in all clusters. Other NE pairs in the cluster are counted as incorrect; the incorrect pair count, N incorrect, is also defined as the total number of incorrect pairs in all clusters. We evaluate the clusters based on Recall, Precision
and F-measure. The definitions of these measures are as follows. [Recall (R)] How many correct pairs are detected out of all the key pairs? The key pair count, N key, is defined as the total number of pairs manually classified in clusters of two or more pairs. Recall = N correct / N key [Precision (P)] How many correct pairs are detected among the pairs clustered automatically? Precision = N correct / (N correct + N incorrect ) [F-measure (F)] F-measure is defined as a combination of recall and precision according to the following formula: F-measure = 2*Recall*Prec/(Recall+Prec) These values vary depending on the threshold of cosine similarity. We fixed the cosine threshold at a single value just above 0 for both domains, which gives almost maximum F values for both domains. This setting does not require parameter optimization and we believe it works for other domains, as well, because it means that all members of a cluster have to have at least one word in common with the other members of the same cluster. We got 34 clusters in the PER-GPE domain and 15 clusters in the COM-COM domain. Table 1 shows the result in both domains. We achieved 80 F-measure in the PER-GPE domain and 75 in the COM-COM domain. Domain Prec. Recall F PER-GPE 79 83 80 COM-COM 76 74 75 Table 1. Result of relation clustering 4.2 Paraphrase acquisition In the second stage, we are going to acquire paraphrases from the clusters of the same relation. Although we obtained some meaningful relations in smaller clusters, we will focus on the larger clusters, those with more than 4 members. We found that all large clusters have meaningful major relations and that the common words in those clusters accurately represented the relations. The large clusters represent the President, Senator, Prime Minister, Governor, Secretary, Republican and Coach relations in the PER-GPE domain, and the M&A, Parent and Alliance relations in the COM-COM domain. We made a reference data set of paraphrases by looking at the phrases in each cluster for both domains. We eliminated the single frequency phrases and phrases which consist of only symbols and stop words from the evaluation. With respect to the major relation, each phrase is categorized into one of the following 4 classes. Table 2 shows the distribution of the phrases. [Class 1:] Phrases which represent the major relation (i.e. strict paraphrases) [Class 2:] Phrases which almost represent the major relation but include extra words (i.e. more restrictive relations) [Class 3:] Phrases which suggest broader meaning than just the major relation (i.e. more general relations) [Class 4:] Phrases which cannot be regarded as representing the major relation (i.e. others) Phrase class 1 2 3 4 total PER-GPE 59 31 17 222 329 COM-COM 102 10 29 177 318 Table 2. Reference data of phrase classes P-G C+C Baseline Phrase duplication Common word Phrase+Common R P F R P F R P F R P F 1 100 18 31 62 44 51 82 38 52 92 35 50 1+2 100 28 44 53 57 55 80 57 66 89 51 65 1+2+3 100 33 50 48 62 54 69 57 62 80 54 64 1 100 33 49 33 77 47 58 82 68 70 76 72 1+2 100 36 53 30 77 44 57 89 70 68 81 74 1+2+3 100 45 62 26 84 40 47 92 62 57 86 69 Table 3. Evaluation result for paraphrase discovery
Then, we evaluated the result of the paraphrase acquisition experiment for the PER-GPE domain and the COM-COM domain. There are three criteria: 1) setting the key phrases to be those in Class 1 (strict paraphrases), 2) in Class 1 plus Class 2 and 3) in Class 1, Class 2 plus Class 3 (loose paraphrases). The loose paraphrases could be useful in an IE application. Even though the phrases are not interchangeable in general, those phrases can be used to extract information once the task is specific. The evaluation metric is the normal Recall, Precision and F-measure. Table 3 shows the evaluation results using different constraints (i.e. no constraint, the phrase duplication constraint, the common word constraint and the combined constraints). In the combined constraints, phrases which satisfy either constraint are saved, rather than satisfying both constraints (i.e. disjunction, rather than conjunction). In the common word constraint, we select the phrases for which the sum of the relative frequencies for each common word was above 0.4. The recall is calculated relative to the case of no constraint (baseline), as we are comparing the phrase sets among the phrases in the baseline, so the recall is 100% for the baseline experiment. However, the precision for the baseline is low because the reference data included a lot of irrelevant phrases. The aim of the two constraints is to push the precision higher while keeping the recall high. The best result is obtained with the common word constraint in the PER-GPE domain, and with the combined constraints in the COM-COM domain. In general, the common word constraint helps to improve the precision compared to the duplicated phrase constraints. This means that the paraphrases in the clusters are not shared by different NE pair instances so much, even though the paraphrases share some words in common. There are a variety of phrases in the COM-COM domain, compared with the PER-GPE domain. In the PER- GPE domain, there are rather small number of typical phrases for the relation (e.g., A is the President of B ). We believe that the PER-GPE domain contains more static relations, compared with the COM-COM domain, which contains more event relations. This assumption is also suggested by the result that the phrase duplication constraint works better in the PER-GPE domain. Table 4 shows some examples of successfully acquired paraphrases for the M&A and Parent relations in the COM-COM domain using the combined constraints. These phrases are paraphrases and would be useful for applications like Information Retrieval, Question Answering or Information Extraction. President A, the president of B B s new President A B s newly elected President, A A becomes president of B B under President A M&A A bought B A has agreed to buy B A, which is buying B A's proposed acquisition of B A's acquisition of B A's agreement to buy B A's purchase of B A bid for B A's takeover of B A merger with B A succeeded in buy B B, which was acquired by A B would become a subsidiary of A B agreed to be bought by A Parent A, a unit of the B A, owned by B A' parent, B B, the parent company of A B, hold company for A B, the company that own A Table 4. Examples of discovered paraphrases 5 Discussion In this section, we will discuss several issues regarding the proposed method. Error Analysis We analyze the errors which lower the precision, a problem primarily in the PER-GPE domain (False Alarms). This analysis was done for the data using the combined constraints. We categorized the errors into the following four types. The distribution of the errors is shown in Table 5.
[Error 1:] Phrase contains two different phrases [Error 2:] Relation discovery error [Error 3:] Relations dependent on context [Error 4:] Other errors Error 1 2 3 4 P-G 34 21 8 10 C-C 3 3 0 7 Table 5. Error distribution The most severe error type (Error 1) involves phrases which actually contain different phrases. An example of such a phrase is visited France (GPE), when President Chirac (PERSON) invite the world leaders. Because France and Chirac are co-occuring frequent NE pairs and phrase (actually a sequence of words) GPE, when President PERSON satisfy the common word constraint, it was taken as a paraphrase candidate. This kind of errors made the precision lower, but we believe if we can use a parser to find the boundary of phrases, this error might be eliminated. Error 2 involves an example like U.S. (GPE) Vice President Al Gore (PERSON). As its context contains word President, the NE pair is regarded as president relation. This should be solved using frequent multi-word terms as keyword, but this remains one of our future work. An example of Error 3 is a phrase Tommy Thompson (PERSON), a Republican from Wisconsin (GPE). Actually, Mr. Tommy Thompson is not a senator or a representative, but a governor. When the sentence appears in a context of different view of different governors (i.e. it is obvious from the context that he is a governor), it does not mention Governor explicitly. So the phrase can be a paraphrase of governor relation in such context, but not always. We don t have a good idea for solving this kind of error. Limitation and Future Direction Our method has some limitations. We set several frequency thresholds, so we can t find less frequent relations between NE pairs and can t find paraphrases for such relations. However, we think that we could possibly resolve the limitation by two approaches. One approach is to increase the amount of text. We used only one-year corpus for this experiment, but there are much more corpus, e.g. newspaper corpus of more than 10 years, or much larger corpus of Web texts. If we can use such corpora, hopefully the sparseness problem will be diminished. The other approach is to combine bootstrapping methods (Brin 98; Agichtein and Gravano 00) with our relation discovery stage. We first find reliable paraphrases using frequent instances, then using the obtained knowledge, less frequent instances will be found. Unsupervised Methods The proposed method is a fully unsupervised method. When we look back over the last decade, there have been great advances in many fields of NLP using supervised machine learning. These include corpus-based POS taggers, NE taggers and treebank-based parsers. We believe that this was possible because those tasks can be decomposed into simple categorization tasks, and the amount of training text required is small enough to be prepared in reasonable time and effort. However, most of the serious NLP applications require a higher level of knowledge, in particular semantic knowledge. We believe that problem can t be solved by a small categorization task. So, recently we have observed an increasing focus on discovering semantic knowledge from untagged corpora, for example (Hearst 92; Riloff 98; Sudo et al. 03). The work in this paper is aiming the same objective, which is to find useful semantic knowledge from untagged corpora using unsupervised methods. As we are fortunate to be able to use enormous corpora, which was not possible 10 years ago, we believe this will be a fruitful direction for investing our efforts to advance NLP technologies. 6 Conclusion In this paper, we proposed an unsupervised method to discover paraphrases via relation discovery. The basic idea was, first, discovering the relation between named entities by clustering their contexts; and then selecting phrases expressing a major relation of the cluster by using the phrase duplication constraint and the common word constraint. Our experiments with one year of newspaper reveals that we were able to discover a variety of paraphrases with high precision and
high recall through the phrase selection constraint as well as the relation discovery process. of the 41 st Annual Meeting of the Association for Computational Linguistics (ACL03) References Agichtein, Eugene and Gravano, Luis. 2000. Snowball: Extracting reations from large plain-text collocations. In Proc. of the 5 th ACM International Conference on Digital Libruaries (ACM DL00) pp 85-94. Barzilay, Regina and McKeown, Kathleen. 2001. Extracting paraphrases from a parallel corpus. In Proc. of the 39 th Annual Meeting of the Association for Computational Linguistics (ACL-EACL01), pp 50-57. Brin, Sergey. 1998. Extracting patterns and relations from world wide web. In Proc. of the WebDB Workshop at 6 th International Conference on Extending Database Technology (WebDB98), pp172-183. Hasegawa, Takaaki, Sekine, Satoshi and Grishman, Ralph. 2004. Discovering Relations among Named Entities from Large Corpora, In the Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL04), pp 415-422 Hearst, Marti A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proc of the Fourteenth International Conference on Computational Linguistics (COLING92). Lin, Dekang and Pantel, Patrick. 2001. Dirt discovery of inference rules from text. In Proc. of the 7 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD01), pp323-328 Ravichandran, Deepak and Hovy, Eduard. 2002. Learning Surface Text Patterns for a Question Answering System. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL02) Riloff E. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proc.of the 13 th National Conference on Artificial Intelligence (AAAI96), 1044-1049. Sekine, Satoshi, Sudo, Kiyoshi and Nobata Chikashi. 2002. Extended Named Entity Hierarchy. In Proc. of the Third International Conference on Language Resource and Evaluation (LREC02), pp1818-1824. Shinyama, Yusuke and Sekine, Satoshi. 2003. Paraphrase acquisition for information extraction. In Proc. of the Second International Workshop on Paraphrasing (IWP03) Sudo Kiyoshi, Sekine, Satoshi and Grishman, Ralph. 2003. An improved extraction pattern representation model for automatic IE pattern acquisition. In Proc.