3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Size: px
Start display at page:

Download "3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work"

Transcription

1 Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa , Japan Abstract One of the difficulties in Natural Language Processing is the fact that there are many way to express the same thing or event. These expressions are called Paraphrases. Paraphrase is important in applications such as IR, QA and IE, and one of the difficulties in paraphrase research is acquiring the requisite paraphrase knowledge. In this paper, we describe an unsupervised method to discover paraphrases containing two named entities from a large untagged corpus. The proposed method consists of two stages. First, it finds relations between named entities using similarity of context and clustering. Then, the phrases which express the relation are selected from each cluster to acquire paraphrases. Our experiments with one year of newspaper reveal that we can discover a variety of paraphrases with high precision and high recall. 1 Introduction One of the difficulties in Natural Language Processing is the fact that there are many way to express the same thing or event. If the expression is a word or a short phrase (like corporation and company ), it is called a synonym. There has been a lot of research on such lexical relations, along with the creation of resources such as WordNet. If the expression is longer or complicated (like A buys B and A s purchase of B ), it is called paraphrase, i.e. a set of phrases which express the same thing or event. Recently, this topic has been getting more attention, as is evident Satoshi Sekine and Ralph Grishman NewYorkUniversity 715 Broadway, 7th floor, New York, NY 10003, U.S.A. sekine@cs.nyu.edu from the Paraphrase Workshops in 2003 and 2004, driven by the needs of various NLP applications. For example, in Information Retrieval (IR), we have to match a user s query to the expressions in the desired documents, while in Question Answering (QA), we have to find the answer to the user s question even if the formulation of the answer in the document is different from the question. Also, in Information Extraction (IE), in which the system tries to extract elements of some events (e.g. date and company names of a corporate merger event), several event instances from different news articles have to be aligned even if these are expressed differently. We have realized the importance of paraphrase; however, the major obstacle is the construction of paraphrase knowledge. For example, we can easily imagine that the number of paraphrases for A buys B is enormous and it is not possible to create comprehensive knowledge. Also, we don t know how many kinds of such paraphrase sets are necessary to cover even some everyday things or events. Up to now, most IE researchers have been creating paraphrase knowledge (or IE patterns) by hand and for specific tasks. So, there is a limitation that IE can only be performed for a pre-defined task, like corporate mergers or management succession. In order to create an IE system for a new domain, you have to spend a long time to create the knowledge. So, it is too costly to make IE technology open-domain like IR or QA. In this paper, we will propose an unsupervised method to discover paraphrases from a large untagged corpus. We are focusing on phrases which have two Named Entities (NEs), as those types of phrases are very important for IE applications. After tagging a large corpus with an automatic NE tagger, the method tries to find sets of paraphrases automatically without being given a seed phrase or any kinds of cue. The proposed ap-

2 proach uses the relation discovery method described in (Hasegawa et al. 04). It is an unsupervised method for finding common relations from a large corpus. We will describe this method below, as it is integral to our paraphrase discovery procedure. The rest of this paper is organized as follows. We discuss the prior work in paraphrase discovery and their limitations in section 2. We describe our method in section 3. Then we report experiments and evaluations in section 4, and discuss the result in section 5. 2 Prior Work There have been several efforts to discover paraphrase automatically from corpora. One general approach uses comparable documents, which are sets of documents whose content are known to be almost the same. In other words, those methods need comparable corpora, implicit or explicit, such as different newspaper stories about the same event (Shinyama and Sekine 03) or different translations of the same story (Barzilay 01). They basically try to find paraphrases in the comparable parts of documents using clues like named entities. However, the availability of comparable corpora is limited; in particular, in the case of Barzilay s approach, the availability of multiple translations of the same story is clearly limited. This is a significant limitation on this general approach. Another approach to finding paraphrases is to find phrases which take similar subjects and objects in large corpora by using mutual information of word distribution (Lin and Pantel 01). This approach is designed to accumulate phrases useful for the QA task by giving a pair consisting of two important phrases from the question and the answer. So, this approach needs a phrase as an initial seed and thus the possible relationships to be extracted are naturally limited. There has also been work using a bootstrapping approach (Brin 98; Agichtein and Gravano 00; Ravichandran and Hovy 02). Their basic strategy is, for a given pair of entity types, to start with some examples, like several famous book title and author pairs; and find expressions which contains those names; then using the found expressions, find more author and book title pairs. This can be repeated several times and collect a list of author and book title pairs and expressions. Ravichandran demonstrated that the collected list improved the accuracy of a QA system. However, those methods need initial seeds, so the relation between entities has to be known in advance. This limitation is the obstacle to making the technology open domain. 3 Paraphrase Acquisition 3.1 Overview Our goal is to discover the paraphrases that represent a particular relation between two named entities. If we could identify pairs of named entities (such as Cingular and AT&T Wireless ) which have a particular relation (such as merger & acquisition ), we could also find paraphrases expressing the relation between these two names. Under this assumption, we propose an approach of paraphrase acquisition via relation discovery from large text documents. Our approach is fully unsupervised and we only need a named entity tagger and large text corpora. The outline of the method is as the follows: 1. Tag named entities in text corpora 2. Discover particular relations by clustering named entity pairs by their context 3. Select phrases which express the relation from those in the cluster Figure 1 shows the overview of the method. First, from the NE-tagged newspaper corpus, we extract expressions containing frequently-appearing pairs of named entities; in the figure, these are expressions containing the pair of COMPANY A and B, C and D, and E and F. Then, we accumulate the context words intervening between these entities, such as is offered to buy, negotiate to acquire for A and B. If the contexts for A and B and those for E and F are similar, it is likely that these pairs represent the same relation; in the figure, A and B and E and F have M&A relation. By this method, we believe we can accumulate the instances of phrases, as well as the instances of relations which are important in the text.

3 aaa NE tagged corpus (Newspaper) 1) Extract expressions between two NE instances 2) build clusters of NE pairs <Company-A Company-B> A is offering to buy B A s proposed acquisitions of B A s interest in B A negotiates to acquire B A is discussing with B. <Company-C Company-D> C s parent company D C is a subsidiary of D <Company-E Company-F> E s acquisition of F E would buy F 3) Find phrases which express the relation (paraphrases) in each clusters Figure 1. Overview of the method Next, we try to acquire phrases to represent the relation from the expressions found in each cluster. The expressions in the cluster include expressions irrelevant to the relation, such as A is discussing with B which is not really the M&A relation, so we apply two constraints in order to select only the phrases expressing the relation. One is the phrase duplication constraint, where the phrase has to appear with some minimum number of NE pair instances in the cluster. The other constraint is the common word constraint, which is to select phrases which contain a frequent word in the cluster. For example, if the word acquisition appears frequently in the cluster, phrases including the word acquisition are likely to be phrases expressing the relation, here the M&A relation. 3.2 Named entity tagging Our proposed method is fully unsupervised. We do not need comparable corpora or any initial seeds which are manually selected. Instead, we use a named entity (NE) tagger. Current automatic named entity taggers have quite satisfactory performance. In addition, the set of NE types has been extended. For example, (Sekine et al. 02) proposed 150 NE types. Extending the NE types would lead to more effective relation discovery. For example, if the type ORGANIZATION is divided into several subtypes, like COMPANY, MILITARY, GOVERNMENT and so on, the discovery procedure could detect more specific relations such as those between COMPANYs. We use an extended NE tagger (reference to be provided in the final paper). 3.3 Relation Discovery We define the co-occurrence between NE pairs as follows: two named entities are considered to cooccur if they appear with no more than 5 intervening words in the same sentence. We collect the intervening words between two named entities for each co-occurrence. These words, which are stemmed, could be regarded as the context of the pair of named entities. Different orders of occurrence of the same named entities are considered as

4 different co-occurrence. Less frequent NE pairs are eliminated because they might be less reliable for relation discovery. We set the co-occurrence frequency threshold to be 30. The vector space model of the context words and the cosine similarity of the vectors are used in order to calculate the similarities between NE pairs. A context vector for each NE pair instance consists of the bag of words formed from all intervening words (excluding stop words) of two named entities. Each word of a context vector is weighted by tf*idf, the product of term frequency and inverse document frequency. Term frequency is the number of occurrences of a word in the collected context words. Document frequency is the number of documents which include the word. The similarity of two pairs of named entities is calculated by cosine similarity of the two vectors. We compare NE pairs of the same NE types, e.g., PERSON-GPE (a geographical-political entity -- a region with a government) pair. In this paper, we will refer to a pair of named entity types as a domain. In addition to the PERSON-GPE domain, we will report on our experiment on the COMPANY-COMPANY domain. 3.4 Clustering After we calculate the similarity among context vectors of NE pairs, we make clusters of NE pairs based on the similarity. We adopt hierarchical clustering and used complete linkage to avoid the chain effect of single-link clustering, which could join two not-so-similar members into a single cluster. In the complete linkage method, the distance between clusters is taken to be the distance of the furthest nodes in the two clusters. Now we have a set of named entity pairs which are likely to express the same relation. 3.5 Selection of Paraphrases Even though a set of named entity pairs in the same relation have been found, not all of the phrases used in those clusters express the relation. In order to filter out the phrases which do not express the relation, we applied two constrains: [Phrase duplication constraint:] A phrase must be shared by at least two NE pairs in a cluster. [Common word constraint:] A phrase must include one of the frequent common words in a cluster. The phrase duplication constraint requires a phrase to have appeared in multiple NE pairs in the same cluster. It is intended to delete phrases which appear accidentally or are specific phrases to a particular NE pair. In the common word constraint, we rely on the idea that words appearing frequently in the cluster are relevant to the relation of the cluster and if a phrase contains one or more of such words, the phrase is considered to express the relation. 4 Experiment We will report on our experiment in two successive stages. The first stage was relation discovery and the second stage was paraphrase acquisition. We conducted the experiment with one year of The New York Times (1995) as our corpus to verify the method. 4.1 Relation discovery First, the frequent NE pairs are found, and the NE pairs along with their intervening words are extracted and clustered. In order to evaluate the result, we analyzed all the extracted NE pair instances manually and identified the relations for two different domains. One was the PERSON- GPE domain, in which 177 distinct NE pairs are obtained and manually classified into 38 relations. The other was the COMPANY-COMPANY domain. We got 65 distinct NE pairs and manually classified them into 10 relations. We evaluated automatically extracted clusters consisting of two or more pairs. For each cluster, the most frequent relation represents the relation of the cluster. For example, in a cluster if there are seven NE pairs of relation A and three NE pairs of relation B, the cluster is labeled as A. When the relation of an NE pair instance is the same as the label of the cluster, it is counted as correct; the correct pair count, N correct, is defined as the total number of correct pairs in all clusters. Other NE pairs in the cluster are counted as incorrect; the incorrect pair count, N incorrect, is also defined as the total number of incorrect pairs in all clusters. We evaluate the clusters based on Recall, Precision

5 and F-measure. The definitions of these measures are as follows. [Recall (R)] How many correct pairs are detected out of all the key pairs? The key pair count, N key, is defined as the total number of pairs manually classified in clusters of two or more pairs. Recall = N correct / N key [Precision (P)] How many correct pairs are detected among the pairs clustered automatically? Precision = N correct / (N correct + N incorrect ) [F-measure (F)] F-measure is defined as a combination of recall and precision according to the following formula: F-measure = 2*Recall*Prec/(Recall+Prec) These values vary depending on the threshold of cosine similarity. We fixed the cosine threshold at a single value just above 0 for both domains, which gives almost maximum F values for both domains. This setting does not require parameter optimization and we believe it works for other domains, as well, because it means that all members of a cluster have to have at least one word in common with the other members of the same cluster. We got 34 clusters in the PER-GPE domain and 15 clusters in the COM-COM domain. Table 1 shows the result in both domains. We achieved 80 F-measure in the PER-GPE domain and 75 in the COM-COM domain. Domain Prec. Recall F PER-GPE COM-COM Table 1. Result of relation clustering 4.2 Paraphrase acquisition In the second stage, we are going to acquire paraphrases from the clusters of the same relation. Although we obtained some meaningful relations in smaller clusters, we will focus on the larger clusters, those with more than 4 members. We found that all large clusters have meaningful major relations and that the common words in those clusters accurately represented the relations. The large clusters represent the President, Senator, Prime Minister, Governor, Secretary, Republican and Coach relations in the PER-GPE domain, and the M&A, Parent and Alliance relations in the COM-COM domain. We made a reference data set of paraphrases by looking at the phrases in each cluster for both domains. We eliminated the single frequency phrases and phrases which consist of only symbols and stop words from the evaluation. With respect to the major relation, each phrase is categorized into one of the following 4 classes. Table 2 shows the distribution of the phrases. [Class 1:] Phrases which represent the major relation (i.e. strict paraphrases) [Class 2:] Phrases which almost represent the major relation but include extra words (i.e. more restrictive relations) [Class 3:] Phrases which suggest broader meaning than just the major relation (i.e. more general relations) [Class 4:] Phrases which cannot be regarded as representing the major relation (i.e. others) Phrase class total PER-GPE COM-COM Table 2. Reference data of phrase classes P-G C+C Baseline Phrase duplication Common word Phrase+Common R P F R P F R P F R P F Table 3. Evaluation result for paraphrase discovery

6 Then, we evaluated the result of the paraphrase acquisition experiment for the PER-GPE domain and the COM-COM domain. There are three criteria: 1) setting the key phrases to be those in Class 1 (strict paraphrases), 2) in Class 1 plus Class 2 and 3) in Class 1, Class 2 plus Class 3 (loose paraphrases). The loose paraphrases could be useful in an IE application. Even though the phrases are not interchangeable in general, those phrases can be used to extract information once the task is specific. The evaluation metric is the normal Recall, Precision and F-measure. Table 3 shows the evaluation results using different constraints (i.e. no constraint, the phrase duplication constraint, the common word constraint and the combined constraints). In the combined constraints, phrases which satisfy either constraint are saved, rather than satisfying both constraints (i.e. disjunction, rather than conjunction). In the common word constraint, we select the phrases for which the sum of the relative frequencies for each common word was above 0.4. The recall is calculated relative to the case of no constraint (baseline), as we are comparing the phrase sets among the phrases in the baseline, so the recall is 100% for the baseline experiment. However, the precision for the baseline is low because the reference data included a lot of irrelevant phrases. The aim of the two constraints is to push the precision higher while keeping the recall high. The best result is obtained with the common word constraint in the PER-GPE domain, and with the combined constraints in the COM-COM domain. In general, the common word constraint helps to improve the precision compared to the duplicated phrase constraints. This means that the paraphrases in the clusters are not shared by different NE pair instances so much, even though the paraphrases share some words in common. There are a variety of phrases in the COM-COM domain, compared with the PER-GPE domain. In the PER- GPE domain, there are rather small number of typical phrases for the relation (e.g., A is the President of B ). We believe that the PER-GPE domain contains more static relations, compared with the COM-COM domain, which contains more event relations. This assumption is also suggested by the result that the phrase duplication constraint works better in the PER-GPE domain. Table 4 shows some examples of successfully acquired paraphrases for the M&A and Parent relations in the COM-COM domain using the combined constraints. These phrases are paraphrases and would be useful for applications like Information Retrieval, Question Answering or Information Extraction. President A, the president of B B s new President A B s newly elected President, A A becomes president of B B under President A M&A A bought B A has agreed to buy B A, which is buying B A's proposed acquisition of B A's acquisition of B A's agreement to buy B A's purchase of B A bid for B A's takeover of B A merger with B A succeeded in buy B B, which was acquired by A B would become a subsidiary of A B agreed to be bought by A Parent A, a unit of the B A, owned by B A' parent, B B, the parent company of A B, hold company for A B, the company that own A Table 4. Examples of discovered paraphrases 5 Discussion In this section, we will discuss several issues regarding the proposed method. Error Analysis We analyze the errors which lower the precision, a problem primarily in the PER-GPE domain (False Alarms). This analysis was done for the data using the combined constraints. We categorized the errors into the following four types. The distribution of the errors is shown in Table 5.

7 [Error 1:] Phrase contains two different phrases [Error 2:] Relation discovery error [Error 3:] Relations dependent on context [Error 4:] Other errors Error P-G C-C Table 5. Error distribution The most severe error type (Error 1) involves phrases which actually contain different phrases. An example of such a phrase is visited France (GPE), when President Chirac (PERSON) invite the world leaders. Because France and Chirac are co-occuring frequent NE pairs and phrase (actually a sequence of words) GPE, when President PERSON satisfy the common word constraint, it was taken as a paraphrase candidate. This kind of errors made the precision lower, but we believe if we can use a parser to find the boundary of phrases, this error might be eliminated. Error 2 involves an example like U.S. (GPE) Vice President Al Gore (PERSON). As its context contains word President, the NE pair is regarded as president relation. This should be solved using frequent multi-word terms as keyword, but this remains one of our future work. An example of Error 3 is a phrase Tommy Thompson (PERSON), a Republican from Wisconsin (GPE). Actually, Mr. Tommy Thompson is not a senator or a representative, but a governor. When the sentence appears in a context of different view of different governors (i.e. it is obvious from the context that he is a governor), it does not mention Governor explicitly. So the phrase can be a paraphrase of governor relation in such context, but not always. We don t have a good idea for solving this kind of error. Limitation and Future Direction Our method has some limitations. We set several frequency thresholds, so we can t find less frequent relations between NE pairs and can t find paraphrases for such relations. However, we think that we could possibly resolve the limitation by two approaches. One approach is to increase the amount of text. We used only one-year corpus for this experiment, but there are much more corpus, e.g. newspaper corpus of more than 10 years, or much larger corpus of Web texts. If we can use such corpora, hopefully the sparseness problem will be diminished. The other approach is to combine bootstrapping methods (Brin 98; Agichtein and Gravano 00) with our relation discovery stage. We first find reliable paraphrases using frequent instances, then using the obtained knowledge, less frequent instances will be found. Unsupervised Methods The proposed method is a fully unsupervised method. When we look back over the last decade, there have been great advances in many fields of NLP using supervised machine learning. These include corpus-based POS taggers, NE taggers and treebank-based parsers. We believe that this was possible because those tasks can be decomposed into simple categorization tasks, and the amount of training text required is small enough to be prepared in reasonable time and effort. However, most of the serious NLP applications require a higher level of knowledge, in particular semantic knowledge. We believe that problem can t be solved by a small categorization task. So, recently we have observed an increasing focus on discovering semantic knowledge from untagged corpora, for example (Hearst 92; Riloff 98; Sudo et al. 03). The work in this paper is aiming the same objective, which is to find useful semantic knowledge from untagged corpora using unsupervised methods. As we are fortunate to be able to use enormous corpora, which was not possible 10 years ago, we believe this will be a fruitful direction for investing our efforts to advance NLP technologies. 6 Conclusion In this paper, we proposed an unsupervised method to discover paraphrases via relation discovery. The basic idea was, first, discovering the relation between named entities by clustering their contexts; and then selecting phrases expressing a major relation of the cluster by using the phrase duplication constraint and the common word constraint. Our experiments with one year of newspaper reveals that we were able to discover a variety of paraphrases with high precision and

8 high recall through the phrase selection constraint as well as the relation discovery process. of the 41 st Annual Meeting of the Association for Computational Linguistics (ACL03) References Agichtein, Eugene and Gravano, Luis Snowball: Extracting reations from large plain-text collocations. In Proc. of the 5 th ACM International Conference on Digital Libruaries (ACM DL00) pp Barzilay, Regina and McKeown, Kathleen Extracting paraphrases from a parallel corpus. In Proc. of the 39 th Annual Meeting of the Association for Computational Linguistics (ACL-EACL01), pp Brin, Sergey Extracting patterns and relations from world wide web. In Proc. of the WebDB Workshop at 6 th International Conference on Extending Database Technology (WebDB98), pp Hasegawa, Takaaki, Sekine, Satoshi and Grishman, Ralph Discovering Relations among Named Entities from Large Corpora, In the Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL04), pp Hearst, Marti A Automatic acquisition of hyponyms from large text corpora. In Proc of the Fourteenth International Conference on Computational Linguistics (COLING92). Lin, Dekang and Pantel, Patrick Dirt discovery of inference rules from text. In Proc. of the 7 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD01), pp Ravichandran, Deepak and Hovy, Eduard Learning Surface Text Patterns for a Question Answering System. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL02) Riloff E Automatically Generating Extraction Patterns from Untagged Text. In Proc.of the 13 th National Conference on Artificial Intelligence (AAAI96), Sekine, Satoshi, Sudo, Kiyoshi and Nobata Chikashi Extended Named Entity Hierarchy. In Proc. of the Third International Conference on Language Resource and Evaluation (LREC02), pp Shinyama, Yusuke and Sekine, Satoshi Paraphrase acquisition for information extraction. In Proc. of the Second International Workshop on Paraphrasing (IWP03) Sudo Kiyoshi, Sekine, Satoshi and Grishman, Ralph An improved extraction pattern representation model for automatic IE pattern acquisition. In Proc.

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 vpekar@ufanet.ru Steffen STAAB Institute AIFB,

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

What Is This, Anyway: Automatic Hypernym Discovery

What Is This, Anyway: Automatic Hypernym Discovery What Is This, Anyway: Automatic Hypernym Discovery Alan Ritter and Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle,

More information

Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web

Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web Keiji Shinzato 1, Satoshi Sekine 2, Naoki Yoshinaga 3, and Kentaro Torisawa 4 1 Graduate School of Informatics, Kyoto

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Open Information Extraction from the Web

Open Information Extraction from the Web Open Information Extraction from the Web Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni Turing Center Department of Computer Science and Engineering University of

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer

More information

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction

More information

Mining Opinion Features in Customer Reviews

Mining Opinion Features in Customer Reviews Mining Opinion Features in Customer Reviews Minqing Hu and Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 {mhu1, liub}@cs.uic.edu

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Resolving Common Analytical Tasks in Text Databases

Resolving Common Analytical Tasks in Text Databases Resolving Common Analytical Tasks in Text Databases The work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B. Database Systems and Text-based Information

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

More information

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web Real-Time Identification of MWE Candidates in Databases from the BNC and the Web Identifying and Researching Multi-Word Units British Association for Applied Linguistics Corpus Linguistics SIG Oxford Text

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD 72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD Paulo Gottgtroy Auckland University of Technology Paulo.gottgtroy@aut.ac.nz Abstract This paper is

More information

How To Write A Summary Of A Review

How To Write A Summary Of A Review PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

A Framework for Named Entity Recognition in the Open Domain

A Framework for Named Entity Recognition in the Open Domain A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics School of Humanities, Languages, and Social Sciences University of Wolverhampton Stafford

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Data Selection in Semi-supervised Learning for Name Tagging

Data Selection in Semi-supervised Learning for Name Tagging Data Selection in Semi-supervised Learning for Name Tagging Abstract We present two semi-supervised learning techniques to improve a state-of-the-art multi-lingual name tagger. They improved F-measure

More information

Duplication in Corpora

Duplication in Corpora Duplication in Corpora Nadjet Bouayad-Agha and Adam Kilgarriff Information Technology Research Institute University of Brighton Lewes Road Brighton BN2 4GJ, UK email: first-name.last-name@itri.bton.ac.uk

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

TREC 2003 Question Answering Track at CAS-ICT

TREC 2003 Question Answering Track at CAS-ICT TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China changyi@software.ict.ac.cn http://www.ict.ac.cn/

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Named Entity Recognition in Broadcast News Using Similar Written Texts

Named Entity Recognition in Broadcast News Using Similar Written Texts Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract

More information

A Mutually Beneficial Integration of Data Mining and Information Extraction

A Mutually Beneficial Integration of Data Mining and Information Extraction In the Proceedings of the Seventeenth National Conference on Artificial Intelligence(AAAI-2000), pp.627-632, Austin, TX, 20001 A Mutually Beneficial Integration of Data Mining and Information Extraction

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

3 Learning IE Patterns from a Fixed Training Set. 2 The MUC-4 IE Task and Data

3 Learning IE Patterns from a Fixed Training Set. 2 The MUC-4 IE Task and Data Learning Domain-Specific Information Extraction Patterns from the Web Siddharth Patwardhan and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 {sidd,riloff}@cs.utah.edu Abstract

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Machine Learning Approach To Augmenting News Headline Generation

Machine Learning Approach To Augmenting News Headline Generation Machine Learning Approach To Augmenting News Headline Generation Ruichao Wang Dept. of Computer Science University College Dublin Ireland rachel@ucd.ie John Dunnion Dept. of Computer Science University

More information

How To Cluster On A Search Engine

How To Cluster On A Search Engine Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

Sentiment-Oriented Contextual Advertising

Sentiment-Oriented Contextual Advertising Sentiment-Oriented Contextual Advertising Teng-Kai Fan, Chia-Hui Chang Department of Computer Science and Information Engineering, National Central University, Chung-Li, Taiwan 320, ROC tengkaifan@gmail.com,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering

More information

Artificial Intelligence and Transactional Law: Automated M&A Due Diligence. By Ben Klaber

Artificial Intelligence and Transactional Law: Automated M&A Due Diligence. By Ben Klaber Artificial Intelligence and Transactional Law: Automated M&A Due Diligence By Ben Klaber Introduction Largely due to the pervasiveness of electronically stored information (ESI) and search and retrieval

More information

On-Demand Information Extraction. Summer/Fall 07. New York University Satoshi Sekine

On-Demand Information Extraction. Summer/Fall 07. New York University Satoshi Sekine On-Demand Information Extraction Summer/Fall 07 New York University Satoshi Sekine Introduction (http://nlp.cs.nyu.edu/sekine) Research topics On-demand IE IE pattern Discovery Multi/Sing doc. sum. IE

More information

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Customer Intentions Analysis of Twitter Based on Semantic Patterns Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun mohamed.hamrounn@gmail.com Mohamed Salah Gouider ms.gouider@yahoo.fr Lamjed Ben Said lamjed.bensaid@isg.rnu.tn ABSTRACT

More information

Semantic Class Induction and Coreference Resolution

Semantic Class Induction and Coreference Resolution Semantic Class Induction and Coreference Resolution Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 vince@hlt.utdallas.edu Abstract This

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Personalization of Web Search With Protected Privacy

Personalization of Web Search With Protected Privacy Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information

More information

Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora

Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora Jing-Shin Chang Department of Computer Science& Information Engineering National

More information

Automated News Item Categorization

Automated News Item Categorization Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr

More information

An Empirical Study on Web Mining of Parallel Data

An Empirical Study on Web Mining of Parallel Data An Empirical Study on Web Mining of Parallel Data Gumwon Hong 1, Chi-Ho Li 2, Ming Zhou 2 and Hae-Chang Rim 1 1 Department of Computer Science & Engineering, Korea University {gwhong,rim}@nlp.korea.ac.kr

More information

Analyzing survey text: a brief overview

Analyzing survey text: a brief overview IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

Extracting Events from Web Documents for Social Media Monitoring using Structured SVM

Extracting Events from Web Documents for Social Media Monitoring using Structured SVM IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85A/B/C/D, No. xx JANUARY 20xx Letter Extracting Events from Web Documents for Social Media Monitoring using Structured SVM Yoonjae Choi,

More information

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

More information

A Survey of Text Mining Techniques and Applications

A Survey of Text Mining Techniques and Applications 60 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 1, NO. 1, AUGUST 2009 A Survey of Text Mining Techniques and Applications Vishal Gupta Lecturer Computer Science & Engineering, University

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Overview of the TACITUS Project

Overview of the TACITUS Project Overview of the TACITUS Project Jerry R. Hobbs Artificial Intelligence Center SRI International 1 Aims of the Project The specific aim of the TACITUS project is to develop interpretation processes for

More information

ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

More information

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value , pp. 397-408 http://dx.doi.org/10.14257/ijmue.2014.9.11.38 Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value Mohannad Al-Mousa 1

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Finding Advertising Keywords on Web Pages. Contextual Ads 101 Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

More information

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Introduction Electronic Commerce 2 is accelerating dramatically changes in the business process. Electronic

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

More information

An ontology-based approach for semantic ranking of the web search engines results

An ontology-based approach for semantic ranking of the web search engines results An ontology-based approach for semantic ranking of the web search engines results Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information