Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web
|
|
- Ami McDonald
- 8 years ago
- Views:
Transcription
1 Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web Keiji Shinzato 1, Satoshi Sekine 2, Naoki Yoshinaga 3, and Kentaro Torisawa 4 1 Graduate School of Informatics, Kyoto University 2 Computer Science Department, New York University 3 Japan Society for the Promotion of Science 4 Graduate School of Information Science, Japan Advanced Institute of Science and Technology skeiji@nlp.kuee.kyoto-u.ac.jp, sekine@cs.nyu.edu, {n-yoshi,torisawa}@jaist.ac.jp Abstract. This paper describes an automatic dictionary construction method for Named Entity Recognition (NER) on specific domains such as restaurant guides. NER is the first step toward Information Extraction (IE), and we believe that such a dictionary construction method for NER is crucial for developing IE systems for a wide range of domains in the World Wide Web (WWW). One serious problem in NER on specific domains is that the performance of NER heavily depends on the amount of the training corpus, which requires much human labor to develop. We attempt to improve the performance of NER by using dictionaries automatically constructed from HTML documents instead of by preparing a large annotated corpus. Our dictionary construction method exploits the cooccurrence strength of two expressions in HTML itemizations calculated from average mutual information. Experimental results show that the constructed dictionaries improved the performance of the NER on a restaurant guide domain. Our method increased the F 1 -measure by 2.3 without any additional manual labor. 1 Introduction The methodologies to choose necessary information from a huge number of documents in the World Wide Web (WWW) and to provide it to a user in a concise manner are very important in these days. Although Information Extraction (IE) can be regarded as one of such methodologies, the diversity of the domains found in the WWW does not allow us to adapt existing IE methods in the WWW. A major problem is that an existing Named Entity (NE) tagger, which is a key component for conducting IE, cannot be applied to a wide range of domains in the WWW, and that developing a new NE tagger for a new domain is a time-consuming task. A variety of methods have been so far proposed for NE Recognition (NER) [1 4]. These studies aimed at NER for a rather small number of predefined NE categories for competitions [5, 6], and achieved high accuracies by relying on
2 2 K. Shinzato, S. Sekine, N. Yoshinaga and K. Torisawa Fig. 1. A procedure flowchart for constructing domain-specific dictionaries a large amount of annotated corpora prepared for the competitions. However, if one tries to develop NE taggers for a new domain with new NE categories, the costs of preparing annotated corpora for the categories are quite large, and it is still quite difficult to achieve high performance without much labor for annotating a large number of documents. One possible way to solve this problem is to define general-purpose finegrained NE categories and develop large annotated corpora for them. Sekine et al. have tried to define 200 fine-grained NE categories including PRODUCT NAME and CONFERENCE, and are developing NE taggers by using annotated corpora [7]. Although their set of NE categories may look like a sufficiently detailed classification, it is still too coarse to conduct IE on specific domains such as a restaurant domain, which is addressed in this work. For instance, Sekine s categories do not contain names of dishes or ingredients. Another method of developing NE taggers for new domains is to employ existing generic handcrafted dictionaries, such as WordNet [8]. Nevertheless, handcrafted dictionaries often fail to cover domain specific expressions, such as names of dishes and restaurants. The aim of this work is to improve the performance in NER for a new domain with small costs by using a WWW-based automatic dictionary construction method for NE categories on the domain. In other words, we are trying to achieve higher performance in NER by using automatically constructed dictionaries from the WWW instead of by enlarging an annotated corpus, which requires high developmental costs. (Note that a use of a small annotated corpus is unavoidable anyway. The point is that we can achieve higher accuracy without enlarging the corpus.) As a basic method for NER, we follow an existing machine learning based approach, and major contribution of this work is in a method of automatic construction of dictionaries for specific domains and use of them in NER. Our dictionary construction algorithm uses the NEs in the annotated corpus as seeds, which are expanded by using a large number of HTML documents downloaded from the WWW. More specifically, our method uses itemizations in HTML documents to obtain expressions that are semantically similar to the seeds, as depicted in Figure 1. A similar idea has been proposed for hyponymy relation acquisition [9]. One difference is that we consider the frequency of cooccurrences in itemizations and try to clean up erroneous dictionary entries. We show that NER performance on the restaurant domain can be improved by using the automatically constructed dictionaries.
3 Constructing Dictionaries for NER on Specific Domains from the Web 3 ADDRESS (248), AREA (251), ATMOSPHERE (364), BGM (26), BUSINESS STYLE (27), CARD (223), CHEF (76), CHILD CARE (39), CLEANNESS (16), CUISINE (307), C DAY (31), C EVALUATION (140), C NUMBER (107), C PROFILE (47), C PURPOSE (193), DAY (397), DISH MATERIAL (974), DISH QUALITY (1,188), DISH (2,064), DISTANCE (212), DRESS (2), (13), EMPLOYEE (103), ENTERTAINMENT (17), EQUIPMENT (211), EXAM- PLE (2), EXTERIOR (43), FAX (51), FORM (260), HANDICAPPED CARE (0), HIS- TORY (81), HOW TO EAT (261), IF POSSIBLE (1), ILLUMINATION (23), INTERIOR (69), LIKE (2), LINE (162), LOCATION (68), MANAGER (99), MEDIA (15), NAME (736), NEAR FACILITY (79), NOT (124), OK (0), OR (0), OTHER SPECIALITY (24), PARK- ING (35), PET CARE (13), POPULARITY (90), PRICE (474), QUIETNESS (7), REGU- LAR CUSTOMER (182), RESERVATION (55), SERVICE OTHER (159), SMOKING CARE (5), SPACE (43), STATION (230), STOCK (156), TABLES (63), TABLEWARE (22), TEL (247), TIME (373), URL (47), VIEW (2) Fig. 2. NE categories for the restaurant domain (# of instances.) In the remainder of this paper, Section 2 describes existing machine learning based Japanese NER methods and our small annotated corpus for the restaurant domain. Section 3 explains an automatic dictionary construction method using HTML documents. Section 4 gives an overview of our NE tagger that utilizes the automatically constructed dictionaries. Section 5 gives experimental results. 2 Background 2.1 Machine Learning Based Japanese Named Entity Recognition Several machine learning techniques, such as Support Vector Machines (SVMs) [10] and the Maximum Entropy model [11], have been employed for IREX [6], Japanese NER competition [12, 3, 13, 14]. The SVM-based approach, originally proposed in [3], showed the best performance [14]. We followed the method proposed by Yamada et al. [3], briefly overviewed below, in implementing our NE tagger, and augment it with automatically constructed dictionaries. Yamada s method decomposes a given sentence into a sequence of words by using an existing morphological analyzer, and then deterministically classifies subsequences of words into appropriate NE categories from the end of the sentence to the beginning. For the annotation of NE categories, Yamada et al. employed IOB2 [15] as a chunk tag set for eight categories defined in the IREX competition. The utilized feature set includes the word itself, part-of-speech tags, character types, and the preceding and succeeding two words. The information of succeeding NE tags is also used since the NE tagger has already determined them and they are available. See [3] for details. As for use of dictionaries in machine learning based NER, NTT goitaikei [16], a manually tailored large-scale generic dictionary, has been already employed for the IREX competition in some studies [13, 14]. The improvements were around 1.0 in F 1 -measure, and were less than the improvement achieved by our method. 2.2 Restaurant Corpus Although our aim is to achieve high performance in NER without a large annotated corpus, a use of a small annotated corpus is unavoidable. The small corpus
4 4 K. Shinzato, S. Sekine, N. Yoshinaga and K. Torisawa (A) An <DISH><DISH MATERIAL> apple </DISH MATERIAL> tart </DISH> was <DISH QUALITY> good taste </DISH QUALITY>. (B) <RESERVATION>If you are going to visit our restaurant, please make reservations so we can give you better service. We have a webpage for your convenience, or you can contact us by telephone or fax.</reservation> Fig. 3. Examples of annotated sentences in the restaurant corpus is used not only for training classifiers for NER but also for collecting seed expressions for automatically constructing dictionaries. This section describes the corpus we used in this work. Since our main objective is to extract from the specific domain (restaurant, in this paper) the information that are useful to people, we predefined 64 NE categories that roughly correspond to aspects of restaurants that are addressed by frequently asked questions about restaurants. We collected inquiries about restaurants posted on Internet bulletin boards ( and and defined a set of NE categories. The defined NE categories are listed in Figure 2. Note that most of the categories in the figure have not been considered in the existing NE categories [5 7]. To develop an NE tagger in the restaurant domain, we collected documents that describe restaurants, and annotated them with the NE tags. We call this corpus restaurant corpus. We simply collected names of restaurants located in Jiyugaoka (one of the popular shopping area in Tokyo) from a certain web site ( We gave each restaurant name as a search query to a commercial search engine for gathering HTML documents that describe the restaurant. We then manually extracted sentences that describe the restaurant from the gathered HTML documents. We obtained 745 documents including 6,080 sentences and 118 restaurant names. One person spent six weeks for annotating the documents with tags corresponding to the 64 NE categories for the restaurant domain. Some examples of annotated sentences are shown in Figure 3. An important point is that the annotated corpus for the IREX consists of 1,174 documents, including about 11,000 sentences [6], and that the restaurant corpus is smaller than this. Note that IREX assumed only eight NE categories. Considering that our task has many finer grained NE categories and that a data sparseness problem is more likely to occur, achieving high accuracy in our task is expected to be more difficult than in the IREX competition. This is the motivation behind the introduction of automatically constructed dictionaries for our NE categories. 3 Automatic Dictionary Construction from HTML documents We automatically constructed dictionaries from HTML documents according to the following hypothesizes. Hypothesis 1 is the same as the one proposed in [9], while Hypothesis 2 is newly introduced in this study.
5 Constructing Dictionaries for NER on Specific Domains from the Web 5 <UL><LI>LOVE Food!</LI> <OL><LI>Canlis Steak</LI> <LI>Sushi</LI> <LI>Pan Fried Dumpling</LI> <LI>Chocolate Cake</LI> </OL></UL> Fig. 4. Sample HTML code of an itemization. Hypothesis 1: Expressions included in identical itemizations are likely to be semantically similar to each other. Hypothesis 2: Expressions that frequently cooccur with many instances of an NE category in itemizations are likely to be proper dictionary entries of the category. Our dictionary construction procedure consists of three steps. Step 1: Extract expressions annotated as instances of each NE category from the annotated corpus. Note that the extracted expressions include not only single words but also multiword expressions, and even a sequence of sentences such as those shown in Figure 3 (B). Step 2: Extract sets of expressions listed in each itemization from HTML documents. We call the extracted set an Itemized Expression Set (IES). Step 3: For each NE category, select from among the IESs extracted by Step 2 those expressions that cooccur with many instances of the each NE category extracted in Step 1 in the IESs, and regard them as dictionary entries. In Step 3, the procedure tries to select from the extracted IESs in Step 2 expressions that can be regarded as proper dictionary entries of an NE category. The detailed explanation of Steps 2 and 3 are described below. 3.1 Step 2: Extracting IESs We follow the approach described in [9] to extract IESs from HTML documents. First, we associate each expression in an HTML document with a path that specifies both the HTML tags enclosing the expression and their order. Consider the HTML document in Figure 4. The expression LOVE Food! is enclosed by tags <LI>,</LI> and <UL>,</UL>. If we sort these tags by nesting order, we obtain a path (UL,LI) that specifies the information regarding the expression s location. We write (UL, LI), LOVE Food! if (UL,LI) is a path for the expression LOVE Food!. We then obtain the following paths for the expressions from the document. (UL, LI), LOVE Food!, (UL, OL, LI), Canlis Steak, (UL, OL, LI), Sushi, (UL, OL, LI), Pan Fried Dumpling, (UL, OL, LI), Chocolate Cake Our method extracts a set of expressions associated with the same path as an IES. In the above example, we obtain the IES {Canils Steak, Sushi, Pan Fired Dumpling, Chocolate Cake}.
6 6 K. Shinzato, S. Sekine, N. Yoshinaga and K. Torisawa 3.2 Step 3: Selecting Dictionary Entities Based on Average Mutual Information Let us assume constructing a dictionary regarding the DISH category. We refer to the set of DISH category instances extracted from the restaurant corpus in Step 1 as I DISH. The procedure collects IESs including at least one element of I DISH from all extracted IESs. We denote the set of expressions included in the collected IESs as E DISH. Note that we discarded expressions included in only one IES and expressions that cooccurred with only one element in I DISH from E DISH since such expressions are less likely to be proper dictionary entries. Although we can regard each element of E DISH as an entry in the dictionary of the DISH category, the dictionary erroneously includes a large number of non-dish-names. We thus filter out such expressions by using a score, which is the average mutual information among each expression included in E DISH and instances in I DISH. This score reflects Hypothesis 2. We sort the E DISH entries according to the scores, and use only the top N entries in NER. The score for expression e E DISH is defined as follows. score DISH (e) = P(e, i) P(e, i) log 2 P(e) P(i), i I DISH where P (x) is the probability of observing expression x in all extracted IESs gathered in Step 2, and P (x, y) is the probability of observing expressions x and y in the same IES. The score gives a large value to expressions that frequently cooccur with many instances of the NE category in I DISH and that infrequently cooccur with expressions other than the instances. A problem with the above score is that it tends to give large values to expressions that frequently appear in itemizations. This has an undesirable effect on the quality of resulting dictionaries. Although we prefer to include such specific dish descriptions as baked cheesecake in the dictionary, the score tends to give a higher score to more generic dish names such as cheesecake, and top entries tend to include only generic single words, which are often inappropriate as dish names. This is because the frequency of a single word tends to be larger than those of multiword expressions and our score is likely to give a large value to single word. We therefore increase the score of each multiword expression by using the score value of its head (e.g., cheesecake in the case of baked cheesecake ). We finally used the following score: score DISH (e) = score DISH (e) + score DISH (e head ), where e head is the head of e. In Japanese, the head of a multiword expression e is usually its suffix substring. We thus collected the other expressions in E DISH that were included in e as its suffix substring, and regarded these expressions as candidates of the e s head. We then assumed that the longest expression among these expressions was the head of e. When e did not include any other expressions in E DISH as its suffix, we used 0 as the value of score DISH (e head ).
7 Constructing Dictionaries for NER on Specific Domains from the Web 7 4 Named Entity Taggers for the Restaurant Domain Now, we describe our NE tagger for the restaurant domain. As mentioned before, we basically follow Yamada s method in implementing our NE tagger. Our NE tagger first decomposes a given sentence into a word sequence by using MeCab ( Next, it obtains feature vectors including the word itself, part-of-speech tags, character types defined in [3], NE tags of the two succeeding words, and the preceding and succeeding two words for each word. Then, the tagger sets the feature values concerning dictionary entries that have been automatically constructed by the method described in Section 3 as follows. Basically, it gives a chunk tag to all words in subsequences of dictionary entries according to the method proposed in [13]. As chunk tags, we employed a Start/End tag model [12]. For example, assume that the sentence I ate a Kobe hamburger steak as a light meal. is given as input and Kobe hamburger steak and hamburger are included in a dictionary regarding the DISH category; the features are set as below. features I ate Kobe Steak Hamburger as... DIC DISH-S DIC DISH-B DIC DISH-I DIC DISH-E Note that the feature DIC DISH-S means that a word is a single word entry in the dictionary for the DISH category. The values of DIC DISH-B and DIC DISH-E indicate if the words are the beginning and the end of a dictionary entry respectively. DIC DISH-I is assigned to a word in an entry other than its beginning and end. The NER gives the obtained feature vectors to an SVM and deterministically assigns the tags according to the IOB2 scheme from the end of the sentence to its beginning. We chose TinySVM ( taku/software/tinysvm/) as an SVM implementation. We used the polynomial kernel of degree 1 provided in TinySVM according to the observations obtained in experiments using the development set. Another important point is that, although Yamada et al. employed a pairwise method for extending SVMs to multi-class classifiers, we employed a one-vs-rest method to extend SVMs. According to [13], there is no significant difference between the performances of the two methods. In addition, the one-vs-rest method requires fewer classifiers than the pairwise method does. This is crucial for our NER because the number of categories is rather large. 5 Experiments 5.1 Setting In our experiments, we disregarded the following tags from the restaurant corpus because it was difficult to recognize these by current NER methodologies.
8 8 K. Shinzato, S. Sekine, N. Yoshinaga and K. Torisawa Table 1. Size of constructed dictionaries NE categories # of entities in a dictionary # of instances in each training set AREA 8, CARD CUISINE 7, DAY 12, DISH 35, DISH MATERIAL 27, FORM 3, LINE NAME 1, NEAR FACILITY 1, NEAR STATION These numbers are the average numbers of instances and dictionary entities in each evaluation. DISH: (shop suey), (fried shrimp), (powdered green tea),* (worcestershire sauce),* (kidney bean), (caramel), (rice),* (sugar),* (name of a sushi bar),* (egg),* (rush hour), DISH MATERIAL: (carrot), (green pepper), (milk), (pepper),* (cooking oil), (soy sauce), (wheat),* (material), (lobster), (tofu),* (iced coffee) Expressions starting with * are inappropriate entries. Fig. 5. Examples of entries in constructed dictionaries NE tags annotated across a period (See Figure 3(B)). NE tags representing logical conditions (e.g., NOT and OR). NE tags whose total frequency is less than 10. After removing these tags, we conducted experiments for remaining 53 tags and evaluated the performance of our NE taggers by 5-fold cross-validation on the restaurant corpus described in Section 2.2. For constructing dictionaries, we downloaded HTML documents (103 GB with HTML tags) and extracted IESs including individual expressions by the method described in Section 3.1. We constructed the dictionaries for 11 categories listed in Table 1 from these IESs. We selected these NE categories because their instances were likely to be noun phrases and that they frequently appeared in the restaurant corpus. For each NE category, our dictionary construction method can collect more than 10 times as many expressions as those annotated as its instances. In other words, our method can generate a large number of dictionary entries from the given instances of each NE category. Some examples of dictionary entities are listed in Figure Contribution of Constructed Dictionaries We investigated the contribution of the dictionaries automatically constructed from HTML documents. We checked the NER performance when we increased the size of the dictionaries of each NE category by 10%. Note that when the size
9 Constructing Dictionaries for NER on Specific Domains from the Web 9 Table 2. The performances of NE taggers by using different-sized dictionaries. NE # of None TOP 10% TOP 20% TOP 30% TOP 100% Categories NEs Prec. Rec. F 1 Prec. Rec. F 1 Prec. Rec. F 1 Prec. Rec. F 1 Prec. Rec. F 1 AREA CARD CUISINE DAY DISH 2, DISH M FORM LINE NAME FACILITY STATION AVERAGE 5, TOTAL 11, DISH M, FACILITY and STATION correspond to DISH MATERIAL, NEAR FACILITY and NEAR STATION respectively. of a dictionary becomes larger, coverage also becomes larger, but inappropriate entries in a dictionary increase. The performance of NE taggers is shown in Table 2. This table shows the performance of NE taggers without the dictionaries and with the top 10%, 20%, 30% and all dictionary entries (i.e., top 100%) in terms of the precision, recall and F 1 -measure. Basically each row shows the performance of the NE tagger on an NE category. The row AVERAGE refers to the average performance of the NE tagger only on the NE categories for which we constructed dictionaries. The column TOTAL is the average performance for all the NE categories (i.e., 53 categories) no matter whether we prepared dictionaries for them or not. The table shows that we successfully improved the performance of NE taggers by using dictionary entries as features. When we used dictionary entries whose scores were in the top 20%, the performance of NE taggers was 55.7 in F 1 -measure of AVERAGE. The improvement from the tagger without the dictionaries is 2.3 in F 1 -measure. In the TOTAL row, the maximum improvement is 1.0 with F 1 -measure. The improvement may not be so large, but if we look at the categories such as DISH and DISH M, the improvement reaches from 3.5 to 5.7. Note that one may expect that the overall performance of NER can be improved by determining an optimal size of a dictionary for each category and by combining the classifiers with the dictionaries with the optimal size. However, because the performance of an NE tagger for each category heavily depends on the NE taggers for the other categories, we cannot independently determine an optimal size of each dictionary. This means that even if we combine NE taggers with dictionaries with size that performed best in our experiments (e.g., the NE tagger for CARD with top 10 % dictionary etc.), this will not necessarily lead to a better overall performance.
10 10 K. Shinzato, S. Sekine, N. Yoshinaga and K. Torisawa 6 Conclusion We proposed an automatic dictionary construction method for Named Entity Recognition (NER) on specific domains. The method expanded seed expressions extracted from an annotated corpus using itemizations in HTML documents. We showed that constructed dictionaries improved NER accuracy through a series of experiments on a restaurant domain. The dictionaries increased F 1 -measure by 2.3 without any additional manual labor, such as additional corpus annotation. We will apply our dictionary construction method to NER in other domains. In addition, we are going to directly evaluate the constructed dictionaries by hand. We will also compare dictionaries built by our method with those built by existing methods [17] in terms of their impact on the performance of NER. References 1. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proc. ANLP Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proc. EMNLP Yamada, H., Kudoh, T., Matsumoto, Y.: Japanese named entity extraction using support vector machine. IPSJ Journal 43(1) (2002) (in Japanese) 4. Sekine, S., Nobata, C.: Definition, dictionaries and tagger for extended named entity hierarchy. In: Proc. LREC Grishman, R., Sundheim, B.: Message understanding conference 6: A brief history. In: Proc. COLING IREX Committee editor: IREX workshop. (1999) 7. Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy. In: Proc. LREC Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: An on-line lexical database. In: Journal of Lexicography. (1990) Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proc. HLT-NAACL Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995) 11. Berger, A.L., Pietra, S.A.D., Pietra, V.J.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1) (1996) Uchimoto, K., Ma, Q., Murata, M., Ozaku, H., Utiyama, M., Isahara, H.: Named entity extraction based on a maximum entropy model and transformation rules. Natural Language Processing 7(2) (2000) (In Japanese). 13. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proc. HLT-NAACL Nakano, K., Hirai, Y.: Japanese named entity extraction with bunsetsu features. IPSJ journal 45(3) (2004) (in Japanese). 15. Tjong Kim Sang, E., Veenstra, J.: Representing text chunks. In: Proc. EACL Ikehara, S., Masahiro, M., Satoshi, S., Akio, Y., Hiromi, N., Kentaro, O., Yoshihumi, O., Yoshihiko, H.: Nihongo Goi Taikei A Japanese Lexicon. Iwanami Syoten (1997) 17. Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern context. In: Proc. EMNLP
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More information3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work
Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp
More informationPOSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationA Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,
More informationANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
More informationNamed Entity Recognition in Broadcast News Using Similar Written Texts
Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract
More informationSentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationSVM Based Learning System For Information Extraction
SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk
More informationData Selection in Semi-supervised Learning for Name Tagging
Data Selection in Semi-supervised Learning for Name Tagging Abstract We present two semi-supervised learning techniques to improve a state-of-the-art multi-lingual name tagger. They improved F-measure
More informationOpinion Sentence Search Engine on Open-domain Blog
Opinion Sentence Search Engine on Open-domain Blog Osamu Furuse, Nobuaki Hiroshima, Setsuo Yamada, Ryoji Kataoka NTT Cyber Solutions Laboratories, NTT Corporation 1-1 Hikarinooka Yokosuka-Shi Kanagawa,
More informationAn Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them
An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,
More informationA Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
More informationA Framework for Named Entity Recognition in the Open Domain
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics School of Humanities, Languages, and Social Sciences University of Wolverhampton Stafford
More informationEffective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted
More informationEnhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects
Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com
More informationBrill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
More informationIntroduction to Text Mining. Module 2: Information Extraction in GATE
Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
More informationEfficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
More informationWhat Is This, Anyway: Automatic Hypernym Discovery
What Is This, Anyway: Automatic Hypernym Discovery Alan Ritter and Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle,
More informationSemi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
More informationInteractive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
More informationTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
More informationIMAGE PROCESSING BASED APPROACH TO FOOD BALANCE ANALYSIS FOR PERSONAL FOOD LOGGING
IMAGE PROCESSING BASED APPROACH TO FOOD BALANCE ANALYSIS FOR PERSONAL FOOD LOGGING Keigo Kitamura, Chaminda de Silva, Toshihiko Yamasaki, Kiyoharu Aizawa Department of Information and Communication Engineering
More informationA Survey on Product Aspect Ranking Techniques
A Survey on Product Aspect Ranking Techniques Ancy. J. S, Nisha. J.R P.G. Scholar, Dept. of C.S.E., Marian Engineering College, Kerala University, Trivandrum, India. Asst. Professor, Dept. of C.S.E., Marian
More informationBoosting the Feature Space: Text Classification for Unstructured Data on the Web
Boosting the Feature Space: Text Classification for Unstructured Data on the Web Yang Song 1, Ding Zhou 1, Jian Huang 2, Isaac G. Councill 2, Hongyuan Zha 1,2, C. Lee Giles 1,2 1 Department of Computer
More informationExtraction of Hypernymy Information from Text
Extraction of Hypernymy Information from Text Erik Tjong Kim Sang, Katja Hofmann and Maarten de Rijke Abstract We present the results of three different studies in extracting hypernymy information from
More informationCustomizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
More informationAutomated Extraction of Vulnerability Information for Home Computer Security
Automated Extraction of Vulnerability Information for Home Computer Security Sachini Weerawardhana, Subhojeet Mukherjee, Indrajit Ray, and Adele Howe Computer Science Department, Colorado State University,
More informationAutomatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
More informationETL Ensembles for Chunking, NER and SRL
ETL Ensembles for Chunking, NER and SRL Cícero N. dos Santos 1, Ruy L. Milidiú 2, Carlos E. M. Crestana 2, and Eraldo R. Fernandes 2,3 1 Mestrado em Informática Aplicada MIA Universidade de Fortaleza UNIFOR
More informationYou can eat healthy on any budget
You can eat healthy on any budget Is eating healthy food going to cost me more money? Eating healthy meals and snacks does not have to cost you more money. In fact, eating healthy can even save you money.
More informationNamed Entity Recognition Experiments on Turkish Texts
Named Entity Recognition Experiments on Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK - Uzay Institute, Ankara - Turkey dilek.kucuk@uzay.tubitak.gov.tr 2 Dept. of Computer Engineering, METU, Ankara - Turkey
More informationMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews Minqing Hu and Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 {mhu1, liub}@cs.uic.edu
More informationArchitecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
More informationSelected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms
Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz
More informationPoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
More informationEvaluation of Bayesian Spam Filter and SVM Spam Filter
Evaluation of Bayesian Spam Filter and SVM Spam Filter Ayahiko Niimi, Hirofumi Inomata, Masaki Miyamoto and Osamu Konishi School of Systems Information Science, Future University-Hakodate 116 2 Kamedanakano-cho,
More informationSINAI at WEPS-3: Online Reputation Management
SINAI at WEPS-3: Online Reputation Management M.A. García-Cumbreras, M. García-Vega F. Martínez-Santiago and J.M. Peréa-Ortega University of Jaén. Departamento de Informática Grupo Sistemas Inteligentes
More informationGet the most value from your surveys with text analysis
PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That
More informationDomain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu
Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!
More informationExploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 125-132. Association for Computational Linguistics. Exploiting Strong Syntactic Heuristics
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationSentiment analysis for news articles
Prashant Raina Sentiment analysis for news articles Wide range of applications in business and public policy Especially relevant given the popularity of online media Previous work Machine learning based
More informationData Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
More informationClustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
More informationExtracting Opinions and Facts for Business Intelligence
Extracting Opinions and Facts for Business Intelligence Horacio Saggion, Adam Funk Department of Computer Science University of Sheffield Regent Court 211 Portobello Street Sheffield - S1 5DP {H.Saggion,A.Funk}@dcs.shef.ac.uk
More informationBridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded
More informationA Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA tafvizi@csail.mit.edu Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA mpacula@csail.mit.edu Abstract
More informationCombining Contextual Features for Word Sense Disambiguation
Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, July 2002, pp. 88-94. Association for Computational Linguistics. Combining
More informationRecognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine
Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine Son Doan and Hua Xu Department of Biomedical Informatics School of Medicine, Vanderbilt University Son.Doan@Vanderbilt.edu,
More informationWeb based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationTransition-Based Dependency Parsing with Long Distance Collocations
Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,
More informationOnline Large-Margin Training of Dependency Parsers
Online Large-Margin Training of Dependency Parsers Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA {ryantm,crammer,pereira}@cis.upenn.edu
More informationMEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK
MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK 1 K. LALITHA, 2 M. KEERTHANA, 3 G. KALPANA, 4 S.T. SHWETHA, 5 M. GEETHA 1 Assistant Professor, Information Technology, Panimalar Engineering College,
More informationPhase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationExploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction
Exploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction Wei Song, Shiqi Zhao, Chao Zhang, Hua Wu, Haifeng Wang, Lizhen Liu, Hanshi Wang College of Information Engineering,
More informationFOOD QUESTIONNAIRE RESULTS
FOOD QUESTIONNAIRE RESULTS QUESTION 1 How many meals do you usually eat every day? At what times do you eat your meals? STUDENTS ANSWERS 3-4 meals a day Breakfast 7.00 Lunch 12.00 Dinner- 16-20.00 Supper
More informationSchema documentation for types1.2.xsd
Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................
More information2-3 Automatic Construction Technology for Parallel Corpora
2-3 Automatic Construction Technology for Parallel Corpora We have aligned Japanese and English news articles and sentences, extracted from the Yomiuri and the Daily Yomiuri newspapers, to make a large
More informationCoding science news (intrinsic and extrinsic features)
Coding science news (intrinsic and extrinsic features) M I G U E L Á N G E L Q U I N T A N I L L A, C A R L O S G. F I G U E R O L A T A M A R G R O V E S 2 Science news in Spain The corpus of digital
More informationCombining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1
Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1 Maria Teresa Pazienza, Armando Stellato and Michele Vindigni Department of Computer Science, Systems and Management,
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationForecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
More informationHow To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
More informationTampa s Best Black Restaurants
Tampa s Best Black Restaurants I eat out a lot and over the years I have dined at many establishments of different cultures in the Tampa area. As I visited the restaurants, I would make mental notes. In
More informationExtracting Events from Web Documents for Social Media Monitoring using Structured SVM
IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85A/B/C/D, No. xx JANUARY 20xx Letter Extracting Events from Web Documents for Social Media Monitoring using Structured SVM Yoonjae Choi,
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationExtraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
More informationCINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
More informationCross-Language Information Retrieval by Domain Restriction using Web Directory Structure
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,
More informationUsing Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance
Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance David Bixler, Dan Moldovan and Abraham Fowler Language Computer Corporation 1701 N. Collins Blvd #2000 Richardson,
More informationRRSS - Rating Reviews Support System purpose built for movies recommendation
RRSS - Rating Reviews Support System purpose built for movies recommendation Grzegorz Dziczkowski 1,2 and Katarzyna Wegrzyn-Wolska 1 1 Ecole Superieur d Ingenieurs en Informatique et Genie des Telecommunicatiom
More informationSentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
More informationAnnotation and Evaluation of Swedish Multiword Named Entities
Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se Introduction
More informationReliable and Cost-Effective PoS-Tagging
Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve
More informationMultiword Expressions and Named Entities in the Wiki50 Corpus
Multiword Expressions and Named Entities in the Wiki50 Corpus Veronika Vincze 1, István Nagy T. 2 and Gábor Berend 2 1 Hungarian Academy of Sciences, Research Group on Artificial Intelligence vinczev@inf.u-szeged.hu
More informationUniversity of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
More informationSemantic Class Induction and Coreference Resolution
Semantic Class Induction and Coreference Resolution Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 vince@hlt.utdallas.edu Abstract This
More informationSemantic parsing with Structured SVM Ensemble Classification Models
Semantic parsing with Structured SVM Ensemble Classification Models Le-Minh Nguyen, Akira Shimazu, and Xuan-Hieu Phan Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa,
More informationFacilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
More informationSemantic Sentiment Analysis of Twitter
Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference
More informationContext Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 dick.chen@lexisnexis.com don.loritz@lexisnexis.com
More informationWord Taxonomy for On-line Visual Asset Management and Mining
Word Taxonomy for On-line Visual Asset Management and Mining Osmar R. Zaïane * Eli Hagen ** Jiawei Han ** * Department of Computing Science, University of Alberta, Canada, zaiane@cs.uaberta.ca ** School
More informationCS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
More informationTowards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
More informationBagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
More informationKnowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD
Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD Eneko Agirre and Oier Lopez de Lacalle and Aitor Soroa Informatika Fakultatea, University of the Basque Country 20018,
More informationHomework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
More informationGenerating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
More informationAN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,
More informationREPENTINO A Wide-Scope Gazetteer for Entity Recognition in Portuguese
REPENTINO A Wide-Scope Gazetteer for Entity Recognition in Portuguese Luís Sarmento, Ana Sofia Pinto, and Luís Cabral Faculdade de Engenharia da Universidade do Porto (NIAD&R), Rua Dr. Roberto Frias, s/n
More informationExploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationA Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
More informationAutomated Extraction of Security Policies from Natural-Language Software Documents
Automated Extraction of Security Policies from Natural-Language Software Documents Xusheng Xiao 1 Amit Paradkar 2 Suresh Thummalapenta 3 Tao Xie 1 1 Dept. of Computer Science, North Carolina State University,
More information