A Test Collection for Entity Linking
|
|
|
- Hugh Gallagher
- 10 years ago
- Views:
Transcription
1 A Test Collection for Entity Linking Ning Gao, Douglas W. Oard College of Information Studies/UMIACS University of Maryland, College Park Mark Dredze Human Language Technology Center of Excellence Johns Hopkins University Abstract Most prior work on entity linking has focused on linking name mentions found in third-person communication (e.g., news) to broad-coverage knowledge bases (e.g., Wikipedia). A restricted form of domain-specific entity linking has, however, been tried with , linking mentions of people to specific addresses. This paper introduces a new test collection for the task of linking mentions of people, organizations, and locations to Wikipedia. Annotation of 200 randomly selected entities of each type from the Enron collection indicates that domainspecific knowledge bases are indeed required to get good coverage of people and organizations, but that Wikipedia provides good (93%) coverage for the named mentions of locations in the Enron collection. Furthermore, experiments with an existing entity linking system indicate that the absence of a suitable referent in Wikipedia can easily be recognized by automated systems, with NIL precision (i.e., correct detection of the absence of a suitable referent) above 90% for all three entity types. 1 Introduction Connecting content found in free text to structured knowledge sources is a useful step for information access, question answering and knowledge base population. Given a document, named entity recognition (NER) first identifies the names mentioned in the document. Given a knowledge base (KB) and the recognized name mentions, entity linking, also known as named entity disambiguation (NED), links the name mentions to the referent entities in the KB, or returns NIL if the references are not in the KB. With the rise of large scale KBs such as Wikipedia, Freebase and Yago, entity linking has drawn considerable attention. The Text Analysis Conference Knowledge Base Population (TAC-KBP) track introduced the first shared entity linking task [1] in Since their introduction, entity linking studies have explored a variety of data types and settings. Traditionally, many systems have sought to link to a KB derived from Wikipedia infoboxes [1, 2]. More recently, Freebase and Yago have also been used as even more extensive sources of structured knowledge [3, 4]. Some shared tasks, such as TAC-KBP, have focused only on linking named entities, but others have extended the task to link mentions of any concept found in Wikipedia, so called Wikification [5, 6]. The TAC-KBP shared task has also evolved, adding NIL clustering (clustering of entities not found in the KB) [7, 8], extending from an initial focus on linking from news articles to also include linking from blogs [8], and adding tasks that required linking across languages [7, 9]. Furthermore, several recent papers have considered social media, such as Twitter, as a new source for mentions to be linked to entities [10, 11, 12, 13]. There has been another thread of work on identity resolution in , which we can think of as a specialized entity linking task. The focus of that work has been on automatically tagging named mentions of a person in the body text of message with the address of that person. If we think of the address inventory as a domain-specific KB, then this is an entity linking task. However, identity resolution research has to date focused only on person entities, leaving scope for 1
2 Table 1: Statistics for the sampling of named mentions, and the NER accuracy Type All Msg Per Msg All 300 Msg Per Msg 300 Sample Correct Accuracy PER 922, , ORG 1,149, , LOC 492, , Table 2: Wikipedia coverage statistics. Type Non-NIL NIL NIL% PER % ORG % LOC % Table 3: Entity linking accuracy, HLTCOE Entity Linker. Type Non-NIL Exact NIL Overall PER % ORG % LOC % future work on organizations, locations, and other entity types. Moreover, identity resolution research has to date focused only on linking to entities that are uniquely identified by addresses, leaving scope for future work on resolving mentions of people for which no suitable referent address is known (e.g., mentions of public figures). As an initial step towards addressing these challenges, we have developed what we believe to be the first test collection for linking named mentions of person (PER), organization (ORG) and location (LOC) entities from message bodies to Wikipedia. 1 As a second step, we have used the existing HLTCOE entity linker [14] to establish a baseline for entity linking accuracy to which future more highly specialized systems can be compared. In the remainder of this paper, we first describe our new test collection in Section 2. We then report the entity linking accuracy of the HLTCOE entity linker on the new collection in Section 3. In Section 4, we recap some related work. Section 5 concludes the paper with a brief discussion of next steps. 2 Test Collection We use a de-duplicated Enron collection as the basis for our new test collection. Within the CMU Enron collection [15], there are a substantial number of duplicate messages (because the same could, for example, appear in the sender s Sent Mail folder and the recipient s Inbox folder, and because some users keep multiple copies of messages, so that the same message might be in the Inbox and also in a folder called East Oil ). We therefore adopted Elsayed s de-duplication process [16], in which messages are considered to be duplicates if they contain exactly the same From, To, Cc, Bcc, Subject, Time, and Body fields. Before de-duplication we normalized the date and time of each message to a standard time zone (Universal Coordinated Time). After de-duplication there are 248,573 messages in the collection. To automatically identify named mentions (i.e., omitting nominal and pronominal mentions), we first use the Illinois Named Entity Tagger (INET) 2 to recognize three categories of named entity mentions: person (PER), organization (ORG), and location (LOC). In Table 1, the column labeled All Msg shows the number of named mentions recognized by INET in the whole collection, with the Per Msg All column showing the average number of mentions in each . We then randomly selected 300 messages that each contained more than 10 words of body text. The Per Msg 300 column shows the average number of mentions in the sampled 300 messages. 3 Finally, from the detected entities in those 300 messages (shown in the 300 Msg column) we then randomly selected 200 named mentions (Sample size) of each type. Comparing the numbers of Per Msg All and Per Msg 300, we can see that the prevalence of named entity mentions is somewhat higher in the sampled collection than in the whole collection because the statistics for the whole collection are reduced by the presence of short messages that contain relatively few named mentions. 1 The test collection is available at ninggao/publications 2 INET is available at 3 We randomly select 20 documents and manually recognize 135 PER, ORG or LOC mentions. The estimated recall of INET on the sampled documents is 98.5%. 2
3 Six annotators then each labeled 100 of the 600 sampled entity mentions for whether the text span recognized by INET was a correctly delimited named mention of an entity of the corresponding type. For example, in the sentence I will meet him in Washington DC, the only correct text span would be Washington DC (not Washington or in Washington DC ), and Washington DC should be classified by INET as a LOC, not a PER. Because each named mention was judged for correctness by a single annotator, measures of inter-annotator agreement for this task are not yet available. As the Correct and Accuracy columns in Table 1 show, the accuracy for PER and LOC mention detection is comparable to levels typically achieved on newswire (90% and 97%, respectively), but detection accuracy for ORG mentions is considerably lower (76%). For the 10% of PER mentions that were missed, informal writing styles in which plausible (but incorrect) names such as Hope or May can begin a sentence make the task more challenging, particular for systems like INET that are not trained specifically for . As expected, we also see classes of errors that are also typical in newswire, such as incorrectly segmenting George Bush from George Bush Intercontinental Airport. The few errors on LOC all resulted from typical causes that are also seen in newswire (e.g., misrecognizing Turkey as a LOC when, read in context, it is clearly referring to a bird). In the 24% of automatically detected ORG mentions marked by our assessors as incorrect, we find product names mislabeled as ORG (e.g., MAC ) and job positions mislabeled as ORG (e.g., ITC in if you want to be an ITC ). We also see some domain-specific names that our assessors simply lacked the knowledge to to judge with confidence (e.g., RTO in standardizing RTO. ) To measure the accuracy of automated entity linking systems, we also asked each assessor to provide us with the correct Wikipedia page for each correctly recognized mention that they assessed, or NIL to indicate if they believed that no such Wikipedia page yet existed. The Non-NIL and NIL columns in Table 2 show the number of entities that the annotator had judged as correct that were or were not in Wikipedia, respectively. A sample of 60 these mentions (20 of each type) was dual annotated by a second assessor, yielding exact agreement (i.e., both designate the same entity or both designate NIL) on 85% of the named mentions. The Cohen s Kappa agreement on whether a mention was NIL or Non-NIL is From Table 2, we can see that only 32% of the PER mentions could be linked to Wikipedia entities. These PER entities include sport stars, politicians, and well known people who worked for Enron such as the former CEO Kenneth Lay. For ORG mentions, about half (53%) of the referenced entities were found in Wikipedia (e.g., Justice Department ) with the other half annotated as NIL (e.g., Southward Energy Ltd ). Most (93%) of the LOC mentions could be found in Wikipedia; the relatively few LOC mentions that were resolved by our annotators as NIL included references to specific locations that had not achieved sufficient notoriety for inclusion in Wikipedia (e.g., 1455 Pennsylvania Ave ). 3 Entity Linking Result on the New Collection We then ran an existing entity linking system, the HLTCOE entity linker [14], on the 524 correctly recognized name mentions that had been automatically found by INET. Table 3 shows the resulting accuracy for mentions that had been resolved by our annotators to Wikipedia (Non-NIL) and for mentions that had been marked by our annotators as NIL. Aggregate accuracy statistics (Overall) are also shown. The Exact column in Table 3 shows the fraction of the correctly resolved Non-NIL mentions that resulted from an exact match to the canonical form in Wikipedia (which is unique); the other correct resolutions resulted from finding candidates using a fuzzy string match, ranking those candidates using match degree, entity popularity and contextual similarity features, and then selecting the top-ranking match. Our inspection of the results indicates that the exact matches were always correct, but that the difficulty of fuzzy matching varied by entity type. For PER mentions, the entity linker tends to resolve single-token name mentions, such as Parker in Parker could get a touchdown pass Sunday against Seattle Seahawks, to NIL when there s not enough useful evidence available to the entity linker. However, our human assessor could resolve the mention to the American football player Larry Parker after some easy reasoning. A frequently observed fuzzy match pattern for ORG name mentions that the linker often got right was abbreviation expansion (e.g., matching a mention of the US congress to United States Congress ). Only for LOC mentions were more mentions correctly resolved through fuzzy matching than through exact match match. This reflects the rarity of fully qualified location references in Wikipedia (e.g., we see what Wikipedia calls Orlando, Florida mentioned simply as Orlando ). 3
4 Looking at Tables 2 and 3 together, we can conclude that Wikipedia KB linking achieves good coverage (a true non-nil rate of 93%) with fairly high accuracy (0.73, despite the frequent need for fuzzy matching) for LOC entities. The situation for PER entities is just the opposite, with Wikipedia KB linking achieving quite poor coverage (with a true non-nil rate of 32%) and with automated entity linking doing somewhat less well for those cases (with an accuracy of 0.52, about half of which are exact matches on a full name). The situation for ORG entities is somewhere in between, with a substantial number of true non-nil links to the Wikipedia KB that are possible (53%), but rather low linking accuracy (0.69). From this we conclude that the research to date on linking PER entities to domain-specific KB s has been well justified, that new research on linking ORG entitles to domain-specific KB s is called for, and that research on linking LOC entities to domain-specific KB s can reasonably continue to be given lower priority. Notably, Table 3 also shows that the HLTCOE entity linker is already rather good at NIL detection on mentions found in messages, with detection accuracy above 0.9 for true NIL s on all three types, despite not having been tuned specifically to . This result suggests that it should be fairly easy to integrate the results of entity linking to broad-coverage and domain-specific KB s. 4 Related Work Klimt and Yang [15] published the CMU Enron collection, built from 152 users folders, totaling 517,424 messages (but no attachments). Minkov et al. [17] built the first evaluation collection from Enron s using messages from the folders of two users (Sager and Shapiro) that contained named mentions that corresponded uniquely to names in the CC field. The corresponding names were removed from the CC field for evaluation purposes. Diehl et al. [18] built a test collection with 78 named mentions that were manually annotated as resolving to specific Enron addresses. Elsayed [16] built what we can think of as the first collection-specific KB for the Enron collection, containing not just the address, but also known name variants and known alternate addresses for the same person. Elsayed also built what is currently the largest manually annotated test collection for the task of person entity linking in Enron . McNamee et al. [14] created the HLTCOE Entity Linker, which we used in Section 3. The system first computes a set of triage features (e.g., string equality, approximate string match) and selects candidate KB entity based on those features. NIL is then added to every candidate set, and ranked in the same way as any other candidate. An extended set of more expensive features (e.g., document similarity, popularity, and plausible NIL cues) are then computed for each candidate. Finally, the candidates are ranked using those features and a learned ranking function, and the top-ranked candidate is returned as the system s prediction of the referent entity. The HLTCOE Entity Linker has proven to be one of the most effective systems in the TAC-KBP entity linking task[1]. 5 Conclusion and Future Work From Table 2, we can see that about two-thirds of the mentioned person entities and about half of the mentioned organization entities in the Enron collection are not covered by Wikipedia. It is, therefore, potentially useful to build collection-specific KBs for those entity types; location entities (for which only about 7% are missing from Wikipedia) seem to us currently to be less of a priority. Although Elsayed et al. [19] have produced a collection-specific KB for persons who sent or received messages in the Enron collection, no comparable collection-specific KB yet exists for organizations. We therefore believe that research on that topic deserves attention. Further, the NIL mentions in the proposed test collection could be resolved to the collection-specific KBs. We expect that the linking to collection-specific KBs will largely reduce the percentage of NIL mentions. We are also interested in the clustering for the rest of the NIL mentions. Our current test collection only includes three entity types: PER, ORG and LOC. In our future work, we are interested in extending the entities to other types such as events and products. We also note that the NER and entity linking systems that we have used were not designed to exploit characteristics of that could make the task easier (e.g., thread structure, the communicant graph, or bursty communication patterns). Leveraging such features could be important because in some ways entity linking in is harder than entity linking in news. For example, news articles 4
5 might be expected to be explicitly self-contextualizing more often that isolated messages in which individual senders and recipients rely more heavily on shared context. Finally, we note that forcing a choice between broad-coverage and collection-specific KB s would likely be suboptimal. We would expect any collection to contain references to well known and less well known entities, so developing techniques for linking from one collection to multiple KBs deserves attention from researchers. provides a particularly attractive testbed around which to develop such capabilities since, as our work has shown, both types of references are naturally present in substantial quantities. In this way, research on entity linking from collections can help to develop capabilities from which the field as a whole can ultimately benefit. References [1] P. McNamee and H. T. Dang, Overview of the TAC 2009 knowledge base population track, in TAC, [2] S. Cucerzan, Large-scale named entity disambiguation based on Wikipedia data., in EMNLP- CoNLL, pp , [3] Z. Zheng, X. Si, F. Li, E. Y. Chang, and X. Zhu, Entity disambiguation with Freebase, in IEEE/WIC/ACM, pp , [4] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum, Robust disambiguation of named entities in text, in EMNLP, pp , [5] D. N. Milne and I. H. Witten, Learning to link with Wikipedia, in CIKM, pp , [6] L. Ratinov, D. Roth, D. Downey, and M. Anderson, Local and global algorithms for disambiguation to Wikipedia, in ACL, [7] H. Ji, R. Grishman, and H. Dang, Overview of the TAC 2011 knowledge base population track, in TAC, [8] H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis, Overview of the TAC 2010 knowledge base population track, in TAC, [9] P. McNamee, J. Mayfield, D. W. Oard, T. Xu, K. Wu, V. Stoyanov, and D. Doerman, Crosslanguage entity linking in Maryland during a hurricane, in TAC, [10] T. Cassidy, H. Ji, L.-A. Ratinov, A. Zubiaga, and H. Huang, Analysis and enhancement of Wikification for microblogs with context expansion., in COLING, pp , [11] S. Guo, M.-W. Chang, and E. Kiciman, To link or not to link? a study on end-to-end tweet entity linking., in HLT-NAACL, pp , [12] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu, Entity linking for Tweets, in ACL, pp , [13] W. Shen, J. Wang, P. Luo, and M. Wang, Linking named entities in Tweets with knowledge base via user interest modeling, in SIGKDD, pp , [14] P. Mcnamee, M. Dredze, A. Gerber, N. Garera, T. Finin, J. Mayfield, C. Piatko, D. Rao, D. Yarowsky, and M. Dreyer, HLTCOE approaches to knowledge base population at TAC 2009, in TAC, [15] B. Klimt and Y. Yang, The Enron corpus: A new dataset for classification research, in ECML, pp , [16] T. Elsayed, Identity Resolution in Collections. PhD thesis, University of Maryland, College Park, [17] E. Minkov, W. W. Cohen, and A. Y. Ng, Contextual earch and name disambiguation in using graphs, in SIGIR, pp , [18] C. P. Diehl, L. Getoor, and G. Namata, Name reference resolution in organizational archives., in SIAM, pp , [19] T. Elsayed and D. W. Oard, Modeling identity in archival collections of A preliminary study., in CEAS, pp ,
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wei Zhang Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei,tancl}@comp.nus.edu.sg Abstract
Name Reference Resolution in Organizational Email Archives
Name Reference Resolution in Organizational Email Archives Christopher P. Diehl Lise Getoor Galileo Namata Abstract Online communications provide a rich resource for understanding social networks. Information
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
Task Description for English Slot Filling at TAC- KBP 2014
Task Description for English Slot Filling at TAC- KBP 2014 Version 1.1 of May 6 th, 2014 1. Changes 1.0 Initial release 1.1 Changed output format: added provenance for filler values (Column 6 in Table
Text Analytics with Ambiverse. Text to Knowledge. www.ambiverse.com
Text Analytics with Ambiverse Text to Knowledge www.ambiverse.com Version 1.0, February 2016 WWW.AMBIVERSE.COM Contents 1 Ambiverse: Text to Knowledge............................... 5 1.1 Text is all Around
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
Statistical Analyses of Named Entity Disambiguation Benchmarks
Statistical Analyses of Named Entity Disambiguation Benchmarks Nadine Steinmetz, Magnus Knuth, and Harald Sack Hasso Plattner Institute for Software Systems Engineering, Potsdam, Germany, [email protected]
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Entity Disambiguation for Knowledge Base Population
Entity Disambiguation for Knowledge Base Population Mark Dredze and Paul McNamee and Delip Rao and Adam Gerber and Tim Finin Human Language Technology Center of Excellence, Center for Language and Speech
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Discovering and Querying Hybrid Linked Data
Discovering and Querying Hybrid Linked Data Zareen Syed 1, Tim Finin 1, Muhammad Rahman 1, James Kukla 2, Jeehye Yun 2 1 University of Maryland Baltimore County 1000 Hilltop Circle, MD, USA 21250 [email protected],
Plato: A Selective Context Model for Entity Resolution
Plato: A Selective Context Model for Entity Resolution Nevena Lazic, Amarnag Subramanya, Michael Ringgaard, Fernando Pereira Google Inc., Mountain View CA 94043, USA {nevena, asubram, ringgaard, pereira}@google.com
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations Helen Kwong Stanford University, USA [email protected] Neil Yorke-Smith SRI International, USA [email protected]
Entity Linking: Finding Extracted Entities in a Knowledge Base
Entity Linking: Finding Extracted Entities in a Knowledge Base Delip Rao 1, Paul McNamee 2, and Mark Dredze 1,2 1 Department of Computer Science Johns Hopkins University (delip mdredze)@cs.jhu.edu 2 Human
Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help?
Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help? Sumit Bhatia 1, Prakhar Biyani 2 and Prasenjit Mitra 2 1 IBM Almaden Research Centre, 650 Harry Road, San Jose, CA 95123,
Stanford s Distantly-Supervised Slot-Filling System
Stanford s Distantly-Supervised Slot-Filling System Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning Computer Science Department,
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Research on Sentiment Classification of Chinese Micro Blog Based on
Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: [email protected] Abstract
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014
Automatic Knowledge Base Construction Systems Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014 1 Text Contains Knowledge 2 Text Contains Automatically Extractable Knowledge 3
Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping
Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping Martin Leginus 1, Leon Derczynski 2 and Peter Dolog 1 1 Department of Computer Science, Aalborg University Selma Lagerlofs
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services
21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Taxonomies in Practice Welcome to the second decade of online taxonomy construction
Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods
An Empirical Study of Chinese Name Matching and Applications
An Empirical Study of Chinese Name Matching and Applications Nanyun Peng 1 and Mo Yu 2 and Mark Dredze 1 1 Human Language Technology Center of Excellence Center for Language and Speech Processing Johns
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
Combining textual and non-textual features for e-mail importance estimation
Combining textual and non-textual features for e-mail importance estimation Maya Sappelli a Suzan Verberne b Wessel Kraaij a a TNO and Radboud University Nijmegen b Radboud University Nijmegen Abstract
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Twitter Stock Bot. John Matthew Fong The University of Texas at Austin [email protected]
Twitter Stock Bot John Matthew Fong The University of Texas at Austin [email protected] Hassaan Markhiani The University of Texas at Austin [email protected] Abstract The stock market is influenced
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Web Information Mining and Decision Support Platform for the Modern Service Industry
Web Information Mining and Decision Support Platform for the Modern Service Industry Binyang Li 1,2, Lanjun Zhou 2,3, Zhongyu Wei 2,3, Kam-fai Wong 2,3,4, Ruifeng Xu 5, Yunqing Xia 6 1 Dept. of Information
Modeling Relations and Their Mentions without Labeled Text
Modeling Relations and Their Mentions without Labeled Text Sebastian Riedel, Limin Yao, and Andrew McCallum University of Massachusetts, Amherst, Amherst, MA 01002, U.S. {riedel,lmyao,mccallum,lncs}@cs.umass.edu
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu
Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!
Challenges of Cloud Scale Natural Language Processing
Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Robust Sentiment Detection on Twitter from Biased and Noisy Data
Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research [email protected] Junlan Feng AT&T Labs - Research [email protected] Abstract In this
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of
Emoticon Smoothed Language Models for Twitter Sentiment Analysis
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of
Local and Global Algorithms for Disambiguation to Wikipedia
ACL 11 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1 Dan Roth 1 Doug Downey 2 Mike Anderson 3 1 University of Illinois at Urbana-Champaign {ratinov2 danr}@uiuc.edu 2 Northwestern
Subordinating to the Majority: Factoid Question Answering over CQA Sites
Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei
Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation
Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation Huidan Liu Institute of Software, Chinese Academy of Sciences, Graduate University of the [email protected]
How To Analyze Sentiment On A Microsoft Microsoft Twitter Account
Sentiment Analysis on Hadoop with Hadoop Streaming Piyush Gupta Research Scholar Pardeep Kumar Assistant Professor Girdhar Gopal Assistant Professor ABSTRACT Ideas and opinions of peoples are influenced
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by
Microblog Sentiment Analysis with Emoticon Space Model
Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory
A Sentiment Detection Engine for Internet Stock Message Boards
A Sentiment Detection Engine for Internet Stock Message Boards Christopher C. Chua Maria Milosavljevic James R. Curran School of Computer Science Capital Markets CRC Ltd School of Information and Engineering
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Named Entity Recognition Experiments on Turkish Texts
Named Entity Recognition Experiments on Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK - Uzay Institute, Ankara - Turkey [email protected] 2 Dept. of Computer Engineering, METU, Ankara - Turkey
Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information
Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information Zhongqing Wang, Sophia Yat Mei Lee, Shoushan Li, and Guodong Zhou Natural Language Processing Lab, Soochow University,
Method of Fault Detection in Cloud Computing Systems
, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow
WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive
Local and Global Algorithms for Disambiguation to Wikipedia
ACL 11 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1 Dan Roth 1 Doug Downey 2 Mike Anderson 3 1 University of Illinois at Urbana-Champaign {ratinov2 danr}@uiuc.edu 2 Northwestern
Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning
3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
CSV886: Social, Economics and Business Networks. Lecture 2: Affiliation and Balance. R Ravi [email protected]
CSV886: Social, Economics and Business Networks Lecture 2: Affiliation and Balance R Ravi [email protected] Granovetter s Puzzle Resolved Strong Triadic Closure holds in most nodes in social networks
Estimating Twitter User Location Using Social Interactions A Content Based Approach
2011 IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing Estimating Twitter User Location Using Social Interactions A Content Based
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Effective Mentor Suggestion System for Collaborative Learning
Effective Mentor Suggestion System for Collaborative Learning Advait Raut 1 U pasana G 2 Ramakrishna Bairi 3 Ganesh Ramakrishnan 2 (1) IBM, Bangalore, India, 560045 (2) IITB, Mumbai, India, 400076 (3)
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
Bhilai Institute of Technology Durg at TAC 2010: Knowledge Base Population Task Challenge
Bhilai Institute of Technology Durg at TAC 2010: Knowledge Base Population Task Challenge Ani Thomas, Arpana Rawal, M K Kowar, Sanjay Sharma, Sarang Pitale, Neeraj Kharya ARPANI NLP group, Research and
A Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel [email protected], [email protected],
The Guide to: Email Marketing Analytics"
The Guide to: Email Marketing Analytics" This guide has been lovingly created by Sign-Up.to we provide email marketing, mobile marketing and social media tools and services to organisations of all shapes
Software-assisted document review: An ROI your GC can appreciate. kpmg.com
Software-assisted document review: An ROI your GC can appreciate kpmg.com b Section or Brochure name Contents Introduction 4 Approach 6 Metrics to compare quality and effectiveness 7 Results 8 Matter 1
Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms
Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká [email protected]
An Iterative Link-based Method for Parallel Web Page Mining
An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,
