VectorSLU: A Continuous Word Vector Approach to Answer Selection in Community Question Answering Systems
|
|
|
- Beverly Taylor
- 10 years ago
- Views:
Transcription
1 VectorSLU: A Continuous Word Vector Approach to Answer Selection in Community Question Answering Systems Yonatan Belinkov, Mitra Mohtarami, Scott Cyphers, James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA {belinkov, mitra, cyphers, glass}@csail.mit.edu Abstract Continuous word and phrase vectors have proven useful in a number of NLP tasks. Here we describe our experience using them as a source of features for the SemEval-2015 task 3, consisting of two community question answering subtasks: Answer Selection for categorizing answers as potential, good, and bad with regards to their corresponding questions; and YES/NO inference for predicting a yes, no, or unsure response to a YES/NO question using all of its good answers. Our system ranked 6th and 1st in the English answer selection and YES/NO inference subtasks respectively, and 2nd in the Arabic answer selection subtask. 1 Introduction Continuous word and phrase vectors, in which similar words and phrases are associated with similar vectors, have been useful in many NLP tasks (Al- Rfou et al., 2013; Bansal et al., 2014; Bowman et al., 2014; Boyd-Graber et al., 2012; Chen and Rudnicky, 2014; Guo et al., 2014; Iyyer et al., 2014; Levy and Goldberg, 2014; Mikolov et al., 2013c). To evaluate the effectiveness of continuous vector representations for Community question answering (CQA), we focused on using simple features derived from vector similarity as input to a multi-class linear SVM classifier. Our approach is language independent and was evaluated on both English and Arabic. Most of the vectors we use are domain-independent. CQA services provide forums for users to ask or answer questions on any topic, resulting in high variance answer quality (Màrquez et al., 2015). Searching for good answers among the many responses can be time-consuming for participants. This is illustrated by the following example of a question and subsequent answers. Q: Can I obtain Driving License my QID is written Employee? A1: the word employee is a general term that refers to all the staff in your company... you are all considered employees of your company A2: your qid should specify what is the actual profession you have. I think for me, your chances to have a drivers license is low. A3: his asking if he can obtain. means he have the driver license. Answer selection aims to automatically categorize answers as: good if they completely answer the question, potential if they contain useful information about the question but do not completely answer it, and bad if irrelevant to the question. In the example, answers A1, A2, and A3 are respectively classified as potential, good, and bad. The Arabic answer selection task uses the labels direct, related, and irrelevant. YES/NO inference infers a yes, no, or unsure response to a question through its good answers, which might not explicitly contain yes or no keywords. For example, the answer for Q is no with respect to A2 that can be interpreted as a no answer to the question. The remainder of this paper describes our features and our rationale for choosing them, followed by an analysis of the results, and a conclusion. 282 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages , Denver, Colorado, June 4-5, c 2015 Association for Computational Linguistics
2 Text-based features Text-based similarities yes/no/probably-like words existing Vector-based features Q&A vectors OOV Q&A yes/no/probably-based cosine similarity Metadata-based features Q&A identical user Rank-based features Normalized ranking scores 2 Method Table 1: The different types of features. Continuous vector representations, described by Schütze (Schütze, 1992a; Schütze, 1992b), associate similar vectors with similar words and phrases. Most approaches to computing vector representations use the observation that similar words appear in similar contexts (Firth, 1957). The theses of Sahlgren (Sahlgren, 2006), Mikolov (Mikolov, 2012), and Socher (Socher, 2014) provide extensive information on vector representations. Our system analyzes questions and answers with a DkPro (Eckart de Castilho and Gurevych, 2014) uimafit (Ogren and Bethard, 2009) pipeline. The DkPro OpenNLP (Apache Software Foundation, 2014) segmenter and chunker tokenize and find sentences and phrases in the English questions and answers, followed by lemmatization with the Stanford lemmatizer (Manning et al., 2014). In Arabic, we only apply lemmatization, with no chunking, using MADAMIRA (Pasha et al., 2014). Stop words are removed in both languages. As shown in Table 1, we compute text-based, vector-based, metadata-based and rank-based features from the pre-processed data. The features are used for a linear SVM classifier for answer selection and YES/NO answer inference tasks. YES/NO answer inference is only performed on good YES/NO question answers, using the YES/NO majority class, and unsure otherwise. SVM parameters are set by grid-search and cross-validation. Text-based features These features are mainly computed using text similarity metrics that measure the string overlap between questions and answers: The Longest Common Substring measure (Gusfield, 1997) identifies uninterrupted common strings, while the Longest Common Subsequence measure (Allison and Dix, 1986) and the Longest Common Subsequence Norm identify common strings with interruptions and text replacements, while Greedy String Tiling measure (Wise, 1996) allows reordering of the subsequences. Other measures which treat text as sequences of characters and compute similarities include the Monge Elkan Second String (Monge and Elkan, 1997) and Jaro Second String (Jaro, 1989) measures. A Cosine Similarity-type measure based on term frequency within the text is also used. Sets of (1-4)-grams from the question and answer are compared with Jaccard coefficient (Lyon et al., 2004) and Containment measures (Broder, 1997). 1 Another group of text-based features identifies answers that contain yes-like (e.g., yes, oh yes, yeah, yep ), no-like (e.g., no, none, nope, never ) and unsure-like (e.g., possibly, conceivably, perhaps, might ) words. These word groups were determined by selecting the top 20 nearest neighbor words to the words yes, no and probably based on the cosine similarity of their Word2Vec vectors. These features are particularly useful for the YES/NO answer inference task. Vector-based features Our vector-based features are computed from Word2Vec vectors (Mikolov et al., 2013a; Mikolov et al., 2013b; Mikolov et al., 2013d). For English word vectors we use the GoogleNews vectors dataset, available on the Word2Vec web site, 2 which has a 3,000,000 word vocabulary of 300-dimensional word vectors trained on about 100 billion words. For Arabic word vectors we use Word2Vec to train 100-dimensional vectors with default settings on a lemmatized version of the Arabic Gigaword (Linguistic Data Consortium, 2011), obtaining a vocabulary of 120,000 word lemmas. We also use Doc2Vec, 3 an implementation of (Le and Mikolov, 2014) in the gensim 1 These features are mostly taken from the QCRI baseline system: task3/index.php?id=data-and-tools
3 toolkit (Řehůřek and Sojka, 2010). Doc2Vec provides vectors for text of arbitrary length, so it allows us to directly model answers and questions. The Doc2Vec vectors were trained on the CQA English data, creating a single vector for each question or answer. These are the only vectors that were trained specifically for the CQA domain. We implemented a UIMA annotator that associates a Word2Vec word vector with each invocabulary token (or lemma). No vectors are assigned for out of vocabulary tokens. Another annotator computes the average of the vectors for the entire question or answer, with no vector assigned if all tokens are out of vocabulary. We initially used the cosine similarity of the question and answer vectors as a feature for the SVM classifier, but we found that we had better results using the normalized vectors themselves. We hypothesize that the SVM was able to tune the importance of the components of the vectors, whereas cosine similarity weights each component equally. If the question or answer has no vector, we use a 0 vector. To make it easier for the classifier to ignore the vectors in these cases, we add boolean features indicating out of vocabulary, OOV Question and OOV Answer. Even though the bag of words approach showed encouraging results, we found it to be too coarse, so we also compute average vectors for each sentence. For English, we also compute average vectors for each chunk. Then we look for the best matches between sentences (and chunks) in the question and answer in terms of cosine similarity, and use the pairs of (unnormalized) vectors as features. 4 More formally, given a question with sentence vectors {q i } and an answer with sentence vectors {a j }, we take as features the values of the vector pair (ˆq, â) defined as: (ˆq, â) = arg max (q i,a j ) q i a j q i a j We also have six features corresponding to the greatest cosine similarity between the comment word vectors and the vectors for the words yes, Yes, no, No, probably and Probably. These features are more effective for the YES/NO classification task. doc2vec.html. 4 Post-evaluation testing showed no significant difference between using normalized or unnormalized vectors. Metadata-based features As a metadata-based indicator, the Q&A identical user identifies if the user who posted the question is the same user who wrote the answer. This indicator is useful for detecting irrelevant dialogue answers. Rank-based features We employ SVM Rank 5 to compute ranking scores of answers with respect to their corresponding questions. After generating all other features, SVM Rank is run to produce ranking scores for each possible answer. For training SVM Rank, we convert answer labels to ranks according to the following heuristic: good answers are ranked first, potential ones second, and bad ones third. Ranking scores are then used as features for the classifier. The normalization of these scores can be used as rank-based features to provide more information to the classifier, although these scores are also used without any other features as explained in Section 3. 3 Evaluation and Results We evaluate our approach on the answer selection and YES/NO answer inference tasks. We use the CQA datasets provided by the Semeval 2015 task that contain 2600 training and 300 development questions and their corresponding answers (a total number of 16,541 training and 1,645 development answers). About 10% of these questions are of the YES/NO type. We combined the training and development datasets for training purposes. The test dataset includes 329 questions and 1976 answers. About 9% of the test questions are bipolar. We also evaluate our performance on the Arabic answer selection task. The dataset contains 1300 training questions, 200 development questions, and 200 test questions. This dataset does not include YES/NO questions. English answer selection Our approach for the answer selection task in English ranked 6th out of 12 submissions and its results are shown in Table 2. VectorSLU-Primary shows the results when we include all the features listed in Table 1 except the rank-based features. VectorSLU-Contrastive shows the results when we include all the features except 5 svm_light/svm_rank.html. 284
4 Method Macro-F1 Accuracy VectorSLU-Primary VectorSLU-Contrastive JAIST (best) Baseline Table 2: Results for the English answer selection task. Method Macro-F1 Accuracy VectorSLU-Primary VectorSLU-Contrastive QCRI (best) Baseline Table 3: Results for the Arabic answer selection task. the rank-based and text-based features. Interestingly, VectorSLU-Contrastive leads to a better performance than VectorSLU-Primary. The lower performance of VectorSLU-Primary could be due to the high overlap between text-based features in different classes that can clearly mislead classifiers. For example, A1, A2 and A3 (see Section 1) all have a considerable word overlap with their question, while only A2 is a good answer. The last two rows of the table are respectively related to the best performance among all submissions and the majority class baseline that always predicts good. Arabic answer selection Our approach for answer selection in Arabic ranked 2nd out of 4 submissions. Table 3 shows the results. In these experiments, we employ all features listed in Table 1 except for yes/no/probably-based features, since the Arabic task does not include YES/NO answer inference. Vectors were trained from the Arabic Gigaword (Linguistic Data Consortium, 2011). We found lemma vectors to work better than token vectors. We computed ranking scores with SVM Rank for both VectorSLU-contrastive and VectorSLU- Primary. In the case of VectorSLU-contrastive, we used these scores to predict labels according to the following heuristic: the top scoring answer is labeled as direct, the second scoring answer as related, and all other answers as irrelevant. This decision mechanism is based on the distribution in the training and development data, and proved to work well on the test data. However, for our primary Method Macro-F1 Accuracy VectorSLU-Primary (best) VectorSLU-Contrastive Baseline Table 4: Results for the English YES/NO inference task. submission we were interested in a more principled mechanism. Thus, in the VectorSLU-primary system we computed 10 extra classification features from the ranking scores. These features are used to provide prior knowledge about relative ranking of answers with respect to their corresponding questions. To compute these features, we first rank answers with respect to questions and then scale the resultant scores into the [0,1] range. We then consider 10 binary features that indicate whether the score of each input answer is the range of [0,0.1), [0.1,0.2),..., [0.9,1), respectively. Note that each feature vector contains exactly one 1 and nine 0s. The last two rows of the table are related to the best performance and the majority class baseline that always predicts irrelevant. English YES/NO inference For the indirect YES/NO answer inference task, we achieve the best performance and ranked 1st out of 8 submissions. Table 4 shows the results. VectorSLU-Primary and VectorSLU-Contrastive have the same definition as in Table 2. Both approaches with or without the textbased features outperform the baseline that always predicts yes as the majority class and other submissions. This indicates the effectiveness of the vectorbased features. 4 Related Work We are not aware of any previous CQA work using continuous word vectors. Our vector features were somewhat motivated by existing text-based features, taken from the QCRI baseline system, replacing text-similarity heuristics with cosine similarity. Some of the approaches to classifying answers can be found in the general CQA literature, such as (Toba et al., 2014; Bian et al., 2008; Liu et al., 2008). 285
5 5 Conclusion In summary, we represented words, phrases, sentences and whole questions and answers in vector space, and computed various features from them for a classifier, for both English and Arabic. We showed the utility of these vector-based features for addressing the answer selection and the YES/NO answer inference tasks in community question answering. Acknowledgments This research was supported by the Qatar Computing Research Institute (QCRI). We would like to thank Alessandro Moschitti, Preslav Nakov, Lluís Màrquez, Massimo Nicosia, and other members of the QCRI Arabic Language Technologies group for their collaboration on this project. References Rami Al-Rfou, Bryan Perozzi, and Steven Skiena Polyglot: Distributed Word Representations for Multilingual NLP. CoRR, abs/ Lloyd Allison and Trevor I. Dix A bit-string longest-common-subsequence algorithm. Information Processing Letters, 23(5): Apache Software Foundation OpenNLP. Mohit Bansal, Kevin Gimpel, and Karen Livescu Tailoring Continuous Word Representations for Dependency Parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages Jiang Bian, Yandong Liu, Eugene Agichtein, and Hongyuan Zha Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media. In Proceedings of the 17th International Conference on World Wide Web, WWW 08, pages , New York, NY, USA. Samuel R. Bowman, Christopher Potts, and Christopher D. Manning Recursive Neural Networks for Learning Logical Semantics. CoRR, abs/ Jordan L. Boyd-Graber, Brianna Satinoff, He He, and Hal Daumé III Besting the Quiz Master: Crowdsourcing Incremental Classification Games. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP- CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, pages Andrei Z. Broder On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997, SE- QUENCES 97, pages 21, Washington, DC, USA. Yun-Nung Chen and Alexander I. Rudnicky Dynamically Supporting Unexplored Domains in Conversational Interactions by Enriching Semantics with Neural Word Embeddings. In Proceedings of the 2014 Spoken Language Technology Workshop, December 7-10, 2014, South Lake Tahoe, Nevada, USA, pages Richard Eckart de Castilho and Iryna Gurevych A broad-coverage collection of portable NLP components for building shareable analysis pipelines. In Nancy Ide and Jens Grivolla, editors, Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT) at COLING 2014, pages 1 11, Dublin, Ireland, August. John Firth A Synopsis of Linguistic Theory, Studies in Linguistic Analysis, pages Daniel (Zhaohan) Guo, Gokhan Tur, Wen-tau Yih, and Geoffrey Zweig Joint Semantic Utterance Classification and Slot Filling with Recursive Neural Networks. pages Dan Gusfield Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA. Mohit Iyyer, Jordan L. Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daumé III A Neural Network for Factoid Question Answering over Paragraphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages Matthew A. Jaro Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406): Quoc V. Le and Tomas Mikolov Distributed Representations of Sentences and Documents. CoRR, abs/ Omer Levy and Yoav Goldberg Dependency- Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages Linguistic Data Consortium Arabic Gigaword Fifth Edition. edu/ldc2011t
6 Yandong Liu, Jiang Bian, and Eugene Agichtein Predicting Information Seeker Satisfaction in Community Question Answering. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SI- GIR 08, pages , New York, NY, USA. Caroline Lyon, Ruth Barrett, and James Malcolm A theoretical basis to the automated detection of copying between texts, and its practical implementation in the ferret plagiarism and collusion detector. In Plagiarism: Prevention, Practice and Policies 2004 Conference. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages Lluís Màrquez, James Glass, Walid Magdy, Alessandro Moschitti, Preslav Nakov, and Bilal Randeree SemEval-2015 Task 3: Answer Selection in Community Question Answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/ Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. CoRR, abs/ Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013c. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013d. Linguistic Regularities in Continuous Space Word Representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages Tomàš Mikolov Statistical Language Models Based on Neural Networks. Ph.D. thesis, Brno University of Technology. Alvaro Monge and Charles Elkan An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Philip Ogren and Steven Bethard Building Test Suites for UIMA Components. In Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009), pages 1 4, Boulder, Colorado, June. Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 14), pages , Reykjavik, Iceland. Radim Řehůřek and Petr Sojka Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45 50, Valletta, Malta, May. Magnus Sahlgren The Word-Space Model. Ph.D. thesis, Stockholm University. Hinrich Schütze. 1992a. Dimensions of Meaning. In Proceedings of Supercomputing 92, pages Hinrich Schütze. 1992b. Word Space. In NIPS, pages Richard Socher Recursive Deep Learning for Natural Language Processing and Computer Vision. Ph.D. thesis, Stanford University. Hapnes Toba, Zhao-Yan Ming, Mirna Adriani, and Tat- Seng Chua Discovering high quality answers in community question answering archives using a hierarchy of classifiers. Information Sciences, 261(0): Michael J. Wise YAP3: Improved Detection Of Similarities In Computer Program And Other Texts. In SIGCSEB: SIGCSE Bulletin (ACM Special Interest Group on Computer Science Education), pages
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Leveraging Frame Semantics and Distributional Semantic Polieties
Leveraging Frame Semantics and Distributional Semantics for Unsupervised Semantic Slot Induction in Spoken Dialogue Systems (Extended Abstract) Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky
Package syuzhet. February 22, 2015
Type Package Package syuzhet February 22, 2015 Title Extracts Sentiment and Sentiment-Derived Plot Arcs from Text Version 0.2.0 Date 2015-01-20 Maintainer Matthew Jockers Extracts
Sentiment Lexicons for Arabic Social Media
Sentiment Lexicons for Arabic Social Media Saif M. Mohammad 1, Mohammad Salameh 2, Svetlana Kiritchenko 1 1 National Research Council Canada, 2 University of Alberta [email protected], [email protected],
Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.
Ming-Wei Chang 201 N Goodwin Ave, Department of Computer Science University of Illinois at Urbana-Champaign, Urbana, IL 61801 +1 (917) 345-6125 [email protected] http://flake.cs.uiuc.edu/~mchang21 Research
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
A Look Into the World of Reddit with Neural Networks
A Look Into the World of Reddit with Neural Networks Jason Ting Institute of Computational and Mathematical Engineering Stanford University Stanford, CA 9435 [email protected] Abstract Creating, placing,
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Subordinating to the Majority: Factoid Question Answering over CQA Sites
Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services
21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
An empirical study of semantic similarity in WordNet and Word2Vec. A Thesis
An empirical study of semantic similarity in WordNet and Word2Vec A Thesis Submitted to the Graduate Faculty of the University of New Orleans in partial fulfillment of the requirements for the degree of
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Learning and Inference over Constrained Output
IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu
CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies
CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies Hang Li Noah s Ark Lab Huawei Technologies We want to build Intelligent
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Microblog Sentiment Analysis with Emoticon Space Model
Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
How To Bet On An Nfl Football Game With A Machine Learning Program
Beating the NFL Football Point Spread Kevin Gimpel [email protected] 1 Introduction Sports betting features a unique market structure that, while rather different from financial markets, still boasts
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
Florida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
Stanford s Distantly-Supervised Slot-Filling System
Stanford s Distantly-Supervised Slot-Filling System Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning Computer Science Department,
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
Issues in evaluating semantic spaces using word analogies
Issues in evaluating semantic spaces using word analogies Tal Linzen LSCP & IJN École Normale Supérieure PSL Research University [email protected] Abstract The offset method for solving word analogies
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Probabilistic topic models for sentiment analysis on the Web
University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Searching biomedical data sets. Hua Xu, PhD The University of Texas Health Science Center at Houston
Searching biomedical data sets Hua Xu, PhD The University of Texas Health Science Center at Houston Motivations for biomedical data re-use Improve reproducibility Minimize duplicated efforts on creating
Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets
Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets Eshrag Refaee Interaction Lab, Heriot-Watt University EH144AS Edinburgh, UK [email protected] Verena Rieser Interaction Lab, Heriot-Watt
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
A Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel [email protected], [email protected],
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Semantically Enhanced Web Personalization Approaches and Techniques
Semantically Enhanced Web Personalization Approaches and Techniques Dario Vuljani, Lidia Rovan, Mirta Baranovi Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, HR-10000 Zagreb,
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
The Seven Practice Areas of Text Analytics
Excerpt from: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications G. Miner, D. Delen, J. Elder, A. Fast, T. Hill, and R. Nisbet, Elsevier, January 2012 Available now:
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia Sungchul Kim POSTECH Pohang, South Korea [email protected] Abstract In this paper we propose a method to automatically
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
RRSS - Rating Reviews Support System purpose built for movies recommendation
RRSS - Rating Reviews Support System purpose built for movies recommendation Grzegorz Dziczkowski 1,2 and Katarzyna Wegrzyn-Wolska 1 1 Ecole Superieur d Ingenieurs en Informatique et Genie des Telecommunicatiom
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching Amir H. Moin and Günter Neumann Language Technology (LT) Lab. German Research Center for Artificial Intelligence (DFKI)
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Natural Language Processing in the EHR Lifecycle
Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS [email protected] Health & Public Service Outline Medical Data Landscape Value Proposition of NLP
Neural Word Embedding as Implicit Matrix Factorization
Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer Science Bar-Ilan University [email protected] Yoav Goldberg Department of Computer Science Bar-Ilan University [email protected]
Automatic Identification of Arabic Language Varieties and Dialects in Social Media
Automatic Identification of Arabic Language Varieties and Dialects in Social Media Fatiha Sadat University of Quebec in Montreal, 201 President Kennedy, Montreal, QC, Canada [email protected] Farnazeh
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
Deposit Identification Utility and Visualization Tool
Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in
Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search
Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Orland Hoeber and Hanze Liu Department of Computer Science, Memorial University St. John s, NL, Canada A1B 3X5
Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering
Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering Guangyou Zhou, Kang Liu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION
Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data Johannes Daxenberger, Oliver Ferschke, Iryna Gurevych and Torsten Zesch UKP Lab, Technische Universität Darmstadt Information
An Introduction to Random Indexing
MAGNUS SAHLGREN SICS, Swedish Institute of Computer Science Box 1263, SE-164 29 Kista, Sweden [email protected] Introduction Word space models enjoy considerable attention in current research on semantic indexing.
