Methods for Augmenting Semantic Models with Structural Information for Text Classification

Similar documents
Author Gender Identification of English Novels

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review

Server Load Prediction

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Knowledge Discovery from patents using KMX Text Analytics

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Predicting Flight Delays

Building a Question Classifier for a TREC-Style Question Answering System

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Spam Detection Using Customized SimHash Function

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

Drugs store sales forecast using Machine Learning

Chapter 6. The stacking ensemble approach

New Ensemble Combination Scheme

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

The Enron Corpus: A New Dataset for Classification Research

Feature Selection for Electronic Negotiation Texts

Clustering Connectionist and Statistical Language Processing

Classifying Manipulation Primitives from Visual Data

How can we discover stocks that will

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Paraphrasing controlled English texts

Domain Classification of Technical Terms Using the Web

How To Identify A Churner

Interactive Information Visualization of Trend Information

Semantic Mapping Between Natural Language Questions and SQL Queries via Syntactic Pairing

Simple maths for keywords

Learning in Abstract Memory Schemes for Dynamic Optimization

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Beating the MLB Moneyline

12 FIRST QUARTER. Class Assignments

SpamNet Spam Detection Using PCA and Neural Networks

Sentiment Analysis of Twitter Data

Data Mining - Evaluation of Classifiers

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

RRSS - Rating Reviews Support System purpose built for movies recommendation

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

The Theory of Concept Analysis and Customer Relationship Mining

The Role of Controlled Experiments in Software Engineering Research

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

Identifying Focus, Techniques and Domain of Scientific Papers

Customer Intentions Analysis of Twitter Based on Semantic Patterns

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Optimizing the relevancy of Predictions using Machine Learning and NLP of Search Query

How To Filter Spam Image From A Picture By Color Or Color

Term extraction for user profiling: evaluation by the user

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Sentiment analysis: towards a tool for analysing real-time students feedback

Support Vector Machine (SVM)

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Blog Post Extraction Using Title Finding

Predict the Popularity of YouTube Videos Using Early View Data

Mining the Software Change Repository of a Legacy Telephony System

End-to-End Sentiment Analysis of Twitter Data

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Automated Content Analysis of Discussion Transcripts

Word Length and Frequency Distributions in Different Text Genres

Sheeba J.I1, Vivekanandan K2

Product Mix as a Framing Exercise: The Role of Cost Allocation. Anil Arya The Ohio State University. Jonathan Glover Carnegie Mellon University

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM

Lesson Plan. Date(s)... M Tu W Th F

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

communication tool: Silvia Biffignandi

User Recognition and Preference of App Icon Stylization Design on the Smartphone

An ontology-based approach for semantic ranking of the web search engines results

Predicting the Stock Market with News Articles

Natural Language Query Processing for Relational Database using EFFCN Algorithm

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Support Vector Machines for Dynamic Biometric Handwriting Classification

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Learning from Diversity

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

The Data Mining Process

Supervised Learning Evaluation (via Sentiment Analysis)!

Interest Rate Prediction using Sentiment Analysis of News Information

Marketing Mix Modelling and Big Data P. M Cain

Towards Effective Recommendation of Social Data across Social Networking Sites

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Active Learning SVM for Blogs recommendation

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Transcription:

Methods for Augmenting Semantic Models with Structural Information for Text Classification Jonathan M. Fishbein 1 and Chris Eliasmith 1,2 1 Department of Systems Design Engineering 2 Department of Philosophy Centre for Theoretical Neuroscience University of Waterloo 200 University Avenue West Waterloo, Canada {jfishbei,celiasmith}@uwaterloo.ca Abstract. Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words or concepts. Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. Here, we investigate three methods to augment semantic modelling with syntactic structure, which encode the structure across all features of the document vector while preserving text semantics. We present classification results for these methods versus the Bag-of-Concepts semantic modelling representation to determine which method best improves classification scores. Keywords: Vector Space Model, Text Classification, Parts of Speech Tagging, Syntactic Structure, Semantics. 1 Introduction Successful text classification is highly dependent on the representations used. Currently, most approaches to text classification adopt the bag-of-words document representation approach, where the frequency of occurrence of each word is considered as the most important feature. This is largely because past approaches that have tried to include more complex structures or semantics have often been found lacking [1], [2]. However, these negative conclusions are premature. Recent work that employs automatically generated semantics using Latent Semantic Analysis and Random Indexing have been shown to be more effective than bag-of-words approaches in some circumstances [3]. As a result, it seems more a matter of determining how best to represent semantics, than of whether or not semantics is useful for classification. Here we demonstrate that the same is true of including syntactic structure. A recent comprehensive survey suggests that including parse information will not help classification [2]. However, the standard method for including syntactic information is simply to add the syntactic information as a completely new, C. Macdonald et al. (Eds.): ECIR 2008, LNCS 4956, pp. 575 579, 2008. c Springer-Verlag Berlin Heidelberg 2008

576 J.M. Fishbein and C. Eliasmith independent feature of the document. In contrast, the methods we investigate in this paper take a very different approach to feature generation by distributing syntactic information across the document representation, thus avoiding the limitations of past approaches. 2 Bag-of-Concepts and Context Vectors The Bag-of-Concepts (BoC) [3] text representation is a recent text representation scheme meant to address the deficiencies of the Bag-of-Words (BoW) representations by implicitly representing synonymy relations between document terms. BoC representations are based on the intuition that the meaning of a document can be considered as the union of the meanings of the terms in that document. This is accomplished by generating term context vectors for each term within the document, and generating a document vector as the weighted sum of the term context vectors contained within that document. Reducing the dimensionality of document term frequency count vectors is a key component of BoC context vector generation. We use the random indexing technique [4] to produce these context vectors in a more computationally efficient manner than using principal component analysis (PCA). BoC representations still ignore the large amount of syntactic data in the documents not captured implicitly through word context co-occurrences. For instance, although BoC representations can successfully model some synonymy relations, since different words with similar meaning will occur in the same contexts, it can not model polysemy relations. For example, consider the word can. Even though the verb form (i.e., I can perform that action. ) and the noun form (i.e., The soup is in the can. ) of the word occur in different contexts, the generated term vector for can will be a combination of these two contexts in BoC. As a result, the representation will not be able to correctly model polysemy relations involving a word that can be used in different parts of speech. 3 Methods for Syntactic Binding To solve the problem of modeling certain polysemy relations in natural language text, we need a representation scheme that can encode both the semantics of documents, as well as the syntax of documents. We will limit syntactic information to a collapsed parts-of-speech (PoS) data set (e.g.: nouns, verbs, pronouns, prepositions, adjective, adverbs, conjunctions, and interjections), and look at three methods to augment BoC semantic modelling with this information. 3.1 Multiplicative Binding The simplest method that we investigate is multiplicative binding. For each PoS tag in our collapsed set, we generate a unique random vector for the tag of the same dimensionality as the term context vectors. For each term context vector, we perform element-wise multiplication between that term s context vector and

Methods for Augmenting Semantic Models with Structural Information 577 its identified PoS tag vector to obtain our combined representation for the term. Document vectors are then created by summing the the document s combined term vectors. 3.2 Circular Convolution Combining vectors using circular convolution is motivated by Holographic Reduced Representations [5]. For each PoS tag in our collapsed set, we generate a unique random vector for the tag of the same dimensionality as the term context vectors. For each term context vector, we perform circular convolution, which binds two vectors A =(a 0,a 1,...,a n 1 )andb =(b 0,b 1,...,b n 1 )to = n 1 k=0 a kb j k for give C = (c 0,c 1,...,c n 1 ) where C = A B with c j j =0, 1,...,n 1 (all subscripts are modulo-n). Document vectors are then created by summing the document s combined term vectors. There are a number of properties of circular convolution that make it ideal to use as a binding operation. First, the expected similarity between a convolution and its constituents is zero, thus differentiating the same term acting as different parts of speech in similar contexts. As well, similar semantic concepts bound to the same part-of-speech will result in similar vectors; therefore, usefully preserving the original semantic model. 3.3 Text-Based Binding Text-based binding combines a word with its PoS identifier before the semantic modelling is performed. This is accomplished by concatenating each term s identified PoS tag name with the term s text. Then, the concatenated text is used as the input for Random Indexing to determine the term s context vector. Document vectors are then created by summing the the document s term vectors. 4 Experimental Setup We performed Support Vector Machine (SVM) classification experiments 1 in order to investigate the classification effectiveness of our syntactic binding methods compared against the standard BoC representation. For the experiments in this paper, we used a linear SVM kernel function (with a slack parameter of 160.0) and fix the dimensionality of all context vectors to 512 dimensions 2.Weusedthe 20 Newsgroups corpus 3 as the natural language text data for our experiments. In these classification experiments, we used a one-against-all learning method 1 We used the SV M perf implementation, which optimizes for F 1 classification score, available at http://svmlight.joachims.org/svm perf.html. 2 The dimensionality of the vectors has been chosen to be consistent with other work. There is as yet no systematic characterization of the effect of dimensionality on performance. 3 Available at http://people.csail.mit.edu/jrennie/20newsgroups/.

578 J.M. Fishbein and C. Eliasmith employing 10-fold stratified cross validation 4. The SVM classifier effectiveness was evaluated using the F 1 measure. We present our aggregate results for the corpus as macro-averages 5 over each document category for each classifier. 5 Results Table 1 shows the macro-averaged F 1 scores for all our syntactic binding methods and the baseline BoC representation under SVM classification. All of the syntactic binding methods produced higher F 1 scores than the BoC representation, thus showing that integrating PoS data with a text representation method is beneficial for classification. The circular convolution method produced the best score, with a macro-averaged F 1 score of 58.19, and was calculated to be statistically significant under a 93.7% confidence interval. Table 1. Macro-Averaged SVM F 1 scores of all methods Syntactic Binding Method F 1 Score BoC (No Binding) 56.55 Multiplicative Binding 57.48 Circular Convolution 58.19 Text-based Binding 57.41 59 58 57 56 F1 Score 55 54 53 BoC (No Binding) 52 Multiplicative Binding Circular Convolution Text based Binding 51 10 20 30 40 50 60 70 80 90 Percentage of Corpus Used for Training Fig. 1. Learning curves of SVM F 1 scores of all methods 4 This cross-validation scheme was chosen as it better reflects the statistical distribution of the documents, although produces lower F 1 scores. 5 Since the sizes of document categories are roughly the, same micro-averaging yields similar results and have been omitted for brevity.

Methods for Augmenting Semantic Models with Structural Information 579 The learning curves for the methods are included in Figure 2. The graph shows that circular convolution consistently produces better SVM classification results when compared to the other methods after 20% of the data is used for training. This result indicates that in situations where there is limited class data from which to learn a classification rule, combining the PoS data using circular convolution leads to the most efficient method to assist the classifier in better distinguishing the classes. 6 Conclusions and Future Research Of all the methods investigated, the circular convolution method of binding a document s PoS information to its semantics was found to be the best. The circular convolution method had the best SVM F 1 score and was superior using various amounts of data to train the classifiers. Our results suggest areas of further research. One area of is to further investigate alternative binding schemes to augment text semantics, since all of the methods can bind more information than just PoS data. As well, further investigations using different corpora, such as the larger Reuters corpus, should be undertaken to examine the effectiveness of the syntactic binding methods under different text domains. Acknowledgment. This research is supported in part by the Canada Foundation for Innovation, the Ontario Innovation Trust, the Natural Science and Engineering Research Council and the Open Text Corporation. References 1. Kehagias, A., et al.: A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems 21(3), 227 247 (2003) 2. Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181 196. Springer, Heidelberg (2004) 3. Sahlgren, M., Cöster, R.: Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 487 493 (2004) 4. Sahlgren, M.: An Introduction to Random Indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (2005) 5. Plate, T.A.: Holographic Reduced Representation: Distributed representation for cognitive structures. CSLI Publications, Stanford (2003)