Weakly-Supervised Techniques for the Analysis of Evaluation in Text. Jonathon Read

Transcription

1 Weakly-Supervised Techniques for the Analysis of Evaluation in Text Jonathon Read Submitted for the degree of Doctor of Philosophy University of Sussex July 2009

2 Declaration I hereby declare that this thesis has not been and will not be submitted in whole or in part to another University for the award of any other degree. Signature: Jonathon Read

3 iii UNIVERSITY OF SUSSEX Jonathon Read, Doctor of Philosophy Weakly-Supervised Techniques for the Analysis of Evaluation in Text Summary A common approach to sentiment analysis is to employ supervised machine-learning methods to acquire prominent features of sentiment. However, the success of these methods is dependent on the domain, topic and time-period represented by the training data. This thesis explores an alternative approach to sentiment analysis, whereby the polarity of text is found by comparing the similarity of its constituents with prototypical examples of positivity and negativity. The techniques proposed are evaluated on various tasks in sentiment analysis and, while they are inferior to well-trained supervised techniques, they perform consistently across different domains, topics and time-periods. The second aspect to this thesis concerns Appraisal, a functional linguistic theory of evaluation in English. The Appraisal theory describes a hierarchy of the language used to communicate evaluation, detailing types of Attitude (how writers communicate their point of view), Engagement (how writers align themselves with respect to the position of others) and Graduation (how writers amplify or diminish their opinions), the recognition of which may assist in performing other tasks in sentiment analysis. The thesis describes the creation of a corpus of book reviews annotated according to the Appraisal theory, and an assessment of the difficulty of performing analyses of Appraisal by way of an inter-annotator agreement study. The corpus is used to evaluate the weakly-supervised methods performance when identifying Appraisal-bearing words. The methods are then used to investigate the application of Appraisal recognition to the broader field of sentiment analysis.

4 iv Acknowledgements Most importantly, thanks go to my supervisor, John Carroll. John made this thesis possible by organising my studentship from the EPSRC. He has offered careful and considered advice throughout the course of my studies and taught me research method. Thanks also to Diana McCarthy who kindled my interest in sentiment analysis with a pointer to a research paper and later offered advice as a member of my thesis committee. Bill Keller also served on the committee, and was particularly helpful in advising on the design on the annotation study. David Hope s help went beyond what one might reasonably expect from a fellow D. Phil. student in spending far too many sunny summer afternoons annotating book reviews. Kentaro Inui kindly provided exposure to research in Japan by arranging my visit to the Nara Institute of Science and Technology. Thanks to all the NLCL faculty, students and visitors for informative seminars and discussions. I am also grateful to the following people, who provided assistance by making their data/software available: Thorsten Joachims (SVMLight), Roy Lipski (newswire articles labelled with sentiment), Bo Pang (movie reviews labelled with sentiment), John Trenkle (language classification software). Thanks also to my friends, who bought me more than a few beers over the course of my frugal student years. Special thanks to Lis for her enthusiasm in the face of my pessimism about my research, and provision of ample distraction while shopping for plants. For my parents; this is the outcome of all the opportunities you have provided me.

5 v Contents List of Tables List of Figures xiii xvi 1 Introduction Background Overview of the Thesis Background Subjectivity Analysis Annotating Expressions of Subjectivity Learning Subjective Language Multi-lingual Subjectivity Analysis Resources Sentiment Analysis Supervised Machine-learning Approach Lexical Approach Determining Degrees of Sentiment Labelling On-Topic Sentiments The Effect of Context on Sentiment Granularity of Sentiment Analysis Opinion Mining Opinion Extraction Opinion-Oriented Question Answering Identifying Opinion Sources Identifying Opinion Topic Coreference Affect Recognition Lexical Approaches A Common-sense Approach

6 vi Supervised Machine-Learning for Mood Classification Dependency in Supervised Techniques for Sentiment Classification Dependencies in Sentiment Classification General Experimental Setup Topic Dependency Domain Dependency Temporal Dependency Sentiment Classification using Emoticons Emoticon Corpus Construction Emoticon-trained Sentiment Classification Discussion Coverage Noise in Usenet Article Extracts Related Work Resolving Dependencies in Supervised Sentiment Classification Emoticons for Sentiment Analysis and Affect Recognition Weakly Supervised Techniques for Sentiment Analysis Applying Measures of Word Similarity for Weakly Supervised Text Classification Lexical Association Semantic Spaces Distributional Similarity Experimental Set-up Selecting Class Prototypes Constructing a Polarity Lexicon Sentiment Classification of Movie Reviews Performance across Topics, Domains and Time-Periods Scoring Sentences According to Strength of Valence and Affect Discussion Introduction to Appraisal Theory Systemic Functional Linguistics Attitude: Ways of Feeling Types of attitude

7 vii Attitudinal realisations Engagement: Appraisals of appraisals Dialogic Expansion Dialogic Contraction Graduation: Strength of evaluation Focus Force Computational uses of Appraisal Annotating Expressions of Appraisal Annotation Methodology Inter-annotator Agreement Text Anchor and Appraisal Type Agreement Text Anchor and Appraisal Type Contingencies Measuring inter-annotator agreement beyond chance Ambiguous Appraisal-bearing expressions Measuring agreement amongst many annotators Agreement in the Ambiguous Term Categorisation Task Creating a Gold-Standard for Appraisal Analysis Computational Appraisal Analysis Identifying Appraisal-bearing Words and Expressions Classifying Expressions of Appraisal Extracting Appraisal-bearing Words Determining the Polarity of Attitude Determining the Direction of Graduation Discussion Appraisal Extraction for Sentiment Analysis Classifying Reviews by Sentiment Scoring Sentences by Strength of Sentiment and Affect Conclusions Thesis Contributions Dependency in Supervised Methods for Sentiment Classification Weakly-supervised Methods for Sentiment Analysis Computational Appraisal Analysis

8 viii 8.2 Future Work Bibliography 163 A Appraisal Corpus Articles 180 B Ambiguous Expressions Questionnaire 183 C Appraisal Corpus Excerpts 188 D Appraisal Classification Contingency Tables 191 E Appraisal Extraction Output Example 195

9 ix List of Tables 2.1 Riloff and Wiebe s (2003) syntactic templates and examples of patterns of subjective expressions Pang and Lee s (2008) exploration of sentiment classification of movie reviews using keywords selected by human judges and simple statistics, with percentage accuracy and ties in number of keywords found Patterns of part-of-speech tags used by SO-PMI-IR (Turney, 2002) for extracting phrases from problem documents Attributes for the two main MPQA annotation types Accuracies of supervised classifiers when training and testing on different topics. Best performance on a test set for each model is highlighted in bold Accuracies of supervised classifiers when training and testing on different domains. Best performance on a test set for each model is highlighted in bold The top twenty most divergent features found when comparing sentiment probabilities in the Newswire and Polarity 1.0 data sets Accuracies of supervised classifiers when training and testing on different time-periods. Best performance on a test set for each model is highlighted in bold Examples of emoticons and the frequency of usage observed in Usenet articles, in percent Accuracy of Emoticon-trained sentiment classifiers across topics Accuracy of Emoticon-trained sentiment classifiers across domains Accuracy of Emoticon-trained sentiment classifiers across time periods Coverage of classifiers, in percent

10 x 4.1 Lund and Burgess s (1996) example matrix for the horse raced past the barn fell, computed for a window width of five words Distance metrics employed by Levy et al. (1998) Prototypes selected for Sentiment classes Prototypes selected for the six Basic emotions The performance of weakly supervised methods in classifying POSITIV and NEGATIV entries in the General Inquirer, with respect to polarity The performance of weakly supervised methods in determining the sentiment of movie reviews in Pang and Lee s (2004) data set The accuracies of supervised and weakly-supervised methods in classifying newswire articles according to sentiment in various topics, with the harmonic means of the accuracies The accuracies of supervised and weakly-supervised methods in classifying documents in the domains of newswire articles and movie reviews, with the harmonic means of the accuracies The accuracies of supervised and weakly-supervised methods in classifying movie reviews from data sets representing different time-periods, with the harmonic means of the accuracies The mean annotators correlation scores for each type in the Affective Text shared task (Strapparava and Mihalcea, 2007) The performance of the weakly-supervised techniques in the Valence test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007). Systems are in alphabetical order, with highest performers in each measure highlighted in bold The performance of the weakly-supervised techniques in the Affect sub task compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihalcea (2008) (SM). The best results in each measure and each emotion are highlighted in bold The mean performance across all six emotions of the weakly-supervised techniques in the Affect test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihalcea (2008) (SM). The best results in each measure are highlighted in bold

11 xi 4.14 Contingency tables of the labels output by the weakly-supervised methods in the word classification task. Rows indicate the labels chosen by the method while columns represent the correct label. The distribution of the labels in the test set was 44.6% Positive and 55.4% Negative The word similarity methods coverage of types and instances in the document level sentiment classification task Illustrations of Affect Illustrations of Judgement Illustrations of Appreciation MUC-7 test score definitions (Chinchor, 1998) MUC-7 test scores, evaluating the agreement in text anchors selected by the annotators. When considering agreement in text anchors there is only one class of interest, hence the SUB measure will always be zero. The average between the two annotators is calculated using the harmonic mean Harmonic means of MUC-7 test scores evaluating the agreement in text anchors selected by the annotators for various matching constraints The contingency table showing d s choices (columns) in terms of percentage of j s annotations (rows). The MIS column indicates the percentage of j s annotations where d did not provide a match The contingency table showing j s choices (columns) in terms of percentage of d s annotations (rows). The MIS column indicates the percentage of d s annotations where j did not provide a match κ values at the different levels of the Appraisal taxonomy over all annotation types and over Attitude, Engagement, and Graduation types only Interpretations of κ values, suggested by Landis and Koch (1977) Senses of the verb abandon listed in WordNet 2.1, accompanied by an Appraisal class proposed by annotator j based on the gloss All possible combinations of questionnaire respondents with English as a first language and/or familiarity with Appraisal, with corresponding respondent frequencies (n) and Kappa (κ) scores. All values are significant at p < Prototypes selected for each Appraisal type

12 xii 7.2 The distribution of annotations in the development and test data sets, according to the number of grams The distribution of Appraisal types found in the development data, at various levels of the Appraisal hierarchy The performance of word similarity algorithms in classifying expressions according to the various levels of the Appraisal framework. (w) indicates weighted versions of the algorithms The performance of word similarity algorithms in extracting features according to the various levels of the Appraisal framework. (w) indicates weighted versions of the algorithms The mean and variance of the cross-validated optimal thresholds for each method at each level of the Appraisal hierarchy. (w) indicates weighted versions of the algorithms The performance of word similarity algorithms in determining the polarity of instances of Attitude Prototypical words of Graduation The performance of word similarity algorithms in determining the direction of instances of Graduation Descriptive statistics summarising the unexpectedness of the selections made by each method The calculation of Lexical Association scores using Pointwise Mutual Information, for the word but for Counter and Deny Similarity scores for the prototypes of Happiness and Security from the Semantic Space algorithm for the words happy and unhappy A summary of proposed heuristics based on the Appraisal Theory types The optimal weights for each class when applying Appraisal heuristics on the sentiment classification task The performance of various approaches to determining the sentiment of movie reviews in the Pang and Lee s (2004) Polarity 2.0 dataset. Items marked with are supervised approaches The performance of the weakly-supervised techniques in the Valence test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007). Systems are in alphabetical order, with highest performers in each measure highlighted in bold

13 xiii 7.17 The mean performance across all six emotions of the weakly-supervised techniques in the Emotions test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihalcea (2008) (SM). Systems are in alphabetical order, with highest performers in each measure highlighted in bold D.1 A contingency table with unexpectedness values of the selections made by the unweighted Lexical Association method when performing the Appraisal Classification task. For ease of reading only the contingencies where u > x + σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row D.2 A contingency table with unexpectedness values of the selections made by the unweighted Semantic Space method when performing the Appraisal Classification task. For ease of reading only the contingencies where u > x + σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row D.3 A contingency table with unexpectedness values of the selections made by the unweighted Distributional Similarity method when performing the Appraisal Classification task. For ease of reading only the contingencies where u > x + σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row

14 xiv List of Figures 2.1 Hatzivassiloglou and McKeown s (1997) example of how the adjective simplistic deviates from its semantic group, in that its semantic orientation is opposite to the highly related word simple Change in performance of the supervised classifiers when constraining the number of reviews permitted of any given movie, in percent Change in Performance of the SVM Classifier on held out reviews from Polarity 1.0, varying training set size and window context size. The datapoints represent 2,200 experiments in total A subsumption hierarchy describing the types of relations output by the RASP system (Briscoe et al., 2006) The results of each optimising General Inquirer test carried out on the Lexical Association method The results of each optimising General Inquirer test carried out on the Semantic Space algorithm The results of each optimising General Inquirer test carried out on the Distributional Similarity algorithm The results of each optimising Movie Review test carried out on the Lexical Association method The results of each optimising Movie Review test carried out on the Semantic Space algorithm The results of each optimising Movie Review test carried out on the Distributional Similarity algorithm The results of each optimising Affective Text valence test carried out on the Lexical Association method

15 xv 4.9 The results of each optimising Affective Text valence test carried out on the Semantic Space method The results of each optimising Affective Text valence test carried out on the Distributional Similarity method The results of each optimising Affective Text emotion test carried out on the Lexical Association method The results of each optimising Affective Text emotion test carried out on the Semantic Space method The results of each optimising Affective Text emotion test carried out on the Distributional Similarity method Frequencies of annotations by valence score in the test set compared with the valence score assigned by the Semantic Space method A systems network depicting the structure of Appraisal resources (Martin and White, 2005) The Cognitive Structure of Emotions, (from Ortony et al., 1988) Types of structural prosody in discourse. Examples from Martin and White (2005) The attitude system Strategies for inscribing and invoking attitude (Martin and White, 2005) The engagement system The graduation system The custom-made Appraisal annotation tool The Appraisal framework showing the hierarchical levels. Labels are accompanied by the harmonic mean of the F 1 of the annotators for appraisal type/overall types for that level The harmonic mean of the recall (REC) exhibited by the annotators at the various levels of the Appraisal taxonomy The harmonic mean of the precision (PRE) exhibited by the annotators at the various levels of the Appraisal taxonomy The harmonic mean of the annotators substitution (SUB) rates at the various levels of the Appraisal taxonomy The harmonic mean of the annotators error (ERR) rates at the various levels of the Appraisal taxonomy

16 xvi 7.1 The results of each optimising test carried out on the unweighted lexical association algorithm, using lemmatised tokens and PMI The results of each optimising test carried out on the weighted lexical association algorithm, using lemmatised tokens and PMI The results of each optimising test carried out on the unweighted semantic space algorithm, using lemmatised tokens and PMI The results of each optimising test carried out on the weighted semantic space algorithm, using lemmatised tokens and PMI The results of each optimising test carried out on the distributional similarity algorithm

17 1 Chapter 1 Introduction 1.1 Background The past decade has witnessed a swell of interest in the analysis of authors opinions as expressed in written documents. This is in no small part due to the proliferation of electronically-published opinion which presents a wealth of easily-accessible text of interest to governments, companies and individuals seeking to automatically distill public opinion. This thesis studies aspects of the computational analysis of opinion in text. Opinion is conveyed in text in a wide variety of domains and genres. Prior to the proliferation of the Internet, most publicly-available opinion was limited to reports and editorials in newspapers. More recently, however, the World Wide Web has allowed both traditional providers and also the general public to distribute written content on a scale not previously possible. Newspapers reproduce much of their content online, while the blogosphere enables suitably skilled Web users to easily publish their thoughts for the consideration of the rest of the community. There are numerous professional and enthusiast review websites, and many online retailers, such as Amazon and itunes, encourage their customers to review their purchases for the benefit of other shoppers. There is such a wealth of product reviews, in fact, that it has prompted the development of opinionaggregation websites such as Metacritic.com. A further important source of opinion may be found in collections of s received by customer relations departments. Reliable methods of automatically analysing evaluative language would be useful in a number of application areas which currently rely on manual analysis. For example, stock market traders often employ manual analyses of sentiment in news articles about a company in order to predict fluctuations in its share price. However, a high degree of accuracy can be obtained automatically by training supervised machine learning classifiers

18 2 such as Naïve Bayes or Support Vector Machines (Pang et al., 2002). The same technology might be applied by opinion-aggregation websites, whose staff manually collate the scores of reviews for products such as films, music, games and television programmes in order to derive averages. Sentence-level sentiment classification may be of benefit to researchers investigating social networks in academic and online communities. Often such networks are constructed using citations made by authors (Wasserman and Faust, 1994); classifying these citations by sentiment would enable the network analysis to discriminate between favourable and unfavourable references. Other applications require more detail about the types of opinions expressed. For example, political parties and governmental departments are often interested in understanding public opinion on some contentious issue, and so commission person-to-person surveys. Similarly, traditional business market research techniques involve conducting surveys or organising focus group sessions to collect the opinions of a small number of members of the public. These techniques are time-consuming and costly, and findings determined in this way are questionable if the participants are not sufficiently representative of the public. Instead, these tasks could be accomplished automatically. Opinion-oriented information retrieval techniques could obtain articles relevant to a given topic such as a political issue or company s product. Opinion-mining techniques might then be employed to identify facets of opinions expressed within the document including the holder, target and nature of the opinion (Wiebe et al., 2003). These expressions of opinion could then be clustered in order to generate descriptive statistics to summarise public opinion. This approach might also be insufficiently representative of the public, however this could be mitigated by employing techniques to identify author demographics (Liu and Mihalcea, 2007; Argamon et al., 2009). An application of interest to government intelligence agencies is that of automatically monitoring dissent on extremist websites. For example, Abbasi and Chen (2007) described how affect recognition techniques can be employed to automatically analyse the forums of extremist groups. Techniques to process evaluative language would also be of benefit to other avenues of computing research that are concerned with interaction with humans. For instance, analysis of emotion in text might be useful to researchers working in Affective Computing (Picard, 1997). Affective Computing is a branch of human-computer interaction research that seeks to adapt interfaces to users emotional state. This is typically achieved by

19 3 monitoring speech patterns, facial expressions and body gestures, but a deeper analysis of users emotions and opinions could provide more detailed information about their interactive experience. Other aspects of human-computer interaction that could benefit from a textual analyses of emotion include: providing clues for prosody in text-to-speech synthesis (Alm et al., 2005); computationally generated humour (Stock and Strapparava, 2005); generation of gestures or facial expressions in avatars and robots (Nakano et al., 2005); and generation of emotive or persuasive language (Strapparava and Mihalcea, 2008). Such analysis could also be applied in Expressive Artificial Intelligence art projects (Mateas, 2001), particularly in interactive dramas in which computer-controlled characters need to respond appropriately to the opinions and emotions expressed by the player. Spertus (1997) described a system for detecting flames in online forums (Spertus, 1997). This might be generalised to detect inappropriate language in formal communications. For example, in much the same way as applications highlight incorrect spelling and grammar, one could develop and word processing applications that warn users of unsuitably affective language such as being overly familiar with customers or aggressive towards colleagues. Automatic processing of the kind mentioned above is typically achieved using techniques rooted in one or more of: supervised machine learning, weakly-supervised machine learning, and linguistically-inspired heuristics. Pang et al. (2002), for example, classified reviews as being positive or negative in sentiment using Naïve Bayes, Maximum Entropy and Support Vector Machine classifiers. Turney (2002) showed how this could also be achieved (with a lesser degree of accuracy) using weakly-supervised machine learning by comparing the similarity of target words with prototypical examples of positive and negative sentiment. Polanyi and Zaenen (2004) explored the application of contextual valence shifters lexical items noted as having an effect on the sentiment conveyed by a sentence. 1.2 Overview of the Thesis This thesis is particularly concerned with weakly-supervised methods for the classification of text according to its positivity or negativity. Previous research has found that supervised machine-learning techniques can be very effective at this task. A number of studies, however, have found that this performance is dependent on a good match between training and testing data with respect to topic. The thesis shows that the data must also match with respect to domain and time-period, and so proposes and evaluates classification techniques based on the similarity of words. These techniques are only weakly-supervised as

20 4 they simply require a small set of prototypical words and a large unlabelled corpus of general text, and therefore potentially do not suffer from the types of dependency exhibited by supervised techniques. The experiments reported in the thesis show that, while the performance of the weakly-supervised methods is inferior to traditional supervised techniques, their accuracy is reasonably consistent across domains, topics and time-periods. Applications such as those described above can potentially be supported by a number of frameworks that describes aspects of evaluative and emotional language, such as types of emotion and opinions, and the variables that affect their intensity. These frameworks typically originate in the fields of cognitive science, linguistics and psychology, and could be informative for computational experiments with evaluative language. Ekman (1993), for instance, derived from facial expressions a list of basic emotions which can provide a set of classes for affect recognition. Gratch and Marsella (2004) developed a cognitive model of appraisal that considered several variables affecting the strength of appraisal, such as the relevance and urgency of an event, and the degree to which the ego is involved. This model was created for use by avatars simulating an emotional reaction, but could be used to inform analyses of evaluative language if suitable indicators of these variables could be found. Wiebe et al. (2005) created a scheme for the annotation of the mental and emotional state conveyed by text. Their scheme distinguished between explicit expressions such as The U.S. fears a spill-over and subjective expressive elements where the affective state is implied by words that contain negative connotations (e.g. We foresaw electoral fraud but not daylight robbery). Hyland (1998) described the linguistic phenomenon of hedging, where writers express the degree to which an opinion is speculative or unconfirmed (e.g. perhaps or somewhat). Di Marco and Mercer (2004) used features based on hedging to determine the nature of the relationships between scientific articles. The thesis extends the breadth of analysis of evaluation in language by investigating the computational analysis of text according to the Appraisal Theory (Martin and White, 2005), a Systemic Functional Linguistic theory of evaluation which is couched in terms of English, but potentially applicable to other languages. It distinguishes between types of attitude (personal affect, judgement of people and appreciation of objects) and describes how authors use language to communicate their engagement with other writers, and to amplify or diminish the strength of their opinions. Knowledge of these types of language could enhance existing techniques for the analysis of evaluation in language by considering the type and strength of evaluation communicated, and identifying when and how authors report the opinions of others. The thesis presents a method employed to manually

21 5 annotate a corpus of book reviews according to the theory, the performance of the weaklysupervised methods in performing an analysis according to the Appraisal framework, and an application of the theory to the task of sentiment classification. The content presented in each chapter of this thesis is summarised below. Chapter 2: Background This chapter reviews previous research conducted in the area of automatically analysing evaluative language in text. Four related lines of research are considered: subjectivity analysis, which seeks to distinguish between facts and expressions of emotion or opinion; sentiment analysis, which focuses on whether text is generally positive or negative in feeling; opinion mining, which extends subjectivity analysis by identifying aspects of opinion such as holders and targets; and affect recognition which attempts to recognise and label different types of emotions. Chapter 3: Dependency in Supervised Techniques for Sentiment Classification This chapter presents experiments that demonstrate that good performance of supervised machine learning techniques for sentiment classification is dependent on a good match between the training and testing data, with respect to domain, topic and the time-period represented by that data. It then proposes that these dependencies might be mitigated by using a body of general text, and therefore discusses the results of training supervised classifiers on text collected by extracting paragraphs containing a smile or frown emoticon from Usenet postings. Chapter 4: Weakly Supervised Techniques for Sentiment Analysis This chapter proposes that the best way to avoid the problems of domain, topic and time-period dependency in sentiment analysis is to instead employ word similarity methods that relate problem words in a very large corpus to prototypical examples of sentiment. It reviews three word similarity techniques: Lexical Association, Semantic Spaces and Distributional Similarity. Each of these methods are applied to three tasks: constructing a polarity lexicon, in which entries are labelled as being positive or negative in sentiment; classifying documents as being positive or negative; and scoring sentences according to the strength of sentiment and six basic emotions. It concludes by discussing the strengths and weaknesses of the similarity methods for analysis of evaluative language. Chapter 5: Introduction to Appraisal Theory As the subsequent chapters of the thesis are concerned with Appraisal Theory, this chapter provides an outline of the basics of the theory. It describes the types of language utilised in communicating

22 6 the three parallel systems of Appraisal: Attitude, which is concerned with types of evaluations; Engagement, which describes how authors align with or distance themselves from the opinions of others; and Graduation, which considers how language can amplify or diminish the strength of opinions. This chapter also discusses previous computational work based on Appraisal Theory. Chapter 6: Annotating Expressions of Appraisal This chapter describes the annotation of a corpus of book reviews according to Appraisal Theory. It presents the results of an inter-annotator agreement study, and considers instances of systematic disagreement that suggest areas in which the theory might be improved. It also reports the results of a survey designed to evaluate the difficulty of Appraisal analysis in particularly ambiguous situations. Although the annotation task is difficult, there are many instances where the annotators agree; these are used to create a gold-standard corpus for the appraisal analysis experiments. Chapter 7: Computational Appraisal Analysis This chapter presents the results of evaluating the word similarity techniques described in Chapter 4 in labelling expressions and extracting words according to Appraisal Theory, and discusses the strengths and weaknesses of the methods for this task. It also proposes a number of heuristics based on Appraisal Theory, and applies these to the task of sentiment classification. Chapter 8: Conclusion This chapter summarises the techniques and experiments reported in the thesis, presents overall conclusions derived from the results, and proposes several directions for future work.

23 7 Chapter 2 Background Researchers investigating the computational analysis of evaluation in natural language have labelled their work using a number of terms, including: opinion mining, sentiment analysis and subjectivity analysis. This chapter reviews each of these related areas and considers the difference between these names with respect to which aspects of evaluative language they focus on. Section 2.1 discusses the analysis of subjectivity in text, where researchers look for clues as to whether a proposition represents a private state (that is, an aspect of a writer s psychological state, which is not open to objective verification (Quirk et al., 1985)) or instead a matter of fact. When considering sentiment in text (reviewed in Section 2.2), the task is to determine the polarity of opinion-bearing expressions, that is, whether it is generally positive or negative in sentiment. Section 2.3 discusses the area of opinion mining, which analyses these expressions in more detail, determining opinion holders, targets and types. Finally, Section 2.4 discusses the related area of affect recognition, which seeks to identify evaluative language in more dimensions than polarity, specifically distinguishing between different types of emotion. 2.1 Subjectivity Analysis Subjective texts represent aspects of some individual s point of view, such as their beliefs, emotions and perceptions (Banfield, 1982). In contrast, objective sentences express factual information (or at least information that is believed to be factual by the individual). Automatic recognition of subjective text is beneficial in a number of natural language processing applications, such as tracking point-of-view (Wiebe, 1994) and answering questions/extracting information with regards to matters of fact or opinion (Wiebe and Wilson, 2002; Riloff et al., 2005). Recognising subjective language is also a useful starting point

24 8 when considering any aspect of opinions in text, since disregarding objective content can expedite and simplify further analysis (Pang and Lee, 2004). Wiebe (1994) observes that the subjective or objective status of propositions is rarely presented explicitly, making classification a challenging task for computational methods. The problem is complicated further if one considers that documents are never wholly either subjective or objective (Wiebe et al., 2001b), which makes evaluation a difficult endeavour. Nevertheless, several studies demonstrate that automatic identification of subjective language is possible to some extent Annotating Expressions of Subjectivity Wiebe et al. (1999) presented a case study of human capability in judging sentences as subjective or objective. The coding of subjectivity status was intention-based: If the primary intention of a sentence is objective presentation of material that is factual to the reporter, the sentence is objective. Otherwise, it is subjective. (Wiebe et al., 1999) Four judges independently annotated the same 14 articles chosen at random from the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993). The judges classified non-compound sentences and each conjunct of each compound sentence as subjective or objective and assigned a certainty value (Bruce and Wiebe, 1999). The annotations were corrected for bias using a variety of statistical methods and then discussed by the judges in order to create an updated coding manual. The same judges used the updated coding manual to annotate a disjoint set of documents. The authors employed the Kappa co-efficient (Carletta, 1996) to evaluate inter-judge pairwise agreement, finding that if items on which the annotators were uncertain were excluded then pairwise agreements between judges yielded a Kappa value of over 0.87, a score high enough to allow definite conclusions (Krippendorf, 1980). We can therefore conclude that the judges were able to agree on the subjective/objective nature of sentences, providing they had confidence in their own classifications. Wiebe et al. employed the conjuncts labelled in their corpus annotation study to evaluate machine learning models for the classification of sentences according to subjectivity. Sentences were represented using the following features: the presence of a pronoun, an adjective, a cardinal number, a modal other than will and an adverb other than not;

25 9 whether the sentence begins a new paragraph or not; and co-occurrence of word tokens and punctuation with respect to subjective/objective classifications. The average accuracy across a ten-fold cross validation was 72.2% compared with a random-choice baseline of 51.0% and an upper bound of 89.5% (estimated from human performance) Learning Subjective Language Having demonstrated the feasibility of automatically recognising subjective language, even when using simple features, Wiebe and colleagues continued to investigate methods of learning more about the nature of subjective language (collated in Wiebe et al., 2004). Wiebe (2000) clustered adjectives according to distributional similarity (using Lin s (1998) method) in order to grow sets of clues of subjectivity from a small number of seed terms. Hatzivassiloglou and Wiebe (2000) examined the effect of various adjective features in determining the subjectivity of sentences. They found that the semantic orientation (whether it was positive or negative in sentiment) and the gradability (whether it accepted modifiers that serve to intensify or diminish its strength) of an adjective to a large extent predicted the subjectivity of the sentences in which it appeared. Furthermore, employing automatic methods for determining semantic orientation and gradability improved the precision of automatic subjectivity classification. Wiebe et al. (2001b) presented a method for learning collocational clues of subjectivity in text. Their method identified collocations of fixed word stems and a generalised collocational pattern (a collocation where one position can be filled by infrequently appearing words). The precision of an n-gram was calculated as the number of instances of that n-gram (n being from 1 to 4) in subjective elements relative to the total number of instances of that n-gram. The method labelled an n-gram as a subjective fixed-n-gram if its precision was at least 0.1 and greater than or equal to the precision of each of its components. To extract generalised collocational patterns Wiebe et al. replaced hapax legomena (words that appear only once in the corpus) with a placeholder (Unique). That is, they treated the set of unique words as a single frequently occurring word. The same criteria to evaluate regular n-grams were used to determine if any n-gram with Unique as a constituent is a subjective collocational pattern; if subjective they were said to be a ugen-n-gram (unique generalised n-gram).

26 10 Wiebe and Wilson (2002) observed that while some expressions are subjective in all contexts (the exclamation mark (!), for example), most are dependent on the surrounding context. Having identified potential subjective elements in previous studies such as those mentioned above, they attempted to disambiguate these elements in context. Their method followed that of Wiebe (1994), where an element was considered as more likely to be subjective if nearby elements were subjective. A potential subjective element (PSE) was considered to be high density if the number of subjective elements within a window W around the PSE was greater than some threshold value T. The authors used a corpus marked with subjective element annotations collected in previous studies (Wiebe et al., 1999, 2001a), finding that a high-density of PSEs was strongly indicative of opinionated text. Riloff and Wiebe (2003) investigated the use of information extraction patterns as a means of representing subjective expressions. They utilised high precision subjectivity and objectivity classifiers to obtain a large number of automatically labelled sentences. An extraction pattern learning algorithm was applied to this training data to learn lexicosyntactic patterns of subjectivity. The patterns were used to identify further subjective sentences and these in turn were used to provide more automatically labelled sentences. This process was then bootstrapped until an optimal set of patterns was found. Riloff and Wiebe (2003) argued that information extraction patterns are linguistically richer than simple n-grams and describe an example potential pattern of subjectivity: e.g. <x> drives <y> up the wall, where x and y are noun phrases. They argued that this pattern could match many sentences such as George drives me up the wall or The nosy old man drives his quiet neighbours up the wall. The first stage of their method was to employ high-precision subjectivity and objectivity classifiers. The classifiers used a lexicon of subjectivity clues from previous research, divided into sets of strongly subjective and weakly subjective. The classifier judged a sentence as subjective if it contained two or more strongly subjective clues. The classifier labelled a sentence as objective if the current, the previous and the following sentences contained no strongly subjective clues and at most one weakly subjective clue. However, a possible shortcoming of this method is that it may accidentally label a sentence that contains a strong subjectivity clue because it has not been encountered before, and does not co-occur with subjectivity clues that have been seen previously. The concern here then is that a depth of subjectivity clues of a certain type may be learnt as opposed to a variety of types.

27 11 Having automatically constructed a training set of subjective and objective sentences, the authors used Riloff s AutoSlog-TS 1996 to learn extraction patterns indicative of subjectivity. AutoSlog-TS uses a set of syntactic templates to describe a search space of possible patterns (listed in the left column of Table 2.1). These templates are exhaustively applied to the training data, generating every possible instantiation of each template (the right column of Table 2.1 lists some examples). The algorithm ranks the patterns according to the frequency of the pattern in a subjective sentence relative to the total frequency of the pattern. Patterns are accepted based on two thresholds one for the pattern s frequency and another for the pattern s frequency in subjective sentences. The authors found that augmenting the high-precision subjectivity classifier with the learned extraction patterns improved the recall by over 7 percentage points and reduced precision by only around 1 percentage point. Wiebe and Riloff (2005) extended this approach in their Opinion-Finder system by attempting to learn patterns of objective expressions as well as subjective. In this study they employed an additional bootstrapping step involving self-training Naïve Bayes classifiers which, once trained on sentences labelled by the extraction pattern process described above, labelled an additional large collection of unannotated data. The most confidentlylabelled sentences from this data set were used to bootstrap the extraction-pattern learner (and subsequently the Naïve Bayes classifiers once again). This additional bootstrapping lead to a large increase in recall with a relatively minor drop in precision, with results comparable to supervised methods. Wiebe and Mihalcea (2006) discussed the integration of word sense disambiguation techniques with subjectivity analysis, asking if it is possible to label word senses as being subjective or objective. For instance, consider the following two senses of the word alarm: His alarm grew versus The alarm went off ; subjectivity analysis and further analysis of evaluation might benefit from methods for the automatic discrimination of word senses subjectivity. The authors began with an annotation study where two judges independently annotated 138 senses of 32 words from WordNet with labels subjective, objective, both, or uncertain. Overall agreement was 85.5%, while the Kappa value of 0.74 indicated strong agreement beyond that expected by chance. (Disregarding uncertain cases resulted in a Kappa value of 0.90). Their method for classifying word senses of a target word as subjective or objective involved finding the distributionally similar words (DSW) using Lin s method (1998) and for each sense of the target word computing a WordNet-based similarity score (WNSS) with each of the DSW. They then scored a sense as the sum of