UNIVERSITÀ DEGLI STUDI DI PAVIA Facoltà di Lettere e Filosofia

Transcription

1 UNIVERSITÀ DEGLI STUDI DI PAVIA Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Linguistica Teorica e Applicata TOWARDS A DISCOURSE RESOURCE FOR ITALIAN: DEVELOPING AN ANNOTATION SCHEMA FOR ATTRIBUTION Relatore: Prof.ssa Irina Prodanof Correlatore: Dott.ssa Claudia Soria Correlatore: Prof.ssa Cecilia Maria Andorno Tesi di: Silvia Pareti Anno Accademico 2008/2009

2 Nobody believes the official spokesman... but everybody trusts an unidentified source. Ron Nessen

3 Abstract This thesis investigates the complex phenomenon of attribution and addresses the issue of annotating attribution relations, developing, by means of a pilot study, a possible annotation schema to be applied to the Italian Syntactic Semantic Treebank corpus of newspaper articles (ISST). Attribution is the relation occurring between assertions but also e.g. beliefs, feelings, intentions, and the agents they belong to (e.g. The minister says that taxes will rise in 2010). As this relation deeply affects the way we perceive information, this should not be considered in isolation. It is fundamental to recognise attribution in order to deal with the reliability of information and with opinions. The development of an annotation schema for attribution aims at providing a resource in which information is overtly linked to its source. Having this annotated resource could serve a number of purposes especially in the fields of Information Retrieval, Multi Perspective Question Answering and Opinion Mining. To date, attribution has only been annotated when associated with a discourse connective or one of its arguments (Prasad et al., 2007), or only at the document, sentence (Skadhauge and Hardt, 2005) or even word level (Wiebe, 2002) thus only partially approaching the phenomenon. The present study addresses attributions independently and regarding them also as a discourse phenomenon. After analysing the features and issues connected to attribution, e.g. scope definition, nested attributions, factuality of the relation and co-reference resolution, an annotation schema will be proposed following the identification of a set of possible attribution devices. To test its feasibility and accuracy against data, a pilot annotation will be performed on a portion of the ISST corpus. This will allow the definition of annotation guidelines and the identification of additional issues remained unnoticed at the theoretical level. In order to select a suitable tool to perform the pilot annotation, several available annotation tools will be compared. This thesis not only constructively contributes to the development of a discourse resource for Italian, but also approaches attribution relations from a new independent perspective raising problematic issues and providing a deeper - ii -

4 account of the phenomenon. Further developments of the project should perform a complete pilot annotation of all the type of attribution and features intended to be included and develop, together with the appropriate tool, a final annotation schema to be applied to the whole corpus. - iii -

5 Acknowledgments It might seem banal, however every time a challenging project is over it is useful to look back and consider who made it possible. Not only to in order to recognise other people s merits and efforts, but especially to realise that we have not been alone. It is because of that very feeling that every time an endeavour finally comes to its conclusion, I can once again think about starting a new one. I can start with remembering how much of what I am I owe to the Erasmus Scheme and the chances it gave me first as a student and recently as an intern to study and research at two amazing UK universities: Reading University and the University of Edinburgh. First of all I would like to thank my supervisor at Edinburgh, Bonnie Webber, for the unforgettable opportunity and the many hours she devoted to listen about my progresses and my many doubts, every time with a solution to propose or a name possibly having it to suggest. Echoes of the enlightening conversations I had there with Theresa Wilson, Janyce Wiebe, Jean Carletta, Nicoletta Calzolari, John Niekrasz, Katja Markert, and other colleagues, can be found in this thesis as they were fundamental in shaping my choices and widening my perspective concerning the topic. A special acknowledgement is also due to Rashmi Prasad who has patiently answered all my s providing me with material and clarifications about the PDTB and precious suggestions. Constructive were also the contacts I had with Roser Saurí, Tommaso Caselli and Massimo Poesio. Lastly, I cannot forget the contribution of Jasmine to the revision of the thesis and the technical and loving support Gregor unfailingly provided. - iv -

6 Contents Abstract... ii Acknowledgments... iv List of Figures and Tables... ix List of Figures...ix List of Tables...ix 1 Introduction An Independent Approach to Attribution Methodology Terminology Outline of the Thesis Discourse and Attribution What is Discourse? Definition Theories of Discourse Coherence and Cohesion Constituency vs. Dependency Discourse Annotation Projects RST-DT The Penn Discourse TreeBank PDTB Other Projects Attribution Towards a Definition of Attribution Are Attribution Relations a Discourse Phenomenon? Related Studies GraphBank Opinion Corpus PDTB - The Penn Discourse TreeBank v -

7 2.5 Summary An Analysis of Attribution The Components of Attribution The Source The Content Elements Functioning as Cue Some Issues Nested Attributions Source of the Source Multiple Sources, Contents, Cues Co-reference Resolution Scope Definition Summary Features to Include in the Annotation Type Assertion Belief Fact Eventuality Issues Concerning Type Definition Source Writer Arbitrary Other Factuality Factual Non-factual Scopal Change Scopal Polarity Other Elements Affecting the Factuality vi -

8 4.5 Summary Performing a Pilot Annotation Corpus ISST Architecture Subcorpus Selection Tool Selection Requirements Comparison of Available Tools Selection and Tool Specifics Setting MMAX Scheme Customization Style Feasibility of the Schema and Issues Summary Annotation Schema and Guidelines Text Spans Selection Source Span Cue Span Content Span Supplement Feature Annotation Guidelines Type Attribute Factuality Attribute Scopal Change Attribute Source Type Attribute Collecting a List of Italian Cues Extracting Verb Cues from the PDTB Summary vii -

9 7 Conclusion Future Work And Beyond Bibliography: Abbreviations and Acronyms Appendix 1 MMAX2 Code Appendix 2 Italian Attribution Cues Appendix 3 PDTB Verb Cues viii -

10 List of Figures and Tables List of Figures Figure A - Reported news example... 2 Figure B - RST schemas... 9 Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997) Figure D - Sense classification of discourse connectives in the PDTB Figure E - Graphic extra-linguistic attribution Figure F - Newspaper article source Figure G - Nested attribution schema Figure H - Truth values of a nested content Figure I - Design Process Figure J - ISST orthographic level (sole002) Figure K - ISST morpho-syntactic level (sole002) Figure L - ISST syntactic constituent level (sole002) Figure M - ISST table format Figure N - GATE annotation environment Figure O - GATE annotation exported in XML Figure P - Knowtator annotation environment Figure Q - Knowtator annotation exported in XML Figure R - MMAX2 Project Wizard Figure S - MMAX2 Base Data (ISST cs001) Figure T - The annotation of cue, content and source as separate levels Figure U - MMAX2 Annotation window Figure V - MMAX2 Annotation window (attributes) Figure W - MMAX2 Annotation of relations Figure X - Nested attributions visible through handles Figure Y - Attribution relation components Figure Z - Annotation, text spans selection Figure AA - Annotation, elements which could function as a markable Figure BB - Annotation, attributes selection List of Tables Table 1 - Factuality values (Saurí and Pustejovsky, 2008) Table 2 - N. of articles selected per section Table 3 - Knowtator/ MMAX2 feature comparison Table 4 - Annotation schema features Table 5 - Factuality and Scopal change values assignment ix -

11 1 Introduction 1 Introduction Discourse relations represent a fundamental aspect of discourse understanding and generation. Therefore research in many areas, such as Information Extraction, Discourse Generation and Question Answering, would benefit from a discourse annotated corpus as a basis for their studies. The aim of this thesis is to contribute towards providing Italian with complete linguistic resources in particular with designing and testing the addition of a discourse level of annotation to the ISST corpus, a multi-level annotated corpus of Italian newspaper texts. This already consists of 5 levels of annotation: orthographic, morpho-syntactic, syntactic (constituents), syntactic (dependencies) and semantic. The addition of a layer for discourse annotation comes as a natural development of the ISST corpus. Most of the work in this frame, to date, concentrates on analysing and annotating discourse connectives or anaphoric relations. For the purpose of the present study, however, these issues will not be addressed and the focus will be on attribution relations. This topic is especially relevant for research dealing with Information Retrieval, Multi-Perspective Question Answering and Opinion Mining. Tools able to discern information according to the relevance of its source or to identify different opinions with regards to a given topic would dramatically improve the quality of the information we are constantly exposed to. People more and more refer to the internet as a source of information and knowledge interrogating search engines instead of encyclopaedias or experts. A number of projects, last the Microsoft search engine Bing, are trying to outperform Google and break its monopoly with scarce success as they introduce interesting small changes without remarkably improving the reliability of responses to our queries. Search engines usually classify the source only at the macro-level, i.e. the webpage a certain text or information was taken from. The urge for retrieving answers quickly does not always allow users to take the context in which the information was found into consideration or to address the troublesome question: Where does this knowledge come from? Quite often for example we hear people supporting their views with stating that they read - 1 -

12 1 Introduction something about them on the internet or even that internet says it. This generalisation is also due to the difficulty of linking the information to the exact source, often hidden by several levels of attribution all nested one in another like a Matryoshka doll. The practice of reporting information is particularly pervasive in the journalistic field and especially in news reviews where what is stated is always second hand if it does not originate from even further away. In the example below (Figure A), on the website First Bell it is reported that the UK has the largest gender gap in science achievement. This, however, according to the UK s Telegraph which in turn reports a study from the OECD whose data is taken from the Program for International Student Assessment (2006). Figure A - Reported news example In the last few years the Web has become the indistinct repository of all human knowledge. However, although it surely is the shallowest source of the data we learn from it, it is never the only one and knowing all the passages a certain statement has gone through is fundamental, as it is e.g. to know its temporal anchor, in order to verify its veracity, understand and interpret it. Just consider the example (1) below: (1) According to The Times the President wants to buy the Amazon Forest and turn the trees into toothpicks

13 1 Introduction This intentions attributed to the President seems to come from a trustworthy source, The Times, and would hopefully determine immediate reactions at least from the environmentalists. But what if this statement was part of another attribution relation as in the paragraph that follows (2)? (2) According to The Times the President wants to buy the Amazon Forest and turn the trees into toothpicks. The comedian pronounced these words, joking about the President s disregard for environmental issues. A last remark concerns the utility and importance of developing such a project in a language other than English. First of all, because findings and results proceeding from studies employing the English language cannot be always and entirely valid for other languages. Secondly, the importance and life of a language depends also on these efforts to make it available for every possible use. Having language resources for Italian means providing support for studies and research and allow the development of tools specific for this language, thus enabling its speakers to rely on it for the full range of their needs. Lastly, developing resources in several languages provides precious data for inter-linguistic comparison, thus making it possible to identify aspects which are common and aspects instead peculiar to each language. 1.1 An Independent Approach to Attribution Being able to automatically link together attributed material and its source would represent a big advantage for a number of tasks. At present, this is still not possible. A manually annotated corpus for attribution is surely not the solution, however, it represents an important step towards it. Studies aiming at developing tools for the recognition of attribution would in fact need a complete description of how the phenomenon functions and is expressed, together with an already annotated corpus to test their reliability. Although attribution relations have already been annotated in a few other projects (Wolf and Gibson, 2005; Wiebe, 2002; Prasad et al., 2007), a systematic and independent account of the phenomenon is still lacking. Studies aiming at - 3 -

14 1 Introduction capturing the complexity of discourse relations recognise the importance of attribution, but reserve a rather secondary role for it (Wolf and Gibson, 2005; Prasad et al., 2007). Other approaches instead take the distance from discourse and assume a more independent perspective or pair attribution with subjective language (Wiebe, 2002). None of them, however, completely investigate attribution as they limit the annotation to only some of the attribution levels: word, clause, sentence or document. In the present project, attribution relations will be investigated as the starting point towards the construction of a discourse resource for Italian and not as an additional feature of it. Moreover, all levels of attribution will be considered and annotated. This way of proceeding will allow exploring the topic independently from other discourse relations and reaching a deeper understanding and a broader account of the phenomenon. 1.2 Methodology In order to annotate the corpus for attribution, some preliminary work needs to be carried out. First of all, attribution relations have to be analysed in order to identify their characteristics and spot issues which represent an obstacle to the annotation. Afterwards, possible solutions to these problems will be proposed and an annotation schema outlined. This will be then applied to a section of the corpus with the help of an annotation tool. The tool has been selected after comparing and testing several available software applications. The choice of the annotation tool poses constraints to the annotation schema as its limited functionality determines what is feasible and what not (e.g. some tools do not allow the selection of overlapping text spans). Although ideally the tool is determined by the annotation schema and should be developed according to its requirements, this was at this stage not realistic. Having to rely on an existing tool, the initial annotation schema proposed will therefore be adapted to the selected tool. Performing the pilot annotation will rise additional issues and determine new changes to the annotation schema. This will finally reach its final stage, a proposal for the annotation of attributions, which should be applicable to the rest of the corpus with the help of annotators, leading - 4 -

15 1 Introduction to presumably good interannotator-agreement. 1.3 Terminology Before moving on, it is opportune to briefly introduce some terminology employed. Although text is used in linguistics to refer to any passage, spoken or written, of whatever length, that does form a unified whole (Halliday and Hasan, 1976:1), as the type of texts within the scope of this study are solely newspaper articles, this will refer to written language only. The account for attribution provided in this thesis should hold also for the spoken language, however, further investigations are necessary in order to determine to what extent this is true. When generally discussing attribution, the term writer will be mostly employed to refer to both the writer and the speaker of the text. Discourse is often characterised as a coherent text, as opposed to text lacking a semantic unity. As incoherent texts will not be taken into consideration, both discourse and text will be generally used to refer to a coherent unit of language. The lexical material signalling an attribution relation will be mostly identified as cue or text anchor 1.4 Outline of the Thesis In the second chapter, the framework of discourse studies will be briefly introduced together with a survey of discourse annotation projects. Afterwards, attribution will be defined and projects involving its annotation reviewed. The third chapter presents the phenomenon of attribution and provides an analysis of its constitutive elements, with particular attention to the elements expressing them in the text. Some of the most problematic issues connected to attribution relations and their annotation are also investigated. A first annotation schema proposal is described in chapter four. The description focuses on the features to include in the annotation. These attributes and their possible values are carefully analysed and described with the help of examples

16 1 Introduction The fifth chapter illustrates the stages towards performing a pilot project in order to test the feasibility of the schema on the corpus. These include the specification of the tool requirements, the analysis and selection of the most suitable tool among the ones currently available and the setting of the selected tool. Afterwards, some additional issues or issues identified through the pilot annotation are also presented. In the sixth chapter the final annotation schema proposed for the annotation of attribution relations is briefly summarised and guidelines concerning the annotation are provided, as they have been adopted for the pilot, in order to facilitate the selection of the relevant text spans and the assignment of the attribute values. In the last chapter conclusions are drawn and future developments discussed

17 2 Discourse and Attribution 2 Discourse and Attribution 2.1 What is Discourse? Definition Aristotle already understood it and warned us in his Metaphysics that The whole is more than the sum of its parts. This also holds for such wholes like texts, where the meaning deriving from the juxtaposition of clauses, as pointed out by Moore and Wiemer-Hastings (2003), may not coincide with the meaning of the individual clauses and may imply more than that. Discourse could therefore be defined as propositions in context (Péry-Woodley and Scott, 2007). Units of language are usually organised in a coherent way and researchers agree that coherent text has a structure and that understanding the way it functions is fundamental for the understanding of discourse (Grosz and Sidner, 1986, Hobbs, 1985). This structure needs to be taken into consideration when dealing with natural language generation but also with tasks such as co-reference resolution, temporal relations and attribution relations. Coherency not only depends on the relations holding between strings but has also to do with extralinguistic components such as the writer/speaker, the recipient, the knowledge they share and the communicative situation. Another concept strongly connected to coherency and contributing to it is that of cohesion. Cohesive elements are linguistic devices employed to signal connections between text units. Coherency and cohesion will be both employed in the next section as some approaches semantically ground discourse relations, therefore focus on the elements giving coherency to the discourse, while other try to account for the cohesive means by which this coherence is linguistically expressed Theories of Discourse Coherence and Cohesion Between sentences, there are no structural relations, and this is where the study of cohesion becomes important. (Halliday and Hasan, 1976:146)

18 2 Discourse and Attribution Two metaphors are usually employed by theories of coherence: that of focus, which holds between entities referred to in a text and can involve more than two text spans; and that of relation, binary in nature and linking instead sections of text. Different theories have taken different approaches to discourse relations which Knott (1996) describes as deep and surface structure. The deep structure theories investigate discourse relations identifying the semantic relations which underlie surface syntactic relations (Grimes, 1975). Surface structure approaches, on the contrary, consider the deep semantic relations less important and characterise discourse relations from the outside, identifying possible resources signalling them on the surface structure (Halliday and Hasan, 1976). The types of structure usually employed by computational models of discourse processing are three. The informational structure, the relation between the information conveyed in consecutive elements of a coherent discourse (Moore and Pollack, 1992:537) which deals with semantic relations, e.g. the causal relation. The attentional structure (Grosz and Sidner, 1986) determines instead the focus or centre of attention: the information or entities which are mostly relevant at any given point. Another type of structure is the intentional structure which deals with the intentions of the speaker/writer and therefore with what they are trying to accomplish through the communicative act. This kind of structure underlies Grosz and Sidner s (1986) concept of discourse relation. In their theory, relations apply to discourse segments (DS) and combine them in larger DSs. The intention relations are those of dominance, when the satisfaction of the subordinate segment concur to the satisfaction of the dominant one, and satisfaction-precedes, in which the satisfaction of one segment precedes the satisfaction of another segment and together they concur to the satisfaction of a third dominant one. Discourse segments are therefore organised in a hierarchical structure of goals and sub-goals. Considering discourse as a composite of linguistic structure, intentional structure and attentional state, Grosz and Sidner also account for the interaction between relation and focus and they present every discourse segments as having also a focus space determined by dominance relations. Two additional structures should also be added to the three already mentioned. One is the information structure, which has to do with the concepts - 8 -

19 2 Discourse and Attribution of theme and rheme, the former being the part connected to the rest of the discourse and the latter the new information which is introduced about it. The other one is the rhetorical structure, which defines a set of rhetorical relations that can connect consecutive discourse elements. Rhetoric relations are the core of the RST (Rhetorical Structure Theory) formulated by Mann and Thompson (1988). Rhetorical relations (RR) are functionally defined as the effect the writer intends to achieve and are expressed by linguistic devices. RRs entail the concept of nuclearity, that is the centrality of the span with respect to the writer s purposes. Nucleus and satellite relations, and less commonly multinuclear relations, structure the text and can be exemplified by schema applications (Figure B), which can then be mapped onto text. A hierarchical system of schema applications produces a Rhetorical Structure tree. For a text to be coherent, it should be possible to represent it with a single RS tree. text span relation nucleus satellite nucleus nucleus Figure B - RST schemas Constituency vs. Dependency Webber (2006) argues that the approaches to discourse structures can also be grouped according to the concepts of constituency and dependency. RST approach is based on constituency, the idea of linguist units as parts within parts, having specific roles or functions (Webber, 2006:340), and considers this as the only basis for discourse relations. Their instantiated schemas represent the constituency structure and correspond to discourse relations between consecutive spans (i.e. clauses or projections of instantiated schemas). Also based on constituency is Polanyi's Linguistic Discourse Model (LDM). This is similar to the RST, however, it separates discourse structure, formed by a hierarchy of discourse units, from discourse interpretation. Their discourse parse tree (DPT) can be described by a context-free grammar consisting of 3 re-write - 9 -

20 2 Discourse and Attribution rules: an N-ary branching rule for discourse coordination, a binary branching rule for discourse subordination and an N-ary branching rule with sisters related by a logical or rhetorical relation and contributing to the interpretation of their parent node. The DPT is right open, which means that every discourse unit resuming an interrupted constituent also closes it off, thus making it impossible for any subsequent coordinate or subordinate discourse unit to attach to it. This claim, which does not allow for incrementation, is similar to the Intention Stack mechanism depicted by Grosz and Sidner (1986) and the notion of Right Frontier in Webber (1988). Another approach to the structure of discourse is that of relating discourse cohesion to dependency, which can be of three kinds: syntactic, semantic and anaphoric. In Halliday and Hasan (1976) this is solely anaphoric dependency. Their idea of cohesion is that of a part whose interpretation requires the interpretation of another part to be enabled. Five types of cohesion can be identified on this basis: anaphora, substitution, ellipsis, lexical cohesion (e.g. repetitions, synonymy) and conjunction, the latter being the only one responsible for discourse relations. As pointed out by Webber (2006), anaphoric relations have, however, no constraint on their locality, no constraint on the number of text parts a given unit can depend on and no constraint on the discourse units that can be linked together. The lack of constraints in this approach results in embedded and cross-relations to be allowed. Other approaches have taken a perspective that combines constituency and dependency in order to account for discourse relations. In the mixed approaches constituency and dependency participate in shaping the discourse structure and determining its cohesion. Wolf and Gibson (2005) discourse structure relations, a set of informational relations based on Hobbs (1985), are associated with constituency alone. However, they do not separately account for anaphoric dependency which is responsible for non-adjacent discourse segments. In this way their approach can be seen as part of the mixed approaches. In their theory of discourse graphs, Wolf and Gibson identify discourse segments as non-overlapping spans of text constituted either by a clause or an attribution. Segments which are related on the basis of a common topic or attribution are grouped together. Groups can also engage in a discourse relation

21 2 Discourse and Attribution with a clause or another group. This results in a sort of hierarchical structure which can be related to constituency. Moreover, Wolf and Gibson argue that treestructures are not adequate for accounting for discourse coherence and propose a chain-graph in order to represent problematic aspects such as nodes with multiple parents and cross-relations. Although their approach tends to associate discourse structure solely with constituency, dependency plays an important role in determining their claims. Cross-relations, which appear to be quite frequent and mainly associated with the relation of elaboration, could be explained through dependency. Webber (2006) notices that cross-relations, which represent the main argument against the treestructure, are often anaphoric dependencies. Also mixed, although very different, is the Lexicalized Tree-Adjoining Grammar for Discourse (D-LTAG) approach (Cristea and Webber, 1997, Webber et al., 2003). Discourse relations are lexicalised in the sense that this theory provides an account of the lexical anchors bearing them. The arguments to these relations are also lexicalised. Each lexical entry is associated with a set of treestructures specifying its syntactic configuration. In this lexical variant of TAG, the adjoining operation, which is available at the right frontier, is paired with the operation of substitution. Adjoining is the operation of identifying a discourse relation between the new material and material in the previous discourse that still is open for elaboration (Cristea and Webber, 1997:91). Cristea and Webber introduce substitution in order to account for discourse features (e.g. although, on the one hand, suppose) arising expectations about what is to come in the following discourse. Figure C shows above the grammatical categories (where * is the foot of an auxiliary tree and a substitution site) and below the adjoining and substitution operations

22 2 Discourse and Attribution Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997) Structural (i.e. conjunctions, subordinators) and empty connectives are the anchors of elementary trees. These discourse relations between arguments produce a compositionally interpreted structure. Discourse adverbials exploit instead anaphoric dependency, establishing a discourse relation connecting the interpretation of a clause to the interpretation of a previous clause or group of clauses. 2.2 Discourse Annotation Projects Discourse annotation projects are becoming popular in recent years due to a growing interest in better understanding discourse structures in order to automatically interpret or reproduce it. A survey of these projects is presented in this section RST-DT The Rhetorical Structure Theory Discourse Treebank (Carlson et al., 2003) is a corpus of 176,000 words from the Penn TreeBank, hence consisting of articles from the Wall Street Journal (WSJ). Realised in the framework of the RST, the RST-DT corpus is annotated for rhetorical relations holding between two or more adjacent and non-overlapping text-spans. In order to construct the discourse tree, they first proceed to identify its minimal building block, the elementary discourse unit (EDU), which is the clause

23 2 Discourse and Attribution Once the text has been segmented, adjacent EDUs are linked via rhetorical relations thus creating a hierarchical structure. The inventory of rhetorical relations they employ consists of 53 mononuclear relations, where one of the spans is more salient (nucleus) and the other conveys additional information (satellite), and 25 multinuclear relations, with equally salient spans The Penn Discourse TreeBank PDTB The PDTB (Prasad et al., 2004; Webber et al., 2005; Prasad and Dinesh et al., 2008) represents a fundamental work in the area of discourse for both its unique approach to discourse relations based on the D-LTAG theory and the echo it has produced, inspiring a number of recent studies and providing them with a strong knowledge base. The PDTB is a discourse resource built on top of the PTB, the Penn Wall Street Journal corpus. It consists of a million words annotated for discourse connectives and their arguments. The annotation was chosen to be stand-off as this is generally more clear than the XML in-line annotation and because the arguments of different connectives could overlap, violating the syntax of XML. Although not tied to any particular theory of discourse, the approach taken is grounded in the D-LTAG approach to discourse. The idea of a lexicalised grammar for discourse results in a bottom-up approach that avoids recurring to a pre-defined set of discourse relations as it is in other theories (e.g. RST). Focus of the annotation are discourse connectives, considered as discourse predicates taking two text spans as their arguments, and their arguments Arg1 and Arg2. In the example (3) below, Arg1 is in italic and Arg2 in bold while the connective is underlined. Discourse relations hold between Abstract Objects (AO), such as propositions, events and states. The annotation was performed proceeding with annotating a single connective throughout the whole corpus before taking into consideration the following one as this was perceived as an easier task for the annotators. (3) Most oil companies, when they set exploration and production budgets for this year, forecast revenue of $15 for each barrel of crude produced. (Prasad and Dinesh, 2008:2)

24 2 Discourse and Attribution Discourse connectives belong to three grammatical classes: subordinating conjunctions (e.g. because, when), coordinating conjunctions (e.g. and, or) and discourse adverbials (e.g. for example, instead). They can also appear as modified or conjoined form (e.g. only because, if and when) or parallel form (e.g. either or, on the one hand on the other hand). The senses of the connectives are also annotated paying attention to their polysemous nature (e.g. since can have a temporal, causal or temporal-causal sense). Senses are hierarchically classified according to their class, type and subtype as exemplified in Figure D. class temporal contingency comparison - expansion type condition cause subtype reason result Figure D - Sense classification of discourse connectives in the PDTB Between adjacent text spans, discourse relations are annotated also when not explicit, that is, when although they lack a discourse connective, the relation can be inferred. In these cases a presumed connective is added to the annotation with the exception of lexicalised discourse relation (AltLex), arguments linked by an entity-based coherence relation (EntRel), and also when no relation is perceived (NoRel). Arguments to a connective can be non-consecutive (3) and anywhere in the text and are constituted of single or multiple clauses or sentences. A principle of minimality applies to them, which prescribes for each argument the selection of the minimum sufficient span. Additional text related to the arguments can also be included in a discourse relation as supplement (Sup1, Sup2). The annotation in the PDTB contains additional information as it also specifies the attribution of connectives and their arguments. This aspect of the annotation will be considered and analysed at a later stage in this thesis

25 2 Discourse and Attribution Other Projects The Chinese Discourse Treebank The Chinese Discourse Treebank (CDTB) project (Xue, 2005) is based on the same principles of the PDTB. Similarly to the PDTB, and unlike the RST approach, discourse relations do not represent a predefined inventory but are lexically grounded and anchored by discourse connectives. Implicit and explicit Chinese discourse connectives were investigated in order to add a discourse layer of annotation to the Penn Chinese Treebank. Discourse connectives are also here regarded as predicates taking two abstract objects as their arguments. In the CDTB coordinating and subordinating conjunctions as well as discourse adverbials are annotated. The main challenges to the realisation of this project were disambiguating lexical items which in Chinese could function both as discourse connectives and non-discourse connectives, as well as determining the sense of polysemous connective, and defining the argument scope. Due to the long morphological evolution of the Chinese language another issue was determined by discourse relations realised by more than one discourse connective. Hence, different morphological forms had to be grouped as diverse realisations of the same discourse relation. The task of annotating attribution is not included in the CDTB. Discourse and the Prague Dependency Treebank (PDT) Also inspired by the PDTB is the initial analysis conducted on the Prague Dependency Treebank for the addition of a layer of annotation for discourse (Mladová et al., 2008). The PDT is a corpus of 2 million word Czech journalistic texts from the Czech National Corpus. Three levels of annotation are already available: morphological, superficial syntactic and deep syntactic. In the latter each sentence is represented by a dependency tree connecting clauses but not trespassing sentence boundaries. In addition, however, some basic co-reference relations are also marked and among those some textual co-reference relations going beyond sentence boundaries. Discourse relations will be added to PDT 3.0 in a fourth level of annotation, containing various types of relations going beyond the sentence. The discourse layer to be added to the PDT will use the PDTB as a background and define a new

26 2 Discourse and Attribution hierarchy of discourse sense labels and exploit the discourse information already carried by the deep syntactic level of annotation. Co-reference relations are already marked for coordination, dependency and reference to the preceding context, however these need to be explicitly marked for discourse and the PREC label for relations going beyond sentence boundaries has to be further specified. Discourse Annotation of the METU Turkish Corpus and the Hindi Discourse Relation Bank Still at their early stages are two recent projects aiming at developing a discourse resource for Turkish and Hindi respectively. Both are based on the theoretical assumptions postulated by the PDTB and focus on the analysis of discourse connectives. From these studies also emerges a certain interlanguage validity of the PDTB schema and the similar approach adopted makes them a valid source for cross-linguistic comparison. Most of the work to date in order to prepare the ground for the discourse annotation of the METU Turkish Corpus (Zeyrek and Webber, 2008) has been the identification and classification of discourse connectives together with a preliminary analysis of the argument scope. The attribution of discourse relations and other aspects such as the annotation of implicit connectives are still unexplored. Similarly, the project for a Hindi Discourse Relation Bank also adopts a lexically grounded approach to discourse relations and is focusing on the analysis of different types of discourse connectives and their realisation in the Hindi language. Implicit connectives and their semantic classification, together with the attribution of connectives and their arguments are left for future developments of the present research (Prasad and Husain et al., 2008). 2.3 Attribution This thesis originates in the framework of developing an Italian Discourse Treebank which similarly to the developing discourse annotation projects for Chinese, Czech, Hindi and Turkish is theoretically inspired by the PDTB. However, unlike these projects, it does not focus on the classification and annotation of

27 2 Discourse and Attribution discourse connectives but on attributions, an aspect included in the PDTB but only in a subordinate way Towards a Definition of Attribution Defining what attributions are is a trivial task, so trivial that it is not at all easy. Although the annotation scheme for attribution in this project is derived from the one in the PDTB, the definition they provide of attribution as a relation of ownership between abstract objects and individual or agents (Prasad and Miltsakaki, 2008:40) is not suitable to fully describe the relations that will be here considered and investigated. AOs refer to propositions, events or states and do not include smaller units such as noun phrases or even single words. Another definition is given in the RST annotation manual: Speech acts verbs that are used to report both direct and indirect speech-- should be segmented and marked for the rhetorical relation of ATTRIBUTION [ ] Cognitive predicates, including verbs that express feelings, thoughts, hopes, etc., should also be segmented and marked for the rhetorical relation of ATTRIBUTION. (Carlson and Marcu 2001:7, 9) More than attribution, what is defined here by describing the means by which it is signalled, is the way of spotting attribution in the text. However, it is possible to derive that attribution is just bound to reporting or cognitive predicates, leaving out the cases when attribution is conveyed by prepositions (4) or just punctuation (5). (4) According to the police, crime rate has fallen this month. (5) The Pope: I will pray for the victims. Murphy (2005:131) provides a partial definition of attribution as the transferral of responsibility for what is being said to a third party. This simple explanation, meant to capture only the attribution of assertions, highlights however the embedded nature of attribution, recognising a third party in the relation. This because any attribution in a text or speech event is already part of a

28 2 Discourse and Attribution communicative event having in the writer/speaker its natural primary source. The insertion of a third party allows the writer/speaker to change this default attribution and transfer the responsibility or ownership of a certain part to another source. As all the above mentioned definitions of attribution are alone not sufficient to capture the phenomenon into consideration in this thesis, a new definition will be here proposed: attribution in a text is ascribing the ownership of an attitude towards some linguistic material, i.e. the text itself, a portion of it or their semantic content, to an entity. This ownership is expressed by explicitly inserting the agent or experiencer holding the intellectual property of the linguistic material, which can express an assertion or a mental state such as an opinion, a will or some knowledge. Attributions as described above will be considered and investigated in the current research Are Attribution Relations a Discourse Phenomenon? In order to decide if attribution relations are a kind of discourse relation it is necessary to specify what discourse relations are. The label of discourse holds for those texts having a structure. This structure originates from cohesive elements. Where the interpretation of any item in the discourse requires making reference to some other item in the discourse, there is cohesion (Halliday and Hasan, 1976:11). If generating cohesion would represent the sufficient and necessary condition to identify a discourse relation, attribution would surely belong to this class. The interpretation of an attributed element is highly dependent on its source. Bergler (1991) distinguishes between primary and circumstantial information, the first being the pure information and the latter the primary information within a perspective, a belief or a modality, and argues that the interest of tasks such as knowledge extraction is primary information. She however acknowledges the importance of the additional information carried by the circumstantial information and stresses the intimacy of this relation. Although primary information still is, after 18 years, the focus of knowledge extraction tasks, the recent flourishing of studies aiming at capturing this intimate relation shows a general understanding of the fundamental contribution of the circumstantial information to the interpretation of the primary one

29 2 Discourse and Attribution Attribution relations are therefore with no doubt cohesive relations. Cohesion, however, is not enough to specifically describe discourse relations as this could represent a characteristic of relations in general. Thus a syntactic relation would also be classified as a discourse one and this should not be the case. The second necessary condition identifying a discourse relation is that it should hold between discourse segments. These should be non-overlapping spans of text, however in the literature a unique definition is still lacking. Different discourse approaches also adopt different discourse units. These can be intentional units (Grosz and Sidner, 1986), sentences (Hobbs, 1985), clauses or phrasal units (Mann and Thompson, 1988; Webber et al., 1999; Wolf and Gibson, 2005). Relations of attribution can hold between sentences or inside them between clauses or group of clauses, therefore it could be considered a discourse phenomenon. (6) "There's no question that some of those workers and managers contracted asbestos-related diseases," said Darrell Phillips, vice president of human resources for Hollingsworth & Vose. "But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today." (PDTB 0003) Skadhauge and Hardt (2005) argue in this respect that attribution is an intrasentential relation, referring to the RST Treebank where it is actually treated as such, and develop a system that they claim can automatically identify it. The assumption is that being an intra-sentential relation attribution is encoded at the syntactic level. Attribution is also a syntactic phenomenon but surely not only that. The premises leading Skadhauge and Hardt to this conclusion are grounded in the RST Treebank approach to attribution which considers only intra-sentential instances of it and only at particular conditions (i.e. a verb immediately followed or preceded by a sentential complement position, and the phrase according to ). The conclusion, quite obvious, should be that a subset of attribution relation which are syntactically grounded, those selected by the RST-DT, can be syntactically derived and automatically identified. Although a certain number of attributions are expressed at the intra

30 2 Discourse and Attribution sentential level, verbs are not the only cues signalling them (see examples (4) and (5) above). They are certainly the most common ones, however, the attributed span is often separated from the verb by intervening material, such as adverbs, complements or even clauses. Only eluding the complexity of attribution relations, considering only a subset of it, Skadhauge and Hardt could easily provide a solution for the automatic identification of this problematical phenomenon. This very partial solution demonstrates the importance of reaching a better theoretical description of attribution and a full account of its characteristics. From the present work attribution emerged as being also a discourse phenomenon. This because it often operates at a higher level than the sentence, connecting larger units such as sentences (6), but also clauses in separate sentences. Moreover, very frequently it bears co-reference relations or better it is bounded to them ((7), (9)). (7) LONDRA - Con I soldi della lotteria nazionale sarà creata un Accademia Britannica per lo Sport. Lo ha deciso il primo ministro, John Major, (ISST re050) LONDON With the money from the National lottery it twill be instituted a British Sport Academy. It was decided by the Prime Minister, John Major, Through the analysis of attribution it was also clear that it can also be a syntactically encoded phenomenon, intra-sentential and even intra-clausal, with as little as a single word functioning as the attributed material ( (8), (9)). (8) Sì, le risponde convinta un amichetta. (ISST cs060) Yes, answers to her confident a friend. (9) L umanità deve proclamare uno storico sciopero ad oltranza fino alla distruzione di tutti gli armamenti nucleari. Le parole registrate di Gheddafi, (ISST cs039) The world should proclaim a non-stop strike till the destruction of all nuclear armaments. Gheddafi s recorded words,

31 2 Discourse and Attribution On the other hand attribution relations can involve much larger units than sentences or clauses and extend to the whole text or speech, reaching the shallowest level of attribution, the one already easily captured by searching engines, in which the source is the writer of the text or the person holding a speech, or even the newspaper or website including the article. At this level the attribution is often conveyed by prosodic or extra-linguistic means, e.g. the inclusion in the web-page/ newspaper/ book, a graphic pointer (Figure E), the sound provenance. Figure E - Graphic extra-linguistic attribution ( For the purpose of the present study attribution will be considered at every level it can be found, however the main account of it will be as a discourse phenomenon. It will be considered at the discourse level itself, when sentences, propositions or clauses or groups of them are attributed and at the sentence or even clause level, with single words or noun phrases being attributed. However, the analysis will be in this case limited to those instances coreferential to a discourse unit ((7), (9)), hence these could be also considered, in combination with the coreferential relations, a discourse relation. The shallow level of attribution will also be included in the annotation as the text, which is always a newspaper article, will have the writer as its primary source, even when the writer is not directly mentioned in the article. The attribution of the entire article to its writer will be assumed as default and left implicit with some exceptions

32 2 Discourse and Attribution 2.4 Related Studies Attribution relations have already been included in some studies. These either have their focus on some other discourse aspect and account for attribution only marginally or limit their analysis to some level of attribution, e.g. the macro-level or the intra-sentential or word level, thus neglecting attribution at the discourse level. Nonetheless they represent a knowledge base and a starting point for the present study. The annotation schema is in fact derived from the annotation schemas proposed by these projects. In this section the most influential ones will be reviewed GraphBank The relation of attribution is included in the GraphBank (Wolf and Gibson, 2005) as an asymmetrical or directed relation, together with cause effect, condition, violated expectation, elaboration, example and generalization. In contrast to symmetrical or undirected relations, i.e. similarity, contrast and same, directed relations hold from satellite to nucleus nodes and are related to Mann and Thompson s (1988) mononuclear and multi-nuclear relations. Attribution relations go from the DS containing the source to the DS which is the content of the attribution. Attributions in the GraphBank are separated only when the attributed material is a sentence or group of sentences or a complementizer phrase (10). These DSs are grouped if they are attributed to the same source. In the other cases they are treated as single discourse segments (11). (10) 1. John said that 2. the weather would be nice tomorrow. (Wolf and Gibson, 2005:254) (11) 1.The restaurant operator cited transaction costs from its 1988 recapitalization. (Wolf and Gibson, 2005:251) Wolf and Gibson added attribution to the relations in Hobbs (1985) as they are dealing with text taken from news corpora. However, they consider attributions,

33 2 Discourse and Attribution more than coherence relations themselves, just as carriers of coherence structures (Wolf and Gibson, 2005:251) Opinion Corpus Connected to attribution are also works in the fields of opinion and emotion annotation and recognition. The most consistent and closely related study in this respect is the Opinion Corpus (Wiebe, 2002; Wiebe et al., 2005; Wilson and Wiebe, 2005). It consists of more than sentences from the world press, annotated for private states. This term covers: opinions, beliefs, thoughts, feelings, emotions, goals, evaluations and judgements. A private state consists of an experiencer holding an attitude, optionally toward an object (Wiebe, 2002:4). Private states partly overlap with the types of attribution considered for the present study. Although feeling and emotions are not part of the annotation, therefore the object of the private state is not optional, other categories such as beliefs and thoughts are included, together with assertions. For the annotation of private states Wiebe et al. (2005) create three frames corresponding each to a type of private state expression: explicit mention of private states, speech event expressing private states and expressive subjective elements. Key elements of these frames are: the text anchor, namely the text span representing the speech act or the private state; the source, employed to refer to both the experiencer of a private state and the writer or speaker of a speech event; the target, although this is only included in the first two frames; some properties. Properties include the intensity of the private state, the expression intensity, which denotes the contribution of the text anchor to the intensity of the private state, insubstantial, when a private state is e.g. in the scope of a conditional and is therefore not presented as real in the discourse, and attitude type, accounting for the polarity of the private state. Assertions are annotated through the objective speech event frame if the target is presented as an objective fact. Another important aspect of their annotation is the inclusion of an agent frame in order to identify with a unique ID every source in the text. This feature is particularly significant in order to deal with bridging or pronominal anaphora, that is when a same source is repeated several times with different nouns or pronouns being involved and making the identification

34 2 Discourse and Attribution of a unique source quite challenging. Sentences presenting private states and speech events are analysed in three parts. With on it is designated the text anchor corresponding to the private state or speech event itself. Outside includes instead the source and everything else in the sentence outside the scope of the private state or speech event, which is labelled as the inside. (12) outside: On Tuesday, John while hanging up the phone. on: said that inside: he was leaving (Wiebe, 2002:8) The Opinion Corpus surely represents a model and a knowledge base for the present study regarding the annotation of attributions. This model needs however to be expanded to go beyond the sentence boundaries, in order to avoid approaching attribution once again merely as a syntactical intra-sentential phenomenon PDTB - The Penn Discourse TreeBank Apart from annotating lexically grounded discourse relations in the form of discourse connectives and their arguments, the PDTB goes further also including attribution relations in the annotation. Considered as a relation of ownership between abstract objects and individuals or agents (Prasad and Milsakaki et al., 2008:40), attribution often overlaps with discourse connectives and their arguments. Also discourse connectives are establishing relations between AOs and can therefore hold between attributions (13) or just between the AOs representing the content of attributions (14). The discourse relation itself can be the AO representing the content of an attribution relation (15). In the examples that follow, taken from the PDTB 2.0, the text spans corresponding to Arg1 are shown in italics, those for Arg2 are in bold, the discourse connectives are underlined and the attribution phrases are identified by small capitals

35 2 Discourse and Attribution (13) ADVOCATES SAID the 90-cent-an-hour rise, to $4.25 an hour by April 1991, is too small for the working poor, while OPPONENTS ARGUED that the increase will still hurt small business and cost many thousand of jobs. (PDTB 0098) (14) Factory orders and construction outlays were largerly flat in December while PURCHASING AGENTS SAID manufacturing shrank further in October. (PDTB 0178) (15) The public is buying the market when in reality there is plenty of grain to be shipped, SAID BILL BIEDERMANN, ALLENDALE INC. DIRECTOR. (PDTB 0192) Discourse connective and attribution relations appear as separate layers that can occur independently or coexist overlapping or even being included one in another. The approach taken by the PDTB, however, considers attribution as subordinate to the identification and annotation of discourse connectives and as the focus is on the latter, attribution appears more as an additional feature to be added to connectives and their arguments than as an independent phenomenon. Attribution is in fact annotated in the PDTB only and every time a discourse relation exists, thus leaving out those instances of attribution to be independently found. Moreover, what is actually marked is the attribution of the discourse connective and of its two arguments Arg1 and Arg2. Therefore, a nested attribution included e.g. in one of the arguments cannot be accounted for and is also left unmarked. In the example below (16), the discourse relation in quotes is attributed to Gov. Nelson Rockefeller of New York and there is no account of the nested attribution of an intention expressed by want and concerning the span: to keep the crimes rates high. (16) In 1966, on route to a re-election rout of Democrat Frank O Connor, GOP GOV. NELSON ROCKEFELLER OF NEW YORK appeared in person SAYING, If you want to keep the crime rates high, O Connor is your man. (PDTB 0041) Key properties of attribution included in the PDTB annotation scheme are: source,

36 2 Discourse and Attribution type, scopal polarity and determinacy. The source feature specifies if the source of the attribution, i.e. the agent in the relation of ownership, is the writer (Wr), another specific agent (Ot), or an arbitrary source (Arb). The writer is always marked as the source when no explicit attribution is made (17). While Other refers to a determinate source either explicitly mentioned (15) or inferable from some other occurrences in the text, Arbitrary sources are lacking a referential agent. This happens for example in case of an impersonal source or an attribution with an agentless passive verb or an adverb (17) as the reporting phrase. In the following example, the relation and Arg1 are attributed to the writer, while Arg2 is labelled as arbitrary. (17) East Germans rallied as officials REPORTEDLY sought Honecker s ouster. (PDTB 2278) Another feature of attribution in the PDTB is the type. This partly accounts for the degree of factuality of the AOs. Type can take four values: assertions, beliefs, facts and eventualities. Assertion propositions (Comm) are generally conveyed by verbs of communication (18), e.g. say, explain, announce. Implicit attributions to the writer (19) also take this value. Belief propositions, which partly correspond to the private states of opinions, beliefs and thoughts, are instead expressed by prepositional attitude verbs (20), i.e. verbs entailing a mental process such as think, believe, doubt, and are labelled as PAtt. (18) We won t put any burden on Farmers, HE SAID. (PDTB 2403) (19) Besides, to a large extent, Mr. Jones may already be getting what he wants out of the team, even though it keeps losing. (PDTB 1411) (20) Scientists need to understand that while THEY TEND TO BELIEVE their work is primarily about establishing new knowledge or doing good, today it is also about power. (PDTB 1495) Facts are associated with factive and semi-factive verbs and involve the attribution

37 2 Discourse and Attribution of an AO presented as factual. To this type belong verbs of perception such as hear, know, remember. The last type of attribution verbs has to do instead with agents holding an intention or attitude towards the AO. Prasad and Miltsakaki et al. (2008) present eventualities (Ctrl) as conveyed by control verbs. These are: verbs of influence (21), such as order, allow and persuade ; verbs of commitment, such as agree, promise and accept ; and verbs of orientation such as hope, want and wish. (21) Eward and Whittington had planned to leave the bank earlier, but MR. CRAVEN HAD PERSUADED THEM to remain until the bank was in a healthy position. (PDTB 1949) Another feature added in the PDTB to attribution is scopal polarity. This is a feature that allows identifying cases when a negation which on the surface appears to scope on the attribution verb, changes instead the polarity of the attributed AO (22). It is important to recognise the real scope of the negation as this affects the last feature present in the annotation of attribution in the PDTB: determinacy. Determinacy has to do with the truth value of the attribution. In case of an attribution verb being in the scope of a negation (23), or e.g. in a conditional or infinitive context, the attribution itself is not presented as real and it should be handled as such when drawing considerations about the AO on the basis of this relation. This does not mean that the attribution is therefore certainly unreal as it could also be that the attribution is just shown as possible (24) or probable. (22) I DON T THINK it s a main consideration. (PDTB 0090) = I THINK it s not a main consideration. (23) Yet the Soviet leader's readiness to embark on foreign visits and steady accumulation of personal power,, DO NOT SUGGEST that Mr. Gorbachev is on the verge of being toppled; (PDTB 0439)

38 2 Discourse and Attribution (24) SOME MAY BE TEMPTED TO ARGUE that the idea of a strategic review merely resurrects the infamous Zero-Based Budgeting (ZBB) concept of the Carter administration. (PDTB 0692) A last issue addressed by the PDTB concerns the annotation of the attribution text span. The attribution span corresponds to the material containing the information about the source, the type, the scopal polarity and the determinacy of the attribution. The AO is usually annotated separately. The attribution spans are often left unexpressed in the sentence in which the AO is realized, and have to be inferred from the prior discourse (Prasad and Miltsakaki et al., 2008:48). When the attribution is to the writer and implicit, no text span is selected. The text span also includes, for every element part of it, its non-clausal modifiers e.g. adverbs and appositive noun phrases. In some cases the attribution span can be represented by a non-clausal phrase as prepositional groups such as in the eyes of and according to (25), or adverbs like reportedly and allegedly can also represent the text anchor of attribution. When one of this constructions and not a verb signals the attribution relation, the attribution span is a non-clausal phrase. Non-clausal attributions are included in the argument span corresponding to their AO as the PDTB annotation conventions do not allow keeping phrasal modifiers separate from the span they modify (25). (25) No foreign companies bid on the Hiroshima project, ACCORDING TO THE BUREAU. But the Japanese practice of deep discounting often is cited by Americans as a classic barrier to entry in Japan s market. (PDTB 0501) 2.5 Summary This chapter has presented a review of different approaches to discourse structure and coherence relations, introducing different theories and surveying the main projects regarding the construction of discourse annotated resources. Attribution relations have been proved to be not only a syntactic, intrasentential phenomenon, as they have been regarded by some studies, but also to scope over discourse units and even to relate extra-textual material. A new

39 2 Discourse and Attribution definition of attribution has also been proposed, in order to supply for the need of one adequate to describe the scope of the present study. Some annotation projects involving attribution were also reviewed. The annotation schema developed in this thesis will be grounded on these projects, though with some modifications. In order to provide a complete account of attribution, it is necessary to extend and adapt these annotation schemas to the range of linguistic units between which this relation can hold (i.e. word, clause, sentence, discourse segment, discourse). Moreover, as the complexity of such a wide scope suggests, an approach to attribution independent from other syntactic or discourse phenomena will be adopted. The benefit of this approach will be reaching a better description of the phenomenon and the development of a complete resource to be employed for attribution related studies

40 3 An Analysis of Attribution 3 An Analysis of Attribution Before proceeding with the definition of an annotation schema for attribution, a deeper understanding and description of the phenomenon is required. Attribution will be segmented in its constitutive elements, which represent the fundamental units of the annotation. This will also enable a more considerate selection of the features to be included in the schema. Moreover, the analysis of the different components playing a role in the attribution relation will provide an account of the different lexical elements possibly representing them. Finally, some characteristics of attribution and various issues representing a challenge for the annotation will be discussed and possible solutions proposed. 3.1 The Components of Attribution Attribution relations are intuitively composed by at least two elements: the attributed linguistic material and the entity this is attributed to. The latter is usually referred to as the source (Prasad et al., 2007; Wiebe, 2002), which includes the experiencer of an emotional state as well as the writer or speaker of a text. The former, due to the multiplicity of its possible referents, has not a unique label. In the literature the attributed element has been termed as the text or document, when dealing with document-level attribution, as the AO (Prasad et al., 2007), representing a discourse unit, or interchangeably, when annotating opinions, as the object, content, inside (Wiebe, 2002) or target (Wiebe et al., 2005) towards which a certain attitude is held by the source. As AO refers to a discourse segment, which is not always the case in this study, this term will not be used. The terms proposed for the annotation of opinions are all equally valid, however, in order to avoid confusion, the attributed linguistic material will be univocally identified here as the content. In addition to source and content a third element is fundamental in the relation: the lexical anchor signalling the existence of an attribution. This has been assimilated to the source in the PDTB and jointly annotated as the attribution phrase. In the manual for the sentential annotation of opinions (Wiebe, 2002:6) the private-state or speech event phrase itself is identified as on. In the

41 3 An Analysis of Attribution annotation scheme proposed later (Wiebe et al., 2005), this element is included in both the private state and speech event frames as the text anchor. Although text and lexical anchor will be occasionally employed, the element connecting source and content will be in this work labelled as cue. In the examples that follow, when the cue, the source or the content are highlighted, this will be done as follows: the source span in bold, the cue underlined and the text corresponding to the content in italics The Source The source of an attribution relation is the entity the content is ascribed to. Sources are usually the agents (26) of a speech event, when it is a statement to be attributed, or the experiencers (27), if dealing with a private state. (26) Chairman Krebs says the California pension fund is getting a bargain price that wouldn t have been offered to others. (PDTB 0331) (27) Sue thinks that the election was fair. (Wiebe et al, 2005:9) However, things can get a lot more complicated than this. Quite frequently mentioned sources are not animate agents. Contents are often attributed to institutions or knowledge repositories, such as law codes, studies, reports and newspapers. Although these are usually in a metonymical relation to the actual animate source, this is deliberately left out of the attribution as unknown, irrelevant or even a plurality (28). In the example below (29) the content is a piece of information which needs a reliable source to be considered trustworthy. Ascribing the content to a major newspaper is here more effective than directly citing an unknown journalist. (28) La Costituzione prevede la mozione di fiducia per battezzare un governo, quella di sfiducia per farlo cadere. (ISST els035) The Constitution prescribes a trust motion to establish in office a government, a distrust one to destitute it

42 3 An Analysis of Attribution (29) Il quotidiano Ma ariv riporta che è stato rafforzato il servizio di sorveglianza attorno a Rabin, al capo di stato maggiore Shahak, al ministro degli Esteri Peres, a quello della Polizia Shahal e dell Ambiente Sarid. (ISST cs042) The newspaper Ma ariv reports that it has been increased the vigilance service for Rabin, the Chief of Staff Shahak, the Foreign Secretary Peres, the Police minister Shahal and the Environment minister Sarid. In other cases the source is not an agent but a specification or an adjective of its metonymic referent, e.g. the words for the speaker, the document for the writer, in agentive position ( (30), (31)). (30) According to John s declaration, Mary left the party before midnight. (31) The presidential report announced that the Defence Minister resigned today. When a source is adding credibility to the content it is related to, it is usually explicitly mentioned through the attribution relation. However, especially in journalistic texts, attribution relations serve another purpose: they remove liability from the writer, interposing another source. Sometimes this strategy is used when the provenance of the information in the content is not certain or not known. In this case the metonymic source is lacking a specific referent on purpose (32). In this way the writer is not assuming the responsibility of the given statement, without really attributing it to another specific source. (32) secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che non intendeva fare nulla di male e che per lui si è trattato di un gioco. (ISST cs004) according to indiscretions he would have told the examining magistrates that he didn t intend doing anything bad and that for him it was just a game

43 3 An Analysis of Attribution (33) Secondo anticipazioni l esame del Consiglio di Stato avrebbe avuto un esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai primi di giugno. (ISST sole153) According to anticipations, it seems that the Council of State examination had a positive result and the regulation should get the starting signal the first days in June. Sources without a corresponding referent can also be indefinite entities, e.g. the people, someone, or impersonal pronouns, e.g. one, you. Moreover, an attribution relation can exist although paradoxically one of its constitutive elements, the source, is missing. This effect is achieved through the use of e.g. a passive attribution verb lacking the agent (34), a past participle (35) or an infinitive. (34) É stato detto che si tratta di sport, non bisogna farne una tragedia; (ISST els060) It has been said that we re dealing with sport, we shouldn t make a fuss out of it; (35) L accordo annunciato ieri (ISST sole101) The agreement announced yesterday As Italian is a pro-drop language, quite often the source is left implicit. This, however does not mean that it is missing. It corresponds in fact to the implicit personal pronoun of the attribution verb, usually coreferential to the explicit entity mentioned somewhere else in the text. (36) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare molto il comportamento dei commentatori sulle emittenti di Berlusconi. (ISST cs043) Probably Vialli has not forgotten the rumours about his presumed happy life during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And

44 3 An Analysis of Attribution (he) doesn t believe that the recent alliance Juventus-Milan could really change the commentators behaviour on Berlusconi s televisions The Content The content of an attribution could be regarded as the nucleus (Wolf and Gibson, 2005) of the relation. The source and also the cue act as satellite elements, therefore, according to the RST theory (Mann and Thompson, 1988), they convey additional information. As it has been already mentioned, the content can be constituted by different linguistic units. Word or phrase A single word or phrase can already constitute the content of the attribution as in ((37), (38)). This is not only the case when this represents, although short, a complete utterance directly reported. Yes/ no function in this case as a sentence substitute and therefore contribute to textual cohesion (Renzi, 1995). Very often, what is attributed is not directly the content, but its container (39). As the main reason behind the creation of an annotated resource for attribution is to be able to link the content with its source in order to allow a more correct semantic interpretation of it and to account for its provenance, the attribution of a container of information appears at first not relevant. Therefore, these words or clauses would not require annotation. In the example (39) knowing that the press release has been issued by Palazzo Chigi, is not necessary as it neither represents some linguistic material directly asserted by the source, nor it conveys any piece of information that could be ascribed to the source. (37) The minister addressed the president calling him padrino. (38) Sì, le risponde convinta un amichetta. (ISST cs060) Yes, answers to her confident a friend. (39) Palazzo Chigi emette un nuovo comunicato. (ISST els048) Palazzo Chigi (seat of the Italian Government) issues a new press release

45 3 An Analysis of Attribution However, this is different in case of event anaphora, when content is also expressed, although somewhere else in the text (40). The annotation of the attribution relation binding the source to the container of the attributed span would allow the actual content, once this metonymic co-reference relation is resolved, to inherit the attribution relation. Similarly, the content can often be found expressed by just a pronoun (41) co-referentially recalling the attributed utterance. In the examples below, the content represents an instance of event anaphora, a relation often intertwined with attribution. (40) Palazzo Chigi emette UN NUOVO COMUNICATO. <<Sarà il governo>> scrive <<a prendere una decisione in piena autonomia e responsabilità>>. (ISST els048) Palazzo Chigi issues A NEW PRESS RELEASE. <<It will be the Government>> (it) writes <<to assume a fully autonomous and responsible decision>>. (41) Dobbiamo fare un ulteriore salto di qualità, entrare in una nuova mentalità. A dirlo è Giuseppe Signori, (ISST re126) We have to achieve an additional quality leap, enter a new mentality. It is Giuseppe Signori to say IT, Finally, it is also possible to find a verb as the content, and at the same time cue, of the attribution. This happens with verbs such as confermare (to confirm), accettare (to accept) negare / rifiutare / smentire (to deny), which implicitly involve, because of the semantic of the verb, the production of a yes/ no utterance. In this case, however, it is not necessary to link source and content as the verb is already syntactically connected to its subject, or object in case of a passive verb. Clauses More often it is a larger linguistic unit to be attributed. This can still happen intrasententially, when the content is a single clause, or more than one (42). Reported, direct (43) or indirect speech is also usually represented at the sentence level. Source and verbal cue together often constitute the main clause while the content

46 3 An Analysis of Attribution is the direct object (42) of the attribution verb. The attributed span can be expressed by a subordinate or embedded clause. The content might also represent the main clause, and the attributing span an incidental clause. (42) Mr. Marcus believes spot steel prices will continue to fall through early 1990 and then reverse themselves. (PDTB 0336) (43) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST cs030) We ll give you the statistics at the end, promise the Croatian generals. Sentences and larger units Nevertheless, it is also common to find one or more clauses in a separate sentence, or one or more full sentences (44), as the content of an attribution relation. Discontinuous contents spreading over several sentences are often associated to interviews (45) or testimonies, where the source and the cue are not changing and do not need to be constantly repeated. (44) "There's no question that some of those workers and managers contracted asbestos-related diseases," said Darrell Phillips, vice president of human resources for Hollingsworth & Vose. "But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today." (PDTB 0003) (45) Dunque, Ghezzi, che cosa significa "non cinema"? Per intenderci, Moretti potrebbe girare tutta la vita ma non arriverebbe mai alla sensuosità o fatalità cinematografica di un Michael Cimino.... Sensuosità? E' un concetto che ha a che fare con la forma? " Fino a Palombella rossa...". (ISST cs050) So, Ghezzi, what does it mean non cinema? To make it clear, Moretti could shoot all his life but he would never reach the cinematic sensuality or fatality of a Michael Cimino Sensuality? Is it a concept that has to do with form? Till Palombella rossa

47 3 An Analysis of Attribution Finally, when dealing with news articles, the article itself represents a content, whose source is the writer. The content of the article is in fact responsibility of its author which is usually explicitly mentioned (Figure F). Figure F - Newspaper article source Elements Functioning as Cue How is it possible to detect the existence of an attribution relation? The simple juxtaposition of a source and a content together would not be enough unless some other element provides the textual anchor that links them together. This element is the attribution cue and it is realised by different linguistic elements. This can simply be graphic elements, the use of punctuation, or grammatical and lexical devices. Apart from establishing the relation, the cue has also another function: it determines the kind of attribution e.g. a belief, a thought, an assertion, etc. While punctuation cues always refer to asserted contents and prepositions alone do not specify the nature of the relation, nouns and verbs can express several types of attribution

48 3 An Analysis of Attribution Punctuation Cues Punctuation, double and single quotation marks (i.e., ) and less frequently double angle brackets (<< >>) or hyphens (- -),represents, in Italian as well as in English, the simplest cue to look for when searching for an attribution, although it is not frequently the only one (46). However, this is not a reliable cue as it only accounts for the attribution of assertions directly reported, leaving out indirect speech and also the attribution of mental states such as opinions, intentions or knowledge. Moreover, the same punctuation marks may as well be employed in Italian to mention a word or a title, to signal an unusual usage (47) of one or a few words such as an ironic or metaphoric use, or even to give emphasis to them. In addition to that, single quotation marks are also used for the apostrophe and in some cases, in order to avoid using special characters, to render accented glyphs. (46) Il Papa: La cultura ha bisogno del genio femminile. (ISST cs014) The Pope: Culture needs the female genius. (47) Settembre, mese tradizionalmente <<caldo>>, non fa registrare vistosi strappi al rialzo, sottolineando l andamento verso il basso del costo della vita. (ISST els020) September, a traditionally <<hot>> month, doesn t make record considerable price rises, stressing the trend towards a reduction of living cost. Preposition and Prepositional Groups Syntactic cues can be expressed by several word classes. Although attribution verbs are by large the most common signal of the existence of an attribution relation, nouns, adjectives, prepositions and adverbs can also function as cues. While only one cue is required, it is common to find two or even more cues combined together (48). A partial account of Italian cues, although only relative to reported speech, can be found in Renzi (1995). In this grammar the prepositions per ( for ) and secondo ( according to ) (48) are listed. To them it should be added, although they are not very frequent and some do not even occur in the ISST corpus, the prepositional groups: a detta di (according to), a parere di (in

49 3 An Analysis of Attribution the opinion of), agli occhi di (in the eyes of), nell ottica di (in the perspective of), per quanto riguarda (as far as it concerns), stando a (according to) (49). (48) Non solo, ma secondo lo stesso Tronchetti Provera da fornitore di cavi siamo diventati fornitori di sistemi integrati. (ISST re062) Not only, but according to Tronchetti Provera himself from supplier of cables we became suppliers of integrated systems. (49) Oltre ai missili di questo tipo, stando alle stesse fonti, le navi partite dalla Corea del Nord ne trasporterebbero:cond altri del tipo Styx... (ISST els075) Besides this kind of missile, according to the same sources, the ships that left from North Korea, (apparently) transport other ones of the Styx kind Adverbials Prasad and Miltsakaki et al. (2008: 43) identify some adverbials which may function in English as attribution cues, such as reportedly (50), allegedly, supposedly, etc. In Italian however, there is not a corresponding class. These adverbials usually have an equivalent in a prepositional phrase: a quanto si dice (according to what one says). (50) East Germans rallied as officials reportedly sought Honecker s ouster. (PDTB 2278) Nouns and Adjectives While discussing the way the source can be expressed (3.1.1) it has been shown how adjectives can assume this function. Adjectives establishing a relation of possess between possessor and owned entity function as cue of an attribution relation if the possessor is the source and the owned entity the content, or the element coreferential to the content, as in the example below (51). (51) The Defence Minister resigned today. The presidential announcement at the press conference came unexpected

50 3 An Analysis of Attribution Although nouns alone do not establish any relation between source and content, they can function as introductory elements (Renzi, 1995) following or preceding the attributed material they represent. These nouns or NPs are very informative about the typology of attribution, e.g. assertion (declaration, release, observation, etc.), belief (doubt, idea, etc.) or intention (agreement, promise, desire, etc.). Knowing the type of attribution is very relevant in order to discern if the attributed material is for example an opinion, a statement or an intention. In the following example (52), la dichiarazione (the declaration) is the only element signalling that the following material (highlighted in italics) is not attributed to the writer but to another source, which is not at all mentioned. The noun itself, representing a speech act, presupposes the existence of a source. (52) Mi ha sconvolto la dichiarazione che tutto questo non vale niente. (Renzi, 1995:435) It upset me the declaration that all this is worth nothing. Renzi (1995) also observes that nouns or NPs functioning as attribution cues usually have an argumentative structure or refer to speech acts, but also an act of i.e. thought (53) or will. (53) It is nice to die for what you believe in; who is afraid, dies every day, who is not afraid, dies only once. With this idea the anti-mafia magistrate Paolo Borsellino worked till he was assassinated in Grammatical Cues: Quotative Conditional Some languages grammatically mark the fact that the writer/ speaker is not directly presenting the information but there is an intermediary source. This grammatical category is called evidentiality, as it accounts for the evidence a speaker has for his/ her statement (De Haan, 2008:77). As the WALS map of the Semantic distinction of evidentiality (De Haan, 2008:77) shows, the encoding of evidentiality is a relatively common feature. In Europe, however, it is almost only indirect evidentiality that can be expressed, without further distinguishing among different modes of sensory evidence

51 3 An Analysis of Attribution De Haan (2008) points out the fact that the languages presenting indirect evidentials in Europe are mainly Germanic, with the exclusion of English, and suggests that Finnish and French may have developed evidentiality because of Germanic influence, as Ugro-Finnic and Romance languages do not present this feature. However, this is not exact as Italian possesses a grammatical structure to express hearsay, i.e. the quotative conditional (54), similar to the French conditionnel de la rumeur. Both languages however do not have a dedicated grammatical category for evidentiality as the conditional is also used for other purposes, e.g. unreality, attenuated wish, etc., expressing a number of factuality degrees and epistemic modality. Epistemic modality is often associated to evidentiality, as the information source influences the degree of certainty the speaker expresses towards a proposition. Although epistemic modality may be intertwined to the conditional, this Italian mood is reportive and not inferential (Giacalone, 2007). (54) Un incendio, che si sarebbe sviluppato:cond per cause accidentali, ha gravemente danneggiato a Fiano (Torino), uno chalet di proprietà di Umberto Agnelli, attiguo alla sua abitazione. (ISST cs010) A fire, which (is said to have) developed for accidental causes, has severely damaged in Fiano (Turin), a chalet belonging to Umberto Agnelli, next to his residence. According to Aikhenvald (2004) languages like Italian and French do not have evidentiality as they do not have dedicated morphemes expressing it, but just evidentiality strategies which originate from the verb mood and represent a secondary function. Nonetheless, quotative conditionals are very common in Italian and are an important indicator of attribution, although, as Knott (1996) also remarks, they can only be recognised in context as exemplified in ((55)a-b). Moreover, more than attributing the content to a source, quotative conditionals mark that the default attribution to the writer is not suitable. They always refer to an indeterminate unknown source, unless this is explicitly expressed by other means ((49), (56))

52 3 An Analysis of Attribution (55) a. Il presidente sarebbe:cond morto. The president (is said) to be dead. b. Il presidente sarebbe:cond morto, se non avesse usato la cintura. The president would have died/ would be dead, if he wouldn t have used the seatbelt. (56) Secondo anticipazioni l esame del Consiglio di Stato avrebbe avuto:cond un esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai primi di giugno. (ISST sole153) According to anticipations, (it seems that) the Council of State examination had a positive result and the regulation should get the starting signal the first days in June. Verb cues Verbs are the most significant attribution cue in Italian as well as English. When occurring at the intra-sentential level, they usually constitute the main clause together with the source, while the content is expressed by a dependent clause with (57) or without (58) the complementizer che (that). The attribution clause may not only occur before or after the content text span, but also enclosed in it as an incidental clause, or even, although it is not a very frequent strategy, around the content (e.g. Giovanni: Tutto qui? chiese con un sorriso./ John: Is that all? (he) asked with a smile.) Renzi (1995) groups these verbs in three categories: A) verbs expressing a linguistic action, e.g. raccontare (to tell), telefonare (to phone), rispondere (to answer), scrivere (to write), ordinare (to order) etc.; B) verbs expressing the reception of a linguistic act, e.g. sentire (to hear), intendere (to understand), leggere (to read), etc.; C) verbs conveying a cognitive process, e.g. pensare (to think), ricordare (to remember), etc. The PDTB adopts instead a different and more fine-grained classification (see 2.4.3). Assertions and eventualities partly overlap with A), facts should correspond to B), and beliefs to C)

53 3 An Analysis of Attribution (57) Nella morte di Ivan Ilic, Tolstoj sostiene che in quel momento si va verso una grande luce. (ISST els034) In the death of Ivan Ilic, Tolstoj claims that in that moment we go towards a big light. (58) The BPC Fine Arts Committee think she had a literal green thumb. (PDTB 0984) Another category of verbs can also be found as attribution cue that does not match any of the above mentioned ones. As they cannot themselves function as introductory devices for the content, Renzi (1995) suggests that these verbs should be considered as implicitly presupposing one of the attribution verbs, probably an hyperonym such as say or think. These verbs can be ascribed to two different categories. One includes verbs such as iniziare (to begin), continuare (to continue), aggiungere (to add) (59) and concludere (to conclude) which suggest the existence of another attribution they usually follow, but may also precede. Therefore these verbs could be considered as inheriting the type from the verb they are linked to, which is usually an assertion, as they correspond to the chronological phases of a speech event. (59) Finché c è chi lo difende e lo incoraggia, lui continuerà a comportarsi così, profetizza Storace. E aggiunge: Ancora più grave poi è l atteggiamento del governo, che non prende posizione davanti alle stronzate di Bossi perché la sua sopravvivenza dipende dai voti della Lega. (ISST cs027) Till there is someone protecting and encouraging him, he will go on behaving like that, forecasts Storace. And (he) adds: Even worse is the attitude of the government, which does not take a stand against Bossi s absurdities because its survival depends on the votes of the Lega. The other includes verbs such as sorridere (to smile) (60), alzare le spalle (to shrug the shoulders), adombrarsi (to grow dark) (61), rallegrarsi (to rejoice),

54 3 An Analysis of Attribution acquietarsi (to calm) etc. These verbs occur mainly in incidental position (Renzi, 1995). Most of them are part of what Levin (1993: ) classifies as verbs of nonverbal expression and of gestures, observing that they are usually associated with an emotion and mainly involve a facial expression or body parts, e.g. annuire (to nod), ammiccare (to blink), corrugare (to wrinkle), etc. The rest of the verbs in this group directly refer to an emotional change, often involving a change in the intonation. Talmy (2000:152) defines manner as a subsidiary action or state that a Patient manifests concurrently with its main action or state. Therefore manner is expressed in languages that cannot normally express it on the verb (e.g. Italian) as two sub-events. As these attribution verbs express the manner of the verbs conveying the attribution, this could be considered a sort of metonymical use (manner for the action/ manner). Similarly to the continuative verbs, also these verbs are usually associated with speech acts, therefore they could be seen as specifying the hyperonym say which is left implicit. In the following example the verb used as cue, sorride (smile), could be substituted with dice sorridendo (says while smiling). (60) Arlacchi sorride: Pura paranoia politica. Non ho partecipato ai lavori solo a causa di un impegno privato. (ISST re095) Arlacchi smiles: Pure political paranoia. I didn t participate in the works only because of a private appointment. (61) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction? "Sì - si adombra Matt - Un ruolo interessante: con Tarantino eravamo a buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei, no? Hanno scelto lui", ride nervoso, tormentando il tappo a vite di una bottiglia d'acqua minerale. (ISST cs060) Is it right that you were going to play the role of Bruce Willis in Pulp Fiction? Yes - Matt grows dark - An interesting role: with Tarantino we were at a good point, then Bruce arrived. His films cash in a bit more than mines, right? They chose him, (he) laughs nervously, tormenting the screw top of a mineral water bottle

55 3 An Analysis of Attribution 3.2 Some Issues The annotation of attribution relations rises several questions as how to deal with peculiar aspects or issues of attribution which represent a challenge to the annotation. These aspects arose from theoretical considerations as well as while performing the pilot annotation and are particularly important as they determine the choice of a suitable tool and shape the annotation schema. In this chapter some of these features will be presented Nested Attributions A pervasive characteristic of attribution is its recursiveness. Any attribution relation can constitute the content of another attribution relation and this the content of another one and so forth. The possibility of nesting an attribution into another attribution is a potentially never-ending process. This could be exemplified as follows, the capital letter representing the source and the brackets signalled by the same small letter its content. A [B {C (D d ) c } b ] a Although not annotated in the PDTB, nested attribution require to be accounted for in order to determine the truth or trustworthiness value of the embedded content. Considering just the shallowest source, that is the most left, or the most embedded one, hence the most right, or even an arbitrary intermediate one, could lead to ignoring characteristics of the other ones which would possibly determine important re-reading of the information in the content. Wiebe (2002; et al., 2005) includes nested sources in their annotation schema, listing in the source ID and all the sources in the sentence, with the addition of the writer, comprising a certain text span in their content. (62) [Sue said {that Mary believes (that Gore won the election)}]. Sources: [writer] {writer, Sue} (writer, Sue, Mary) (Wiebe, 2002:5 - with the addition of brackets) Formalising the effect the sources determine on the different embedded contents would allow, once attribution relations have been recognised, the automatic

56 3 An Analysis of Attribution derivation of the truth value of information at different level of embedding. This represents a simplistic abstraction as sources almost never differentiate so sharply as sincere or liar but usually imply different degrees of reliability or bias they project onto the content. This also vary according to the content topic as the source expertise also vary. Making use of Boolean logic it is however possible to draw some considerations. Figure G represents a possible scheme of nested attributions. Source A is related to the content a through the attitude it holds towards it (belief, statement, desire, etc ). The content a is formed by the relation Bb occurring between the source B and its content b, which in turn is composed by Cc and so forth, plus optional additional material which is not part of the relation. The trustworthiness and knowledge of the source A determines the truth and reliability of its content a as in a relation of implication A a. Substituting a with its correspondent Bb the implication becomes A Bb that is A implies the attribution relation embedded in it. If A is trustworthy, also the attribution relation nested in it Bb is and should be considered factual. Similarly every source implies its content and therefore the attribution relation included in it: B b, C c, D d, N n. A B C D a b Figure G - Nested attribution schema c d However, when deriving the truth value of a content d all the sources of the contents it is included in need to be considered. It is not sufficient that D is considered reliable, all sources to its left (i.e. A, B, C) need to be (Figure Ha). They can therefore be joined with an AND relation (A Λ B Λ C Λ D) d. To make it simple, sources and contents which are reliable and taken into consideration are here labelled as true, while sources and contents that are not, as false. Figure H shows the truth values (T/ F) of a nested content d, the arrows point to the attribution relation between the content (small letter) and the source (capital letter)

57 3 An Analysis of Attribution Proceeding from the inside to the outside, in case D is false (Figure Hb), d is already uncertain, and it is not necessary to also check A, B and D. Considering the example (63) and supposing an answer to the question Is John innocent? is required, probably his mother should not be considered as a reliable source. In that case, the piece of information representing the most embedded content is not relevant and cannot be considered as the answer. Did she really made such a declaration? If The Times and the police are considered as reliable sources, and also in this case this is an arbitrary decision, then it should be assumed that this attribution relation is correct. (63) The Times writes about the police saying that the murderer s mother declared: John is innocent. Moreover, a false source implies that everything to its right, therefore towards the more embedded attribution relations, cannot be trusted although the other sources to the right are T. Figure Hc shows a case in which A is true and so it is its content a and therefore the attribution of b to B. Being B false, however, the content b and everything contained in it cannot be considered true. T T T T T a) A Λ B Λ C Λ D a b c d T T T T T/F b) A Λ B Λ C Λ D a b c d

58 3 An Analysis of Attribution T F T T T/F c) A Λ B Λ C Λ D a b c d Figure H - Truth values of a nested content Discerning between sources which should be considered and sources which should lead to the rejection of the content depends on subjective and domain specific considerations. Once the sources have been sorted, determining if the content is to be taken into consideration can potentially be turned into a Boolean problem, as presented above. As already anticipated, however, determining the relevance of a piece of information is not a Boolean problem as it involves variables with domain sizes greater than the binary true/ false. This means, sources are almost never completely reliable or completely unreliable, but they occupy intermediate positions on a continuum between the true and false poles according to the field of information under consideration and personal orientations and subjective characteristics of both the source and the person considering the information. Algorithms for non-boolean problems should be more appropriately employed to fully deal with the degree of truth of the content. This is better captured by fuzzy logic as the sources and the content have a truth value ranging between 0 and 1. Example (64), taken from real language use, presents four levels of nested attributions. First from the outside, the writer of the article, or more generally the newspaper publishing it of which the whole sentence represents a content. Second the New York Times which is reporting rumours, the third source, and last Blinder, holding the most internal content. None of these four sources needs to be a priori discarded, although rumours is surely less reliable as it has a nonspecific referent. (64) Blinder, secondo voci riferite dal New York Times, sperava di succedere al presidente Greenspan quando a marzo scadrà la sua nomina. (ISST

59 3 An Analysis of Attribution re070) Blinder, according to rumours reported by the New York Times, hoped to succeed to president Greenspan when in May his appointment will run over. The more the sources, the more the passages a piece of information has gone through, and therefore the chances it underwent transformations from its original form as in the well known Whisper Game. Although not so common and usually shallower than the nesting of an attribution relation in the content of another attribution relation, the example (64) above shows that the source may also be an attribution relation itself. Rumours reported by the New York Times is in fact the source of the content Blinder hoped to ( ). Conversely, this could be also interpreted and analysed as a different perspective towards the same problematic. A source can present an attribution relation in its content and the source of a nested attribution relation can be in turn an attribution relation whose content is its local source and source the superordinate one (e.g. Blinder, according to rumours reported by the New York Times / The New York Times reports rumours saying that Blinder ) Source of the Source A special case of nesting could be considered what is here called source of the source. In this case, the attribution relation makes explicit the presence of another source which is not on the same level of embedding of the actual source, in that case it would just represent an instance of multiple source (see 3.2.3). This added source could be more internal as in the example (65), the source of the source is in small capitals, where the most embedded source Maurizio Damilano is not directly connected to the content by an attribution relation although this is semantically inferable. This type of additional source is usually dependent on verbs of perception and knowledge, the ones labelled as facts in the PDTB, as these correspond to the verbs representing the reception of a linguistic act (Renzi, 1995), therefore they more or less implicitly recall the production of a linguistic act. The sentence in (65) could be in fact transformed into its reciprocal equivalent Maurizio Damilano told me about the disqualification of Garciano,, the original source becoming the

60 3 An Analysis of Attribution indirect object, the recipient of the speech act, while the source of the source, signalling the provenance of the information, becomes the new source. (65) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro, non pensavo di arrivare primo. (ISST cs071) (I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I swear, I didn t imagine I would have came first. Both recipient and source of the source are relevant to the attribution relation as they inform about the source of a piece of information and the entity this was addressed to. Both these elements can influence the way the content is perceived ((66)a-b). (66) a. The pope/ scientist says we do not derive from monkeys. b. The scientist told THE PRESIDENT/ THE SCHOOLCHILDREN that asbestos is harmful. Source of the source could be also considered those instances where a more external source is mentioned without directly relating it to its whole content but just to the source embedded in it as in ((67)-(68)). This strategy is often adopted when the intermediary source is not particularly prominent and is not expressing anything other than reporting the most embedded content. Example (68) could be paraphrased as The president s spokesman Rossi said that the president announced that a new anti-mafia pool has been appointed. Making the attribution relation involving the source of the source explicit, the spokesman would become the subject of the sentence and therefore occupy a prominent role attracting unneeded attention and diverting it from the more prominent source the president. (67) Poi però, TRAMITE LA FIGLIA che sta a Santiago, prima limita la portata del colloquio con Gaston Salvatore ( non è stata una vera intervista, solo una conversazione ), poi smentisce. (ISST period005) Afterwards however, THROUGH THE DAUGHTER who lives in Santiago, first

61 3 An Analysis of Attribution diminishes the importance of the colloquium with Gaston Salvatore ( it wasn t a real interview, just a conversation ), then she denies. (68) The president has announced THROUGH HIS SPOKESMAN ROSSI that a new anti-mafia pool has been appointed. These second type of sources of sources are expressed as an adjunct indicating means. In that position they are also presented as less relevant, as if they were neutral and not affecting the content. However, both types should be included in the annotation as they do inform about the fact that the attribution relation is second hand material and they surely need to be considered when computing the disturbing effect of the Whispering game as they add a level of nesting to the attribution Multiple Sources, Contents, Cues The main elements involved in an attribution relation are three, however, more than one of each at a time can be involved in the relation. The most common is the case when a source is holding the same attitude towards more than one content as in the examples ((69), (72)). (69) (Ø) Ho detto che ero dalla sua parte e che ritenevo giusta la sua protesta. (ISST cs063) (I) said that I was on his side and that I considered his complaint fair. Also quite common is the presence of more than one mentioned source (70). This is different from collective sources, such as institutions, organisations, pluralities or groups as multiple sources are separate entities or at least are presented as such, e.g. John and Mary; the government, the army, and the civilians, etc... Often, like in the example below one source semantically includes the other one which represents a specification of the more general source. Multiple sources are more common when expressing believes or knowledge as assertions or even opinions usually belong to a single entity or to an entity presented as unanimous

62 3 An Analysis of Attribution (70) Tutti, incluse le autorità, conoscono la loro provenienza, ma nessuno dice e fa nulla per prevenire il massacro di capi selvatici. (cs.morph020) Everyone, including the authorities, knows their provenance, but no one says and does anything to prevent the massacre of wild animals. Lastly, the cue itself or the attitude the source is holding towards the content can be multiple. Often an attribution relation is signalled by several strategies, e.g. According to what John suggests, the market is not ready yet, however, they do not interfere as they are all conveying that the content is a statement or a belief or another kind of attribution. When cues represent instead two separate attitudes a source holds towards the same content as in the example (72), this could be considered a multiple source. In (71) instead both verb cues refer to linguistic productions and could be grouped together. Multiple cues are not very common, more frequently the presence of two different cues does not reflect a different attitude but an evaluation about the content the writer expresses, suggesting in a way the key to interpret the utterance as in (73) where a speech act directly reported is bound to a cue labelling it as an opinion. (71) <<domani questa stessa gente é pronta a scendere in piazza per rivendicare>> dicono e scrivono in molti. (ISST els063) <<tomorrow the same people are ready to take to the streets to claim>> many say and write. (72) The men can defeat immunities that states often assert in court by showing that officials knew or should have known that design of the structure was defective and that they failed to make reasonable changes. (PDTB 1160) (73) The journalists shouldn t morbidly write about people s sorrow thinks Mary Co-reference Resolution Since source and content are often recalled by a pronoun or a coreferential element, co-reference resolution becomes a fundamental issue when dealing with

63 3 An Analysis of Attribution attribution relations. The manual annotation could simply mark the coreferential text span, nonetheless, the automatic capturing of the phenomenon would require the resolution of anaphora and co-reference relations. Research in this area is progressing, however, a tool able to resolve the kind of co-references involved in attribution is still lacking. Co-reference regarding the source is usually either bridging, e.g. El Sayed.l arabo /El Sayed the Arabian (ISST els001), or pronominal anaphora. Source anaphora often involves pronouns (74) recalling full nouns or NPs, but also in Italian Ø subjects (75). The coreferential source or content is presented in the examples below in small capitals. (74) Secondo il governo di Pechino, le accuse in base alle quali due diplomatici cinesi sono stati espulsi la settimana scorsa dagli Stati Uniti, sono una montatura. LO ha detto ieri un portavoce del ministero degli Esteri, IL QUALE ha anche annunciato che il governo cinese ha protestato con quello degli Stati Uniti e che si riserva il diritto di ulteriori reazioni. (ISST els075) According to Beijing Government, the charges on the basis of which two Chinese diplomats have been banned last week from the United States, are a frame. IT was said yesterday by a spokesman of the Foreign Ministry, WHO has also announced that the Chinese government has complained to the one of the United States and that they reserve themselves the right of further reactions. (75) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare molto il comportamento dei commentatori sulle emittenti di Berlusconi. (ISST cs043) Probably Vialli has not forgotten the rumours about his presumed happy life during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And (HE) doesn t believe that the recent alliance Juventus-Milan could really change the commentators behaviour on Berlusconi s televisions

64 3 An Analysis of Attribution The content is instead usually formed by clauses or sentences recalled by a pronoun (74), but also a noun of which it represents an elaboration (see 3.1.2), as in (76), where words refers back to the whole direct quotation. Example (74) contains three attribution relations of which two involve co-reference. The first sentence/ attribution is in fact attributed to a spokesman of the Foreign Ministry via recalling it by the personal pronoun it. The first attribution is nested in the second not as usual with being inside its content span, it is in fact in a separate sentence, but because of the event anaphora relating it with the content of the attribution above. The source, a spokesman of the Foreign Ministry is afterwards recalled by the relative pronoun who and becomes part of the last attribution relation. (76) L umanità deve proclamare uno storico sciopero ad oltranza fino alla distruzione di tutti gli armamenti nucleari. LE PAROLE registrate di Gheddafi, (ISST cs039) The world should proclaim a non-stop strike till the destruction of all nuclear armaments. Gheddafi s recorded WORDS, While still challenging, anaphoric expression such as pronouns have been deeply investigated and some studies are also analysing event co-reference, which is closely related to studies about temporality and time references. The coreferences included in attribution relations partly overlap with both research areas as the source falls in the first group, i.e. bridging and pronominal anaphora, while the content is partly of interest of the second one, i.e. event anaphora. The resolution of co-reference is crucial in order to allow retrieving the specific provenance of information as pronouns alone do not carry information about reliability, expertise or bias of the source. Similarly, it is necessary to be able to establish a relation between the source and what it has actually said, thought, dreamt of, etc... In a sentence like John has an idea, linking John to idea is not informative and would be of no use unless we can retrieve what John s idea was. As attribution is part of a bidirectional relation, not only linking linguistic material to the entity expressing it but also entities to what they express, coreference also needs to point in both direction. Only a co-reference tool being able

65 3 An Analysis of Attribution to account for this bidirectionality would allow in the example (76), once the material in quotes has been retrieved, to realise that this is coreferential to a NP which is part of an attribution relation from which it should therefore inherit the source. On the other hand, if the task is retrieving Gheddafi s declarations, words as such, although attributed to him, is not what he said and it should be possible to clasp the coreferential quotation it stands for Scope Definition The main challenge for the annotation of discourse phenomena, and annotation in general, is reaching a precise scope definition which would not invalidate any attempt to reach satisfactory interannotator agreement scores. As far as attribution is concerned, it is important to define what to include in the cue and over which text span the attribution relation holds. The content is not always as easily detectable as when it is delimited by quotes. Sometimes it is expressed by a pronoun or full noun recalling it as discussed in (3.2.4), other times, due to the ambiguity of language, it is not clear what exactly is the attributed span and what is possibly just additional material. In case of multiple insides for example, the presence of a conjunction (77) is not sufficient to assume the second span is also a content as it often represents some additional information or even a comment the writer expresses. To be sure the text span should also be attributed there should be also the subordinator that. (77) The president said that the economy is on the verge of a severe crisis AND he is going to meet the ministers to talk about possible solutions. Concerning the source, this is often a noun phrase, however, attributes, appositives (78) or relative clauses need to be considered as they might be necessary to the characterisation of the entity they refer to. Other times this material constitutes a colourful description (79) which does not help identifying the source referent and would just make the annotation less neat and manageable. (78) Per il presidente dei deputati progressisti, Luigi Berlinguer, la maggioranza <<ha fatto una proposta di natura consociativa che abbiamo

66 3 An Analysis of Attribution rifiutato >> (ISST sole013) For the president of the progressive delegates, Luigi Berlinguer, the political caucus <<has made a associative proposal that we have refused >> (79) Poi stasera torno a Zagabria, grida Kasim Zdionica, un signore con una pancia enorme, le ciabatte di gomma e un pugnale infilato nella cintura. (ISST cs030) Besides, this evening I ll go back to Zagreb, shouts Kasim Zdionica, a men with a huge belly, plastic slippers and a dagger inserted in the belt. The span to be included in the cue itself is also sometimes unclear. Although the verb, noun, preposition, etc., functioning as textual anchor of the attribution relation are not difficult to recognise, there might be supplementary information necessary to the characterisation of the context in which the relation takes place such as a temporal specification, a reference to the situation or entity (80) the content refers to and so forth. (80) PARLANDO DI VERGA, Pirandello scriveva: i siciliani, quasi tutti, hanno un istintiva paura della vita. (ISST els034) WHILE TALKING ABOUT VERGA, Pirandello wrote: Sicilians, almost all of them, have an instinctive fear of life. Deciding what to include and what to leave out of the annotation has not only to be done taking into account the relevance of each element to the interpretation of the content, but also considering the difficulty this could cause to the annotation therefore making the task of the annotators more complicated and uncertain. Suggestions concerning how to deal with this issue are reported in (6.1)

67 3 An Analysis of Attribution 3.3 Summary In this chapter attribution has been analysed in order to highlight its constitutive elements and some problematic characteristics it possesses which are of particular interest for the annotation. Attribution relations can be considered as being composed of three constitutive elements: the content, representing the attributed material; the source, which is the entity the content is related to; and the cue, the textual anchor linking source and content together. Each of these constitutive element can be expressed by a number of linguistic structures which make it more difficult to describe the phenomenon in order e.g. to allow the automatic recognition of it. Sources can be expressed by proper nouns or pronouns but also be left implicit. The content span can range from a single word to the entire discourse. The cue is usually a verb, but can also be a noun, an adjective, an adverb, a preposition or prepositional group, a graphical device (i.e. punctuation) or even a grammatical one (i.e. quotative conditional). To this complex scenery it is also necessary to add some features and problematic issues that need to be considered when developing the annotation schema. First of all, attribution relation can recursively nest into each other. Moreover, it can happen that a level of nesting is not made explicit and the relative source is added as a source of the source representing the means through which a more embedded source expresses the content. Other times a speech event is presented from the hearer s perspective, therefore leading to a change in the roles with the perception of the information being attributed to the hearer and its source being expressed as a specification of its provenance. Another issue involves the occurrence of multiple sources, contents and even cues as part of the same attribution relation. Furthermore, attribution is heavily intertwined with co-reference and the understanding of attribution relations is subsequent to the resolution of anaphora and co-references. A last challenge is determined by the definition of the scope as the text span to include in each of the three components of attribution needs to be defined so that elements important for the interpretation of the content or the identification of the source are not left out, but also without making the annotation too complex or too arbitrary, thus decreasing the interannotator agreement

68 4 Features to Include in the Annotation 4 Features to Include in the Annotation The annotation of an attribution relation basically requires to mark the link between source and content. However, additional features could also be included in the annotation which would provide useful information about the nature and veracity of this relation. This features, or attributes, have been derived from the PDTB annotation scheme for attribution (Prasad et al., 2007). As presented in (2.4.3), the scheme includes the attributes of type, source, determinacy and scopal polarity. After analysing the phenomenon of attribution, however, the PDTB scheme had to be partially modified and adapted in order to suit the present project. In the following chapters each feature that has been included in the annotation will be presented and the values it can assume discussed with the help of examples from the ISST corpus. 4.1 Type The feature type, marking the type of attribution, has been included in the annotation schema employed for the pilot without any changes from the PDTB. The type, which is anchored to the cue, namely determines the kind of attitude the speaker holds towards the content of the relation. This, as in the PDTB scheme (Prasad, Miltsakaki et al., 2008) can assume four values: assertion, fact, belief and eventuality. The distinction seems quite viable, especially if compared to the more fine-grained categorisation adopted by Wiebe (2002; Wiebe et al., 2005) for the annotation of speech events and private states: assertions (writing or speaking), opinions, beliefs, thoughts, feelings, emotions, goals, evaluations and judgements. However, some issues arose that would suggest a revision of this classification before applying it to the whole corpus Assertion Assertions are conveyed by verbs of communication, e.g. dire (to say), affermare (to claim), riferire (to relate), spiegare (81) (to explain), and suggest that the attribution content has been verbally expressed, in writing (82) or speaking (81)

69 4 Features to Include in the Annotation (81) Ha spiegato Sciandri dopo l arrivo: Ho imparato dagli errori del passato, quando spesso esitavo troppo prima di partire (ISST cs082) Sciandri explained after the arrival: I ve learnt from past mistakes, as when I was hesitating too much before starting. (82) L obiettivo, dice sempre il comunicato dell Olp, <<è quello di assicurare una gestione trasparente e altamente professionale delle risorse palestinesi>>. (ISST sole023) The goal, says the PLO release, <<is that of guarantying a transparent management and highly professional of the Palestinian resources>> Belief Beliefs are associated with verbs expressing a mental attitude, such as pensare (to think), credere (83) (to believe), immaginare (to imagine). The content in this case reflects a mental orientation more than conveying an event and it also expresses a slightly lower level of factuality as while the content of assertions is presented in a factual way, beliefs bound the content to a point of view (83), an opinion without pretence of being generally valid. (83) Ø credo che vivesse nella villa dei Pietroiusti anche d inverno. (ISST re118) I think that she was living in Pietroiusti s villa also in winter Fact Facts are the attributions of the reception of a speech act or of the knowledge of an information whose truth is not questioned. Cues in this category include verbs of perception, e.g. sentire (84) (to hear), vedere (84) (to see), and verbs expressing a knowledge such as sapere (to know), ricordare (85) (to recall), rimpiangere (to regret). (84) Ø abbiamo visto e sentito, assieme, un antica ira e uno stato di grazia. (ISST re011)

70 4 Features to Include in the Annotation (We) have seen and heard, contemporarily, an ancient anger and a condition of grace. (85) Era di ottimo umore, ricorda Francesco. (ISST els077) She was in a very good mood, recalls Francesco Eventuality Eventuality conveys instead an intention the source holds towards the content. This group is quite heterogeneous and includes, under the label of control verbs, these three classes (Sag and Pollard, 1991:65): verbs of the order/ permit type, with the source trying to influence another agent to perform what is in the content, e.g. ordinare (to order), consentire (to allow), proibire (86) (to forbid); verbs of promise, e.g. promettere (87) (to promise), accettare (to accept), accordarsi (to agree), expressing the commitment of the source towards performing a certain action; and verb of the want/ expect type, e.g. desiderare (to desire), sperare (88) (to wish), volere (to want), expressing a mental orientation of the source. (86) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e nelle altre località appena riconquistate. (ISST cs030) And Zagreb authorities have forbidden journalists to go to Petrinja and the other just reconquered places. (87) Il governo di Zagabria smentisce seccamente e promette di punire i responsabili se venissero portate delle prove del fatto. (ISST cs031) The Zagreb government sharply denies and promises to punish the responsible people in case evidence of the deed would be provided. (88) Gli operatori del mercato fisico sperano che la chiusura americana segni la fine dell esplosivo rialzo delle quotazioni. (ISST sole150) The listed exchange operators hope that the American close could mark the end of the explosive price rise of the quotations

71 4 Features to Include in the Annotation Issues Concerning Type Definition The definition of the type feature presents some problems. First of all, it refers only to verbal cues, while the textual anchor signalling an attribution relation can be expressed by different means as listed in (3.1.3), e.g. prepositions, nouns, punctuation. The latter is employed to report direct speech and can be therefore interpreted as indicating an assertion-type attribution. Nouns are often deverbal, e.g. suggerire > suggerimento (suggestion), permettere > permesso (permission), comunicare > comunicato (82) (release), and generally easily referable to the verb they implicitly involve, e.g. pensiero/ idea (thought/ idea) > pensare (to think), parola (word) > dire/ scrivere (to say/ write). Prepositions instead do not explicitly specify the type of attitude the source holds towards the content. However, it could be argued that they express an opinion, a point of view ((89), (90)), although derived from an assertion. (89) secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che non intendeva fare nulla di male e che per lui si è trattato di un gioco. (ISST cs004) according to indiscretions he would have told the examining magistrates that he didn t intend doing anything bad and that for him it was just a game. (90) Secondo il giornale gli Stati Uniti sperano di siglare un <<memorandum di intesa>> sul programma <<Sdi>> con Italia, Israele e Giappone entro la fine del (ISST els015) According to the newspaper the United States hope to sign a <<memorandum of understanding>> concerning the <<Sdi>> program with Italy, Israel and Japan by the end of All types of attribution presuppose however some kind of assertion allowing the entity reporting the attribution relation to acquire the information. In the example (91) the content represents the thought of some people, however, it does not mean that this was acquired through mind-reading techniques. It is implicit that it is possible to learn about opinions if they are expressed, usually through assertions,

72 4 Features to Include in the Annotation but also using other means of communication, e.g. facial expressions. More strikingly this bound connecting assertions and beliefs is clear with self-attributions as in (92). A speaker or writer wanting to express a personal belief has to assert it. The source in (92) believes the assertion expressed by the content but at the same time is saying it. Similarly also wills, intentions, orders, etc., more or less directly presuppose an assertion. (91) C è gente che pensa siamo professionisti super pagati e invece la situazione è molto diversa. (ISST cs077) There are people who think that we are super paid professionals instead the situation is very different. (92) Ø credo anche che forse convenga parlarsi tra le parti prima di spedire lettere. (ISST re012) (I) also believe that maybe it would be appropriate for the parties to talk to each other before sending letters. On the other hand, assertions quite often reflect what the source is thinking as the two attributions in (93). In the example, what the sources say, in quotes, is also an expression of their opinion. Less common are attributions like (94) where the assertion itself is what matters and the content is not an expression of the source s thought but just the sequence of words she pronounced, namely the attention is on the cue rather than on the content. The verbal cue in (93), dicono could have been substituted by the entity reporting these two attributions with pensano (think). (93) S é pentita d aver rotto il silenzio dicono alcuni. L hanno costretta, dicono gli altri. (ISST period005) She regretted having broken the silence say some. She s been forced, say the others. (94) Shana, meglio ricordata per la pubblicità dove Ø dice: Toglietemi tutto ma non il mio Breil (ISST re028)

73 4 Features to Include in the Annotation Shana, better remembered for the commercial in which she says: Take everything away from me but my Breil Another issue is determining the type when different types of cue co-exist. This is different from multiple cues (95) (3.2.3), which should be analysed as separate attribution relations. Relatively often a direct quotation occurs combined with a verbal cue other than assertion, sometimes providing an interpretation of the content (3.2.3). In the example (96) the quotes suggest that the content corresponds to reported direct speech, therefore an assertion, while the cue promise refers to an eventuality, of the kind expressing a commitment. (95) The men can defeat immunities that states often assert in court by showing that officials knew or should have known that design of the structure was defective and that they failed to make reasonable changes. (PDTB 1160) (96) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST cs030) We ll give you the statistics at the end, promise the Croatian generals. The strategy adopted here for these cases of composite cues of different types is to give priority to the punctuation. A direct quote is surely the most reliable of the attributions as the content is reported without any mediation. Moreover, the assertion precedes the attitude expressed by the other cue as this was derived from the semantic of the content. In (96) what the Croatian generals said was perceived as a promise, at least by the journalist reporting the information. With establishing the predominance of punctuation, these instances would be classified as assertions. Consequently manner verbs, with implicit general reportive verbs, functioning as cues in combination with quotes (3.1.2), e.g. sorridere (97) (to smile/ to say while smiling), will also be classified as assertions avoiding possible confusion. (97) Arlacchi sorride: Pura paranoia politica. Non ho partecipato ai lavori solo a causa di un impegno privato. (ISST re095)

74 4 Features to Include in the Annotation Arlacchi smiles: Pure political paranoia. I didn t participate in the works only because of a private appointment. A last issue is determined by the semantic of the verb cues. On one hand because the myriad of attribution verbs cannot be always unquestionably assigned to one of the four possible types. While verbs like dire (to say), pensare (to think), sapere (to know), volere (to want), are quite prototypical and central to their category, other verbs such as criticare (to criticise), avvertire (to warn), leggere (to read), elogiare (to praise), suggerire (to suggest), are more peripheral und uncertain. On the other hand, a conspicuous number of verbs are polysemous and can belong to one or the other type according to which of its meanings is currently at use. This can only be determined by the context as in ((98), (99)). The same verb cue sostenere assumes in (98) an assertive function, corresponding to claim, while in (99) it expresses a commitment, meaning support, which represents an eventuality. (98) Il governo di Zagabria, invece, sostiene che sono solo 100 mila le persone in cammino. (ISST cs031) Zagreb government claims instead that they are only 100 thousand the people who set out. (99) Ma ieri sera I parlamentari serbi hanno sostenuto senza riserve la decisione di Karadzic. (ISST cs034) However yesterday evening the Serbian parliamentarians have supported wholeheartedly Karadzic s decision. The issues presented in this chapter partly arose from the pilot annotation, partly from previous considerations and from the attempt to list and classify attribution cues (6.3). Although the type classification has been adopted unchanged for the pilot in this study, the problems it arises strongly suggest testing its feasibility with evaluating the inter-annotator agreement it determines and eventually introduce some changes

75 4 Features to Include in the Annotation 4.2 Source The source is one of the key components of the attribution relation and as such it is marked in the annotation. It can occupy any position, i.e. before, around, after or in between, with respect to its content and can be expressed by a number of elements (3.1.2). All the variation in their linguistic realisation aside, the entities the sources refer to can be very different and this deeply affects the content, hence the need of retrieving this relation. The annotation could therefore mark a basic distinction of source types which would facilitate evaluating their reliability or relevance. The source type has been included in the annotation schema and can assume the same values as in the PDTB. These are: writer, other, and arbitrary. Aikhenvald (2004:64) distinguishes between QUOTATIVE, that is reported information having an overt reference to the source, writer and other are of this kind, and HEARSAY, referring instead to reported information without an overt reference to those it was reported by. The source of a hearsay takes the value arbitrary in the annotation Writer The writer is the default source of any journalistic text, and he or she holds the shallowest level of attribution, the content being the entire news article. Relatively often, at least in Italian newspapers, authors are not even explicitly mentioned, or they are recalled by just their initials. Even when they are mentioned, writers are never part of the article body as, similarly to any other attribution relation, the source is usually not part of the content it holds but occupies an external or peripheral position with respect to it. Unlike the PDTB, where discourse connectives and their arguments are always attributed even without an explicit attribution relation, therefore most of the attributions are to the writer, this will be here left implicit in order to simplify the annotation process. The writer is external to the article intended as a discourse unit and usually not the only external source involved. Apart from the writer of the article, the newspaper publishing it could be considered another source and even the website reporting it, in case of news published on the web. The attribution of the entire news article to the writer and subsequently to the newspaper should be

76 4 Features to Include in the Annotation easily inferable and can be added in a second time if needed. Nonetheless, in case the writer is directly and explicitly reporting his or her opinion or words, the annotation should mark the writer as the source. By explicitly mentioning himself, the writer presents information in a less factual way making explicit that it is not shared knowledge but a personal point of view he is presenting ((100), (101)). (100) È questo a mio parere il dato politico-sociale rilevante: (ISST re085) It is this in my opinion the relevant socio-political data: (101) Un arbitro corrotto caro Brera, è possibile che in tanti anni di calcio non sia venuto fuori il nome di un arbitro corrotto? Io non ci credo. (ISST els027) A corrupted referee dear Brera, is it possible that in many years of football no name of a corrupted referee has come up? I don t believe it Arbitrary As arbitrary should be marked all those sources which do not really attribute the content to a specific entity or to an entity having a real referent in the world. In this category fall impersonal sources such as si (102)/ uno (one), personal and indefinite pronouns used as impersonals tu (you), qualcuno (someone) (103), nessuno (no one), relative pronouns, e.g. chi (who) (104), and missing sources, like with verbal moods having no explicit subject, e.g. infinito (infinitive), gerundio (gerundive), and passive constructions (3.1.2) with omitted agent. (102) Spesso in questi casi si dice la mobilitazione popolare é più importante di mille altre ricerche. (ISST els032) Often in these cases one says that the popular intervention is more important than thousands of other investigations. (103) Qualcuno pensa che questo sia un quartiere privilegiato. (ISST cs092) Someone thinks that this is a privileged district

77 4 Features to Include in the Annotation (104) C è chi sostiene che stiamo vivendo il ritmo giusto di un mercato azionario come quello italiano, considerato la sua dimensione e le sue strutture. (ISST els055) There is who claims that we are living the right pace of a share market like the Italian one, considered its size and its structures. Also as arbitrary can be used personal plural pronouns, i.e. noi (we) voi (you) loro (they), or indefinite pronouns, e.g. tutti (everyone), molti (many) (105), but also collective nouns, such as la gente (the people) (106), referring to an indistinct plurality. Especially with plural impersonals the effect achieved is often that of attributing the content to everyone, as if this was some kind of general truth, the expression of common sense or general knowledge (107). (105) <<domani questa stessa gente é pronta a scendere in piazza per rivendicare>> dicono e scrivono in molti. (ISST els063) <<tomorrow the same people are ready to take to the streets to claim>> many say and write. (106) C è gente che pensa siamo professionisti super pagati e invece la situazione è molto diversa. (ISST cs077) There are people who think that we are super paid professionals instead the situation is very different. (107) Tutti gli esseri umani sanno di poter essere più di ciò che sono. (ISST cs012) Every human being knows they can be more than what they are. Indefinite pronouns however are not arbitrary when their referent is restricted by a specification (108) or they assume an adjectival role as in (109). (108) Ma, con una reazione molto comune in casi del genere, nessuna delle vittime ha pensato... (ISST els072)

78 4 Features to Include in the Annotation However, having a very common reaction in similar cases, no one of the victims thought (109) La decisione di convocarla fu presa domenica 23 dicembre, dopo che alcuni ministri affermarono: <<sentiremo l opinione dei familiari e decideremo>>. (ISST els048) The decision of convoking it was made Sunday, 23 December, after that some ministers affirmed : <<we will listen to the relatives opinion and decide>>. Another group of arbitrary sources is formed by those nouns referring to containers or means of information, such as voci (voices/rumors) (110), resoconto (report), indiscrezione (indiscretion) (111), proverbio (proverb) etc, when the entity producing them is not expressed. (110) In Italia si è fermi ai progetti e alle intenzioni, nonostante le voci che da anni pronosticano l avvento di Warner o Paramount nella gestione di sale. (ISST sole036) In Italy we are still at projects and intentions, despite the voices that since years predict the arrival of Warner or Paramount in the management of movie theatres. (111) Secondo indiscrezioni la prima segnalazione è stata inviata alla Procura della Repubblica. (ISST cs015) According to indiscretions the first report has been sent to the Public Prosecutor s office. Arbitrary is a very informative attribute which allows distinguishing between attributions to a real referent, which are labelled as other, and attributions whose source is not really clear. Having this data marked, it is possible to decide whether to include these attributions when considering the content or just leave them out as if they were just a device the above source is employing to take the distance from the content and from the responsibility deriving from being its direct source

79 4 Features to Include in the Annotation In case information with a traceable source are searched, contents having an arbitrary referent could be automatically discarded as they do not meet this requirement and cannot be verified. On the other hand, when looking for general truths, rumours about previsions, moods concerning an event and so forth, arbitrary sources are particularly relevant Other The value other is associated with those sources which refer to a specific entity. This is often a proper noun of a person, e.g. Kasim Zdionica (112), Angela Merkel, or an organisation, e.g. the Parliament, The Times, etc... The specific referent can be mentioned also somewhere else in the article and recalled (bridging anaphora) by a general noun or pronoun in the attribution relation as in (113). The borderline between arbitrary and other, however, is far from being sharp. Sources can be more or less generic and detectable. Common nouns sometimes refer to an entity whose identity can be more or less easily reconstructed, such as the president when taking about a specific company, the judge referring to a precise trial, Angelina Jolie s husband, and so forth, but other times this term is too generic to really allow identifying its referent. (112) Poi stasera torno a Zagabria, grida Kasim Zdionica, un signore con una pancia enorme, le ciabatte di gomma e un pugnale infilato nella cintura. (ISST cs030) Besides, this evening I ll go back to Zagreb, shouts Kasim Zdionica, a men with a huge belly, plastic slippers and a dagger inserted in the belt. (113) La Fermenta, a sentire l'arabo, è organizzata in modo che oggi consegue un utile pari al 35 per cento del fatturato. Questo il vero traguardo che dovrà nel tempo raggiungere la Pierrel. Ma come? Con tagli di mano d'opera? Nemmeno per sogno, dice El Sayed. (ISST els001) Fermenta, according to the Arabian, is organised so that it earns at present a profit of 35 per cent of the turnover. This is the real goal that in the long

80 4 Features to Include in the Annotation distance Pierrel will have to achieve. But how? Cutting down on workforce? No way, says El Sayed. Although included in the present study as other, common names such as residente (resident), passante (passer-by), donna/ signora (woman) (114), esperti (experts), whilst referring to a specific referent in the real world, represent general terms which do not allow any identification or characterisation of the source. In the example (115) the journalist is trying to give a characterisation to this unknown referent he is quoting by adding a detail about the way she was dressed, i.e. in grey, as if this would make the lady recognisable. However, these sources are not to be confused with arbitrary. A lady in grey, unless the writer is lying, is not a generic entity, a plurality, a hearsay, but a specific human being in the real world, as the man in (112). It is desirable to provide the final annotation with an additional distinction that can account for this type of source, introducing an additional value for the source feature such as common. (114) Una donna afferma di aver assistito all uccisione a sangue freddo del marito. (ISST re084) A woman claims she has witnessed the cold blood killing of her husband. (115) <<Voto no>> diceva una signora in grigio <<tanto c è già chi ha deciso per noi>>. (ISSTels048) <<I vote no>> was saying a lady in grey <<anyway there is already who has decided for us>>. 4.3 Factuality Hunter et al. (2006) remark that many reportative verbs have, in addition to their intensional use, an evidential use, such as the one making B in (116) an appropriate answer to A. They argue that theories of discourse interpretation should account for these different uses. According to their analysis, the intensional use is conceptually primary and the evidential use derives from it. They therefore

81 4 Features to Include in the Annotation introduce two discourse relations in order to account for the different interpretations: an evidence relation and an attribution relation. While evidence is a subordinating relation veridical in both arguments, with attribution is the embedded clause that is subordinate to the main claim and it is non-veridical with respect to the right argument. (116) A: Why is John absent from the meeting? B: Sharon said that he is out of town. (Hunter et al., 2006:99) When considering the factuality of an attribution relation, it should be clear to which of these two uses it refers to. The evidential relation is true depending on the veracity of the evidence, which is the information in the content of the attribution relation (116). The attribution is instead true if the actual relation sourcecontent via the attitude expressed by the cue is real, e.g. the assertive event in (116) really took place. The factuality of an attribution relation, however, does not entail the factuality of its content. Sources can in fact lie or just be wrong. As the intentional use precedes the evidential use, the content of an attribution relation can constitute evidence for something else only if the attribution relation itself is factual. While it is very complex to account for the factuality of the content, the factuality of the attribution relation can be syntactically computed considering the cue and the source or other elements scoping over it. Some information about the factuality of the content can be derived from the type of cue, as suggested by Prasad, Miltsakaki et al. (2008:44). The feature factuality accounts here for the factuality of the attribution relation only, marking the fact whether this relation really exists, i.e. answering the question: is this content really presented as attributed to this source? In their account of event factuality, Saurí and Pustejovsky (2008) distinguish among situations presented as corresponding to real situations in the world, situations which instead are unreal, and uncertain situations. They characterise factuality as involving polarity and epistemic modality, which could be defined as the commitment of a source towards the content of a proposition. Polarity takes two values, i.e. positive and negative, while epistemic modality can assume a

82 4 Features to Include in the Annotation range of values varying from absolute certain to uncertain. The combination of these two features determines a range of factuality values (Table 1). Positive Negative Certain Fact Counterfact Probable Probable Not probable Possible Possible Not certain Table 1 - Factuality values (Saurí and Pustejovsky, 2008) For the present annotation scheme, the factuality of the attribution relation can assume only two values: factual and non-factual. The first accounts for the attribution relation being presented as a fact in the world (certain and positive). Non-factual represents underspecified factuality and should not be confused with counterfactual. It accounts in fact not only for attributions presented as not real, but also includes the intermediate values expressing different degrees of possibility and probability. Further distinctions of the non-factual value of the factuality attribute are left for future developments of the annotation schema Factual The factuality of the attribution relation is marked in the PDTB with the determinacy feature. This can take only two values: indet, accounting for the attributions presented not as factual, and null for the factual ones. This substantially corresponds to the present account of factuality of attribution, with null corresponding to factual and indet to non-factual. The term has been here however changed to factuality as this seems to be more specific and easily recognisable. In news language factual attributions occur by far most frequently as journalists tend to report facts and to present information and events as real facts, more than just making suppositions or hypothesis. An attribution presented as factual may nonetheless not correspond to a real event. Whether or not to believe the attribution relation is genuine can be decided only on the basis of the above source, i.e. the source, or sources, of the content in which the attribution relation is nested. In the examples ((117), (118)) the attributions are presented as factual by the writer. In order to postulate about the veracity of the content it is instead

83 4 Features to Include in the Annotation necessary to determine whether the source in (117) Evtuscenko is mendacious and the source in (118) the public prosecutors really hold the attitude towards the content or deceived it. This could be decided with the help of the context but also common sense and extra-linguistic knowledge contribute to the conclusion. (117) Evtuscenko, nel suo articolo, afferma che Pasternak gli fece pervenire una copia del romanzo poco dopo la sua prima pubblicazione. (ISST els076) Evtuscenko, in his article, claims that Pasternak sent him a copy of the romance shortly before its first publication. (118) Monreale, i pm vogliono Cassisa alla sbarra. (ISST re124) Monreale, the public prosecutors want Cassisa before the bar. The analysis of the content factuality as well as of the source trustworthiness is complex as it is not inferable from the syntax and grammatical features. The factuality of the attribution itself is instead easily determined: the source should be an entity and the cue should not be in the scope of a negation or an element expressing uncertainty or probability Non-factual Non-factual attributions can be considered as negated attributions, they namely express that there is no link between source and content or that this link is just hypothetical. It could be argued that these instances do not hence represent attribution relations and could be left out of the annotation. However, they nonetheless convey relevant information and have been for this reason included in the annotation. Non-attributions can, for example, correct false attributions or just remark that there is no link between that particular content and a specific source (119). (119) John is under investigation. The police, however, haven t said that he is presumed guilty

84 4 Features to Include in the Annotation Non-factual attributions expressing possibility or probability, moreover, are very useful when there is an interest in retrieving hypothesis or previsions and not just facts. Modal verbs can be employed to express an attribution which is just possible, desired, ordered or urged (120). However, this is not the case when the source is in the first person as it then reflects more an idiomatic use as in (121). While the source in (120) has never really asserted the content, the one in (121) necessarily did. (120) L umanità deve proclamare uno storico sciopero ad oltranza fino alla distruzione di tutti gli armamenti nucleari. ISST cs039) The world should proclaim a non-stop strike till the destruction of all nuclear armaments. (121) No, Ø devo dire anzi che in queste prime due settimane il mondo sindacale è stato in attesa e mi auguro che sia possibile intessere un dialogo forte. (ISST sole011) No, on the contrary (I) have to say that in these first two weeks the union world has been lying in wait and I wish that it will be possible to intertwine a strong dialogue. Similarly, when the cue is in the scope of a conditional or part of an hypothetical sentence (122), the attribution should be marked as non-factual. Other structures or contexts making an attribution non-factual are: the imperative, usually with verbs of belief and assertion such as think, imagine, but also say and admit ; interrogative forms, as in (123); the future tense (124),(125), as an event happening in the future is not yet a real event and it is not certain it will ever become one; and the infinitive used to make a conjecture as in (126). (122) Se Ø vuoi che il fast relax sia davvero efficace tieni d occhio l orologio e scegli: l intervallo di pranzo e il ritorno a casa. (ISST period003) If (you) want that the fast relax is really effective keep an eye on the watch and choose: the lunch break and the homecoming

85 4 Features to Include in the Annotation (123) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo scrittore si trovasse a una svolta esistenziale? (ISST els034) Do you also think like many literary critics that, with his unfinished romance, the writer was at an existential turning-point? (124) E Ø diranno all ONU che il problema dei profughi non li riguarda. (ISST re084) And (they) will tell the UN that the refugee problem does not concern them. (125) E naturalmente molti diranno che ha usurpato il posto in finale. (ISST els062) And surely many will say that it has usurped the presence in the final. (126) It is silly libel on our teachers to think they would educate our children better if only they got a few thousand dollars a year more. (PDTB 1286) The presence of a grammatical cue, i.e. the quotative conditional (see 3.1.3) could be also taken as a sign of uncertainty, as in the example (127). However, although often related to epistemic modality, and therefore involving some degree of uncertainty, the quotative conditional is a sign of an additional level of attribution, namely a level of nesting left implicit. In (127) what the quotative conditional expresses is not uncertainty about the attribution. The uncertainty is a consequence of the quotative conditional which, scoping on the cue, presents the attribution relation as second hand material, similarly to hearsays. Attributions including a quotative conditional should be therefore considered factual. (127) Manlio Averna avrebbe infatti riferito al pm che, in base agli accertamenti finora effettuati, è molto improbabile che Castellari si sia sparato. (ISST sole016) Manlio Averna has told (QUOT.COND) the public prosecutor that, according to the verifications done till now, it is very unlikely that Castellari shot himself

86 4 Features to Include in the Annotation Apart from being connected with the cue, non-factual attributions are also found when the source is negated as in (128). The attribution to no-source is not linking the content to any entity and therefore is non-factual. (128) Nessuno parla più di baratro imminente e di crisi finanziaria. (ISST cs025) No one is talking anymore about imminent precipice and financial crisis. 4.4 Scopal Change It is not always the case that an attribution cue in the scope of a negation is nonfactual. It is possible for example that a negative particle affecting a verbal cue on the surface, reverses instead the polarity of the content. This feature is included in the PDTB (Prasad, Miltsakaki et al., 2008:46) with the name of scopal polarity. Annotating this feature is not essential in order to account for the attribution relation, however, it is crucial for the interpretation of the content. The feature takes two values: scopal change and none. In case an attribution is factual, but its cue is in the scope of a negation, presumably the negation is affecting the content and not the relation itself. If it could be possible to separately determine the scope of negations and other elements this would be preferable and the scopal change feature would be no longer needed Scopal Polarity Most commonly, the scopal change affects the polarity of the content. The surface negation can be expressed syntactically (i.e. don t say, don t think), or lexically, e.g. negare (to deny), escludere (to exclude), smentire (to deny). Lexical negations which are part of the verb semantics, as in the example (129) below, are always scoping on the content of the relation. The relation between the Croatian government and the contents it holds is factual and it could be changed into: the Croatian government affirms that they have NOT been banned and affirms also TO HAVE NO ethnic cleansing intention in the newly conquered areas or alternatively says it is not true that. (129) Qualunque sia il numero di sfollati, il governo croato nega che siano stati espulsi e nega anche qualsiasi volontà di pulizia etnica nelle regioni appena

87 4 Features to Include in the Annotation riconquistate. (ISST cs031) Whatever the number of evacuees, the Croatian government denies that they have been banned and denies also any ethnic cleansing intention in the newly conquered areas. In case of a double negation as in (130), containing a syntactic and a lexical negation, the first scoping on the verb and the second on the content, the result is again a positive, and therefore factual, attribution. Not deny corresponds to affirm and as the negation scoping on the verb is changing its semantics, its reversed reading can no longer affect the polarity of the content. In these cases the annotation should assign the feature scopal change the value none. (130) Ieri circa mille giovani hanno lasciato la città, ma la polizia non esclude che possa esserci qualche altra esplosione di violenza. (ISST cs037) Yesterday around a thousand young people have left town, but the police don t exclude that there could be some other act of violence. Scopal changes do not occur with the verbs of the fact type (131) as noted by Kiparsky and Kiparsky (1971). They can occur however with the other types of cue and relatively often with beliefs. Determining whether an attribution is non-factual or there is a change in the scope of the polarity is often problematic. In the example below (132) the attribution relation contains a no-entity source, no one, and should therefore be non-factual. However, no one would like that to happen in their town could be also rewritten as everyone would like that not to happen in their town, involving a change in the polarity from the source to the content. Are these sentences equivalent? Probably not. The correspondence is especially difficult with wills or intentions: not wanting something does not exactly correspond to wanting the opposite. (131) Ma lui si strapazza, lavora troppo, Ø non ha capito che deve stare più attento. (ISST cs059) But he tires himself out, he works too much, (he) hasn t understood that he has to take more care of himself

88 4 Features to Include in the Annotation (132) Strano destino, quello di Civitavecchia: finire spesso, troppo spesso, sulle pagine dei giornali per eventi misteriosi, oppure per fatti che nessuno vorrebbe accadessero nella sua città. (ISST cs090) Strange destiny, that of Civitavecchia: ending up often, too often, in the news because of mysterious events, or because of events that no one would like to happen in their town. Part of the problem derives from the fact that beliefs and some eventualities do not refer to events like assertions. While negating an event makes it non-factual, a negative belief or will does not cancel the attribution relation: a negative mental state is still a mental state. With including non-factual attributions in the annotation, the issue of determining the presence of a scopal change in order to account for the veracity of the content is less crucial. Uncertain instances, those still involving an attribution, therefore not completely non-factual, and not exactly attributing the negation of the content, hence also not involving a real scopal change, could be annotated according to two strategies. One possible solution would be that of marking them as non-factual since the attribution of the content does not actually take place. In this case, the factuality attribute would be restricted to the veracity of the attribution of the unchanged content to the source. This non-factual attribution could still suggest, however, that the reverse of the content, or a different content is presupposed. In case this solution is adopted, the content of a non-factual attribution should be more carefully considered as it could still carry useful information. On the other hand, the opposite strategy could be adopted and the attribution marked as factual but involving a scopal change. In this case it should be clear that the change is not implying the exact reverse of the content polarity, but just that the negation is not really scoping over the attribution relation itself, and that the content or the attitude the source holds are affected by it. With choosing this solution it should be clear that the scopal change attribute does not necessarily reverse the polarity of the content. Since in some cases it is not possible to determine, despite the help of the

89 4 Features to Include in the Annotation context, if the attribution relation itself is negated or just the attitude the source holds, e.g. John doesn t want to become president (he never expressed this intention/ he expressed a negative intention towards becoming president), the first strategy seems more appropriate. The annotators should be invited to decide whether they perceive an existing attitude, positive or negative, the source holds towards the content and in case this is not clear they should mark the attribution as non-factual. Analysing the inter-annotator agreement it will be then possible to determine whether the issue of scopal change requires further clarifications Other Elements Affecting the Factuality Scopal polarity (PDTB annotation) has been in the present annotation project labelled as scopal change as polarity is not the only element affecting the factuality, and not the only one which can change in scope and affect the content of an attribution instead of the attribution itself. Other constructions, although uncommon, can occur. For example, the cue could be in the scope of a condition as in (133). However, the condition in the first clause does not mean this is required for the attribution relation to be factual, namely for the belief event in (133) to take place. The condition affects instead the content of the attribution and it is part of the belief: If there is a majority [ ] the legislature could continue. (133) Se c è, cioè, una maggioranza in Parlamento in grado di affrontare seriamente una fase di riforme anche elettorali, Ø penso che la legislatura possa utilmente proseguire. (ISST re075) If there is a majority at the Parliament able to seriously face a phase of reforms, also electoral, (I) think that the legislature could usefully continue. It is possible that other elements or constructions manifest a change in scope, although further investigations are necessary to detect which ones and how to recognise them. This is especially difficult because of their infrequency. The annotation could however allow detecting other changes in scope affecting the content

90 4 Features to Include in the Annotation 4.5 Summary Apart from annotating the spans corresponding to the three components of the attribution relation, i.e. source, cue, content, attributes should be included in the annotation schema which carry relevant information affecting the relation itself or the interpretation of its content. In this chapter, these features have been presented and confronted to the features included in the PDTB scheme, adopted as a model for the present one. One aspect to annotate is the type of the cue (4.1), expressing the kind of attitude the entity is holding towards the content: assertion, fact, belief or eventuality. This feature provides information partially affecting the factuality of the content and the values other features can assume, e.g. facts do not support any scopal change. The feature type, however, is often complex to determine as this categorisation is partially ambiguous. Before applying it to the whole corpus, this should be tested for inter-annotator agreement and, in case of poor score, perfected by changing or reducing the values. Another useful feature to be marked is the source type (4.2). This allows a basic distinction among: writer, other and arbitrary. The first (4.2.1) can be connected to information presented as the personal point of view of the writer. Other (4.2.2) stands for a specific source corresponding to a real entity. The latter (4.2.3), arbitrary, should be used when referring to sources without a real or certain referent, thus labelling e.g. general knowledge, hearsays and rumours. Factuality (4.3) allows to distinguish between real attributions, corresponding to a real event or mental attitude in the world, and hypothetical or unreal attribution events. The annotation of this feature enables keeping these separate, without loosing the information carried by non-factual attributions. Lastly, a change in the scope affecting the content is also annotated and labelled as scopal change. This usually affects the polarity of the content although superficially it should involve the cue and make the attribution non-factual. Determining when it is correct to identify a scopal change is a problematic issue. Despite the fact that a scopal change cannot occur with cues of the type fact, this matter needs to be addressed in context with particular attention to discerning between negations affecting the existence of the attitude the source holds and negations reversing instead this attitude

91 5 Performing a Pilot Annotation 5 Performing a Pilot Annotation Developing an annotation schema goes hand in hand with testing it on the corpus that is going to be annotated. The application of the schema to the corpus allows to assess intuitions and solutions thus making more aware choices based on the data and not only on theoretical considerations. Real language examples, moreover, while on one hand reflect real language use, thus having few or even no occurrences of some possible but uncommon features, on the other hand represent a repository of special cases which do not match descriptions of general occurrences and characteristics. Designing an annotation schema follows a similar path as any design process (Figure I): (1) a preliminary stage in which objectives and requirements are defined; (2) a phase in which the problem is analysed; (3) a planning phase in which possible solutions are presented and a subsequent (4) testing phase with the development of a prototype. This latter leads to the identification of viable or unfeasible solutions and the discovery of new issues. This leads to a new planning phase, and the process gets iterated until a satisfactory solution is reached. requirements 1 analysis 2 planning 3 evaluation 4 release 5 Figure I - Design Process In order to perform an annotation, a suitable tool is required. The selection of the most appropriate one to employ for the pilot annotation is the result of a detailed

92 5 Performing a Pilot Annotation analysis of several available tools. To be able to make such a decision, the characteristics they should possess so as to match the annotation schema requirements had to be identified thus allowing the definition of desired tool specifications. These represent the basis towards the development of an appropriate software especially designed to perform the task of annotating attribution relations. In this chapter, the Italian corpus to which a layer for attribution will be added is presented and a subsection of it is sampled to be employed in the pilot annotation. Afterwards, several tools for performing annotation are compared in the light of the specific requirements of the current annotation schema. Eventually, one of these tools will be selected and set before proceeding with the annotation of a sample of the corpus, thus leading to the identification of new issues and a partial redesign of the annotation scheme. 5.1 Corpus The present study originates in the framework of a project aiming at the addition of a layer for discourse to the ISST corpus. It takes, however, a different perspective, leaving for later the analysis of discourse relations in general and concentrating instead on attribution, which is only partially a discourse phenomenon. The ISST corpus employed for this study is the Italian Syntactic-Semantic Treebank, developed between 1999 and 2001 in the frame of the SI-TAL project, a collaboration of several Italian research and university institutions with the purpose of developing a suite of resources and tools for Natural Language Processing applications. For the pilot annotation a subcorpus of the ISST had been selected as described in the relevant chapter (5.1.2) ISST Architecture The ISST corpus (Montemagni et al., 2003) consists of word tokens and was built to reflect contemporary language use. It is formed by a collection of 484 newspaper and periodical articles published between 1985 and One section of the corpus, about two thirds, represent general language use and contains articles about different subjects from Repubblica, identified in the examples as

93 5 Performing a Pilot Annotation re, Corriere della Sera (cs), and other newspapers (els) and periodicals (period). The other section of the corpus, about tokens is instead specialised as it deals with the financial domain. Articles in this section are taken from a single financial newspaper: Il Sole 24 Ore (sole) and were all published in The ISST has a five level structure encoding orthographic, morphosyntactic, syntactic and semantic information. Only the financial section of the corpus has been fully annotated with all five levels. The syntactic level is split into two separate ones so as to separately account for the constituent and dependency structures, thus providing an independent view of the same surface syntax as one level does not presuppose the other. The orthographic level (Figure J) contains the word tokens and information about low or capital letters and punctuation. To each token a unique ID number is assigned. <w id="w_001" case="cap"> Bruxelles </w> <w id="w_002" case="low"> all' </w> <w id="w_003" case="cap"> Italia </w> <w id="w_004"> : </w> <w id="w_005" case="low"> urgente </w> <w id="w_006" case="low"> ridurre </w> <w id="w_007" case="low"> il </w> <w id="w_008" case="low"> deficit </w> <w id="w_009">. </w> Figure J - ISST orthographic level (sole002) The morpho-syntactic annotation (Figure K) includes the mark-up of POS, lemma, number, person, gender, etc Multi-word expressions are analysed as a whole e.g. in_mezzo_a (between/ among), while morphologically complex words, such as cliticised verbs are instead treated so as to account for its constitutive parts, e.g. impedendoci > impedire + ci (prevent us). <mw id="mw_001" pos="sp" mfeats="nn" lemma="bruxelles" sfeats="np" href="sole.orth002#id(w_001)"> Bruxelles </mw> <mw id="mw_002" pos="e" mfeats="fs" lemma="a" sfeats="part" href="sole.orth002#id(w_002)"> all' </mw>

94 5 Performing a Pilot Annotation <mw id="mw_003" pos="sp" mfeats="nn" lemma="italia" sfeats="np" href="sole.orth002#id(w_003)"> Italia </mw> <mw id="mw_004" pos="pu" lemma=":" sfeats="dirs" href="sole.orth002#id(w_004)"> : </mw> <mw id="mw_005" pos="a" mfeats="ns" lemma="urgente" sfeats="ag" href="sole.orth002#id(w_005)"> urgente </mw> <mw id="mw_006" pos="v" mfeats="f" lemma="ridurre" sfeats="vit" href="sole.orth002#id(w_006)"> ridurre </mw> <mw id="mw_007" pos="rd" mfeats="ms" lemma="il" sfeats="art" href="sole.orth002#id(w_007)"> il </mw> <mw id="mw_008" pos="s" mfeats="ms" lemma="deficit" sfeats="n" href="sole.orth002#id(w_008)"> deficit </mw> <mw id="mw_009" pos="pu" lemma="." sfeats="tit" href="sole.orth002#id(w_009)">. </mw> Figure K - ISST morpho-syntactic level (sole002) The ISST takes a distributed approach to syntax, keeping functional annotation and constituent structure on two separate levels which can be however combined if required. This strategy represent a more suitable way (Montemagni et al., 2003) of describing languages like Italian having a syntactically free constituent order and pro-drop property thus requiring the insertion of a number of empty elements which would result in a consequent loss of annotation transparency. The annotation of constituency (Figure L) produces shallow tree structures. It was performed with a Shallow Parser and then manually revised. The functional annotation is word-based and includes relations such as dependency, coordination and intra-sentential co-reference. [F3 [SN Bruxelles [SP a [SN Italia SN] SP] SN] F3] [CP [SA urgente SA] [F [SV2 ridurre SV2] [COMPT [SN il deficit SN] COMPT] F] CP] Figure L - ISST syntactic constituent level (sole002) Lastly, the ISST presents a lexico-semantic level of annotation, assigning

95 5 Performing a Pilot Annotation semantics tags. These convey: the sense of each word, based on the ItalWordNet (IWT) lexical resource; special uses, e.g. idiomatic, proper nouns, neologisms, etc ;and additional comments of the annotators. A tool has been especially developed for the task of annotating and combining the 5 levels of annotation of the ISST: GesTALt. The tool also provides a visual representation of the annotation, e.g. functional annotation makes use of graphs, while constituent structure is visualised as a strip tree. This tool is unfortunately not open-source and could not be tested or employed for the pilot annotation in the present study. The ISST corpus is available in a number of formats, i.e. text, XML and CoNLL Subcorpus Selection As attribution is a very pervasive relation in journalist language, as it is common in newspaper article to report opinion, statements and information other people expressed, only a part of the corpus could be annotated for the present study. Extending the annotation to the whole ISST represents a subsequent stage which would require employing annotators and possibly the development of a specific tool. In order to test the feasibility and effectiveness of the annotation schema object of the present study, a pilot annotation was performed on a sample of the ISST corpus. Being the financial section the only one having already all five levels of annotation, the addition of a sixth level for discourse and attribution would be better performed on this part of the corpus so as to have a complete resource. However, in order to avoid interferences deriving from the specificity of the financial domain, the selection of articles for the pilot annotation has not been drawn only from this part, corresponding to the articles form Il Sole 24 Ore. The subcorpus has been designed in order to be balanced with respect to the language contained in the ISST corpus as articles from every section are represented. Table 2 reports the total number of articles in each section (first row) and the number of articles from that section included in the sample (second row). A total of 50 articles out of the 484 constituting the corpus have been annotated, representing approximately a tenth of the ISST (roughly tokens). The phenomenon of attribution appeared to be well represented in this subsection,

96 5 Performing a Pilot Annotation thus containing a wide range of occurrences of attribution relations. Cs Els Period Re Sole Table 2 - N. of articles selected per section The subcorpus was obtained from a single file (Figure M), containing the whole corpus in table format with each line corresponding to a new token, and each tabseparated column to a different annotation feature. The first column refers to the article ID, the second to the sentence number and the third to the word counter in the relative article. Following columns add information about constituency, POS, lemma and the seventh contains the tokens. Figure M - ISST table format In order to reconstruct the articles, so as to have them available in the text format the tool required, the table file was split into a file each article containing the word tokens only, divided by a single space. This was achieved with writing a few lines

97 5 Performing a Pilot Annotation of code in the scripting language Python. Subsequently, it was necessary to correct some errors detected in the original file leading to an incorrect word order. Moreover, some characters such as hyphens and angle brackets were individuated as responsible for the crash at the launch of the tool software. In this case it was necessary to substitute the relative ASCII character codes for the problematic characters. 5.2 Tool Selection A myriad of tools have been developed with the purpose of annotating NL, though finding an existing tool perfectly matching a specific annotation project requirements is a search which in most cases is doomed to fail. One obstacle is determined by the availability of the tool, as due to the high costs involved in the production of software material, some tools are commercialised. Among the many open-source tools, developed mainly by research and university institutes and made available for academic purposes in order to promote their use and share resources, the great majority was developed in the frame of a specific project. These tools do not support all the annotation requirements of another project and their code is often difficult or impossible to change in order to adapt it to the new task. A last group of open-source tools supports a wider range of annotation projects and a high level of customizability. These annotation tools were designed not just for a specific project, but to be able to support the annotation of one or a group of phenomena, e.g. anaphora relations, speech interactions, temporal references, etc However, it is unlikely that a tool generally developed for a specific phenomenon succeeds in capturing all its possible aspects as it might take an approach grounded in a specific theory or miss aspects which another project wish to consider and include in the annotation. In the frame of attribution relations, to the above mentioned issues making the identification of a suitable tool challenging, it has to be added a more relevant one: there is no tool especially designed to support the annotation of attribution

98 5 Performing a Pilot Annotation Requirements In order to find the best matching available tool it is necessary to first define what it should match in order to support the annotation scheme, i.e. the annotation requirements. First of all, the tool should be able to take advantage of the other layers of annotation already available for the ISST corpus, especially to facilitate the annotators task of retrieving possible annotation relations through the corpus. For this reason, the tool should be able to read in a file like the table format (Figure M) containing information from other layers of annotation. Only the bare text should be displayed in order to avoid confusion, however, the tool should possess a search function capable of retrieving e.g. the lemma of a given verb that could be associated with attribution such as say, think or order or the POS of a token in order to disambiguate between e.g. a verb and an adjective with words like ordinato (ordered/ tidy). This would support proceeding cue by cue to annotate attribution, strategy adopted also by the PDTB (Prasad, Miltsakaki et al., 2008) for the annotation of discourse connectives. Once a cue is identified it should be possible to select it and mark the existence of an attribution relation in that point of the text. This should be done on the cue as it represents the only constituent of attribution which is always expressed and singularly considered (in case of multiple cues separate relations are annotated). The relation should require the selection of one or multiple text spans for the content and the optional selection, as the source might be left implicit, of one or multiple spans corresponding to the source. Each element constituting a single source, cue or content will be from now on called markable (Mueller and Strube, 2001:48). As source, cue and content (134) might be fragmented and separate by intervening material, it should also be possible to select as a single markable discontinuous text spans. (134) <<La responsabilità è politica aveva aggiunto il Procuratore capo- ed è il potere politico che deve far funzionare i servizi>>. (ISST els046) <<The responsibility is political had added the Chief Prosecutor and it is the political power that has to make services work>>

99 5 Performing a Pilot Annotation Moreover, overlapping text spans should also be selectable as it is often the case that attribution relations are nested into each other (see 3.2.1). To each selected markable it should be therefore possible to associate the features it possesses, through the selection of predefined values, thus speeding up the annotation process and avoiding spelling errors the annotators could make when manually writing these values. Finally, in case this is not automatically done when adding an attribution relation, the tool should support linking two or more markables to establish relations in both directions. Lastly, concerning the output of the tool, this should save the annotation as stand-off in a separate file each article identified by the same index as the files containing the other levels of annotation for the same article (i.e. cs.morph001, cs.orth001, etc, cs.attr001). In-line annotation, consisting of adding XML tags to the original text as in the example (135) cannot represent overlapping markables, because of XML syntax, and therefore is not suitable for describing attribution relations. The annotation should preferably refer to the word index (136), thus establishing a unique pointer to each token in the corpus corresponding to each line in the table format (Figure O) and not to the byte as e.g. white spaces and multi-words would possibly determine a mismatch between the bytes in the original files and those the tool refers to. Although possible, transforming the byte reference into the word index reference can lead to additional errors and should be dispreferred. (135) <content> In città non abbiamo uno scippo </content>, <cue>ha dichiarato</cue> <source>il sindaco</source>. (ISST re040) In town we do not have a single bag-snatching, declared the major. (136) <markable id= 1 span= token_001 token_008 role= content > <markable id= 1 span= token_010 token_011 role= cue > <markable id= 1 span= token_012 token_013 role= source >

100 5 Performing a Pilot Annotation Comparison of Available Tools In order to select the most appropriate software to employ to perform the pilot annotation of attribution relations, features of different available tools have been compared in the light of the requirements listed above (5.2.1). Only open-source tools have been taken into consideration. The analysis that follows is not intended to provide a full account of every tool described but just to highlight positive and negative aspects with respect to the present annotation project. While the most promising tools have been tested via setting a sample annotation schema and performing the annotation of a single file, tools which appeared to be incompatible with the most important requirements were soon dismissed and not further investigated, together with those tools potentially meeting these requirements but practically requiring complex modifications to their code. A selection of possibly suitable tools has been drawn from surveys available on the internet, such as David Lee s Corpus-based Linguistics LINKS ( and considering the tools adopted by similar annotation projects. Since there is no tool specifically developed for the annotation of attribution, general annotation tools or tools for the annotation of anaphora or discourse, phenomena relatively similar or overlapping with attribution and therefore also likely to require a similar description, have been considered. A brief analysis of the main tools taken into account is reported below. GATE GATE (Cunningham et al., 2002), General Architecture for Text Engineering, is a very complete architecture (freely available to download from allowing the development of language processing software. The tool supports a variety of formats, such as XML, RTF, HTML, plain text, although only the latter was easily accepted and used for the sample annotation. A set of NLP resources are provided with the tool and include a POS and a semantic tagger and a coreferencer. Setting an annotation schema was a relatively easy task which could be performed in a few minutes

101 5 Performing a Pilot Annotation Figure N - GATE annotation environment The tool supports nested annotation, as the same portion of text can be selected several times, however, the annotation of discontinuous spans is not possible as it is not allowed to include in the same markable non adjacent spans. It also seems not to be possible to establish relations between markables. Moreover, the annotation itself is quite problematic as the selection of the text spans, their deletion or modification, and the addition of features is not intuitive. GATE, which was used for example for the annotation of the MPQA Opinion Corpus (Wiebe et al., 2005), includes also a query tool. The annotation is stored in XML format with reference to the byte, as in the example below (Figure O). <?xml version='1.0' encoding='windows-1252'?> <GateDocument>  <GateDocumentFeatures> <Feature> <Name classname="java.lang.string">gate.sourceurl</name> <Value classname="java.lang.string">file:/c:/documents%20and%20settings/prova1.txt</value> </Feature> <Feature> <Name classname="java.lang.string">mimetype</name>

102 5 Performing a Pilot Annotation <Value classname="java.lang.string">text/plain</value> </Feature> <Feature> <Name classname="java.lang.string">docnewlinetype</name> <Value classname="java.lang.string">crlf</value> </Feature> </GateDocumentFeatures>  <AnnotationSet> <Annotation Id="8" Type="Source" StartNode="2736" EndNode="2746"> <Feature> <Name classname="java.lang.string">type</name> <Value classname="java.lang.string">arbitrary</value> </Feature> </Annotation> <Annotation Id="9" Type="cue" StartNode="2747" EndNode="2771"> <Feature> <Name classname="java.lang.string">factuality</name> <Value classname="java.lang.string">non-factual</value> </Feature> <Feature> <Name classname="java.lang.string">scopal change</name> <Value classname="java.lang.string">none</value> </Feature> <Feature> <Name classname="java.lang.string">type</name> <Value classname="java.lang.string">fact</value> </Feature> </Annotation> <Annotation Id="10" Type="content" StartNode="2772" EndNode="2864"> </Annotation> </AnnotationSet>  <AnnotationSet Name="Original markups"> <Annotation Id="0" Type="paragraph" StartNode="0" EndNode="2887"> </Annotation> </AnnotationSet> </GateDocument> Figure O - GATE annotation exported in XML Knowtator Meant to serve a wide range of annotation purposes, Knowtator (Ogren, 2006) is a plug-in of the knowledge representation system Protégé (both freely downloadable from which allows the definition of annotation schemas. Setting an annotation schema is not particularly complicated, however, for the attributes it is not possible to set pre-defined values to choose from but only one default element. This means that values have to be typed in manually by the

103 5 Performing a Pilot Annotation annotators thus representing an additional difficulty, and a consequent chance for errors. The tool, however, supports establishing a relation between markables as well as multiple selections. A multiple slot for instances of source or cue could be for example inserted in the cue class as in Figure P (right hand side in the middle). This shows a sample annotation project consisting of a single file and of a single attribution relation. The file containing the annotation is presented in Figure Q. Nested and discontinuous selections are also supported. A searching function is instead not available. Another negative side is that although relatively easy to set, the tool is quite complicated to use and requires some training as the markable selection and addition of features make use of icon buttons in a not very userfriendly manner. Figure P - Knowtator annotation environment A collection of texts can be defined for a project. These should be plain text, however XML and database table formats should also be supported. The tool provides stand-off annotation with reference to the byte (Figure Q). The output is relatively redundant as every annotation and feature is saved as a separate

104 5 Performing a Pilot Annotation annotation instance with explicit mention of the annotator and creation date. <?xml version="1.0" encoding="utf-8"?> <annotations textsource="01.txt"> <annotation> <mention id="attributionprova_instance_20000" /> <annotator id="attributionprova_instance_6"> Pareti, Edinburgh University</annotator> <span start="439" end="494" /> <spannedtext>il presidente della Banca Centrale, Jean-Claude Trichet</spannedText> <creationdate>sun Aug 16 18:24:30 CEST 2009</creationDate> </annotation> <annotation> <mention id="attributionprova_instance_20003" /> <annotator id="attributionprova_instance_6"> Pareti, Edinburgh University</annotator> <span start="496" end="506" /> <spannedtext>ha parlato</spannedtext> <creationdate>sun Aug 16 18:24:49 CEST 2009</creationDate> </annotation> <annotation> <mention id="attributionprova_instance_20007" /> <annotator id="attributionprova_instance_6"> Pareti, Edinburgh University</annotator> <span start="511" end="544" /> <spannedtext>grave rallentamento dell economia</spannedtext> <creationdate>sun Aug 16 18:25:24 CEST 2009</creationDate> </annotation> <classmention id="attributionprova_instance_20007"> <mentionclass id="content">content</mentionclass> </classmention> <classmention id="attributionprova_instance_20000"> <mentionclass id="source">source</mentionclass> </classmention> <classmention id="attributionprova_instance_20003"> <mentionclass id="cue">cue</mentionclass> <hasslotmention id="attributionprova_instance_20009" /> <hasslotmention id="attributionprova_instance_20010" /> </classmention> <stringslotmention id="attributionprova_instance_20009"> <mentionslot id="type" /> <stringslotmentionvalue value="assertion" /> </stringslotmention> <complexslotmention id="attributionprova_instance_20010"> <mentionslot id="attribution_source" /> <complexslotmentionvalue value="attributionprova_instance_20000" /> <complexslotmentionvalue value="attributionprova_instance_20007" /> </complexslotmention> </annotations> Figure Q - Knowtator annotation exported in XML Callisto The annotation tool Callisto (open-source, available from

105 5 Performing a Pilot Annotation was adopted for a part of the annotation of temporal relations in the frame of developing the ITB, Italian TimeBank (Caselli et al., 2008), on a portion of the ISST corpus. The tool has a very neat and basic interface nonetheless allowing setting user preferences. Overlapping text spans can be selected as well as single characters, by changing the annotation from word to character swiping. To create a new task, some annotation schemas e.g. POS or coreference are already available, it is necessary to define a DTD. However, this possibility seems no to easily work and therefore the tool was not set for the annotation of attribution on a sample article. This was not necessary, since the tool does not meet some important requirements as it seems not possible to select discontinuous spans as a single markable and to establish relations between markables. Callisto annotation is saved as stand-off with reference to the byte, however, the conversion into word index reference is supported. MMAX2 Written in Java, MMAX2 (Mueller and Strube, 2006) is a general purpose tool (available open-source from with a special focus on the annotation of anaphoric/ coreferential expressions, word sense disambiguation and POS tagging. Starting a project requires some time as the annotation schema has to be externally specified prior to launching the program. Nonetheless MMAX2 is a very flexible instrument that allows personalising the display of the annotation tool using XSL Style Sheets. The tool requires text input files, however XML support is under development and should be available shortly. The tool can be set so as to guide the annotation presenting default and pre-defined values to choose from for the attributes. MMAX2 allows the selection of overlapping and discontinuous text spans as well as the possibility to link markables together using relations. The stand-off annotation provided by the tool points to the word index and not to the byte as most other tools. Every markable level is saved in a different XML file where to each markable is associated an ID, the pointer to the text span and any other feature or relation associated with it. The result is a very compact and easy to read annotation. The tool was employed, among others, for the annotation of anaphora an deixis in the VENEX corpus (Poesio et al., 2009)

106 5 Performing a Pilot Annotation Annotator Annotator was the tool especially developed for the annotation of discourse connectives and their argument in the PDTB (Prasad, Miltsakaki et al., 2008). The tool supports the annotation of attribution on raw text files according to the schema adopted by the PDTB (see 2.4.3). The interface is very user-friendly and guides the annotation with listing the possible values from which to select and with employing constraints. Unfortunately, however, the tool could not be adapted to the present annotation schema as Annotator does not support the setting of different markables or features. The tool could be adapted by changing the source code, however this was not available and represents anyway a time-consuming task similar to writing a completely new annotation software. Annotator was not designed to account for attribution relation not occurring in correspondence to the discourse connective structure. In addition, nested attributions are not contemplated and it is not possible to specify the role of each of the three element constituting an attribution (i.e. source, cue, content) and establish relations between them. The tool produces stand-off annotation with reference to the byte. Even though the tool represents a good example of how an annotation tool for attribution could be also designed and implemented, since it was not possible to adapt Annotator to the annotation schema developed in this study, this could not be considered a possible candidate for the pilot annotation. Other tools Among other tools, also NITE and EXMARaLDA were briefly taken into consideration. NITE XML Toolkit (open-source at: is a very powerful instrument aimed at software developers which allows building specialised annotation schemas and interfaces for a wide range of purposes. It is especially intended to support multimedia language data and it has been employed in a number of meeting and dialogue corpora. NITE, however, is quite complex to set up and a sample annotation project could not be developed to test it. EXMARaLDA (Schmidt, 2001), Extensible Mark-up Language for Discourse Annotation (available from : is a system of Java based tools with XML data formats especially designed for the annotation and assisted

107 5 Performing a Pilot Annotation transcription of spoken language. The tool is not suitable for the annotation of attribution relations as it is not meant to relate markables and it does not support the definition of an annotation schema as it would be required for attribution Selection and Tool Specifics Concerning the requirements specified in (5.2.1) priority was given first to those features enabling the selection of the text spans involved in attribution, i.e. discontinuous and nested markables, together with the possibility of establishing relations among cue, source and content markables, including the eventuality of having more than one source and/or content each relation. Subsequently, the tool customizability and user-friendliness were also considered, with particular attention to the possibility of setting guided choices for the markable features. Part of this second group of requirements was also the tool annotation format, ideally neat and compact stand-off XML annotation with reference to the word index. Other aspects, such as the possibility of querying the corpus with reference to other levels of annotation in order to retrieve possible cues or the support of input data in a format other than text, were temporarily left aside. From the tool considered above (5.2.2) only two, Knowtator and MMAX2, appeared to meet the first set of requirements and were therefore more closely compared to check other relevant characteristics. Supported features Knowtator MMAX2 Discontinuous text selection Yes Yes Nested selection Yes Yes Relations Yes Yes Multiple sources/contents Yes Yes Pre-defined values selection No (one default) Yes (menus) Display customizability Yes (partial) Yes (complete) Ease of setting a scheme Simple (internal) Medium (external) Ease of annotation Medium Simple XML stand-off output Yes Yes Reference to word index No (byte) Yes Table 3 - Knowtator/ MMAX2 feature comparison

108 5 Performing a Pilot Annotation Knowtator and MMAX2 differ in some aspects concerning the second group of requirements. These are listed in the lower half of Table 3. Knowtator is certainly easier to set as the annotation schema and customization can be internally defined through the interface. MMAX2 requires instead the modification of XSL for both setting the annotation schema and customizing the interface and display of the annotation. On the other hand MMAX2 can be more personalised and the annotation scheme better specified so as to have pre-defined values to select from, thus facilitating the annotation by reducing the annotators cognitive load. The annotation itself, i.e. selection, deletion, extension of a text span, is also easier. Lastly (last row in Table 3), MMAX2 saves the annotation as stand-off with reference to the word index, whereas Knowtator refers to the byte. Considering all their characteristics, the higher setting costs of MMAX2 seems to be well compensated by a subsequent more structured annotation and a more flexible interface. This, together with the possibility to anchor the markables span to the original text through references to the word indexes, made this tool prevail as the most suitable for the present purpose of annotating attribution relations according to the proposed schema. 5.3 Setting MMAX2 Installing MMAX2 is easy, though it requires a current Java version installed on the machine to run. Once the program is launched, it is possible to start a project using the Project Wizard shown in Figure R. In this window the raw text input file that will be used for the annotation has to be selected. This file gets then analysed by the program and tokenised. The article from the ISST were previously tokenised and corrected, it was therefore not necessary to do it again as in this case it is possible to tick the Input file is one token per line box. Afterwards it is required to specify at least one markable level for the annotation, and eventually some display preferences related to it. The last section in the window (Figure R) contains the paths to where the different project components are stored and allows selecting a name for the project and the stored input file

109 5 Performing a Pilot Annotation Figure R - MMAX2 Project Wizard Each MMAX2 annotation project has five different components (and a common_paths file specifying where these are stored): -the Base Data, that is the data on which the annotation is performed. For the present project this consists of an XML file each article, derived from the raw text files provided to the tool as input. The file has a token per line to which a progressive word index is assigned (Figure S). <?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE words SYSTEM "words.dtd"> <words> <word id="word_1">londra</word> <word id="word_2">.</word> <word id="word_3">gas</word> <word id="word_4">dalla</word>

110 5 Performing a Pilot Annotation <word id="word_5">statua</word> <word id="word_6">evacuata</word> <word id="word_7">la</word> <word id="word_8">tate</word> <word id="word_9">gallery</word> <word id="word_10">.</word> <word id="word_n"> </word> </word> Figure S - MMAX2 Base Data (ISST cs001) -the Scheme, an XML file for each markable level containing the annotation schema. This file specifies markable attributes and relations. Attributes represent descriptive information, while relations account for structural or associative information. Relations in MMAX2 can be of two kinds: markable-set, undirected relations between two or more markables, and markable-pointer, a directed relation from one markable to one or more target markables. Attributes can be simple FREETEXT, thus accepting any string as their value, NOMINAL_LIST, a pre-defined closed set of possible values presented as a drop-down menu, or NOMINAL_BUTTON, similar to the precedent but the values are presented as a sequence of radio buttons. In the Scheme file it is not only possible to set the type of attributes, with their pre-defined values, and relations, but also to determine a hierarchy of attributes. Dependencies can be expressed by adding a next value to an attribute specifying, in case this one is selected, which other attribute or set of attributes to enable. -the Style, an XSL file which defines the display. Here the way the text and the annotation are presented can be modified, for example, by adding handles to the markables, inserting empty lines or structuring dialogue turns. -the Customization file (XML), containing a description of how each markable should be visualised, i.e. foreground and background colour, size

111 5 Performing a Pilot Annotation and font aspect, according to its attributes and relations. A markable that has not yet been assigned attribute values could be associated e.g. with a different background colour so as to be easily spotted as requiring the completion of the annotation. - the Markable directory, containing the annotation in XML format. The annotation of each article is stored in a separate file, as it was for the Base Data, while Scheme, Style, Customization are common to the entire project. This file represents the stand-off annotation and lists all the markables for the specific level of annotation. Markables are assigned a unique ID and a reference span, pointing to the original text, stored in the Base Data, by pointing to the word index (e.g. span="word_62..word_78"). For each markable attribute values and relations are specified. In a preliminary stage, cue, content and source were defined as three separate markable levels. This allows keeping the three components completely distinct during the annotation process and makes it possible to select immediately the role of each markable when the text span is selected, as shown in Figure T on the right hand side. Figure T - The annotation of cue, content and source as separate levels On the other hand, however, this results in the annotation of each article being

112 5 Performing a Pilot Annotation stored in three separate files, one each markable level. As having the annotation on one single file guarantees better access to it and less storage space, the Scheme was changed and the components of attribution were subsequently annotated on the same markable level as different attributes. The pilot annotation project consists of an input directory containing the 50 articles, tokenised and in XML format, i.e. the Base Data, and the Markable directory containing 50 corresponding files where the annotation output is stored. Scheme, Customization and Style are instead common and had to be written only once. For each article a.mmax file is also produced, this contains the reference to the Base Data file corresponding to it and it is this file that needs to be loaded when opening the relative annotation project with the tool Scheme The Scheme, included in Appendix 1, is the most interesting component of MMAX2 as it describes the annotation schema and the way it is presented during the annotation. After selecting a markable on the text, it is possible to assign attributes to it through the annotation window. This is initially displayed as in Figure U, where just the role of the markable in the attribution relation can be selected, none being the default value and all the possible values being displayed as radio buttons. Only when the relevant role has been selected, other features are activated. The type feature is definitely related to the cue, together with the factuality. The source type is an attribute of the source, however, implicit sources can frequently occur with the consequence of no source markable to be available. Not to loose the information about the source type, this attribute can be made available when selecting the cue. As the cue is the textual anchor of the attribution relation, it is never missing and can therefore carry information about weaker elements. As far as the scopal change is concerned, as the change in the scope usually involves reversing the polarity or factuality of the content, this could have been associated with it. Considering however that the element changing scope is usually included in the cue span, e.g. the negation of the attribution verb, and that it would be easy to forget this relatively infrequent attribute if this would be

113 5 Performing a Pilot Annotation separate from the other ones, the scopal change has also been made available for selection through the cue. Figure U - MMAX2 Annotation window When a markable is defined as the cue, the annotation panel shows also all the features connected to it (Figure V). The type attribute has by default the value none. When assertion, belief or eventuality is selected, the scopal_change feature is also made available. As a change in the scope cannot occur with facts, this feature is disabled in order to facilitate the annotation. Figure V - MMAX2 Annotation window (attributes)

114 5 Performing a Pilot Annotation Factuality is by default factual as this is by far more frequently the case, the unmarked value, non-factual has to be therefore voluntarily selected. The source is by default writer. It is the writer in fact the shallowest source of any attribution relation. The annotation scheme includes also a free text slot for the source_id attribute. This was left blank in the pilot, however it has been included in the tool Scheme as it represents a highly desirable feature the final annotation should posses. The same source can be in fact mentioned in an article, in a number of different ways, e.g. proper name, common name, profession, pronoun, etc This feature is included in the Opinion Corpus (Wiebe, 2002) where it not only provides the source with a unique ID, assigned by the annotator, but it also accounts for embedded attributions. In this slot the annotator should list, from the shallowest, i.e. the first source the writer is mentioning or the writer itself when explicit, to the most embedded one, i.e. the one directly holding the content in the attribution relation. The source ID slot should ideally be redundant. A coreference tool should be able in the future to automatically and reliably relate pronouns and alternative full nouns to the original source, similarly as it should be possible to do for coreference relations involving the content. It should also be possible to derive the additional sources to the left of an attribution by identifying text spans containing the one corresponding to the attribution. Once an attribution relation is included in the content of another attribution, it should inherit its source. By performing this task starting from the outside, a nested attribution would simply inherit all the external sources from the attribution immediately above as they would be all already listed in its source ID. Relations are not established in the annotation window, although they are shown as the last element (Figure V), but directly on the window displaying the text by selecting a markable and then right-clicking on the markable this should be related to, the option add to markable set should be then available as in Figure W. When an element part of a relation is selected (Figure W below) the markables part of the same relation are shown with a grey background and linked by a red line. The type of relation adopted for the annotation of attribution is the markable set. This allows to relate as many markables as required. As the relation is

115 5 Performing a Pilot Annotation undirected this can be retrieved from the annotation of any markable part of the set, and not only from the annotation of the markable from which the relation originates as with the markable pointer relation. This was especially important as attribution relations are bidirectional, it is in fact necessary to trace the source from the content, but also vice versa. Figure W - MMAX2 Annotation of relations Customization The Customization file was written as reported in Appendix 1. The third line specifies the display preferences for all the markables. Every selected span is highlighted by surrounding it with black handles and showing its text in blue, bold font. The following lines in the file define for each attribute specific display preferences. This could have been done for every different value of every feature connected to attribution, however an excessive differentiation instead of helping the annotation and fruition of the annotation by visually characterising different elements, would simply confuse. It would be in fact necessary to memorise the association of many colours and font effects to the different features involved in the annotation. Only the components of attribution were therefore visually characterised, once a span is marked as the cue, its background is changed to orange, the content has instead a cyan background, the source and the supplement a green and a light gray one respectively. Apart from allowing the immediate identification in the text of cue, content, source and supplement, these display settings provide

116 5 Performing a Pilot Annotation a feedback about the successful annotation of a markable. In case the annotator forgets to assign a role to a selected markable, or, in case of uncertainty, intentionally leaves that for a later stage, the display will continue showing the markable (blue, bold font and handles) without colour background, thus making it easier to identify it later on when completing the annotation. Similarly, in case the annotator fails saving the annotation, as it is easy to forget selecting the auto-save function every time a different annotation project is loaded and even more to manually save the annotation for every single markable, this can be immediately noticed. The annotator can therefore select the auto-save option and repeat only the last markable selection Style The Style sheet (also reported in Appendix 1) has not been deeply modified. While for example dialogues are surely better displayed with separating turns and differentiating the actual text from the speaker, news articles have a simpler structure. Apart from the body text, the other elements are the title(s) and the author, when explicitly mentioned. However, distinguishing these elements for the annotation is not necessary. On the other hand, since attribution relations can often be found nested one in another, adding handles to the right and to the left of each markable represents the only way to make it possible to identify these instances (Figure X). Handles were therefore added in the Style sheet. Figure X - Nested attributions visible through handles

117 5 Performing a Pilot Annotation 5.4 Feasibility of the Schema and Issues While performing the pilot annotation several issues arose leading to a reconsideration of the annotation schema. This was partially modified and reapplied to the sample corpus. Some changes were determined by the tool characteristics, in order to better exploit its potential or make up for shortcomings so that the schema was adequately represented and the annotation process relatively easy and intuitive. Other issues were brought up by acquiring evidence of real language occurrences of attribution relations presenting aspects not yet considered. Finally, doubts and difficulties in applying the schema shed light on features of the schema requiring further investigations to reach a more appropriate description. These issues, which have already been analysed in the relevant chapters, will be here only shortly presented. First of all, the annotation process highlighted the necessity of more precisely determine the scope of the attribution relation, i.e. the text span to select as source, cue or content. Adverbs, relative clauses, appositive or other elements can in turn represent highly informative material contributing to the interpretation of the attribution or disruptive additional information which could be better left out of the annotation. In addition, through the annotation it was possible to realise the necessity of a solution to preserve the information carried by source of the source elements (3.2.2), namely the provenance of the knowledge acquired through verbs of the fact type (e.g. John knows FROM MARY, that ) and recipients of messages presupposing a perlocutionary act i.e. the indirect object of eventualities (especially influence verbs, e.g. The pope prohibits CATHOLICS to ). In order to account for these elements which do not however correspond to any of the three components of an attribution relation, a supplement role was added and included in the annotation. Moreover, it was necessary to account for instances of multiple sources belonging to different types. As all attributes have been added on the cue, i.e. associated to the text span corresponding to the cue, it is not possible to give different values for the source type feature. It could have been instead possible with marking the type directly on the source, therefore assigning a type to each source in the attribution relation, however, the null or hidden sources would have then been problematic as they have no corresponding text span. Hidden sources

118 5 Performing a Pilot Annotation are a lot more frequent than multiple sources belonging to different source types and therefore the former issue was given priority. The solution adopted was that of including a value mixed for the source type attribute. In addition, the frequency of coreference relations involving the source led to the addition of a source ID attribute as described in (3.2.1, 5.3.1). Lastly, assigning a value to the features type (4.1) and scopal change (4.4.) turned out to be in some cases not certain and depending on subjective considerations. Thus the necessity of a statistical analysis of the problem, confronting inter-annotator agreement on these features, in order to estimate its entity and introduce the appropriate changes to the schema if required. 5.5 Summary In order to develop an annotation schema for the phenomenon considered in this study and test its efficacy, it was decided to develop a pilot annotation. The pilot was performed on a balanced portion of the ISST corpus consisting of 50 articles. The annotation was carried out with the help of an annotation tool. In order to select the most suitable tool, specifications for the proposed annotation schema were listed and confronted with the available software. Some requirements, such as the possibility to select discontinuous text span for a single markable and to relate markables through relations, were considered having priority and tools not meeting them were discarded. The two remaining tools, Knowtator and MMAX2, were confronted with respect to the additional requirements, e.g. their customisability, how the annotation is saved and userfriendliness. MMAX2 was eventually adopted and set for the annotation. This required delineating the annotation Scheme, i.e. how to organise cue, source and content as well as their attributes and constraints, as well as defining Style and Customisation. The articles had to be prepared, that is corrected and in raw text, to become the XML Base Data of the annotation. The annotation of each article was stored stand-off in a single XML file with reference to the word index. The pilot allowed identifying some issues with confronting the annotation schema with real language instances. These, together with the constraints

119 5 Performing a Pilot Annotation determined by the tool characteristics resulted in the partial modification of the annotation schema, in order to account for example for the problem of coreference resolution, and phenomena such as sources of the source and mixed sources. Although some modifications might still be required, e.g. type and scopal change features, once the applicability of the schema has been statistically evaluated, the annotation scheme developed so far proved to be feasible and the annotation, with the help of the tool and of annotation constraints, rather reliable although at times problematic

120 6 Annotation Schema and Guidelines 6 Annotation Schema and Guidelines The annotation process starts with loading an article at a time in the MMAX2 tool and is generally performed in five phases. First of all it is necessary to identify the presence of an attribution relation. This is usually done starting from the identification of an attribution cue, typically punctuation marks and reportive verbs. However, for the relation to be annotated, it is not enough to find elements linked by the cue which function as source and content. The content should in fact express the object of the attribution and not just its description. An attribution like John said two words is not relevant (while John said: two words would be), unless it is necessary in order to relate John to the actual two words he pronounced which can be expressed somewhere else in the article, similarly to coreferential pronouns functioning as content. Also idiomatic or false attributions ((137), (138), (139)) should not be annotated. These attributions in fact are not meant to establish a relation between source and content. The source of idiomatic attributions is also generally hidden. Examples (137) and (138) represent a specification or a concession with respect to what it was previously said. In (139) the reportive verb say is just employed to express an equivalence (Biedermeier = il buon Meier ). (137) C'È DA DIRE CHE d'arminio Monforte non sarà scelto dall'intraprendente El Sayed ad interpretare ed eseguire le nuove strategie che dovranno portare a così alti traguardi. (ISST els001) IT SHOULD BE SAID THAT Arminio Monforte won t be chosen by the enterprising El Sayed to interpret and execute the new strategies that will have to bring to so high achievements. (138) Perché VA DETTO CHE il signor B. spesso ha una casa fuori porta, in mezzo al verde. (ISST perod001) Because IT HAS TO BE SAID THAT mister B. often has a house out of town, in the countryside

121 6 Annotation Schema and Guidelines (139) Biedermeier, COME DIRE "il buon Meier": il cittadino medio del secolo scorso, protagonista di un'epoca, un gusto, uno stile. (ISST perod001) Biedermeier, AS TO SAY the good Meier : last century average citizen, protagonist of an epoch, a taste, a style. In the second phase, after having identified an attribution relation, the relevant text spans need to be selected and labelled as markables. They hence get displayed as blue bold text in between square brackets. Afterwards, a role (Figure Y) is assigned to each markable which is therefore shown with a specific colour background. The following passage consists in assigning values to each attribute in the annotation. Lastly, the markables need to be linked in a relation. This can be done by selecting a markable and right-clicking on the elements which should be included in the same set. When a markable in a relation is selected, the markables part of that relation set are displayed joined by red arches. relation SOURCE(S) CUE CONTENT(S) (SUPPLEMENT) Figure Y - Attribution relation components In this chapter the annotation schema developed in this thesis will be summarised and presented as it has been employed in the pilot annotation. Indications will be provided regarding the selection of the relevant text span for each of the constitutive elements of the attribution relation. With the use of examples from the corpus, instructions concerning how to assign the values for each annotated features will be also given. All the recommendations reported in the following chapters however have to be regarded as suggestions, a referential repository of good practice examples with the aim of facilitating the annotation process, rather than prescriptions. The context and a full awareness of the goals to achieve should alone be sufficient to reliably drive the annotation. Would this prove incorrect the strategy adopted here should be abandoned in favour of a more controlled one

122 6 Annotation Schema and Guidelines 6.1 Text Spans Selection Once an attribution relation is found, it is necessary first of all to identify its constitutive elements (Figure Z) and determine which span represents them. Each relation requires at least three components: the cue, i.e. the textual anchor signalling the relation; the content, that is the attributed material; and the source, the entity the content is attributed to. The source can be missing as it is sometimes left implicit. It should be however clear when annotating which implicit entity the attribution refers to. In some cases it is instead possible to have multiple instances of source and content. In addition to these three components there is a fourth one, the supplement, which can be optionally used to mark additional relevant information. Figure Z - Annotation, text spans selection. The text spans corresponding to cue, source and content should be first selected (Figure Z) thus enabling the option of creating a markable with the selected text. In case extensions or reductions to the text span corresponding to a markable are required, it is possible to do so with choosing add or remove from this markable from the menu on the selected span

123 6 Annotation Schema and Guidelines Elements that can possibly constitute each markable type are listed in Figure AA. Deciding what is in the scope of the attribution relation, i.e. what exactly to comprise in each markable, should not be taken for granted. In the following chapters indications will be provided about each markable type and what should be included or left out of its text span. relation SOURCE(S) CUE CONTENT(S) (SUPPLEMENT) -noun phrase -adjective -prep. phrase -verb -noun -adjective -preposition -prep. group -graphic marker -word -phrase -clause -sentence -entire article -cue modifier -indirect object -source of source -event specification Figure AA - Annotation, elements which could function as a markable Source Span In general, in the source span should be included all those elements relevant to the identification of the entity having this role. However, what is to be considered relevant needs to be defined. The source should always comprehend the full noun phrase expressing it ((140) attribution 1) or, in case the source is represented by an adjective or a prepositional phrase (141), these elements have to be included. (140) [Il ministro del Tesoro] 1 [ha indicato anche] 1 [l'obiettivo del prossimo anno: 4 per cento] 1. Ø [Ha anche aggiunto] 2 [che i risultati positivi derivano soprattutto dalla caduta dei prezzi del petrolio, da quello delle altre materie prime e dal calo del dollaro] 2. (els020) [The Secretary of the Treasury] 1 [has also indicated] 1 [next year goal: 4 per cent] 1. (He) [has also added] 2 [that the positive results mainly derive from the drop of the petrol price, from that of other raw materials and the dollar decrease]

124 6 Annotation Schema and Guidelines (141) Le parole registrate di Gheddafi, (ISST cs039) Gheddafi s recorded words, In case of appositives or relative clauses referring to the entity in the noun phrase and contributing to its characterisation, these should also be selected together with the noun phrase as in the example (142). When they instead digress from the task of identifying the scope as in (143) and constitute a mere description or provide additional details which are not necessary, they should not be annotated. (142) il presidente della casa giapponese, Osamu Suzuki, ha previsto un'ulteriore flessione dei profitti anche per quest'anno. (ISST sole100) the president of the Japanese trade, Osamu Suzuki, has predicted an additional fall of the revenues also for his year. (143) <<Un'idea geniale>> l'ha definita Cesare Verlucca, editore piemontese pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone. (ISST sole040) <<A genius idea>> has defined it Cesare Verlucca, publisher from Piedmont ready to face the salotti after the sale success obtained at the Salone fair. When the relation is part of a relative clause with the source expressed by a relative pronoun, just the pronoun should be annotated as in (144). The full noun the relative pronoun refers to, in this case Milan vice-president, Galliani, should be syntactically retrievable, moreover, it will be reported in the source ID slot. Null or missing subject, having no corresponding span, should not be marked on the text ((140) attribution 2). (144) Una provocazione collegata a un recente colloquio con il vicepresidente del Milan, Galliani, il quale ha convenuto con me circa l insostenibilità della situazione. (ISST cs064) A provocation connected to a recent conversation with Milan vice-president, Galliani, who agreed with me about the situation being unbearable

125 6 Annotation Schema and Guidelines Cue Span The cue can be expressed by a considerable number of elements thus making it difficult to automatically recognise it. Most commonly, however, cues are reportive verbs. Apart form including the particle or expression reversing its polarity (145), adverbs ((146), (147)modifying the attitude should also be included while complements or specifications, which should just be considered in case they provide relevant information, can be included in the supplement. (145) Ø Non ho mai pensato che <<Il Dottor Zhivago>> potesse essere considerato un opera ostile al socialismo. (ISST els076) (I) have never thought that <<Doctor Zhivago>> could be considered a work against socialism. (146) Afferma ufficialmente l Antitrust : <<Le modalità di pubblicizzazione del prezzo consigliato >>. (ISST sole049) The Antitrust officially affirms: <<The advertisement modalities of the suggested price >>. (147) Ieri sera i segretari generali hanno esplicitamente detto di essere d accordo con una delicata proposta contenuta nel documento dei giuristi sulle sanzioni da applicare ai singoli lavoratori che si rifiutassero di prestare il lavoro richiesto per garantire il minimo di servizio. (ISST els079) Yesterday the general secretaries explicitly said that they agree with a delicate proposal included in the lawyers document concerning the penalties to inflict to the individual workers who would refuse to give the work required to guarantee the minimum service. Similarly cue of the cue particles, i.e. usually complements expressing mean or provenance, should not be labelled as cue but included in the annotation as supplement, together with the indirect object, when expressed. In this way these elements and the information they carry would be retrievable. When more than one cue belonging to the same type, i.e. conveying the same attitude the source holds, is expressed, these should all be included in a

126 6 Annotation Schema and Guidelines single cue markable as to each and every cue markable corresponds an attribution relation. Cases like (148) with a redundant cue and source (since Martino and the Foreign Secretary are the same person) are also possible. In this case the co-referential sources should be included in the same source markable, included in the squared brackets labelled as s. Similarly, the two cues will constitute a single cue markable, the corresponding text span is marked with c. While cues of different types should be split into separate attribution relations, those of the same type concur to signalling the presence of an attribution and should be grouped. An exception is made only for punctuation cues which should be annotated only when the relation is not signalled by any other mean as in (149). (148) [Secondo] c [il ministro degli Esteri] s, [la prossima ondata di ottimismo ci sarà], [ha detto] c [Martino] s, [quando comunicheremo le nostre prime iniziative concrete]. (ISST sole017) [According to] c [the Foreign Secretary] s, [the next wave of optimism will take place], [said] c [Martino] s, [when we will announce our first concrete initiatives]. (149) Il Papa: La cultura ha bisogno del genio femminile. (ISST cs014) The Pope: Culture needs the female genius. Lastly, when the attribution is a question, the cue should include the element giving the utterance the interrogative form, i.e. the question mark in case of direct questions (150). This element should not be included in the content: It is in fact the cue, therefore the attribution itself which is questioned and not that a question is the content of the attribution. (150) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo scrittore si trovasse a una svolta esistenziale? (ISST els034) Do you also think like many literary critics that, with his unfinished romance, the writer was at an existential turning-point?

127 6 Annotation Schema and Guidelines Content Span The selection of the content should obey to a principle of limiting the annotation to that portion of text which is surely meant to be attributed to the source. This means that the content span should not include utterances of uncertain attribution due to syntactic ambiguities. An example is when a clause constituting the content is joined to another utterance via a coordinating conjunction. In this case, only if the complementizer che (that) is included ((151), (152)) the second clause is also surely attributed, otherwise it could represent material added by the source above, usually the writer. (151) Più positive, invece, il giudizio di Fim-Cisl e Uilm-Uil, che hanno annunciato per oggi una conferenza stampa e che sono favorevoli ad una votazione referendaria sulla bozza di accordo. (ISST els002) More positive, instead, the opinion of Fim-Cisl and Uilm-Uil, that have announced a press release for today and that they are positive about a referendum poll concerning the agreement draft. (152) Lo ha detto ieri un portavoce del ministero degli Esteri, il quale ha anche annunciato che il governo cinese ha protestato con quello degli Stati Uniti e che si riserva il diritto di ulteriori reazioni. (ISST els075) It was said yesterday by a spokesman of the Foreign Ministry, who has also announced that the Chinese government has complained to the one of the United States and that they reserve themselves the right of further reactions. Also part of the content span should be the IO of verbs requiring one, e.g. to order, to forbid (153). In the example below in fact the prohibition would be incomplete without the IO to which it is addressed. Zagreb authorities did not prohibit to go to Petrinja, this could even be considered an incorrect attribution, but they forbid the journalists to go to Petrinja. (153) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e nelle altre località appena riconquistate. (ISST cs030)

128 6 Annotation Schema and Guidelines And Zagreb authorities have forbidden journalists to go to Petrinja and the other just reconquered places. When the content span is separated by an incidental phrase or clause, it should be annotated as a single markable, unless, as in (154), the content is also divided by sentence boundaries. In this case it seems more appropriate the addition of the second part of the attribution still to the same relation, though as a second content markable. (154) "There's no question that some of those workers and managers contracted asbestos-related diseases," said Darrell Phillips, vice president of human resources for Hollingsworth & Vose. "But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today." (PDTB 0003) The complementizer that should always be included in the content span, together with the quotation marks (155) (i.e. or << >>). When source and cue are expressed incidentally, surrounded by hyphens, these should also be included in the content (155). (155) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction? ["Sì -] [si adombra] [Matt] [- Un ruolo interessante: con Tarantino eravamo a buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei, no? Hanno scelto lui"] (ISST cs060) Is it right that you were going to play the role of Bruce Willis in Pulp Fiction? [ Yes -] [Matt] [grows dark] [- An interesting role: with Tarantino we were at a good point, then Bruce arrived. His films cash in a bit more than mines, right? They chose him ] Punctuation at the end of a content span should only be included if part of the content itself. This means that for example a full stop at the end should be included when the content is expressed by a full sentence, a question mark when the content itself is a question (156) and so forth

129 6 Annotation Schema and Guidelines (156) Ø Sospende il racconto e formula una domanda, in inglese: Sai cos è un rabbit?. (ISST cs030) (He) holds the narration and poses a question, in English: Do you know what a rabbit is? Supplement The supplement span is a useful device in order to account for optional additional elements which although not fundamental in an attribution relation, they are in fact often missing, do carry useful information. These can be: concurring to the identification of the source and the provenance (157) or mean by which the information was acquired; providing further specification of the attitude this holds; the recipient of a reportive verb of the assertion type (e.g. to tell); and event specifications providing context indications determinant to the interpretation and comprehension of the content. The latter includes also instances like (158), where the content has been asserted or expressed about a certain entity or event ( it ). In the example this element it is necessary as required by the verb and in case of an indirect quotation it could be included in the content. In this case however, it is not directly part of it as the source has been talking about this event without mentioning it. In the examples below the supplement span is in small capitals. (157) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro, non pensavo di arrivare primo. (ISST cs071) (I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I swear, I didn t imagine I would have came first. (158) <<Un'idea geniale>> L'ha definita Cesare Verlucca, editore piemontese pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone. (ISST sole040) <<A genius idea>> has defined IT Cesare Verlucca, publisher from Piedmont ready to face the salotti after the sale success obtained at the Salone fair

130 6 Annotation Schema and Guidelines 6.2 Feature Annotation Guidelines After selecting the text spans corresponding to the elements part of an attribution relation it is necessary to assign the role to each markable in the annotation window. When the role cue is chosen, the window ( Figure BB) will display also the attributes and their values which need to be assigned. Figure BB - Annotation, attributes selection. The features included in the attribution are summarised in Table 4. They are all marked on the cue, although some refer to characteristics of the source, i.e. source type and source ID. Cue Type Factuality Scopal change None Factual None Assertion Non-factual Scopal change Belief Fact Eventuality Source type Source ID Source Writer Other Arbitrary Mixed Table 4 - Annotation schema features free text Content Supplement