Text Analysis beyond Keyword Spotting

Text Analysis beyond Keyword Spotting Bastian Haarmann, Lukas Sikorski, Ulrich Schade { bastian.haarmann lukas.sikorski ulrich.schade }@fkie.fraunhofer.de Fraunhofer Institute for Communication, Information Processing, and Ergonomics (FKIE) Neuenahrer Straße 20 53343 Wachtberg GERMANY ABSTRACT Texts, e.g. written reports, can be pre-analyzed automatically in order to find those texts which might be of interest with respect to a specific problem. Often this is done by a more or less sophisticated version of keyword spotting. However, the field of computational linguistics offers more potent and more precise tools for automatic text analysis. In this paper we present such a system tailored to the pre-analysis of military text as reports. We discuss the modules of the system and the improvements we installed to meet the needs of the military like the need for speed. As a conclusion we show how that system can be used in the process of automatic threat recognition. MOTIVATION The success of military operations of all kinds (battlefield, anti-terrorism, peacekeeping, disaster relief) relies on information. Commanders must be aware of the current situation, they have to understand it, and they have to grasp the sense of how the situation might develop [20]. In the past, critical pieces of information often were not available while today they are often hidden in the haystack of gathered sensor data, SIGINT data, HUMINT reports, and open sources. The huge amount of these data can no longer be processed by human reconnaissance specialists in toto. It has to be reduced by automatic means. Systems have to check for information pieces that may be relevant to tackle a specific problem. Then, the human experts can analyse the promising information pieces to find the pearls sought after. With respect to texts, from reports written by members of one s own forces to web pages of possible opponents, the automatic pre-processing often is limited to keyword spotting. However, the field of 1

computational linguistics offers more potent and more precise tools for automatic text analysis. This paper is about a system that applies these tools and methods. INFORMATION EXTRACTION The process of extracting limited kinds of semantic content from texts is called information extraction (IE) [12, p. 759]. Our work in that field is conceptually based on Hecking (e.g., [8, 9]) who applied IE techniques to the analysis of battlefield and HUMINT reports, the domains we also aim at. The technical base for our work is the freely available open-source tool GATE [2, 7]. GATE provides a toolbox to build the required IE processing pipeline which then can be adjusted and enlarged according to one s needs. The standard IE processing pipeline consists at least of the following processing modules: tokenizer, gazetteer, sentence splitter, part-of-speech tagger, recognizer for named entities, parser, and a module for semantic role labeling. The tokenizer determines individual tokens of the text, i.e. single words, numbers, abbreviations, and punctuation marks. The gazetteer then compares the tokens to elements of several lists which contain names of various types. There usually are at least a list of person and organization names and a list of names for relevant geographic entities, e.g. countries, provinces, towns, villages, rivers and the like. Tokens matching an element in one of the lists will be annotated with the respective type, e.g., the token Kabul might be tagged as type = location, subtype = city. After dealing with tokens, the IE process looks for pieces in the text which consist of one or more tokens which belong together according to linguistic theory. First, sentences are determined by the sentence splitter. This task is less trivial than it may seem at first glance because one has to prevent the splitter from suspecting the end of a sentence after every period. Otherwise, a sentence would never make it past Mr. or Dr. or any other abbreviation of that kind. Next, the word tokens need to be annotated by their syntactic category (e.g., for the tanks move, the has to be tagged as determiner, tanks as noun and move as verb). This is the task of the part-of-speech tagger (POS tagger). GATE comes shipped with a tagger that annotates the word tokens according to the categories of the Penn-Treebank tag set [15, 17]. On the basis of the annotations provided by the gazetteer and by the POS tagger, the recognizer for named entities and the parser identify the larger pieces, the constituents and the subordinate clauses, within 2

the sentences. The recognizer for named entities combines elements annotated by the gazetteer. For example, for the sequence Dr. Mohammed el-baradei, the gazetteer will provide the annotations title for Dr., male forename for Mohammed and surname for el-baradei so that the recognizer can annotate the whole sequence with the tags person and noun phrase according to its rules. The parser operates on the tags provided by the POS tagger and by the recognizer for named entities. For example, in The tanks move towards Kabul, the had been labeled determiner and tanks had been labeled noun. The parser recognizes a sequence of a determiner and a noun as noun phrase. So the tanks is annotated noun phrase. In addition, towards is labeled preposition and Kabul labeled location. Together they are labeled prepositional phrase by the parser. The last module in the pipe is the module for semantic role labeling. In short, this module assigns semantic roles to the recognized constituents. In our example, the noun phrase the tanks should be labeled by the role theme, and the prepositional phrase by direction. We will come back to semantic role labeling in the next section. Obviously, the information extraction pipeline as described above can be enlarged by additional modules. For example, a spell checker can be integrated following the tokenizer in order to prevent wrong annotations resulting from spelling errors. Another module that can be added is a module for coreference resolution. Such a module should follow the parser such that syntactic information, available through the annotations provided by the parser, can be exploited for coreference resolution. At the moment, our system does not include these modules. However, the module for coreference resolution is under development. The process of information extraction has to be adapted to the domain the texts to be analyzed come from. When texts are military reports, these reports are about events that have happened or are ongoing in a specific area, or they are about persons or organizations operating in that area. Even if the reports report status information, the information is about persons, organizations, facilities and more, all located in that area. Thus, the gazetteer has to be tailored for that area. Its geographical lists have to include the names of villages and town of that area, the names of its rivers and so on. 3

SEMANTIC ROLE LABELING After the aforementioned steps in the data processing pipeline have been completed, we next need to determine the actions, events and situations reported in the text and assign semantic roles to the constituents the sentences of the text consist of. The process of assigning semantic roles to constituents is called Semantic Role Labeling. Sometimes the term thematic role is used for semantic role (e.g., by Sowa [18]) so that Thematic Role Labeling also denotes that process. The process of semantic role labeling links word meanings to sentence meaning. For example, the sentence Legio IX Hispania Eburaci castra posuit consists of the verb posuit and the constituents Legio IX Hispania, Eburaci, and castra. Semantic role labeling assigns agent to Legio IX Hispania, location to Eburaci, and theme to castra. Taking these assignment together with the words semantics ( Legion IX Hispania refers to the 9 th Legion of the Roman Army which had earned the surname Hispania [5]; Eburacum refers to the ancient city which is now York, castra denotes a military camp, and posuit is the Latin verb (past tense, third person singular, active, indicative) for to set up ) it becomes clear that the sentences means the 9 th Spanish Legion set up a camp in York. In order to build a module for semantic role labeling one first has to choose a set of semantic roles. Different such sets have been discussed in the linguistic literature. Our system mainly relies on the work of Sowa [18] although we have added a few roles. For example, Sowa proposes the roles location, origin, destination, and path as spatial roles to which we have added direction. In general, roles can be assigned to constituents in a process that exploits syntactic, lexical, and semantic information. With respect to English, syntactic information here means word order information. For example, in a sentence of active voice, the subject constituent, that constituent which precedes the verb, either receives the role agent or the role effector. Lexical information refers to information provided mainly by verbs and prepositions. For example, a prepositional phrase that starts with the preposition at normally open a constituent that will be labeled location as in at the marketplace or point in time as in at noon. Semantic information normally provides constraints that can be used to decide among alternatives. For example, the role agent means an active animate entity that voluntarily initiates an action [18, p. 508] whereas effector means an active determinant source, either animate or inanimate, that initiates an action, but without 4

voluntary intention [ibid., p. 509]. Thus, if semantic information tells us that a constituent that takes the subject position denotes an inanimate object, the role to be assigned to that constituent has to be effector and cannot be agent. In our system, the process of semantic role labeling starts by identifying the verb group within a sentence. The reason for this is quite obvious. Our reports are about actions and events, and the expression of actions and events is the domain of the verbal vocabulary, i.e. of verbs and to some degree also of deverbal nouns (nouns derived from verbs; e.g., by detonation in the IED s detonation the event of an IED detonating is referred to). By identifying the main verb from the verb group (or the verb that is the base of a deverbal noun) we got what linguists call that head of a sentence. As a next step, our system uses the lexical information of that head to identify the set of roles that are compatible with or even demanded by it. For example, a verb that denotes an action that involves a movement like advance is compatible with the spatial semantic roles origin, path and destination or direction (e.g., The company advanced from Wilderness Church via Dowdall s Tavern towards Chancellorsville ). In contrast, a verb that denotes an action that does not involve movement is compatible with the spatial role location only (e.g., The army stayed at Stafford Heights ). In the next section, subsections Preparing Semantic Role Labeling and Ontology, we will take a more detailed look at this step. After the set of compatible and mandatory roles has been identified, the system calculates for each constituent which kind of role it might take. For this step syntactic information as well as lexical information from prepositions is used. Finally, the calculated possible roles for the constituents are matched against the set of compatible and mandatory roles so that each of the constituents receives an appropriate role. For this step, the constraints provided by semantic information are exploited. IMPROVEMENTS In this section, we will discuss changes we made to the standard information extraction process and the following process of semantic role labeling. Our main motivation for those changes had been the need for speed. The resulting system for automatic text analysis is supposed to be integrated into a C2 system. Therefore, incoming reports have to be processed in real time. Thus, the changes we made improve the speed of the 5

process. In the following, we discuss topics related to changes: the revision of GATE s ANNIE, the use of a chunker instead of a parser, the addition of a module to prepare semantic role labeling, and, with respect to semantic role labeling itself, the integration of a specific ontology that provides lexical and semantic knowledge. MIETER In order to extract information from military reports, a high rate of correctly identified constituents and structures is crucial as well as a reasonable processing speed. The open source tool GATE already comes with an IE toolbox called ANNIE (A Nearly-New Information Extraction system) which consists of comprehensive rules and linguistic resources such as bulky gazetteer lists and an extensive full-form lexicon based on different corpora [3]. The richness of these rules and resources, however, comes along with several problems with respect to our purpose: The amount of time needed for computing increases with the number of rules. The resources and grammars contain insufficient military-specific definitions. These problems result in a lower rate of detected and extracted information and in a processing speed that does not match our needs. In order to avoid the problems, we developed our own version of an information extraction system, the so-called MIETER, which is based on ANNIE. MIETER stands for Military Information Extraction from Texts and its Electronic Representation. It is constructed in a way such that it uses smaller and more specific linguistic resources (e.g. the lexicon), it does not look for obsolete information (such as names of business companies), and it uses grammars adapted to the specific structures found in military reports. The use of less and adapted rules and resources leads to a significantly higher rate of detected and extracted information out of military reports and to a higher processing speed. 6

In order to keep the recognition rates high, MIETER includes additional features which are able to correct false markings of previous components of the pipeline. The POS tagger often makes false assumptions regarding the category of a word. Its decision for a tag is based on a full-form lexicon and a set of template rules both derived from corpus work. For example, the word prevailing would be recognized as the verb prevail in continuous form, but in the phrase under the prevailing conditions, it takes the function of an adjective, hence needs to be recognized as such. The MIETER Re-Tagger component detects these tagging errors and corrects them. CHUNKING Parsers normally calculate complete parse trees which represent the syntactic structure of a sentence. Parse trees contain the syntactic information we need for semantic role labeling. However, the complete and sometimes rather complex tree is normally not needed. It is often sufficient to know the verb group and the other constituents of a sentence as well as their sequence. Thus, a deep syntactic analysis is not necessary; partial parsing does the job as well. The kind of partial parsing we applied is called chunking. It is the process of identifying and classifying the (consecutive, nonoverlapping) constituents within sentences by statistical or rule-based heuristics. MIETER includes a chunker that operates with rules. The major advantages of chunking are its robustness with respect to unseen words and possible ambiguities, its ability to provide at least partial results even if a full analysis is not feasible, and its speed. Deep syntactic analysis, on the other hand, requires that for each sentence the entire syntactic structure has to be calculated. It produces much more information than chunking (more than we need for our purpose) but it might fail because of unknown words and ambiguity. Additionally, deep syntactic analysis is on the one hand highly time-consuming and on the other hand computationally very resource-intensive. Nevertheless, within the context of the work on Hecking s ZENON system [9], an approach is being developed and implemented that uses a deep parser to calculate syntactic structures of report sentences and use these structures to assign semantic annotations. See [16] for details on the deep approach and [11] for a more detailed discussion on the pros and cons of both approaches. 7

PREPARING SEMANTIC ROLE LABELING In order to prepare semantic role labeling, MIETER incorporates a module called General Identifier. This module annotates the constituents identified by MIETER s chunker with preliminary semantic roles such as agent, affected, completion, time and location that then will be refined by MIETER itself (cf. the following paragraph) and by the process of semantic role labeling (cf. the next section). Agent and Affected annotations can in most cases easily be calculated out of the position of a noun phrase, preceding or following the verb, and the voice of the verb group (active or passive). In contrast, syntactic and lexical information together often are needed, but also are sufficient to decide whether there is temporal or spatial information in a sentence. The preposition under might at first glance be judged to be a location marker, as in under the bridge. But prepositional phrases with the preposition under can also bear a completion, under heavy bombardment, or temporal information, under 9.58 seconds. The General Identifier uses gazetteer lists to annotate whether a prepositional phrase most probably contains temporal or spatial information. These lists include trigger words, e.g. morning or Thursday for temporal information, or Oslo or Helmand for spatial information. After classifying constituents as temporal, spatial, or completion, MIETER refines that classification by dividing the temporal constituents into start time constituents (a trigger word would e.g. be from ), end time constituent ( until ) or point in time constituent ( on ). In the same way, MIETER sub-classifies spatial constituents as location ( at ), direction ( towards ), origin ( from ), destination ( to ) or path ( via ). Figure 1 shows an example of the labeling that is already achieved by the information extraction process. Figure 1: The figure shows the preliminary labeling as calculated by MIETER. 8

It can be said that the General Identifier already does the main share of semantic role labeling, those parts of the work that can be done by exploiting syntactic information and simple lexical information only. However, some more processing has to be done taking more complex lexical information and semantic information in account. This means that the labels the General Identifier has assigned to the constituents have to be refined. For example, constituents that have been preliminary labeled agent will be examined once more to check whether these constituents are indeed to be labeled agent or whether the have to be relabeled, e.g. as effector or as experiencer. Similarly, the constituents that have been labeled affected will be relabeled, e.g., as patient, beneficiary, or theme. ONTOLOGY The final process of semantic role labeling exploits lexical information and semantic information. In our system, the respective knowledge is stored in an ontology (for general information about ontologies cf. [19]). The specific process is as follows: at the end of the information extraction process, for each sentence of the report its verb group is identified and its constituents are marked. For semantic role labeling, the main verb is identified in the verb group to be looked up in our ontology. This ontology is focused on verbs. It provides information about the verbs in general and about their semantic frames in particular. Semantic frames are a further development of Fillmore s case grammar [7], that tells us which semantic roles come with the verb and which of these roles are mandatory, optional or forbidden. For each verb looked up, its frame is taken, and then the slots of that frame are filled with the constituents of the sentence the verb in question came from. In order to map the constituents to the correct slot, the semantic information is exploited. This semantic information consists, for example, of restrictions that refer to the slots and that are also represented in the ontology. We created the verb ontology based on FrameNet [1, 14] and VerbNet [4]. Its construction was also influenced by the work of Helbig [10], Levin [13], and Sowa [18]. Although most existing ontologies concentrate on the objects in the domain of interest, a verb ontology has to focus on situations in general and on actions in particular. As can be seen in figure 2, left panel, the verbs in the ontology are classified into those that refer to static situations and those that refer to dynamic situations. The latter is further divided into those verbs that refer to actions and those that refer to events. Action verbs 9

demand an agent ( an active animate entity that voluntarily initiates an action [18, p. 508]). The action verbs are divided into many classes, among them cognition verbs (e.g., consider ), exchange verbs (e.g., receive ) and motion verbs (e.g., advance ). Actions belonging to the same class share the semantic roles they demand and allow. For example, motion verbs like advance inherit their frame of semantic roles for the Motion class, cf. figure 2, bottom right panel. Figure 2: This snippet from a Protégé screen shows the semantic properties of the verb advance. The task to make the ontology available as resource for the process of semantic role labeling is assigned to the so-called Frame Slot Creator. This module takes the main verb out of the verb group of each sentence and sends it to an Ontology Web-Service. That service acts as an interface to the ontology. It returns the semantic frame for each verb requested. The Frame Slot Creator then stores the frame in the verb group annotation as a matrix of attribute-value pairs in which the semantic roles of the frame serve as attributes. The values of these attributes have to be the constituents of the respective sentence. To fill the constituents into the appropriate slots is done by a module called Frame Slot Filler that calculates this mapping out of the constituents tags, especially its preliminary semantic role tags, and the semantic constraints the ontology provides for appropriate fillers. 10

CONCLUSION In the sections above, we presented a process by which texts in general and military reports in particular can be analyzed. As a result of that process the text in question is annotated. That annotation assigns a semantic role to most of the text s constituents. Since the text in question is written in natural language, the anomalies which are characteristic of natural language like ambiguity, vagueness, and creative usage provoke some gaps as well as some errors in the assignment. Nevertheless, as long as the texts refer to a restricted domain, as military reports normally do ( attack in these reports means a specific military action and not a situation on the soccer field or on the racetrack and so on), the assignment is nearly complete and correct. The question remains how an assignment of semantic roles to text constituents might help for text analysis and why this means advancement beyond keyword spotting. In order to illustrate that advancement let us take a look at an application of written text analysis, namely the automatic pre-processing for threat recognition. Through this preprocessing those written reports in a stream of reports can be identified that may indicate threats. Following the automatic pre-process, a human expert receives the selected reports to check them for threat indicators. Obviously, keyword spotting can help in the automatic pre-processing. For example, if person X is known as member of an organization Y that carries out terrorist attacks, spotting X or Y in text A should result in adding A to the list of texts to be given to the human expert. However, keyword spotting results in many false hits. For example, if a report says that X died in a car accident the day before, the report might be of importance for updating X s member list but does not indicate a threat. In contrast, if a report tells us that X bought lots of items that are used in IED construction that report indicates threat. The example tells us that keywords can be used as indicators for threats but that they are not very precise indicators. However, under the assumption that the reports in question are represented with the semantic role annotations we discussed above, indicators can be constructed that are significantly more precise. This is even truer if we use our ontology to do some simple reasoning in addition. For example, instead of searching for X or Y, we can construct an indicator which says check for X as agent of a procure action in which the theme is an item needed for IED construction. The annotation provides information as to which constituents are agent or theme, respectively, and the ontology provides the knowledge regarding which verbs can describe a procure action 11

(like buy or steal) and the knowledge which objects are part of an IED. The former knowledge is represented in the ontology s action branch and the latter in its object branch. In sum, annotating text with semantic roles allows for focused checks and thus provides a fine basis for applications like automatic threat pre-processing. The technologies described above have been developed for and are integrated in the AuGE system, a demonstrator for automatic threat recognition build by IABG in a project sponsored by German Air Force s Transformation Center. In that project, Fraunhofer FKIE served as subcontractor to IABG. REFERENCES [1] FrameNet. http://framenet.icsi.berkeley.edu/. [2] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. [3] GATE s ANNIE: http://gate.ac.uk/ie/annie.html. [4] VerbNet: A Class-Based Verb Lexicon. http://verbs.colorado.edu/mpalmer/projects/verbnet.html. [5] http://en.wikipedia.org/wiki/legio_ix_hispana. [6] Lukas Sikorski, Bastian Haarmann, Ulrich Schade. Computational Linguistics Tools Exploited for Automatic Threat Recognition. To be published. Proceedings of the NATO RTE IST-099. Madrid, 2011. [7] Charles J. Fillmore. The case for case. In Emmon Bach and Robert T. Harms, editors, Universals in Linguistic Theory. Holt, Rinehart and Winston, New York, 1968. [8] Matthias Hecking. Information Extraction from Battlefield Reports. In Proceedings of the 8 th International Command and Control Research and Technology Symposium (ICCRTS), Washington, DC, 2003. [9] Matthias Hecking. System ZENON. Semantic Analysis of Intelligence Reports. In Proceedings of the LangTech 2008, Rome, Italy, 2008. [10] Hermann Helbig. Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, 2006. [11] Constantin Jenge, Silverius Kawaletz and Ulrich Schade. Combining Different NLP Methods for HUMINT Report Analysis. NATO RTO IST Panel Symposium. Stockholm, Sweden, October 2009. 12

[12] Bastian Haarmann, Lukas Sikorski. Applied Text Mining for Military Intelligence Necessities. To be published. Proceedings of the Future Security Conference. Berlin, 2011. [13] Beth Levin. English Verb Classes And Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, IL, 1993. [14] Birte Lönneker-Rodman and Collin F. Baker. The FrameNet Model and its Applications. Natural Language Engineering, 15, 415-453, 2009. [15] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19, 313-330, 1993. [16] Bastian Haarmann: Semantic Role Labeling im modernen Text-Analyse-Prozess. To be published. Know Tech Conference 2011. Bad Homburg v.d.h., 2011. [17] Martha Palmer, Dan Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31, 75-105, 2005. [18] John F. Sowa. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole, 2000. [19] Steffen Staab and Rudi Studer, editors. Handbook on Ontologies. International Handbooks on Information Systems. Springer, 2004. [20] Jake Thackray. The holy grail. In David Potts, editor. The Big Issue: Command and Combat in the Information Age. CCRP, Washington, DC, 2003. 13