Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Transcription

1 Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance David Bixler, Dan Moldovan and Abraham Fowler Language Computer Corporation 1701 N. Collins Blvd #2000 Richardson, TX, 75080, USA {bixler, moldovan, Keywords: Information Sharing and Collaboration, Search and Retrieval, Novel Intelligence from Massive Data, Knowledge Discovery and Dissemination, Information Sharing and Collaboration Abstract Analysts are constantly overwhelmed by large amounts of data which lack meaningful or useful structure. LCC is working on two tools which help to alleviate this problem, Jaguar and Polaris. The technical contributions of each of these tools, namely automatic extraction of semantic relations, automatic ontology construction, and metrics to evaluate ontology quality, as well as experimental results are discussed. 1. Introduction Intelligence analysts are constantly plagued with an overabundance of information. Individual analysts approach this problem in a variety of ways, using organizational methods which work on a small scale but do not lend themselves to interoperability with methods used by other analysts. Even these methods do not solve the problem, as analysts can only handle a tiny amount of the information available to them. Unfortunately, many of the clues and answers they are looking for reside in the vast amounts of information left untouched, and even the information they do have at their disposal lacks many of the data bridges which could help drive inferences and hypotheses. LCC has been developing two tools which will help these problems by enabling technologies such as those which leverage prior and tacit knowledge: Question Answering (QA), Information Extraction (IE), and Summarization. These two tools are Polaris, a semantic parser, and Jaguar, an automatic ontology builder. Both Polaris and Jaguar operate automatically on text, allowing an analyst to perform other tasks while these tools run in the background. The end result of Jaguar (which uses Polaris in its processing) is automatically generated, semantically rich, domain-specific ontologies which analysts can use while working on a task related to a domain or set of domains. These ontologies can capture data specific to a given analyst as well as data for broader use, allowing analysts to keep their own specific knowledge while being able to share and exchange information with other analysts in an efficient, streamlined fashion. The ontologies and semantic clusters can also be integrated with other tools to boost their accuracy and performance. 2. Motivation Analysts lack tools which can assist them in higher modes of critical thinking, but it is these tools which analysts need to improve analysis on complex issues [Heuer]. One method is to structure the information in a way which is easy to understand and allows the analyst to be more efficient. More information, however, is not necessarily better. Many psychological studies have demonstrated that accuracy generally increases very little, if at all, as more information is given to an expert; what is needed is "more truly useful information" [Heuer]. Since analysis tends not to improve with more information, it is important that the information that is used is the most important and is structured in a useful fashion. It is also well-known that the capacity of short term memory (STM) is very minute, and long term memory (LTM) retrieval is difficult for tasks not performed recently. Humans are also not good at identifying patterns between chunks of data, structuring data in ways which are useful, and analogizing. External memory aids are helpful in resolving these issues, and semantically enriched ontologies can serve as external memory aids by both identifying patterns between concepts and groups of concepts and simulating a highly structured LTM that is simple to retrieve information from. Heuer also notes that human memory rarely changes retroactively, and well-maintained knowledge bases can accommodate this shortcoming. 3. Approach 3.1 Polaris Polaris is based on a set of 40 semantic relations which LCC has defined. Semantic relations are abstractions of underlying relations between concepts, and can occur within a word, between words, between phrases, and between sentences. Semantic relations are useful because

2 # Semantic Relation # Semantic Relation # Semantic Relation 1 Possession 15 Source-From 29 Possibility 2 Kinship 16 Topic 30 Certainty 3 Property-Attribute Holder 17 Manner 31 Theme-Patient 4 Agent 18 Means 32 Result 5 Temporal 19 Accompaniment-Companion 33 Stimulus 6 Depiction 20 Experiencer 34 Extent 7 Part-Whole 21 Recipient 35 Predicate 8 Hyponymy 22 Frequency 36 Belief 9 Entail 23 Influence 37 Goal 10 Cause 24 Associated-with/Other 38 Meaning 11 Make-Produce 25 Measure 39 Justification 12 Instrument 26 Synonymy-Name 40 Explanation 13 Location-Space 27 Antonymy 14 Purpose 28 Probability-of/Existence Table 1: LCC s 40 Semantic Relations they provide denser connectivity between concepts and contexts. Also, detecting semantic relations is one essential step toward the ultimate goal of machine text understanding. Semantic relations allow for richer ontologies and knowledge bases which can capture contextual knowledge, events, and firmer assertions. LCC's set of 40 relations is summarized in Table 1. These 40 relations have been carefully selected for their usefulness in natural language processing, for the feasibility of their automatic extraction from text, and for the broadest semantic coverage with the least amount of overlap. While no list will ever be perfect, LCC feels this list strikes a good balance between being too specific (too many relations making reasoning difficult) and too general (not enough information to be useful). An example of semantic relations is the sentence He carefully disarmed the letter bomb. The compound nominal letter bomb alone contains at least 5 semantic relations: letter bomb IS-A bomb, letter bomb IS-A letter, letter is the LOCATION of the bomb, bombing is the PURPOSE of letter bomb, and letter is the MEANS of bombing. The sentence also includes several other relations: He is the AGENT of disarm; carefully is the MANNER of disarmed; and the letter bomb is the THEME (or object) of disarmed. Together, these semantic relations can give a structured picture of the event: who was involved, what was done, and to what; and what was the purpose, etc. of the object involved. To find semantic relations in text, Polaris uses a combination of state-of-the-art text processing and machine learning techniques. In the first step, low-level NLP processing, such as named entity recognition, part-ofspeech tagging, syntactic parsing and word sense disambiguation, are used to structure the text. The parse tree is then broken down into a number of syntactic patterns that Polaris can analyze. These syntactic patterns include s and their arguments, complex nominals, adjective phrases, adjective clauses, and others. Polaris next runs classifiers on each section of text that matched a syntactic pattern. The classifiers examine features of the text and attempt to determine whether any of the 40 relations apply between the elements of the pattern. Most of the classifiers are based on one of four different machine learning algorithms: Decision Trees, Naïve Bayes, Support Vector Machine (SVM), and Semantic Scattering (a new learning algorithm that uses WordNet classes to find the most probable relation that holds between two nouns [Badulescu]). Some of these machine-learning classifiers use a per-relation approach to output only one specific relation they were trained to recognize, while others use a per-pattern approach which could potentially output any of the 40 semantic relations. Additionally, some classifiers containing human-coded rules are used for the most explicit and unambiguous cases. These three methods form a hybrid approach which produces better results than any one approach on its own. As an example of actual system performance, Table 2 demonstrates the output discovered by Polaris from the sentence Bin Laden reportedly purchased anthrax a half decade ago from a supplier in North Korea. Human-generated relations System output AGENT(Bin Laden, purchased) AGENT(Bin Laden, purchased) TOPIC(purchased, reportedly) THEME(anthrax, purchased) THEME(anthrax, purchased) RECIPIENT(a supplier in North LOCATION(from a supplier in Korea, purchased) North Korea, purchased) TEMPORAL(a half decade ago, TEMPORAL(a half decade ago, purchased) purchased) MEASURE(a half, decade) PROPERTY(half, decade) LOCATION(in North Korea, a LOCATION(in North Korea, a supplier) supplier) Table 2: List of relations discovered from example sentence 3.2 Jaguar Jaguar automatically builds domain-specific ontologies by processing plain text from a variety of sources. These ontologies can be fine-tuned to contain the level of detail

3 desired by an analyst. Ontologies built by Jaguar contain (i) ontological concepts, which are the basic building blocks of an ontology, (ii) a hierarchy, consisting of a structure imposed on certain ontological concepts via transitive relations that generally hold to be universally true (e.g. IS-A, part-whole, locative, etc), and (iii) the contextual knowledge base, consisting of semantic contexts that encapsulate knowledge of events via semantic relations. Current work also includes a fourth component called Axioms on Demand which capture assertions about knowledge and are useful for reasoning. Jaguar is a complex text processing project, using both basic and advanced NLP tools to accomplish its task. The first step in the process is to filter and clean up the input text. Raw input to Jaguar can come from all possible types of sources, including Word documents, PDF files and web pages in HTML format, and is therefore prone to having many irregularities, such as incomplete, strangely formatted sentences, headings, and tabular information. The filtering mechanism of Jaguar is a crucial step that makes the input acceptable for subsequent NLP tools to process it. A single run of Jaguar can be divided into two major processes: (i) text processing, and (ii) classification/hierarchy formation. In Text Processing, Jaguar is provided with a set of seeds which are used to determine the set of sentences of interest. Until recently, these were always selected manually; now, seeds can be automatically generated if desired and used in place of or to augment the manually selected seed set. The set of sentences selected based on the seeds goes through a set of NLP processing tools: named-entity recognition, part-ofspeech tagging, parsing, word-sense disambiguation, coreference resolution, and semantic relation discovery (Polaris). The resulting data structure is processed and used to populate one or many semantic contexts, groups of relations or nested contexts which hold true around a common central concept. Another aspect of text processing is concept discovery, which entails the discovery of noun concepts in sentences which are related to the target words or seeds. Each processed sentence is scanned for noun phrases, and targeted noun concepts are added to a local data structure for subsequent processing into the ontology's hierarchy. Figure 1 shows an example hierarchy and semantic context. Classification is the determination of a hierarchical structure within a group of concepts. Isolated IS-A (hypernymy) relations are discovered in the text processing stage. Classification uses a set of well-formed and tested procedures to impose a hierarchical structure on the set of discovered concepts, and it uses WordNet [Miller] as its upper ontology. Details of these procedures are presented in [Moldovan and Girju]. Hypernymy relations discovered via classification may contain anomalies or redundancies. Jaguar contains a conflict resolution engine which detects and corrects possible inconsistencies. The hierarchies in Jaguar are created link by link (or relation by relation) and follow a conflict avoidance technique, Figure 1: Example Hierarchy and Semantic Context within a Knowledge Base wherein each new relation is tested for anomalies/redundancies before being added to the hierarchy. Although single runs of Jaguar yield rich ontologies, the real power of it lies in providing an option to layer ontologies from many different runs. Jaguar can currently merge disparate ontologies into one by using the aforementioned conflict resolution technique. The merge tool merges the two ontologies' concept sets, hierarchies (using conflict resolution), and their knowledge bases (set of semantic contexts). Merging is useful for distributed or parallel systems where small chunks of the input text may be processed on some portions of the system and then subsequently merged. It also provides a foundation for future work in contextual reasoning and epistemic logic. The result is a rich knowledge base which can be viewed at many different levels of granularity, providing an analyst with the level of detail desired. 4. Results 4.1 Polaris As mentioned earlier, Polaris uses four machine learning algorithms to discover semantic relations in syntactic patterns: Semantic Scattering, Decision Trees, Naïve Bayes and Support Vector Machine. There are six primary pattern types discovered within noun phrases: N-N and Adj-N (which comprise compound nominals), 's and of (Genitive patterns), Adjective Phrases, and Adjective Clauses. The first five are further subdivided into nominalized and non-nominalized occurrences, giving a total of 11 patterns discovered within compound nominals. Table 3 summarizes the accuracy over the training data of each machine learning algorithm for each noun phrase pattern. In this table, non-al refers to nominalized forms and al refers to non-nominalized. The training corpus source for the noun phrase patterns is Wall Street Journal (TreeBank 2), L.A. Times (TREC 9), and XWN 2.0 [Harabagiu and Moldovan]. There are also five argument level patterns being discovered: NP, NP, PP, ADVP, and S. Table 4 summa-

4 Machine Learning Algorithms Syntactic Patterns Adjective Complex nominals Genitives Phrases NN AdjN Of 's NP prep NP al al al Verbal nonal nonal nonal Nonal nonal al Adj Clauses Semantic Scattering n/a NP Wh- Pron Decision Tree n/a Naïve Bayes n/a SVM Table 3: Machine Learning Accuracy for Noun Phrase Level rizes the accuracy over the training data for two machine learning algorithms. The training corpus source for the argument patterns is FrameNet [Baker]. Neither table is an indication of overall system score; however, if all inputs were perfect, each would indicate the expected best performance for the current system. Machine Learning Algorithms Syntactic Patterns NP NP PP ADVP Verb S Decision Tree SVM Table 4: Machine Learning for Verb Argument Level LCC has created a benchmark corpus to evaluate the Polaris system. The corpus contains 300 sentences, but currently only 51 have been fully annotated due to the large manual effort required. Within these 51 sentences, human annotators discovered 683 total relations; 290 of these match the syntactic patterns that Polaris currently recognizes. A scorer program runs Polaris over these same 51 sentences and compares the generated relations to the human annotations. As of March 29, 2005, Polaris discovered 265 relations within the syntactic patterns that it uses. Of these, 94 were exact matches to the human annotations. An additional 38.2 were partial matches, meaning that while the relation type was correct and the argument bracketing at least overlapped, there were some extra or missing tokens in the generated arguments. The partial matches are scored using precision, recall, and F- measure on the overlapping tokens. The total score for all matches, including discounting for partial matches, is shown in Table 5. The first column indicates performance on all human annotations, including those on syntactic patterns Polaris currently cannot see. The second column shows the performance within the syntactic patterns Polaris currently recognizes. The second column is a better indication of the overall potential of Polaris' approach if it were extended to include more syntactic patterns. All relations Measured over: Only relations covered by syntactic patterns Precision 49.89% Recall 19.63% 50.04% F-Measure 28.18% 49.96% Table 5: Polaris System Score The numbers continue to improve but are obviously not perfect. There are many reasons for this, resulting both from external and internal factors. The external NLP techniques which Polaris depends on offer varying degrees of precision. Automatic word sense disambiguation is percent accurate for nouns, and lower than that for s. Syntactic parsing is close to 90 percent accurate for subtrees, but this precision degenerates to somewhere between 50 and 70 percent for an entire, complex sentence. The part of speech tagger is around 95 percent accurate, and the named entity tagger ranges from percent accuracy. Additionally, there is currently no true coreference resolution library. Multiplying the accuracies of each tool which Polaris depends upon demonstrates that there is likely less than 50 percent likelihood of accuracy on real-world, complex sentences. Internally, there are also many issues which affect the precision and recall. The training data has a fair number of issues: insufficient examples for syntactic patterns or semantic relations; narrow domain for the training corpora; inconsistency in the order of relations arguments; noisy data; and lack of a one-to-one mapping to the source. Additionally, there are currently not enough features for each of the semantic relations. Relation arguments are many times ambiguous within a parse tree structure, and syntactic patterns do not always capture all relations. The machine learning classifiers tend to only return one relation per syntactic pattern even if there are multiple possibilities. There are also issues caused by metonymy (figures of speech) and multiple relations

5 Metric Name Conceptual Precision (CP) Subsumption Precision (SP) Conceptual Recall (CR) Subsumption Recall (SR) Unlinked Concepts (UC) Conceptual Expansion (CE) Metric Description number of well-formed and relevant concepts in the ontology divided by the total number of concepts in the ontology number of correct subsumption links in the ontology divided by the total number of subsumption links in the ontology number of well-formed and relevant concepts in the ontology divided by the union of this number and this number from a reference ontology number of correct subsumption links in the ontology divided by the union of this number and this number from a reference ontology proportion of orphan concepts in the ontology proportional difference between number of seed concepts and number of concepts in generated ontology Table 6: Ontology Evaluation Metrics found within the same phrase. Work is being done on all of these areas to help improve precision and recall. 4.2 Jaguar LCC has recently developed a battery of evaluation metrics to assess the quality of ontologies. They are summarized in Table 6. These ontology evaluation metrics were used to evaluate two versions of Jaguar, one which uses a manually selected set of seed concepts and one which selects seeds automatically. The document collection used for this evaluation was 5.67 megabytes of text from a CNS (Center for Nonproliferation Studies) corpus focused on chemical and biological weapons. The manually selected seed set consisted of 158 concepts associated with biological agents and weapons, and the automatically selected seed set consisted of 100 concepts. Both sets were used as input to Jaguar to create two separate ontologies for the biological weapons and agents domain. Two manually built, hand-edited ontologies focusing on the biological weapons domain were used as reference ontologies. These reference ontologies were pruned from the original ontologies to remove information about chemical and nuclear weapons, and one of them was additionally pruned to remove concepts not found in the document collection. The first reference ontology, which contains 151 concepts and 208 subsumption links, will be referred to as BW-manual, and the second one, which contains 68 concepts and 93 subsumption links, will be referred to as BW-manual-filtered. Jaguar was run two times, first with the 158 manually selected seeds (labeled BW-KAT1), and second with the 100 automatically selected seeds (labeled BW-KAT2). BW-KAT1 contained 4,712 concepts, with 896 considered to be well-formed and relevant to the domain; 85 of these were unsubsumed, and 756 of the remaining 811 subsumed concepts were considered to be accurate when checked manually. BW-KAT2 contained 7,197 concepts, with 1,147 considered to be well-formed and relevant to the domain; 68 of these were unsubsumed, and 977 of the remaining 1079 subsumed concepts were considered to be accurate. The metrics described above are summarized for BW-KAT1 and BW-KAT2 in Table 7. With the exception of conceptual precision, the results are very good. The results are also very comparable between the manual and automatic selection of seeds. There are, however, still issues which need to be addressed to improve the results. Due to its dependency on Polaris, Jaguar also depends on a number of lower level NLP components. Their shortcomings and effect on Polaris have previously been discussed and thus impact the performance of Jaguar. Improvement in lower level components should increase the performance of Jaguar. There is still a good bit of noise in the input to Jaguar, and better filtering techniques will increase the overall quality of the resultant ontology. The classifier uses a variety of heuristics, many of which possess some degree of ambiguity. Additionally, anomalies in the hypernymy tree, such as two very different concepts sharing the same hypernym several levels removed, introduces more noise into the data. Conflict resolution is still being researched, and though an initial implementation is in place, further refinement should also improve the quality of the built ontologies. Much effort has been made to build a collection of Metric BW-KAT1 BW-KAT2 Conceptual Precision (CP) 19.02% (896/4712) 15.94% (1147/7197) Subsumption Precision (SP) 93.22% (756/811) 90.55% (977/1079) Conceptual Recall (1) CR % (896/( )) 88.37% (1147/( )) Conceptual Recall (2) CR % (896/( )) 94.40% (1147/( )) Subsumption Recall (1) SR % (756/( )) 82.45% (977/( )) Subsumption Recall (2) SR % (756/( )) 91.31% (977/( )) Conceptual Expansion (CE) % (( )/ 158) 1047% (( )/100) Unlinked Concepts (UC) 9.49% (85/896) 5.93% (68/1147) Table 7: Results of Jaguar Evaluation

6 domain-specific ontologies on a regular and automatic basis. Using web harvesting tools developed at LCC, Jaguar has been extended to build ontologies automatically from the web. Seed concepts are used as query keywords for a search engine like Google, and found documents are ranked accordingly and then processed by Jaguar. Over 30 different ontologies have been built which include IS-A hierarchies; work is being done to augment them with other relation types, such as partwhole and locative. Example domains which have been built and made available via the web include HR, biological weapons, Al Qaeda, North Korean Nuclear Program, acid rain, and trains. 5. Conclusion LCC has made great strides toward extracting, structuring, and maintaining knowledge which can assist an analyst in higher levels of critical thinking for better analysis, but there is still much work to be done. Continued improvement of the quality of knowledge extracted and the relationships between chunks of knowledge is needed to ensure that the most useful information is always available to the analyst. More detailed work on extracting and formulating Axioms on Demand will allow ontologies to become more useful knowledge bases. Work on reasoning will allow the system to perform preliminary analysis and present it to the analyst to aid the critical thinking process. Mechanisms for connecting with disparate knowledge bases and ontologies are also being explored to improve the utility and structure of knowledge available to the analyst. The impact on text processing has already been large by bridging the gap to machine text understanding, enabling powerful technologies like QA, reasoning and inferences, IE, and summarization. Overall, the current system provides a very strong foundation for future endeavors and possesses a great deal of utility in its own right. Roxana Girju, et al Support Vector Machines Applied to the Classification of Semantic Relations in Nominalized Noun Phrases. In Proc. of the Lexical Semantics Workshop, HLT 2004, Boston. Sanda Harabagiu and Dan Moldovan. Knowledge Processing on an Extended WordNet. WordNet-An Electronic Lexical Database. MIT Press, C. Fellbaum editor, pp , Richards J. Heuer, Jr. Psychology of Intelligence Analysis, Center for the Study of Intelligence, Central Intelligence Agency, George Miller. WordNet: a lexical database for English. Communications of the ACM, Vol.38, No.11:39-41, Dan I. Moldovan and Roxana C. Girju. An Interactive Tool for the Rapid Development of Knowledge Bases. International Journal on Artificial Intelligence Tools, vol 10, no 1-2, March Acknowledgments This material is based upon work funded in part by the U.S. Government and any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Government. Thanks to Altaf Mohammed, Lowell Boggs, Adriana Badulescu, and Ian Niles for their contributions. References Adriana Badulescu. Classification of Semantic Relations Between Nouns. Ph.D. Dissertation, University of Texas at Dallas Collins F. Baker, Charles J. Fillmore, and John B. Lowe The Berkeley FrameNet Project. In Proceedings of COLING/ACL '98: Montreal, Canada.