An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

Transcription

1 Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February 2007 Primary Supervisor: Prof. Dr. Heinz F. Schweppe Database and Information Systems Group Fachbereich Mathematik und Informatik Freie Universität Berlin

2 Supervisors: Prof. Dr. Heinz F. Schweppe Database and Information Systems Group Institute for Computer Science Freie Universität Berlin Prof. Dr. Bernhard Thalheim Systems for Information Management Institute of Computer Science and Applied Mathematics Christian-Albrechts-Universität zu Kiel This thesis has been supported by the German Research Society, Berlin-Brandenburg Graduate School in Distributed Information Systems (DFG grant no. GRK 316).

3 Dedicated to the memory of my parents, Uta Siefkes ( ) and Harm Siefkes ( )

4

5 Abstract Most of the information stored in digital form is hidden in natural language (NL) texts. While information retrieval (IR) helps to locate documents which might contain the facts needed, there is no way to answer queries. The purpose of information extraction (IE) is to find desired pieces of information in NL texts and store them in a form that is suitable for automatic querying and processing. The goal of this thesis has been the development and evaluation of a trainable statistical IE approach. This approach introduces new functionality not supported by current IE systems, such as support for incremental training to reduce the human training effort by allowing a more interactive workflow. The IE system introduced in this thesis is designed as a generic framework for statistical classification-based information extraction that allows modifying and exchanging all core components (such as classification algorithm, context representations, tagging strategies) independently of each other. The thesis includes a systematic analysis of switching one such component (the tagging strategies). Several new sources of information are explored for improving extraction quality. Especially we introduce rich tree-based context representations that combine document structure and generic XML markup with more conventional linguistic and semantic sources of information. Preparing these rich context representations makes it necessary to unify various and partially conflicting sources of information (such as structural markup and linguistic annotations) in XML-style trees. For this purpose, we develop a merging algorithm that can repair nesting errors and related problems in XML-like input. As the core of the classification-based IE approach, we introduce a generic classification algorithm (Winnow+OSB) that combines online learning with novel feature combination techniques. We show that this algorithm is not only suitable for information extraction, but also for other tasks such as text classification. Among other good results, the classifier was found to be one of the two best filters submitted for the 2005 Spam Filtering Task of the Text REtrieval Conference (TREC). The thesis includes a detailed evaluation of the resulting IE which shows that the results reached by our system are better than or competitive with those of other stateof-the-art IE systems. The evaluation includes an ablation study that measures the influence of various factors on the overall results and finds that all of them contribute to the good results of our system. It also includes an analysis of the utility of interactive incremental training that confirms that this newly introduced training regimen can be very helpful for reducing the human training effort. The quantitative evaluation is complemented with an analysis of the kinds of mistakes made during extraction and their likely causes that allows a better understanding of where and how we can expect further improvements in information extraction quality to be made and which limits might exist for information extraction systems in general.

6

7 Wir sehen ein kompliziertes Netz von Ähnlichkeiten, die einander übergreifen und kreuzen. Ähnlichkeiten im Großen und Kleinen. Ich kann diese Ähnlichkeiten nicht besser charakterisieren als durch das Wort Familienähnlichkeiten ; denn so übergreifen und kreuzen sich die verschiedenen Ähnlichkeiten, die zwischen den Gliedern einer Familie bestehen: Wuchs, Gesichtszüge, Augenfarbe, Gang, Temperament, etc. etc. Und ich werde sagen: die Spiele bilden eine Familie. [... ] Wie würden wir denn jemandem erklären, was ein Spiel ist? Ich glaube, wir werden ihm Spiele beschreiben, und wir könnten der Beschreibung hinzufügen: das, und Ähnliches, nennt man,spiele. Und wissen wir selbst denn mehr? Können wir etwa nur dem Anderen nicht genau sagen, was ein Spiel ist? Aber das ist nicht Unwissenheit. Wir kennen die Grenzen nicht, weil keine gezogen sind. Ludwig Wittgenstein, Philosophische Untersuchungen Leeloo: Hello. Korben Dallas: Oh, so you speak English now. Leeloo: Yes. I learned. The Fifth Element (1997)

8

9 Contents 1 Introduction Motivation and Goals Contributions Outline of this Work Acknowledgments I The Field of Information Extraction 13 2 Information Extraction Information Retrieval, Text Mining, and Other Related Areas Overview and Classification of Approaches Architecture and Workflow Tasks to Handle Architecture of a Typical IE System Active Learning and Incremental Learning Workflow Statistical Approaches Probabilistic Semantic Parsing Hidden Markov Models Maximum Entropy Markov Models and Conditional Random Fields Token Classification Fragment Classification and Bayesian Networks Non-Statistical Approaches Covering Algorithms Relational Rule Learners Wrapper Induction Hybrid Approaches Knowledge-based Approaches Comparison of Existing Approaches Types of Tasks Handled Types of Texts Handled Considered Features Tagging Requirements and Learning Characteristics

10 Contents II Analysis 47 7 Aims and Requirements Aims of Our Approach Further Requirements Chosen Approach Non-Goals Assumptions Novel Assumptions General Assumptions Suitability of Tasks Target Schemas and Input/Output Models Target Schemas Formats for Input Texts Input Formats for Answer Keys Serialization of Extracted Attribute Values III Algorithms and Models Modeling Information Extraction as a Classification Task Idea and Concept Tagging Strategies Classification Algorithm and Feature Combination Techniques The Winnow Classification Algorithm Feature Combination Techniques Alternative Classification Algorithms and Implementations Preprocessing and Context Representation Preprocessing Tree-based Context Representation Tokenization Merging Conflicting and Incomplete XML Markup Introduction and Motivation Types of Errors in XML-like Input Configurable Settings and Heuristics for Repair Algorithm Description Limitations Application in Our Approach Related Work Weakly Hierarchical Extraction 99 2

11 Contents 14.1 Introduction Inheritance Hierarchies of Attributes Strictly Hierarchical Approach and Related Problems Weakly Hierarchical Approach Integration into Information Extraction Approach IV Evaluation Evaluation Goals and Metrics Goals and Limitations of Quantitative Evaluation Evaluation Methodology Evaluation Metrics Text Classification Experiments Introduction Text Classification Setup for Spam Filtering Experimental Results on the SpamAssassin Corpus TREC Spam Filtering Challenge Concluding Remarks Extraction of Attribute Values Test Corpora Evaluation Results for the Seminar Announcements Corpus Evaluation Results for the Corporate Acquisitions Corpus Ablation Study and Utility of Incremental Training Ablation Study Utility of Interactive Incremental Training Comparison of Tagging Strategies Idea and Setup Comparison Results Analysis Weakly Hierarchical Extraction Experimental Setup Experimental Results Concluding Remarks Mistake Analysis Mistake Types Distribution of Mistakes Type Confusion Additional Manual Analysis Length Analysis

12 Contents V Conclusions Conclusion and Outlook Discussion of Results Summary of Contributions Future Work Bibliography 175 A Schema for Augmented Text 185 B Curriculum Vitae 189 C Zusammenfassung in deutscher Sprache 191 4

13 List of Tables 2.1 Applications of IR, IE, and Text Mining Overview of the Selected Approaches and Systems Example of External Answer Keys Properties of Tagging Strategies Labeling Example SBPH 5 Feature Combinations Containing f Features Generated by SBPH and OSB Regular Expressions Used for Tokenization Promotion and Demotion Factors Threshold Thickness Comparison of SBPH and OSB with Different Feature Storage Sizes Utility of Single Tokens (Unigrams) Sliding Window Size Preprocessing Comparison with Naive Bayes and CRM Results on the Seminar Corpus System Comparison on the Seminar Corpus (F-measure) Results on the Acquisitions Corpus System Comparison on the Acquisitions Corpus (F-measure) Ablation Study: Seminar Announcements Ablation Study: Corporate Acquisitions Results with Incremental Feedback Incremental Feedback: User Effort for Correcting the Training Set Incremental Feedback: User Effort for Correcting the Evaluation Set F-measure Percentages for Incremental Training F-measure Percentages for Batch Training Incremental Training: Significance of Changes Compared to IOB Batch Training: Significance of Changes Compared to IOB Recall Reached by Supertype Recognizers on Subtype Answer Keys Seminar Corpus: Length Distribution of Answer Keys Acquisitions Corpus: Length Distribution of Answer Keys

14

15 List of Figures 3.1 Tasks to Be Handled Architecture of a Typical IE System Sample Interface: Information Extraction from Messages Sample Input Text Sample Text with Inline Annotations Partial DOM Tree of a Simple HTML Document with Linguistic Annotations Processed File from the Seminar Announcements Corpus Inverted Subtree of the Elements Considered for a Context Representation Learning Curve for the best setting (Winnow 1.23,0.83,5%, 600,000 features, OSB 5 ) ROC curve for the best filters (Source: [Cor05, Fig. 2]) Results on the Seminar Corpus Seminar Corpus: Precision and Recall Improvements System Comparison: F-measure Averages on the Seminar Corpus Results on the Acquisitions Corpus Acquisitions Corpus: Precision and Recall Improvements System Comparison: F-measure Averages on the Acquisitions Corpus Ablation Study: Seminar Announcements Ablation Study: Corporate Acquisitions Incremental Feedback: Learning Curve (average precision, recall, and F-measure on all documents processed so far) Incremental Feedback: Correct, Missing, and Spurious Predictions in the Training Set Incremental Feedback: Correct, Missing, and Spurious Predictions in the Evaluation Set Seminar Corpus: Inheritance Hierarchy Acquisitions Corpus: Inheritance Hierarchy Seminar Corpus: F-measure Results Acquisitions Corpus: F-measure Results Acquisitions Corpus: Collapsing Short and Long Names Seminar Corpus: Temporal Predictions Seminar Corpus: Mistakes Combinations

16 List of Figures 21.2 Seminar Corpus: Distribution of Mistake Types Acquisitions Corpus: Mistakes Combinations Acquisitions Corpus: Distribution of Mistake Types Seminar Corpus: Confusion Matrix (expected type predicted type) Acquisitions Corpus: Confusion Matrix (expected type predicted type) Seminar Corpus: Precision and Recall by Token Length Seminar Corpus: F-Measure by Token Length Seminar Corpus: Weighted Averages by Token Length Acquisitions Corpus: Precision and Recall by Token Length Acquisitions Corpus: F-Measure by Token Length Acquisitions Corpus: Weighted Averages by Token Length