Introduction to Text Mining. Module 2: Information Extraction in GATE



Similar documents
Semantic annotation of requirements for automatic UML class diagram generation

SVM Based Learning System For Information Extraction

Interactive Dynamic Information Extraction

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Web Document Clustering

How to make Ontologies self-building from Wiki-Texts

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Technical Report. The KNIME Text Processing Feature:

Named Entity Recognition Experiments on Turkish Texts

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Micro blogs Oriented Word Segmentation System

Schema documentation for types1.2.xsd

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System

Text Generation for Abstractive Summarization

Brill s rule-based PoS tagger

How To Write A Summary Of A Review

Intelligent Analysis of User Interactions in a Collaborative Software Engineering Context

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Special Topics in Computer Science

IT services for analyses of various data samples

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Mining a Corpus of Job Ads

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Natural Language to Relational Query by Using Parsing Compiler

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Ontology-based information extraction for market monitoring and technology watch

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Finding Advertising Keywords on Web Pages

Word Completion and Prediction in Hebrew

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

31 Case Studies: Java Natural Language Tools Available on the Web

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 22

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Structural Health Monitoring Tools (SHMTools)

Term extraction for user profiling: evaluation by the user

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

Active Learning SVM for Blogs recommendation

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

Knowledge Discovery from patents using KMX Text Analytics

Text Analysis beyond Keyword Spotting

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Web 3.0 image search: a World First

Deposit Identification Utility and Visualization Tool

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Starting User Guide 11/29/2011

CENG 734 Advanced Topics in Bioinformatics

Predicting stocks returns correlations based on unstructured data sources

Context Grammar and POS Tagging

Draft Response for delivering DITA.xml.org DITAweb. Written by Mark Poston, Senior Technical Consultant, Mekon Ltd.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Social Media Mining. Data Mining Essentials

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Social Media Data Mining and Inference system based on Sentiment Analysis

PoS-tagging Italian texts with CORISTagger

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Clustering Technique in Data Mining for Text Documents

Terminology Extraction from Log Files

Statistical Machine Translation

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Towards a Visually Enhanced Medical Search Engine

Sisense. Product Highlights.

Tractor Manual. 1 What is Tractor? GATE Propositionalizer CBIR Syntax-Semantics Mapper... 3

Get results with modern, personalized digital experiences

Shallow Parsing with Apache UIMA

A Survey on Product Aspect Ranking

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Transcription:

Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence

Aims of this module This module follows from Module 1, which introduces the concepts of text mining, information extraction (and its evaluation), and gives an overview of the GATE architecture Here we introduce the applications for Named Entity Recognition, term extraction and event detection developed in GATE for ARCOMEM We also discuss some alternative approaches using Machine Learning

Research Directions Addressed In ARCOMEM., we advance state of the art in adaptation of language processing resources to new domains and languages, especially in the domain of social media Development of new methodologies for language processing on social media and particularly on degraded texts (tweets etc) Development of new event recognition components for GATE, using new linguistic techniques and resources, and experiments with ML in particular, event recognition for social media is a new direction Investigate how entities and events can be effectively used for opinion mining, topic detection, and crawling

ARCOMEM offline processing architecture GATE components

Entity Recognition The application for Entity Recognition is based on (modified versions of) ANNIE and TermRaider It consists of a set of Processing Resources (PRs) executed sequentially over the corpus of documents Document Pre-processing Linguistic Pre-processing Named Entity Recognition Term Extraction RDF generation

Document annotated by ANNIE

Named Entity Recognition Adapted default ANNIE NE system to deal with specific aspects of social media (e.g. formatting issues of forums and tweets, relaxed use of English) More about processing social media in Module 4 Terms and NEs are assimilated (e.g. a term is removed if it's part of a longer NE) and output as Entities Entities are then used as input for the Event recognition module

Document Pre-processing Separate the body of the content from the rest, for facebook pages and tweets, using information from the original document metadata (HTML/XML tags), so that only the relevant parts of the document are annotated. Use the Boilerpipe plugin to separate other kinds of noncontent information (javascript, navigational links etc) from the content we want to analyse

The Boilerpipe PR in GATE In a closed domain, you can often write some JAPE rules to separate real document content from headers, footers, menus etc. In many cases, or when dealing with texts of different kinds or in different formats, it can get much trickier Boilerpipe PR provides algorithms to separate the surplus clutter (boilerplate, templates) from the main textual content of a web page. Applies the Boilerpipe Library to a GATE document in order to annotate the content, the boilerpipe, or both. Due to the way in which the library works, not all features from the library are currently available through the GATE PR

Original HTML document

Processed Document Only the Content parts are used for further analysis

Linguistic Pre-Processing We use the following PRs: Tokeniser Sentence Splitter Language Identification POS tagger Morphological analyser For social media such as twitter, we use a specially adapted version of some of these (see Module 4)

Named Entity Recognition The following entity types are recognised by the NER component, corresponding to the data model Persons (e.g. artists, politicians, web 2.0 users) Organizations (e.g. companies, music bands, political parties) Locations (e.g. cities, countries) Dates and times (of events and content publication) NER consists of: Gazetteer lookup JAPE grammars Co-reference

Document annotated with Gazetteer Lookups

Document annotated with NEs

Document annotated with co-references items in the co-reference chain

TermRaider GATE plugin for detecting single and multi-word terms Based on a simple web service developed in the NeOn project - now extended to run in GATE, with visualisation tools, and extended functionality (new scoring systems, and an adaptation for German. Runs in GATE Developer (GUI) or on the command-line with RDF and CSV output Terms are ranked according to three possible scoring systems: tf.idf = term frequency (nbr of times the term occurs in the corpus) divided by document frequency (nbr of documents in which the term occurs) augmented tf.idf = after scoring tf.idf, the scores of hypernyms are boosted by the scores of hyponyms Kyoto domain relevance = document frequency (1 + nbr of hyponyms in the corpus), Bosma and Vossen 2010

TermRaider: Methodology After linguistic pre-processing (tokenisation, lemmatisation, POS tagging etc.), nouns and noun phrases are identified as initial term candidates Noun phrases include post-modifiers such as prepositional phrases, and are marked with head information for determining hyponymy. Nested nouns and noun phrases are all marked as candidates. Term candidates are then scored in 3 ways. The results can be viewed in the GATE GUI, exported as RDF according to the ARCOMEM data model, or saved as CSV files The viewer can be used to adjust the cutoff parameter. This is used to determine the score threshold for a term to be considered valid Terms can also be shown as a tag cloud

Term candidates in a document

Try TermRaider in GATE Load the TermRaider plugin in GATE (click the plugin icon) Load a corpus (around 20-100 documents on a similar topic is ideal, e.g. the US elections) Load TermRaider from the Ready-made Applications and run it on the corpus Note: this is the default TermRaider which is a little different from the one used in ARCOMEM (and not optimised for social media) Inspect the results (click on SingleWord, MultiWord or Candidate Term in the document viewer)

Top terms from Greek Financial Crisis corpus

Terms can be exported as a tag cloud

Viewing termbanks in GATE Double click on any termbank in GATE to view the list of terms Expand the tree to view the documents where the term was found Click the frequency tab to show a sortable list of term, term frequency and document frequency in the corpus Click the term cloud tab to view the term clouds. Drag the slider towards cloudy to show more terms, or towards sunny to show fewer terms Term clouds can be sorted by type: term frequency, term score or document frequency

German Entity Extraction Adaptation of TermRaider for German texts (uses some different pre-processing) System for German NER is based on ANNIE (with similar functionality), and uses German-specific resources (TreeTagger for lemmatisation and POS tagging, tailored gazetteers and grammars)

Entities in the Rock am Ring forum

How to process documents in multiple languages In GATE, you can set a processing resource in your application to run or not depending on certain circumstances You can have several different PRs loaded, and let the system automatically choose which one to run, for each document. This is very helpful when you have texts in multiple languages, or of different types, which might require different kinds of processing For example, if you have a mixture of German and English documents in your corpus, you might have some PRs which are language-dependent and some which are not You can set up the application to run the relevant PRs on the right documents automatically.

What if a single document is in multiple languages? We can run TextCat over each sentence separately and label it with the language We can then use a plugin in GATE called the Segment Processing PR This enables us to process labelled sections of a document independently, one at a time, and merges back the individual sections once they've been processed Useful for when you want annotations in different sections to be independent of each other when you only want to process certain sections within a document processing multiple languages within a single document

How to run ANNIE conditionally Load ANNIE with defaults and add the TextCat PR Now let's make the PRs conditional We only want ANNIE to run on English documents, not German ones For each PR after the TextCat PR in the pipeline, we can click on the yellow dot and set it to run only if the value of the feature lang is english Only the English one should have annotations

The application should look like this

What if we want to process the German too? If we want to process both German and English documents with differerent resources, we have a couple of options 1. We can just call some language-specific PRs conditionally, and use the language-neutral PRs on all documents 2. We can call differerent applications from within the main application

Running both applications conditionally Load ANNIE with defaults Load the German IE application from Ready-made applications Create a new conditional corpus pipeline Load a TextCat PR and add it to the new pipeline created Add the ANNIE and German applications to the pipeline (in either order) after the TextCat Set ANNIE to run on English documents and the German app to run on German ones Save the main application and run it on your corpus

Running applications conditionally

Expression of events Events can be expressed by: verbal predicates and their arguments (e.g. The committee dismissed the proposal ); noun phrases headed by nominalizations (e.g. economic growth ); adjective-noun combinations (e.g. governmental measure ; public money ); event-referring nouns (e.g. crisis, cash injection ).

Rule-based event extraction in GATE Recognition of entities and the relations between them in order to find domain-specific events and situations. In a (semi-)closed domain, this approach is preferable to an open IE-based approach which holds no preconceptions about the kinds of entities and relations possible. Application for event recognition is designed to follow the entity recognition application, so no linguistic pre-processing is necessary Basic approach involves finding event-indicative seed words (e.g. downturn might indicate an economic event) and some linguistic relaation to existing entities (e.g. Greece ) in the sentence.

Event extraction application The application in GATE consists of the following Processing Resources: event gazetteer verb phrase chunker event recognition JAPE grammars RDF generation As with the NE recognition module, a German version is also available, and conditional processors decide, based on the language identified for each sentence, which version to run.

Events annotated in GATE

Recognising TimeML Events TimeML is a robust specification language for events and temporal expressions in natural language. There have been a number of evaluation challenges associated with event recognition using TimeML, which provide a source of annotated data for experimentation The guidelines are a little different to our generic event definition in ARCOMEM, but they make a useful experimentation ground for us

TIMEML specification 7 classes of event specfied: Reporting: action of a person or organisation declaring or narrating an event (e.g. "say") Perception: physical perception of another event (e.g. "see", "hear") Aspectual: aspectual predication of another event (e.g. "start", "continue") I_Action: intensional action (e.g. "try") I_State: intensional state (e.g. "feel", "hope") State: circumstance in which something holds the truth (e.g. "war", "in danger") Occurrence: events that describe things that happen (e.g. "erupt", "arrive"). Of these, I_Action and Occurrence are most relevant for ARCOMEM

ML application in GATE for TimeML event detection We use the PAUM algorithm (Perceptron with Uneven Margins) as it's most suitable and efficient for this kind of classification task You can also use SVM, which is slightly better, but much slower We use lemmatised tokens, POS tags and gazetteer lookup as input for the Machine Learning. Based on this linguistic information, an input vector is constructed for each token, as we iterate through the tokens in each document (including word, number, punctuation and other symbols) to see if the current token belongs to an information entity or not. Since in event recognition the context of the token is usually as important as the token itself, the features in the input vector come not only from the current token, but also from preceding and following ones.

Feature weighting scheme We use the reciprocal scheme, which weights the surrounding tokens reciprocally to the distance to the token in the centre of the context window. This reflects the intuition that the nearer a neighbouring token is, the more important it is for classifying the given token. Our previous experiments have shown that such a weighting scheme typically obtains better results than the commonly used equal weighting of features. Best results so far have been obtained with a window size of 4, which means the algorithm uses features derived from 9 tokens (4 preceding, current, and 4 following)

Evaluation Current results using PAUM are almost equivalent to state-ofthe-art systems in the latest TimeML evaluation in terms of F1- measure. In terms of recall, it actually outperforms existing state-of-theart by several percentage points (achieving 84.99%) There is still plenty of scope for further experimentation with additional features

Summary We have introduced applications in GATE for NER, term and event extraction, including an alternative application for event recognition using Machine Learning. Module 3 will look at opinion mining, while Module 4 wlil deal with the specific adaptations or text mining on social media, including the TwitIE system

Further materials T. Risse, S. Dietze, D. Maynard and N. Tahmasebi. Using Events for Content Appraisal and Selection in Web Archives. In Proceedings of DeRiVE 2011: Workshop in conjunction with the 10th International Semantic Web Conference 2011, October 2011, Bonn, Germany. S. Dietze, D. Maynard, E. Demidova, T. Risse, W. Peters, K. Doka, Y. Stavrakas. Preservation of Social Web Content based on Entity Extraction and Consolidation. International Workshop on Semantic Digital Archives (SDA 2012), Paphos, Cyprus, September 2012, Vol. 912 CEUR-WS.org (2012), p. 18-29. Y. Li, K. Bontcheva and H. Cunningham. Adapting SVM for Data Sparseness and Imbalance: A Case Study on Information Extraction. Natural Language Engineering, 15(02), 241-271, 2009. T. Risse, W. Peters, P. Senellart, D. Maynard. Documenting Contemporary Society by Preserving Relevant Information from Twitter. In 'Twitter and Society', edited by K. Weller, A. Bruns, J. Burgess, M. Mahrt and C. Puschmann. (forthcoming 2013). D. Maynard, K. Bontcheva. Natural Language Processing. In J. Lehmann and J. Voelker editors, Perspectives of Ontology Learning. IOS Press (forthcoming 2013)